I’ve found that many people see Python, the programming language, much the same way they see Pythons, the scary-ass snakes. If you haven’t dealt with it before, chances are you’re going to think it’s not to be trusted. There’s no way I could convince my mom, for instance, to open up a .py file on her computer.
She will, however, open up a Word document no questions asked.
So I’m abandoning web scraping in Python momentarily to work through a VBA scraper for anyone out there with trust issues. I’m calling it a a jerry rig because no one in their right mind uses VBA to scrape sites—it’s wildly error prone and forces you to use Internet Explorer. But if you’re willing to power through all that, it works. So… Yay.
My test run is going to grab a bunch of movie scripts from www.imsdb.com. For our purposes, I’m just having it dump all the HTML (tags included) into Word.
Fun barrier to entry here. A lot of the scripts I found online didn’t mention that you need to add a few libraries for this to work. If you find yourself encountering an inordinate number of inexplicable errors—that might be your issue. Especially if the error vaguely references a missing object.
So before getting started, go to Tools and then References in the Visual Basic Editor. In the window that pops up, make sure the “Microsoft HTML Object Library” and “Microsoft Internet Controls” boxes are checked.
Ok, after doing that, you may need to restart Excel or Word to make it work.
A Basic Scrape
Microsoft has a pretty extensive discussion on using Internet Explorer Objects, which honestly gave me a raging headache. Long story short, I couldn’t work that out, so I went to this thread for a basic template.
First, create the Internet Explorer object.
Dim appIE As Object Set appIE = CreateObject("InternetExplorer.Application")
Then you feed that object a URL within a With Block.
With appIE .Navigate "http://www.imsdb.com/scripts/Deadpool.html" .Visible = True End With
Now, see that bit where it says “.Visible”? That’s whether you want to see the browser. I recommend keeping that set to true. The moment I switched it to false, the entire macro stopped working.
I have absolutely no idea why.
Do While appIE.Busy DoEvents Loop
Now, I’ve done some Googling on DoEvents. That’s a solid two hours of my life I’ll never get back.* The one important thing of note is that the macro threw an error when I took it out. So leave it in.
*Brief Summary of my Findings (feel free to skip if you already figured this out)*
According to the Microsoft documentation, the DoEvents Function “yields execution so that the operating system can process other events.” Looking at my scraper, I thought to myself, “I have no other processes. What else would I need to be doing? Why are you allowing my computer to get distracted? JUST SCRAPE THE DATA, VBA.” Then I calmed down and wondered why DoEvents would be in a Do While Loop.
Because, while Internet Explorer is busy (getting the webpage), I do want the operating system to process other events. And more importantly, I need to make sure that webpage is loaded before attempting to paste the page’s HTML anywhere. Essentially, those three lines are halting the script from going any further until Explorer has done it’s business and has the HTML ready to go.
*End of Summary*
That takes us to the last line, which can be easily customized whether you want the HTML in Excel or Word . For ease of use, I opted to just put the text into the ActiveDocument.
That’s the easy part. The real fun comes when you plug that into a giant loop to download several pages at a time.
Ok, one thing to note—when you start to bulk process, it’ll only work if you have a list of URLs to pull from or if the URL is easily predictable. I know which scripts I want. I also happen to know that the URL for the scripts on each of those is formatted like this:
http://www.imsdb.com/scripts/ [MOVIE NAME] .html
Movie names with more than one word are have hyphens instead of spaces, like Blade-Runner. So our first order of business in adapting the scraper to handle bulk requests is creating an array of the movie names. Like this one:
Dim Movies() Movies = Array("Alien", "Avatar", "Blade-Runner", "Gattaca", "Ghostbusters", "Jurassic-Park", "Looper", "Lost-in-Space", "Sphere", "Signs", "Spider-Man", "Terminator", "V-for-Vendetta")
That way, we can plug the array into a loop specifying its upper and lower bounds as the start and end.
For x = LBound(Movies) To UBound(Movies) Next x
And within that For loop, we can have VBA automatically stitch together the URL.
Script = "http://www.imsdb.com/scripts/" & Movies(x) & ".html"
And the file name for the Word document so that it can be saved separately.
FilePath = "C:\Users\Desktop\Scripts\" & Movies(x) & ".doc" ActiveDocument.SaveAs2 FilePath
Ah, I lied. One other thing to note. The HTML isn’t always going to load. Or at least, it didn’t for me. It could be because my internet connection is slow or because VBA is plowing through the code so fast, it’s trying to save the document before the pages load properly.
To head off potential issues, we’re going to add an On Error GoTo line. I found that the best place for it is immediately after the line that attempts to insert the HTML into the active document.
ActiveDocument.Content.InsertAfter Text:=appIE.Document.body.innerHTML On Error GoTo ErrorHandler
So what should we do if that line doesn’t work? I’m thinking a warning would be sufficient. Just to let me know to go back and look at those separately.
ErrorHandler: MsgBox Movies(x) & " could not be loaded." Resume
Fun note. Once I plugged this in, I stopped getting object errors (presumably from not having inner HTML within appIE), but oddly enough, Word still managed to import the HTML in spite of throwing out an ErrorHandler box.
Because… Reasons. Anyway. That’s about it.
The full script is available at Gist.
The end result will hopefully be this: