Python: How to Write a Web Scraper

Header

Back from vacation! I did a version of this in AutoHotkey a while back, but this one’s in Python. It does require importing modules—check out this post for that. You’ll want both requests and beautifulsoup (I’m using beautifulsoup4) before getting started.

Step 1: Get the HTML

Webpages are made up of HTML (Hypertext Markup Language). Requests is how Python grabs that from specific sites. You can use requests.get like so:

 import requests
res = requests.get('https://gifguide2code.wordpress.com/')

Step 1.5: You can skip this bit (it’s really unnecessary)

Ok, so say you want to write the worse web scraper of all time. Here’s the script for that:

import requests
res = requests.get('https://gifguide2code.wordpress.com/')
dataDump = open('C:\\Users\\Desktop\\Wordpress.txt', 'wb')
for chunk in res.iter_content(100000):
dataDump.write(chunk)
dataDump.close()

Basically, it takes ALL of the HTML and throws it into a text document. You can use iter_content() to loop through the HTML 100,000 bytes at a time and dump everything including the links, the addresses to images, the formatting instructions. Why 100,000 bytes at a time you ask? I have no idea. I got it from Al Sweigart and did not ask questions.

That’ll get you this horrendus mess here:

HTML

You may have noticed that this is wildly unhelpful. Ideally it needs to be written in a way that doesn’t make you want to scratch your eyes out.

Step 2: Cut out the pieces you actually need

Figuring out what part of the HTML you want isn’t as difficult as you might think. You do need a vague notion of how HTML works, but you can also get pretty far by using the Page Inspector in Firefox.

Just right click the part you want, hit Inspect Element, and it’ll tell you what to look for.

Inspection Editor

You may notice there are a bunch of <>’s floating around that. Those are tags.  You can check out the different types here.

Now, I want to pull out all the titles to posts on my main page. Those are nested within the <h2> tags.

Once you know which tags you want, you can push the HTML text into a BeautifulSoup object and use .find_all to collect the specific pieces within those tags.

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print((soup.find_all('h2'))

What’s that look like:

H2

The find_all method gets you a collection of the <h2>’s in the HTML. Keep in mind, this is not a single string. That means you need to loop through the collection for each piece inside individual <h2>‘s. When you do that, you can specify that you just want the text to be rid of all the links, image references, and formatting tags.

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
headers = soup.find_all('h2')

for h in headers:
print(h.text)

That’ll get you here:

Blog Titles

Step 3: Dump that into a text document

This is the easy part. All you have to do is specify a text document to pop it into. And you’re done. Here’s the complete script for reference:

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
dataDump = open('C:\\Users\\Desktop\\Wordpress.txt', 'w')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
headers = soup.find_all('h2')

for h in headers:
dataDump.write(h.text)
dataDump.close()

print('Done.')

 

 

 

Advertisements

9 thoughts on “Python: How to Write a Web Scraper

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s