Python: How to Write a Web Scraper

Header

Back from vacation! I did a version of this in AutoHotkey a while back, but this one’s in Python. It does require importing modules—check out this post for that. You’ll want both requests and beautifulsoup (I’m using beautifulsoup4) before getting started.

Step 1: Get the HTML

Webpages are made up of HTML (Hypertext Markup Language). Requests is how Python grabs that from specific sites. You can use requests.get like so:

 import requests
res = requests.get('https://gifguide2code.wordpress.com/')

Step 1.5: You can skip this bit (it’s really unnecessary)

Ok, so say you want to write the worse web scraper of all time. Here’s the script for that:

import requests
res = requests.get('https://gifguide2code.wordpress.com/')
dataDump = open('C:\\Users\\Desktop\\Wordpress.txt', 'wb')
for chunk in res.iter_content(100000):
dataDump.write(chunk)
dataDump.close()

Basically, it takes ALL of the HTML and throws it into a text document. You can use iter_content() to loop through the HTML 100,000 bytes at a time and dump everything including the links, the addresses to images, the formatting instructions. Why 100,000 bytes at a time you ask? I have no idea. I got it from Al Sweigart and did not ask questions.

That’ll get you this horrendus mess here:

HTML

You may have noticed that this is wildly unhelpful. Ideally it needs to be written in a way that doesn’t make you want to scratch your eyes out.

Step 2: Cut out the pieces you actually need

Figuring out what part of the HTML you want isn’t as difficult as you might think. You do need a vague notion of how HTML works, but you can also get pretty far by using the Page Inspector in Firefox.

Just right click the part you want, hit Inspect Element, and it’ll tell you what to look for.

Inspection Editor

You may notice there are a bunch of <>’s floating around that. Those are tags. You can check out the different types here.

Now, I want to pull out all the titles to posts on my main page. Those are nested within the <h2> tags.

Once you know which tags you want, you can push the HTML text into a BeautifulSoup object and use .find_all to collect the specific pieces within those tags.

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print((soup.find_all('h2'))

What’s that look like:

The find_all method gets you a collection of the <h2>’s in the HTML. Keep in mind, this is not a single string. That means you need to loop through the collection for each piece inside individual <h2>‘s. When you do that, you can specify that you just want the text to be rid of all the links, image references, and formatting tags.

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
headers = soup.find_all('h2')

for h in headers:
print(h.text)

That’ll get you here:

Blog Titles

Step 3: Dump that into a text document

This is the easy part. All you have to do is specify a text document to pop it into. And you’re done. Here’s the complete script for reference:

import requests, bs4
res = requests.get('https://gifguide2code.wordpress.com/')
res.raise_for_status()
dataDump = open('C:\\Users\\Desktop\\Wordpress.txt', 'w')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
headers = soup.find_all('h2')

for h in headers:
dataDump.write(h.text)
dataDump.close()

print('Done.')

9 Replies to “Python: How to Write a Web Scraper”

ARJ says:

January 1, 2018 at 9:34 am

a beautiful soup tuto for nect time ^^_

LikeLiked by 1 person

1. gifguide2code says:
  
  January 2, 2018 at 4:56 am
  
  Beautiful Soup is a wily one–I’m working on a nice, long tutorial for that one 🙂
  
  LikeLiked by 1 person
  
micropi says:

January 15, 2018 at 8:21 pm

Very useful post! I am using it to get random text to train my word prediction AI 🙂

LikeLiked by 1 person

1. gifguide2code says:
  
  January 16, 2018 at 1:13 am
  
  Ooooooh, that’s marvelous! You posting new tutorials soon? I try to keep an eye out for posts of the DIY slot machine variety 🙂
  
  LikeLike
  
  1. micropi says:
    
    January 19, 2018 at 8:31 pm
    
    I haven’t posted much recently because I’ve been so busy, but hopefully I’ll start again soon!
    
    LikeLiked by 1 person
    
Raven Hon says:

January 25, 2018 at 8:15 am

Great, I posted a Python web scraper post as well, but in Scrapy (http://www.codeastar.com/web-scraping-python/) :]]

LikeLiked by 1 person

1. gifguide2code says:
  
  January 26, 2018 at 4:53 am
  
  You got a cool blog going over there Raven–you an illustrator as well? I appreciate posts that mesh cartooning and coding.
  
  LikeLike
  
  1. Raven Hon says:
    
    January 27, 2018 at 2:46 am
    
    Thank you, Gif. No, I am a hobbyist in drawing. People always think coding is serious, but I think we can do serious things with fun.
    
    LikeLike
    
    1. gifguide2code says:
      
      January 27, 2018 at 5:21 pm
      
      Yeah, if there were more cartoon guides to coding, I’d be WAY further along than I am now.
      
      LikeLike

Step 1: Get the HTML

Step 1.5: You can skip this bit (it’s really unnecessary)

Step 2: Cut out the pieces you actually need

Step 3: Dump that into a text document

Share this:

Related

9 Replies to “Python: How to Write a Web Scraper”

Leave a comment Cancel reply