Python: What the heck is Regex?

Header

Regex is one of those massive pains in the ass that’s just too helpful not to learn. I won’t pretend it’s fun. I’d put Regex at the same level of No-Dear-God-Why-Is-This-Happening as an enema or a certified letter from the IRS. But once you wrap your head around it, you’ll be wanting to use them all the time because they make everything else just… stupid easy.

Regex stands for Regular Expression. Basically, it’s a way to find data when you don’t know exactly what you’re looking for.

Say you’ve got a giant block of text. Your Google Location JSON data, for instance. JSON files are gross.

Oh. So. Ugly.

It’s the kind of thing you would probably rather have in an Excel spreadsheet. With the Timestamp in column A, Latitude in column B, and Longitude in C.

But how do you make that happen without descending into a dark and dangerous black hole of copy/paste/repeat?

The RegEx Module: FindAll Function

For right now, I’m going to focus on the FindAll() Function, which returns a list of all the times the computer caught a bit of text that matched your search pattern.

You’ll start off with the import re line. Then you’ll define a search pattern using this:

import re
testString = ""
x = re.findall("MY SEARCH PATTERN", testString)
print(x)

What you mean by Search Pattern?

The search pattern, in its most basic format, will just return a string if ask it to. Let’s try it with the first line of my Json data:

testString = " locations : [ {timestampMs : 1548600561029,latitudeE7 : 258965182,longitudeE7 : -801625674,accuracy : 23,altitude : -26,verticalAccuracy : 2"

A Simple Example

That by itself—not helpful. But combine it with some special characters, and you can grab all of your coordinates.

Special Characters?

In JSON, the latitude is formatted like so:

latitudeE7 : 258965182

Basically, it’s the word “latitudeE7 : “ followed by 9 numbers. You can search for that using a special sequence that tells Python to look for a number, or 9 of them.

The special sequence for a number is “\d”.

Slightly more complicated example

Cool, right? There are also metacharacters that can bulk up your search, allowing you to be more vague and Python to catch more things.

For instance, you can use the “|” character to indicate “either or.”

Either Or Example

What’s a Wildcard?

Possibly the coolest metacharacter in the RegEx toolbox is the period.  They’re the equivalent of the joker in a game of poker. The wildcard can stand for any character (except a new line), so if you don’t know exactly how something is spelled out but you do know how many characters it contains, you’re in business.

For instance, my Json latitudes generally end in E7. But I don’t know that they all do, and I don’t want to spend time combing through all my data to make sure. So instead, I’m just replacing those characters with dots.

Wildcard

A word of warning. Vague isn’t always better.

The Problem with the *

Another magic metacharacter is the asterisk. When you put an *, Python will match 0 or more instances of whatever came before it.

Whoa, Nelly, you might say. What happens if you combine the wildcard period with the asterisk. That means any characters any number of times. On its own, this just returns the entire string.

Asterisk

An important thing to remember here is it won’t return every iteration of any character followed any number of times (because that would generate a shit ton of results). It’ll return the longest one it can find.

More Asterisks

Useful if you know what’s at the start and finish of your desired substring, especially if you don’t know how many characters will be in between.

Hey, you said you were gonna fix my JSON data?

Right. We’re going to focus on digit sequences and wildcards.

If you haven’t already, check out this primer on how Google Location JSON data is formatted. Basically, the file is one giant key/value pair. The key is “locations” and the value is a massive array of nested JSON objects containing coordinates and timestamps. For our purposes, we’re only going to register the first set of coordinates and their timestamp.

First, we’ll use json.load() to convert the JSON object to a Python dictionary, and we’ll prep a csv. Change the filepath to wherever you’ve got the JSON file stored  (I also changed the extension to .txt) and where you want the csv to populate.

import re
import json
import csv

JSON_data="C:/Location History.txt"
CSV_file="C:/Location History.csv"

with open(JSON_data) as f:
data = json.load(f)
csv_f=open(CSV_file, 'w')

Now we’ve got a Python dictionary called data, and all it’s associated information is in a single key, “Locations”. Let’s loop through that and have regex scrape out the timestamp, latitude and longitude:

for x in range(len(data["locations"])):
regex_Time=re.findall(".timestampMs.: .\d\d\d\d\d\d\d\d\d\d\d\d\d.", str(data["locations"][x]))
regex_Lat=re.findall(".latitude...: \d\d\d\d\d\d\d\d\d", str(data["locations"][x]))
regex_Long=re.findall(".longitude...: -\d\d\d\d\d\d\d\d\d", str(data["locations"][x]))
csv_f.write(regex_Time[0][16:-1] + "," + regex_Lat[0][14:] + "," + regex_Long[0][15:]+'\n')

csv_f.close
print('Done')

Once Regex has done its thing, you can slice out the words “timestamp,” “latitude,” and “longitude” by using brackets, like so [  :  ]. The first number will indicate the position where you want the new string to start, and the second will tell it where to end. Negative numbers have Python start the count from the end of the string.

Basically, the line regex_Time[0][16:-1] above asks Python to grab the first instance of a timestamp, which looks like this:

“timestampMs.: “1234567891011”

“timestampMs.: “ is 16 characters, and the trailing is one character from the end. So [16:-1] will return only our target numbers – 1234567891011.

Here’s a summary of how the whole script ties together.

Code Explained

After running it, I had a csv file with 219336 results. Which is… Still not pretty, but in the next few posts, I’ll be working on making it more readable.

The Results

Here’s an easily copy/pastable gist of the script. Happy data diving!

Advertisements

One thought on “Python: What the heck is Regex?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s