Python: What the heck is Regex?

Header

Regex is one of those massive pains in the ass that’s just too helpful not to learn. I won’t pretend it’s fun. I’d put Regex at the same level of No-Dear-God-Why-Is-This-Happening as an enema or a certified letter from the IRS. But once you wrap your head around it, you’ll be wanting to use them all the time because they make everything else just… stupid easy.

Regex stands for Regular Expression. Basically, it’s a way to find data when you don’t know exactly what you’re looking for.

Say you’ve got a giant block of text. Your Google Location JSON data, for instance. JSON files are gross.

Oh. So. Ugly.

It’s the kind of thing you would probably rather have in an Excel spreadsheet. With the Timestamp in column A, Latitude in column B, and Longitude in C.

But how do you make that happen without descending into a dark and dangerous black hole of copy/paste/repeat?

The RegEx Module: FindAll Function

For right now, I’m going to focus on the FindAll() Function, which returns a list of all the times the computer caught a bit of text that matched your search pattern.

You’ll start off with the import re line. Then you’ll define a search pattern using this:

import re
testString = ""
x = re.findall("MY SEARCH PATTERN", testString)
print(x)

What you mean by Search Pattern?

The search pattern, in its most basic format, will just return a string if ask it to. Let’s try it with the first line of my Json data:

testString = " locations : [ {timestampMs : 1548600561029,latitudeE7 : 258965182,longitudeE7 : -801625674,accuracy : 23,altitude : -26,verticalAccuracy : 2"

A Simple Example

That by itself—not helpful. But combine it with some special characters, and you can grab all of your coordinates.

Special Characters?

In JSON, the latitude is formatted like so:

latitudeE7 : 258965182

Basically, it’s the word “latitudeE7 : “ followed by 9 numbers. You can search for that using a special sequence that tells Python to look for a number, or 9 of them.

The special sequence for a number is “\d”.

Slightly more complicated example

Cool, right? There are also metacharacters that can bulk up your search, allowing you to be more vague and Python to catch more things.

For instance, you can use the “|” character to indicate “either or.”

Either Or Example

What’s a Wildcard?

Possibly the coolest metacharacter in the RegEx toolbox is the period.  They’re the equivalent of the joker in a game of poker. The wildcard can stand for any character (except a new line), so if you don’t know exactly how something is spelled out but you do know how many characters it contains, you’re in business.

For instance, my Json latitudes generally end in E7. But I don’t know that they all do, and I don’t want to spend time combing through all my data to make sure. So instead, I’m just replacing those characters with dots.

Wildcard

A second really important metacharacter is the ?, which will match 0 or 1 instances of whatever is placed before it.

The Problem with the *

Another magic metacharacter is the asterisk. When you put an *, Python will match 0 or more instances of whatever came before it.

Whoa, Nelly, you might say. What happens if you combine the wildcard period with the asterisk. That means any characters any number of times. On its own, this just returns the entire string.

Asterisk

An important thing to remember here is it won’t return every iteration of any character followed any number of times (because that would generate a shit ton of results). It’ll return the longest one it can find.

More Asterisks

Useful if you know what’s at the start and finish of your desired substring, especially if you don’t know how many characters will be in between.

So, What’s the Full Script Look Like?

First, a short story. My initial attempt at this script had a massive bug in it because (you might notice from the screenshots above), I tailored the RegEx to my JSON coordinates. My JSON coordinates all live in the United States, where the latitude is positive and the longitude is negative. It wasn’t until someone commented that taking away the negative sign before the longitude fixed the script for them that I realized the script was useless for anyone outside of North America. My takeaway from the experience: Just because a script works for you doesn’t mean it’s not broken. Also, latitude and longitude are confusing.

TL;DR: I accidentally excluded 3/4 of the world’s coordinates by including a – in a RegEx search pattern. Oops.

Moving on.

If you haven’t already, check out this primer on how Google Location JSON data is formatted. Basically, the file is one giant key/value pair. The key is “locations” and the value is a massive array of nested JSON objects containing coordinates and timestamps. For our purposes, we’re only going to register the first set of coordinates and their timestamp.

First, we’ll use json.load() to convert the JSON object to a Python dictionary, and we’ll prep a csv. Change the filepath to wherever you’ve got the JSON file stored  (I also changed the extension to .txt) and where you want the csv to populate.

import re
import json
import csv

JSON_data="C:/Location History.txt"
CSV_file="C:/Location History.csv"

with open(JSON_data) as f:
data = json.load(f)
csv_f=open(CSV_file, 'w')

Now we’ve got a Python dictionary called data, and all it’s associated information is in a single key, “Locations”. Let’s loop through that and have regex scrape out the timestamp, latitude and longitude. The -? before the digits will catch a negative sign if it pops up before the longitude or latitude.

for x in range(len(data["locations"])):
regex_Time=re.findall(".timestampMs.: .\d\d\d\d\d\d\d\d\d\d\d\d\d.", str(data["locations"][x]))
regex_Lat=re.findall(".latitude...: -?\d\d\d\d\d\d\d\d\d", str(data["locations"][x]))
regex_Long=re.findall(".longitude...: -?\d\d\d\d\d\d\d\d\d", str(data["locations"][x]))
csv_f.write(regex_Time[0] + "," + regex_Lat[0] + "," + regex_Long[0]+'\n')

csv_f.close
print('Done')

Here’s a summary of how the whole script ties together:

Code Explained

After running it, I had a csv file with 219336 results. Which is… Still not pretty, but keep your eyes peeled for another update in the near future.

Here’s an easily copy/pastable gist of the script. Happy data diving!

7 Replies to “Python: What the heck is Regex?”

  1. Still the same problem here List index out of range

    c:/Local/Temp/JSON2CSV-Google_Location_History.py
    Traceback (most recent call last):
    File “c:/Local/Temp/JSON2CSV-Google_Location_History.py”, line 16, in
    csv_f.write(regex_Time[0] + “,” + regex_Lat[0] + “,” + regex_Long[0]+’\n’)
    IndexError: list index out of range

    Like

    1. Hi ChoasTyp! Thanks for the link–I just saw the rundown on the bug over at Github. Looks like Google isn’t using UNIX time anymore. And they’ve got a source field now. I’m going to have to write an updated tutorial on it.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: