Mark Needham

Thoughts on Software Development

Archive for the ‘python’ tag

Luigi: Defining dynamic requirements (on output files)

without comments

In my last blog post I showed how to convert a JSON document containing meetup groups into a CSV file using Luigi, the Python library for building data pipelines. As well as creating that CSV file I wanted to go back to the meetup.com API and download all the members of those groups.

This was a rough flow of what i wanted to do:

  • Take JSON document containing all groups
  • Parse that document and for each group:
    • Call the /members endpoint
    • Save each one of those files as a JSON file
  • Iterate over all those JSON files and create a members CSV file

In the previous post we created the GroupsToJSON task which calls the /groups endpoint on the meetup API and creates the file /tmp/groups.json.

Our new task has that as its initial requirement:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

But we also want to create a requirement on a task that will make those calls to the /members endpoint and store the result in a JSON file.

One of the patterns that Luigi imposes on us is that each task should only create one file so actually we have a requirement on a collection of tasks rather than just one. It took me a little while to get my head around that!

We don’t know the parameters of those tasks at compile time – we can only calculate them by parsing the JSON file produced by GroupsToJSON.

In Luigi terminology what we want to create is a dynamic requirement. A dynamic requirement is defined inside the run method of a task and can rely on the output of any tasks specified in the requires method, which is exactly what we need.

This code does the delegating part of the job:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
 
    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]
 
 
                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members
 
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

Inside our run method we iterate over the output of GroupsToJSON (which is our input) and we yield to another task as well as collecting its outputs in the array outputs that we’ll use later.
MembersToJSON looks like this:

class MembersToJSON(luigi.Task):
    group_id = luigi.IntParameter()
    key = luigi.Parameter()
 
 
    def run(self):
        results = []
        uri = "https://api.meetup.com/2/members?&group_id={0}&key={1}".format(self.group_id, self.key)
        while True:
            if uri is None:
                break
            r = requests.get(uri)
            response = r.json()
            for result in response["results"]:
                results.append(result)
            uri = response["meta"]["next"] if response["meta"]["next"] else None
 
 
        with self.output().open("w") as output:
            json.dump(results, output)
 
    def output(self):
        return luigi.LocalTarget("/tmp/members/{0}.json".format(self.group_id))

This task generates one file per group containing a list of all the members of that group.

We can now go back to MembersToCSV and convert those JSON files into a single CSV file:

class MembersToCSV(luigi.Task):
    out_path = "/tmp/members.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
 
    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]
 
 
                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members
 
        with self.output().open("w") as output:
            writer = csv.writer(output, delimiter=",")
            writer.writerow(["id", "name", "joined", "topics", "groupId"])
 
            for path in outputs:
                group_id = path.split("/")[-1].replace(".json", "")
                with open(path) as json_data:
                    d = json.load(json_data)
                    for member in d:
                        topic_ids = ";".join([str(topic["id"]) for topic in member["topics"]])
                        if "name" in member:
                            writer.writerow([member["id"], member["name"], member["joined"], topic_ids, group_id])
 
    def output(self):
        return luigi.LocalTarget(self.out_path)
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

We then just need to add our new task as a requirement of the wrapper task:

And we’re ready to roll:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup --workers 3

We’ve defined the number of workers here as we can execute those calls to the /members endpoint in parallel and there are ~ 600 calls to make.

All the code from both blog posts is available as a gist if you want to play around with it.

Any questions/advice let me know in the comments or I’m @markhneedham on twitter.

Written by Mark Needham

March 28th, 2017 at 5:39 am

Posted in Python

Tagged with ,

Luigi: An ExternalProgramTask example – Converting JSON to CSV

without comments

I’ve been playing around with the Python library Luigi which is used to build pipelines of batch jobs and I struggled to find an example of an ExternalProgramTask so this is my attempt at filling that void.

Luigi - the Python data library for building data science pipelines

I’m building a little data pipeline to get data from the meetup.com API and put it into CSV files that can be loaded into Neo4j using the LOAD CSV command.

The first task I created calls the /groups endpoint and saves the result into a JSON file:

import luigi
import requests
import json
from collections import Counter
 
class GroupsToJSON(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def run(self):
        seed_topic = "nosql"
        uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(seed_topic, self.lat, self.lon, self.key)
 
        r = requests.get(uri)
        all_topics = [topic["urlkey"]  for result in r.json()["results"] for topic in result["topics"]]
        c = Counter(all_topics)
 
        topics = [entry[0] for entry in c.most_common(10)]
 
        groups = {}
        for topic in topics:
            uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(topic, self.lat, self.lon, self.key)
            r = requests.get(uri)
            for group in r.json()["results"]:
                groups[group["id"]] = group
 
        with self.output().open('w') as groups_file:
            json.dump(list(groups.values()), groups_file, indent=4, sort_keys=True)
 
    def output(self):
        return luigi.LocalTarget("/tmp/groups.json")

We define a few parameters at the top of the class which will be passed in when this task is executed. The most interesting lines of the run function are the last couple where we write the JSON to a file. self.output() refers to the target defined in the output function which in this case is /tmp/groups.json.

Now we need to create a task to convert that JSON file into CSV format. The jq command line tool does this job well so we’ll use that. The following task does the job:

from luigi.contrib.external_program import ExternalProgramTask
 
class GroupsToCSV(luigi.contrib.external_program.ExternalProgramTask):
    file_path = "/tmp/groups.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def program_args(self):
        return ["./groups.sh", self.input()[0].path, self.output().path]
 
    def output(self):
        return luigi.LocalTarget(self.file_path)
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

groups.sh

#!/bin/bash
 
in=${1}
out=${2}
 
echo "id,name,urlname,link,rating,created,description,organiserName,organiserMemberId" > ${out}
jq -r '.[] | [.id, .name, .urlname, .link, .rating, .created, .description, .organizer.name, .organizer.member_id] | @csv' ${in} >> ${out}

I wanted to call jq directly from the Python code but I couldn’t figure out how to do it so putting that code in a shell script is my workaround.

The last piece of the puzzle is a wrapper task that launches the others:

import os
 
class Meetup(luigi.WrapperTask):
    def run(self):
        print("Running Meetup")
 
    def requires(self):
        key = os.environ['MEETUP_API_KEY']
        lat = os.getenv('LAT', "51.5072")
        lon = os.getenv('LON', "0.1275")
 
        yield GroupsToCSV(key, lat, lon)

Now we’re ready to run the tasks:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
DEBUG: Checking if GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   PENDING
DEBUG: Checking if GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
INFO: Running command: ./groups.sh /tmp/groups.json /tmp/groups.csv
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   Meetup()
Running Meetup
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      Meetup()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
    - 1 GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
    - 1 Meetup()
 
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

Looks good! Let’s quickly look at our CSV file:

$ head -n10 /tmp/groups.csv 
id,name,urlname,link,rating,created,description,organiserName,organiserMemberId
1114381,"London NoSQL, MySQL, Open Source Community","london-nosql-mysql","https://www.meetup.com/london-nosql-mysql/",4.28,1208505614000,"<p>Meet others in London interested in NoSQL, MySQL, and Open Source Databases.</p>","Sinead Lawless",185675230
1561841,"Enterprise Search London Meetup","es-london","https://www.meetup.com/es-london/",4.66,1259157419000,"<p>Enterprise Search London is a meetup for anyone interested in building search and discovery experiences — from intranet search and site search, to advanced discovery applications and beyond.</p>
<p>Disclaimer: This meetup is NOT about SEO or search engine marketing.</p>
<p><strong>What people are saying:</strong></p>
<ul>
<li><span>""Join this meetup if you have a passion for enterprise search and user experience that you would like to share with other able-minded practitioners."" — Vegard Sandvold</span></li>
<li><span>""Full marks for vision and execution. Looking forward to the next Meetup."" — Martin White</span></li>
<li><span>“Consistently excellent” — Helen Lippell</span></li>
</ul>

Sweet! And what if we run it again?

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=172768377, workers=1, host=Marks-MBP-4, username=markneedham, pid=4531) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 Meetup()
 
Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

As expected nothing happens since our dependencies are already satisfied and we have our first Luigi pipeline up and running.

Written by Mark Needham

March 25th, 2017 at 2:09 pm

Posted in Python

Tagged with , , ,

Python 3: TypeError: Object of type ‘dict_values’ is not JSON serializable

without comments

I’ve recently upgraded to Python 3 (I know, took me a while!) and realised that one of my scripts that writes JSON to a file no longer works!

This is a simplified version of what I’m doing:

>>> import json
>>> x = {"mark": {"name": "Mark"}, "michael": {"name": "Michael"}  } 
>>> json.dumps(x.values())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'dict_values' is not JSON serializable

Python 2.7 would be perfectly happy:

>>> json.dumps(x.values())
'[{"name": "Michael"}, {"name": "Mark"}]'

The difference is in the results returned by the values method:

# Python 2.7.10
>>> x.values()
[{'name': 'Michael'}, {'name': 'Mark'}]
 
# Python 3.6.0
>>> x.values()
dict_values([{'name': 'Mark'}, {'name': 'Michael'}])
>>>

Python 3 no longer returns an array, instead we have a dict_values wrapper around the data.

Luckily this is easy to resolve – we just need to wrap the call to values with a call to list:

>>> json.dumps(list(x.values()))
'[{"name": "Mark"}, {"name": "Michael"}]'

This versions works with Python 2.7 as well so if I accidentally run the script with an old version the world isn’t going to explode.

Written by Mark Needham

March 19th, 2017 at 4:40 pm

Posted in Python

Tagged with

scikit-learn: Clustering and the curse of dimensionality

with one comment

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
def distances(a, b):
    return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1]
 
def mixed(n_zeros, n_ones):
    return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0)
 
def ones(n_ones):
    return np.repeat([1], n_ones)
 
print distances(mixed(2, 2), ones(4))
print distances(mixed(3, 3), ones(6))
print distances(mixed(50, 50), ones(100))
print distances(mixed(300, 300), ones(600))
 
(1.4142135623730951, 0.70710678118654746)
(1.7320508075688772, 0.70710678118654768)
(7.0710678118654755, 0.70710678118654757)
(17.320508075688775, 0.70710678118654746)

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD
 
n_components = 2
reducer = TruncatedSVD(n_components=n_components)
reducer.fit(all)
new_all = reducer.transform(all)
print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum()))
 
2: Percentage explained: 0.124579183633

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans
 
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(new_all)
    silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000)
 
    print n_clusters, silhouette_avg
 
2 0.559681096025
3 0.498456585461
4 0.524704352941
5 0.441580592398
6 0.44703058946
7 0.447895331824
8 0.433698007009
9 0.459874485986

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

2016 08 27 21 18 14

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

Written by Mark Needham

August 27th, 2016 at 8:32 pm

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

without comments

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

:play http://guides.neo4j.com/got

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0]
Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> one = [1,1,0]
>>> two = [0,1,1]
 
>>> cosine_similarity([one, two])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo"))
session = driver.session()
 
rows = session.run("""
    MATCH (c:Character), (e:Episode)
    OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e)
    RETURN e, c, appearance
    ORDER BY e.id, c.id""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows:
        print row
 
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {}
for row in rows:
    if episodes.get(row["e"]["id"]) is None:
        if row["appearance"] is None:
            episodes[row["e"]["id"]] = [0]
        else:
            episodes[row["e"]["id"]] = [1]
    else:
        if row["appearance"] is None:
            episodes[row["e"]["id"]].append(0)
        else:
            episodes[row["e"]["id"]].append(1)

Here’s an example of one entry in the matrix:

>>> len(episodes)
60
 
>>> len(episodes[1])
638
 
>>> episodes[1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values()
 
>>> cosine_similarity(all[0:1], all)[0]
array([ 1.        ,  0.69637306,  0.48196269,  0.54671752,  0.48196269,
        0.44733753,  0.31707317,  0.42340087,  0.34989921,  0.43314808,
        0.36597766,  0.18421252,  0.30961158,  0.2328101 ,  0.30616181,
        0.41905818,  0.36842504,  0.35338088,  0.18376917,  0.3569686 ,
        0.2328101 ,  0.34539847,  0.25043516,  0.31707317,  0.25329221,
        0.33342786,  0.34921515,  0.2174909 ,  0.2533473 ,  0.28429311,
        0.23026565,  0.22310537,  0.22365301,  0.23816275,  0.28242289,
        0.16070148,  0.24847093,  0.21434648,  0.03582872,  0.21189672,
        0.15460414,  0.17161693,  0.15460414,  0.17494961,  0.1234662 ,
        0.21426863,  0.21434648,  0.18748505,  0.15308091,  0.20161946,
        0.19877675,  0.30920827,  0.21058466,  0.19127301,  0.24607943,
        0.18033393,  0.17734311,  0.16296707,  0.18740851,  0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
0 1.0
1 0.696373059207
3 0.546717521051
2 0.481962692712
4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
59 1.0
52 0.500670191678
46 0.449085146211
43 0.448218732478
49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {}
>>> for idx, episode_id in enumerate(episodes):
        episode_mapping[idx] = episode_id
 
>>> for idx, episode_id in enumerate(episodes):
        similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0]
        for other_idx, similarity_score in enumerate(similarity_matrix):
            other_episode_id = episode_mapping[other_idx]
            print episode_id, other_episode_id, similarity_score
            if episode_id != other_episode_id:
                session.run("""
                    MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}})
                    MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2)
                    ON CREATE SET similarity.score = {similarityScore}
                    """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})
 
    session.close()

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score DESC
LIMIT 10
 
╒═════╤════╤══════════════════╕
│ep1  │ep2 │score             │
╞═════╪════╪══════════════════╡
│S1E2 │S1E1│0.6963730592072543│
├─────┼────┼──────────────────┤
│S1E4 │S1E3│0.6914173051223086│
├─────┼────┼──────────────────┤
│S1E9 │S1E8│0.6869464497590777│
├─────┼────┼──────────────────┤
│S2E10│S2E8│0.6869037302955034│
├─────┼────┼──────────────────┤
│S3E7 │S3E6│0.6819943394704735│
├─────┼────┼──────────────────┤
│S2E7 │S2E6│0.6813598225089799│
├─────┼────┼──────────────────┤
│S1E10│S1E9│0.6796436827080401│
├─────┼────┼──────────────────┤
│S1E5 │S1E4│0.6698105143372364│
├─────┼────┼──────────────────┤
│S1E10│S1E8│0.6624062584864754│
├─────┼────┼──────────────────┤
│S4E5 │S4E4│0.6518358737330705│
└─────┴────┴──────────────────┘

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score
LIMIT 10
 
╒════╤════╤═══════════════════╕
│ep1 │ep2 │score              │
╞════╪════╪═══════════════════╡
│S4E9│S1E5│0                  │
├────┼────┼───────────────────┤
│S4E9│S1E6│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E2│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E4│0                  │
├────┼────┼───────────────────┤
│S5E6│S4E9│0                  │
├────┼────┼───────────────────┤
│S6E8│S4E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E6│0                  │
├────┼────┼───────────────────┤
│S3E9│S2E9│0.03181423814878889│
├────┼────┼───────────────────┤
│S4E9│S1E1│0.03582871819500093│
└────┴────┴───────────────────┘

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode)
WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5
return episode1, episode2
 
(no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Written by Mark Needham

August 22nd, 2016 at 9:12 pm

Python: matplotlib, seaborn, virtualenv – Python is not installed as a framework

without comments

Over the weekend I was following The Marketing Technologist’s content based recommender tutorial but ran into the following exception when trying to import the seaborn library:

$ python 5_content_based_recommender/run.py 
Traceback (most recent call last):
  File "5_content_based_recommender/run.py", line 14, in <module>
    import seaborn as sns
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/__init__.py", line 6, in <module>
    from .rcmod import *
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/rcmod.py", line 8, in <module>
    from . import palettes, _orig_rc_params
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/palettes.py", line 12, in <module>
    from .utils import desaturate, set_hls_values, get_color_cycle
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/utils.py", line 12, in <module>
    import matplotlib.pyplot as plt
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/pyplot.py", line 114, in <module>
    _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
    globals(),locals(),[backend_name],0)
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/backend_macosx.py", line 24, in <module>
    from matplotlib.backends import _macosx
RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

We can see from the stacktrace that seaborn calls matplotlib so that’s where the problem lies. There’s even a page on the matplotlib website suggesting some workarounds.

I’ve come across this error before and been unable to get any of the suggestions to work, but this time I was successful. I needed to create the following function in my bash profile file:


~/.bash_profile

function frameworkpython {
    if [[ ! -z "$VIRTUAL_ENV" ]]; then
        PYTHONHOME=$VIRTUAL_ENV /usr/bin/python "$@"
    else
        /usr/bin/python "$@"
    fi
}

And call that function instead of my virtualenv’s python:

$ frameworkpython 5_content_based_recommender/run.py

This time the matplotlib visualisation works:

2016 08 14 16 16 08

#win

Written by Mark Needham

August 14th, 2016 at 6:56 pm

Posted in Python

Tagged with

scikit-learn: TF/IDF and cosine similarity for computer science papers

without comments

A couple of months ago I downloaded the meta data for a few thousand computer science papers so that I could try and write a mini recommendation engine to tell me what paper I should read next.

Since I don’t have any data on which people read each paper a collaborative filtering approach is ruled out, so instead I thought I could try content based filtering instead.

Let’s quickly check the Wikipedia definition of content based filtering:

In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes.

In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present).

We’re going to focus on the finding similar items part of the algorithm and we’ll start simple by calculating the similarity of items based on their titles. We’d probably get better results if we used the full text of the papers or at least the abstracts but that data isn’t as available.

We’re going to take the following approach to work out the similarity between any pair of papers:

for each paper:
  generate a TF/IDF vector of the terms in the paper's title
  calculate the cosine similarity of each paper's TF/IDF vector with every other paper's TF/IDF vector

This is very easy to do using the Python scikit-learn library and I’ve actually done the first part of the process while doing some exploratory analysis of interesting phrases in the TV show How I Met Your Mother.

Let’s get started.

We’ve got one file per paper which contains the title of the paper. We first need to iterate through that directory and build an array containing the papers:

import glob
 
corpus = []
for file in glob.glob("papers/*.txt"):
    with open(file, "r") as paper:
        corpus.append((file, paper.read()))

Next we’ll build a TF/IDF matrix for each paper:

from sklearn.feature_extraction.text import TfidfVectorizer
 
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf_matrix =  tf.fit_transform([content for file, content in corpus])

Next we’ll write a function that will find us the top n similar papers based on cosine similarity:

from sklearn.metrics.pairwise import linear_kernel
 
def find_similar(tfidf_matrix, index, top_n = 5):
    cosine_similarities = linear_kernel(tfidf_matrix[index:index+1], tfidf_matrix).flatten()
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]

Let’s try it out:

>>> corpus[1619]
('papers/221215.txt', 'TOTEM: a reliable ordered delivery protocol for interconnected local-area networks')
 
>>> for index, score in find_similar(tfidf_matrix, 1619):
       print score, corpus[index]
 
0.917540397202 ('papers/852338.txt', 'A reliable ordered delivery protocol for interconnected local area networks')
0.248736845733 ('papers/800897.txt', 'Interconnection of broadband local area networks')
0.207309089025 ('papers/103726.txt', 'High-speed local area networks and their performance: a survey')
0.204166719869 ('papers/161736.txt', 'High-speed switch scheduling for local-area networks')
0.198514433132 ('papers/627363.txt', 'Algorithms for Distributed Query Processing in Broadcast Local Area Networks')

It’s pretty good for finding duplicate papers!

>>> corpus[1599]
('papers/217470.txt', 'A reliable multicast framework for light-weight sessions and application level framing')
 
>>> for index, score in find_similar(tfidf_matrix, 1599):
       print score, corpus[index]
 
1.0            ('papers/270863.txt', 'A reliable multicast framework for light-weight sessions and application level framing')
0.139643354066 ('papers/218325.txt', 'The KryptoKnight family of light-weight protocols for authentication and key distribution')
0.134763799612 ('papers/1251445.txt', 'ALMI: an application level multicast infrastructure')
0.117630311817 ('papers/125160.txt', 'Ordered and reliable multicast communication')
0.117630311817 ('papers/128741.txt', 'Ordered and reliable multicast communication')

But sometimes it identifies duplicates that aren’t identical:

>>> corpus[5784]
('papers/RFC2616.txt', 'Hypertext Transfer Protocol -- HTTP/1.1')
 
>>> for index, score in find_similar(tfidf_matrix, 5784):
       print score, corpus[index]
 
1.0 ('papers/RFC1945.txt', 'Hypertext Transfer Protocol -- HTTP/1.0')
1.0 ('papers/RFC2068.txt', 'Hypertext Transfer Protocol -- HTTP/1.1')
0.232865694216 ('papers/131844.txt', 'XTP: the Xpress Transfer Protocol')
0.138876842331 ('papers/RFC1866.txt', 'Hypertext Markup Language - 2.0')
0.104775586915 ('papers/760249.txt', 'On the transfer of control between contexts')

Having said that, if you were reading and liked the HTTP 1.0 RFC the HTTP 1.1 RFC probably isn’t a bad recommendation.

There are obviously also some papers that get identified as being similar which aren’t. I created a CSV file containing 5 similar papers for each paper as long as the similarity is greater than 0.5. You can see the script that generates that file on github as well.

That’s as far as I’ve got for now but there are a couple of things I’m going to explore next:

  • How do we know if the similarity suggestions are any good? How do we measure good? Would using a term counting vector work better than TF/IDF?
  • Similarity based on abstracts as well as/instead of titles

All the code from this post for calculating similarities and writing them to CSV is on github as well so feel free to play around with it.

Written by Mark Needham

July 27th, 2016 at 2:45 am

Posted in Python

Tagged with

Python: Scraping elements relative to each other with BeautifulSoup

without comments

Last week we hosted a Game of Thrones based intro to Cypher at the Women Who Code London meetup and in preparation had to scrape the wiki to build a dataset.

I’ve built lots of datasets this way and it’s a painless experience as long as the pages make liberal use of CSS classes and/or IDs.

Unfortunately the Game of Thrones wiki doesn’t really do that so I had to find another way to extract the data I wanted – extracting elements based on their position to more prominent elements on the page.

For example, I wanted to extract Arya Stark‘s allegiances which look like this on the page:

2016 07 11 06 45 37

We don’t have a direct route to her allegiances but we do have an indirect path via the h3 element with the text ‘Allegiance’.

The following code gets us the ‘Allegiance’ element:

from bs4 import BeautifulSoup
 
file_name = "Arya_Stark"
wikia = BeautifulSoup(open("data/wikia/characters/{0}".format(file_name), "r"), "html.parser")
allegiance_element = [tag for tag in wikia.find_all('h3') if tag.text == "Allegiance"]
 
> print allegiance_element
[<h3 class="pi-data-label pi-secondary-font">Allegiance</h3>]

Now we need to work out the relative position of the div containing the houses. It’s inside the same parent div so I thought it’d probably be the next sibling:

next_element = allegiance_element[0].next_sibling
 
> print next_element

Nope. Nothing! Hmmm, wonder why:

> print next_element.name, type(next_element)
None <class 'bs4.element.NavigableString'>

Ah, empty string. Maybe it’s the one after that?

next_element = allegiance_element[0].next_sibling.next_sibling
 
> print next_element.name, type(next_element)
[<a href="/wiki/House_Stark" title="House Stark">House Stark</a>, <br/>, <a href="/wiki/Faceless_Men" title="Faceless Men">Faceless Men</a>, u' (Formerly)']

Hoorah! Afer this it became a case of working out how the text was structure and pulling out what I wanted.

The code I ended up with is on github if you want to recreate it yourself.

Written by Mark Needham

July 11th, 2016 at 6:01 am

Posted in Python

Tagged with

Python: BeautifulSoup – Insert tag

without comments

I’ve been scraping the Game of Thrones wiki in preparation for a meetup at Women Who Code next week and while attempting to extract character allegiances I wanted to insert missing line breaks to separate different allegiances.

I initially tried creating a line break like this:

>>> from bs4 import BeautifulSoup
>>> tag = BeautifulSoup("<br />", "html.parser")
>>> tag
<br/>

It looks like it should work but later on in my script I check the ‘name’ attribute to work out whether I’ve got a line break and it doesn’t return the value I expected it to:

>>> tag.name
u'[document]'

My script assumes it’s going to return the string ‘br’ so I needed another way of creating the tag. The following does the trick:

>>> from bs4 import Tag
>>> tag = Tag(name = "br")
>>> tag
<br></br>
>>> tag.name
'br'

That’s all for now, back to scraping for me!

Written by Mark Needham

June 30th, 2016 at 9:28 pm

Posted in Python

Tagged with

Python: Regex – matching foreign characters/unicode letters

without comments

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors.

Here are some examples of the format of the data:

Peter Dinklage as Tyrion Lannister
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
Filip Lozić as Young Nobleman
Morgan C. Jones as a Braavosi captain
Adewale Akinnuoye-Agbaje as Malko

So the pattern is:

<actor> as <character>

optionally followed by some other text that we’re not interested in.

The output I want to get is:

Peter Dinklage, Tyrion Lannister
Daniel Naprous, Oznak zo Pahl
Filip Lozić, Young Nobleman
Morgan C. Jones, a Braavosi captain
Adewale Akinnuoye-Agbaje, Malko

I started using the ‘split’ command on the word ‘as’ but that broke down when I realised some of the characters had the letters ‘as’ in the middle of their name. So regex it is!

This was my first attempt:

import re
 
strings = [
    "Peter Dinklage as Tyrion Lannister",
    "Filip Lozić as Young Nobleman",
    "Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)",
    "Morgan C. Jones as a Braavosi captain",
    "Adewale Akinnuoye-Agbaje as Malko"
]
 
regex = "([A-Za-z\-'\. ]*) as ([A-Za-z\-'\. ]*)"
 
for string in strings:
    print string
    match = re.match( regex, string)
    if match is not None:
        print match.groups()
    else:
        print "FAIL"
	print ""
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Filip Lozić as Young Nobleman
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

It works for 4 of the 5 scenarios but now for Filip Lozić. The ‘ć’ character causes the issue so we need to be able to match foreign characters which the current charset I defined in the regex doesn’t capture.

I came across this Stack Overflow post which said that in some regex libraries you can use ‘\p{L}’ to match all letters. I gave that a try:

regex = "([\p{L}\-'\. ]*) as ([\p{L}\-'\. ]*)"

And then re-ran the script:

Peter Dinklage as Tyrion Lannister
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
FAIL
 
Filip Lozić as Young Nobleman
FAIL
 
Morgan C. Jones as a Braavosi captain
FAIL
 
Adewale Akinnuoye-Agbaje as Malko
FAIL

Hmmm, not sure if I did it wrong or if that isn’t available in Python. I’ll assume the latter but feel free to correct me in the comments and I’ll update the post.

I went search again and found this post which suggested another approach:

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”.

Let’s try plugging that in:

regex = "([A-Za-z\-'\.^\W\d_ ]*) as ([A-Za-z\-'\.^\W\d_ ]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous as Oznak zo Pahl(credited', 'Stunt Performer)')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

So that’s fixed Filip but now Daniel Naprous is being incorrectly parsed.

For Attempt #4 I decided to try excluding what I don’t want instead:

regex = "([^0-9\(]*) as ([^0-9\(]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

That does the job but has exposed my lack of regex skillz. If you know a better way let me know in the comments.

Written by Mark Needham

June 18th, 2016 at 7:38 am

Posted in Python

Tagged with