Mark Needham

Thoughts on Software Development

Archive for the ‘machine-learning’ tag

scikit-learn: Clustering and the curse of dimensionality

with one comment

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def distances(a, b):
    return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1]
def mixed(n_zeros, n_ones):
    return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0)
def ones(n_ones):
    return np.repeat([1], n_ones)
print distances(mixed(2, 2), ones(4))
print distances(mixed(3, 3), ones(6))
print distances(mixed(50, 50), ones(100))
print distances(mixed(300, 300), ones(600))
(1.4142135623730951, 0.70710678118654746)
(1.7320508075688772, 0.70710678118654768)
(7.0710678118654755, 0.70710678118654757)
(17.320508075688775, 0.70710678118654746)

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD
n_components = 2
reducer = TruncatedSVD(n_components=n_components)
new_all = reducer.transform(all)
print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum()))
2: Percentage explained: 0.124579183633

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(new_all)
    silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000)
    print n_clusters, silhouette_avg
2 0.559681096025
3 0.498456585461
4 0.524704352941
5 0.441580592398
6 0.44703058946
7 0.447895331824
8 0.433698007009
9 0.459874485986

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

2016 08 27 21 18 14

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

Written by Mark Needham

August 27th, 2016 at 8:32 pm

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

without comments

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:


Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0]
Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> one = [1,1,0]
>>> two = [0,1,1]
>>> cosine_similarity([one, two])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo"))
session = driver.session()
rows ="""
    MATCH (c:Character), (e:Episode)
    OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e)
    RETURN e, c, appearance
    ORDER BY,""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows:
        print row
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {}
for row in rows:
    if episodes.get(row["e"]["id"]) is None:
        if row["appearance"] is None:
            episodes[row["e"]["id"]] = [0]
            episodes[row["e"]["id"]] = [1]
        if row["appearance"] is None:

Here’s an example of one entry in the matrix:

>>> len(episodes)
>>> len(episodes[1])
>>> episodes[1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values()
>>> cosine_similarity(all[0:1], all)[0]
array([ 1.        ,  0.69637306,  0.48196269,  0.54671752,  0.48196269,
        0.44733753,  0.31707317,  0.42340087,  0.34989921,  0.43314808,
        0.36597766,  0.18421252,  0.30961158,  0.2328101 ,  0.30616181,
        0.41905818,  0.36842504,  0.35338088,  0.18376917,  0.3569686 ,
        0.2328101 ,  0.34539847,  0.25043516,  0.31707317,  0.25329221,
        0.33342786,  0.34921515,  0.2174909 ,  0.2533473 ,  0.28429311,
        0.23026565,  0.22310537,  0.22365301,  0.23816275,  0.28242289,
        0.16070148,  0.24847093,  0.21434648,  0.03582872,  0.21189672,
        0.15460414,  0.17161693,  0.15460414,  0.17494961,  0.1234662 ,
        0.21426863,  0.21434648,  0.18748505,  0.15308091,  0.20161946,
        0.19877675,  0.30920827,  0.21058466,  0.19127301,  0.24607943,
        0.18033393,  0.17734311,  0.16296707,  0.18740851,  0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
0 1.0
1 0.696373059207
3 0.546717521051
2 0.481962692712
4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
59 1.0
52 0.500670191678
46 0.449085146211
43 0.448218732478
49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {}
>>> for idx, episode_id in enumerate(episodes):
        episode_mapping[idx] = episode_id
>>> for idx, episode_id in enumerate(episodes):
        similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0]
        for other_idx, similarity_score in enumerate(similarity_matrix):
            other_episode_id = episode_mapping[other_idx]
            print episode_id, other_episode_id, similarity_score
            if episode_id != other_episode_id:
                    MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}})
                    MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2)
                    ON CREATE SET similarity.score = {similarityScore}
                    """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score DESC
│ep1  │ep2 │score             │
│S1E2 │S1E1│0.6963730592072543│
│S1E4 │S1E3│0.6914173051223086│
│S1E9 │S1E8│0.6869464497590777│
│S3E7 │S3E6│0.6819943394704735│
│S2E7 │S2E6│0.6813598225089799│
│S1E5 │S1E4│0.6698105143372364│
│S4E5 │S4E4│0.6518358737330705│

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score
│ep1 │ep2 │score              │
│S4E9│S1E5│0                  │
│S4E9│S1E6│0                  │
│S4E9│S4E2│0                  │
│S4E9│S2E9│0                  │
│S4E9│S2E4│0                  │
│S5E6│S4E9│0                  │
│S6E8│S4E9│0                  │
│S4E9│S4E6│0                  │

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode)
WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5
return episode1, episode2
(no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Written by Mark Needham

August 22nd, 2016 at 9:12 pm

Mahout: Exception in thread “main” java.lang.IllegalArgumentException: Wrong FS: file:/… expected: hdfs://

without comments

I’ve been playing around with Mahout over the last couple of days to see how well it works for content based filtering.

I started following a mini tutorial from Stack Overflow but ran into trouble on the first step:

bin/mahout seqdirectory \
--input file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo \
--output file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out \
-c UTF-8 \
-chunk 64 \
-prefix mah
16/07/21 21:19:20 INFO AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo], --keyPrefix=[mah], --method=[mapreduce], --output=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out], --startPhase=[0], --tempDir=[temp]}
16/07/21 21:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/21 21:19:20 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/07/21 21:19:20 INFO deprecation: is deprecated. Instead, use
16/07/21 21:19:20 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo, expected: hdfs://localhost:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
	at org.apache.mahout.text.SequenceFilesFromDirectory.runMapReduce(
	at org.apache.mahout.text.SequenceFilesFromDirectory.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
	at org.apache.hadoop.util.ProgramDriver.driver(
	at org.apache.mahout.driver.MahoutDriver.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.hadoop.util.RunJar.main(

I was trying to run the command against the local file system on my laptop which should have been possible according to the instructions. I couldn’t find any flag I could pass any flag that I could pass to Mahout to tell it not to use HDFS but I eventually stumbled on someone else experiencing a similar problem.

It turns out the last time I was playing around with Hadoop, in late 2015, I’d actually configured that and had completely forgotten. I needed to comment out the following config:



I commented that property out and all was happy with the (Hadoop) world again.

Written by Mark Needham

July 21st, 2016 at 5:57 pm

Posted in Hadoop

Tagged with ,

How I met your mother: Story arcs

without comments

After weeks of playing around with various algorithms to extract story arcs in How I met your mother I’ve come to the conclusion that I don’t yet have the skills to completely automate this process so I’m going to change my approach.

The new plan is to treat the outputs of the algorithms as suggestions for possible themes but then have a manual step where I extract what I think are interesting themes in the series.

A theme can consist of a single word or a phrase and the idea is that once a story arc is identified we’ll search over the corpus and find the episodes where that phrase occurs.

We can then generate a CSV file of (story arc) -> (episodeId), store that into our HIMYM graph and use the story arc as another factor for episode similarity.

I ended up with the following script to work out which episodes contained a story arc:

find_term() {
  episodes=$(grep --color -iE "${searchTerm}" data/import/sentences.csv | awk -F"," '{ print $2  }' | sort | uniq)
  for episode in ${episodes}; do
    echo ${arc},${episode}
find_term "Bro Code" "bro code"
find_term "Legendary" "legen(.*)ary"
find_term "Slutty Pumpkin" "slutty pumpkin"
find_term "Magician's Code" "magician's code"
find_term "Thanksgiving" "thanksgiving"
find_term "The Playbook" "playbook"
find_term "Slap Bet" "slap bet"
find_term "Wrestlers and Robots" "wrestlers"
find_term "Robin Sparkles" "sparkles"
find_term "Blue French Horn" "blue french horn"
find_term "Olive Theory" "olive"
find_term "Thank You Linus" "thank you, linus"
find_term "Have you met...?" "have you met"
find_term "Laser Tag" "laser tag"
find_term "Goliath National Bank" "goliath national bank"
find_term "Challenge Accepted" "challenge accepted"
find_term "Best Man" "best man"

If we run this script we’ll see something like the following:

$ ./scripts/
Bro Code,14
Bro Code,155
Bro Code,170
Bro Code,188
Bro Code,201
Bro Code,61
Bro Code,64
Best Man,208
Best Man,30
Best Man,32
Best Man,41
Best Man,42

I pulled out these themes by eyeballing the output of the following scripts:

  • TF/IDF – calculates TF/IDF scores for ngrams. This helps find important themes in the context of a single episode. I then did some manual searching to see how many of those themes existed in other episodes
  • Weighted Term Frequency – this returns a weighted term frequency for ngrams of different lengths. The weights are determined by the skewed random discrete distribution I wrote about earlier in the week. I ran it with different skews and ngram lengths.
  • Named entity extraction – this pulls out any phrases that are named entities. It mostly pulled out names of people (which I used as a stop word list in some other algorithms) but also revealed a couple of themes.
  • Topic modelling – I used mallet to extract topics across the corpus. Most of them didn’t make much sense to me but there were a few which identified themes that I recognised.

I can’t remember off the top of my head if any obvious themes have been missed so if you know HIMYM better than me let me know and I’ll try and work out why those didn’t surface.

Next I want to see how these scripts fare against some other TV shows and see how quickly I can extract themes for those. It’d also be cool if I can make the whole process a bit more automated.

Written by Mark Needham

April 3rd, 2015 at 11:31 pm

Posted in Machine Learning

Tagged with

Kaggle Titanic: Python pandas attempt

with one comment

Nathan and I have been looking at Kaggle’s Titanic problem and while working through the Python tutorial Nathan pointed out that we could greatly simplify the code if we used pandas instead.

The problem we had with numpy is that you use integers to reference columns. We spent a lot of time being thoroughly confused as to why something wasn’t working only to realise we were using the wrong column.

The algorithm scores an accuracy of 77.99% and the way it works is that we build a ‘survival table’ which works out an average survival rate of people based on 3 features:

  • Passenger Class
  • Passenger Fare (grouped into those who paid 0-9, 10-19, 20-29, 30+)
  • Gender

It looks like this:

2013 10 30 07 05 03

And the code that creates that is:

import pandas as pd
def addrow(df, row):
    return df.append(pd.DataFrame(row), ignore_index=True)
def fare_in_bucket(fare, fare_bracket_size, bucket):
    return (fare > bucket * fare_bracket_size) & (fare <= ((bucket+1) * fare_bracket_size))
def build_survival_table(training_file):    
    fare_ceiling = 40
    train_df = pd.read_csv(training_file)
    train_df[train_df['Fare'] >= 39.0] = 39.0
    fare_bracket_size = 10
    number_of_price_brackets = fare_ceiling / fare_bracket_size
    number_of_classes = 3 #There were 1st, 2nd and 3rd classes on board 
    survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived', 'NumberOfPeople'])
    for pclass in range(1, number_of_classes + 1): # add 1 to handle 0 start
        for bucket in range(0, number_of_price_brackets):
            for sex in ['female', 'male']:
                survival = train_df[(train_df['Sex'] == sex) 
                                    & (train_df['Pclass'] == pclass) 
                                    & fare_in_bucket(train_df["Fare"], fare_bracket_size, bucket)]
                row = [dict(Pclass=pclass, Sex=sex, PriceDist = bucket, 
                            Survived = round(survival['Survived'].mean()), 
                            NumberOfPeople = survival.count()[0]) ]
                survival_table = addrow(survival_table, row)
    return survival_table.fillna(0)
survival_table = build_survival_table("train.csv")

where ‘train.csv’ is structured like so:

$ head -n5 train.csv 
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

After we’ve built that we iterate through the test data set and look up each person in the table and find their survival rate.

def select_bucket(fare):
    if (fare >= 0 and fare < 10):
        return 0
    elif (fare >= 10 and fare < 20):
        return 1
    elif (fare >= 20 and fare < 30):
        return 2
        return 3
def calculate_survival(survival_table, row):
    survival_row = survival_table[(survival_table["Sex"] == row["Sex"]) & (survival_table["Pclass"] == row["Pclass"]) & (survival_table["PriceDist"] == select_bucket(row["Fare"]))]
    return int(survival_row["Survived"].iat[0])    
test_df = pd.read_csv('test.csv')             
test_df["Survived"] = test_df.apply(lambda row: calculate_survival(survival_table, row), axis=1)

I wrote up the difficulties we had working out how to append the ‘Survived’ column if you want more detail.

‘test.csv’ looks like this:

$ head -n5 test.csv 
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S

We then write out the survival value for each customer along with their ID:

test_df.to_csv("result.csv", cols=['PassengerId', 'Survived'], index=False)
$ head -n5 result.csv 

I’ve pasted the code as a gist for those who want to see it all as one.

Next step: introduce some real machine learning, probably using scikit-learn unless there’s something else we should be using?

Written by Mark Needham

October 30th, 2013 at 7:26 am

Posted in Machine Learning

Tagged with ,

Feature Extraction/Selection – What I’ve learnt so far

with one comment

A couple of weeks ago I wrote about some feature extraction work that I’d done on the Kaggle Digit Recognizer data set and having realised that I had no idea what I was doing I thought I should probably learn a bit more.

I came across Dunja Mladenic’s ‘Dimensionality Reduction by Feature Selection in Machine Learning‘ presentation in which she sweeps across the landscape of feature selection and explains how everything fits together.

The talk starts off by going through some reasons that we’d want to use dimensionality reduce/feature selection:

  • Improve the prediction performance
  • Improve learning efficiency
  • Provide faster predictors possibly requesting less information on the original data
  • Reduce complexity of the learned results, enable better understanding of the underlying process

Mladenic suggests that there are a few ways we can go about reducing the dimensionality of data:

  • Selecting a subset of the original features
  • Constructing features to replace the original features
  • Using background knowledge to construct new features to be used in addition to the original features

The talk focuses on the first of these and a lot of it focuses on how we can go about using feature selection as a pre-processing step on our data sets.

The approach seems to involve either starting with all the features and removing them one at a time and seeing how the outcome is affected or starting with none of the features and adding them one at a time.

However, about half way through the talk Mladenic points out that some algorithms actually have feature selection built into them so there’s no need to have the pre-processing step.

I think this is the case with random forests of decision trees because the decision trees are constructed by taking into account which features give the greatest information gain so low impact features are less likely to be used.

I previously wrote a blog post describing how I removed all the features with zero variance from the data set and after submitting a random forest trained on the new data set we saw no change in accuracy which proved the point.

I also came across an interesting paper by Isabelle Guyon & Andre Elisseeff titled ‘An Introduction to Variable and Feature Selection‘ which has a flow chart-ish set of questions to help you work out where to start.

One of the things I picked up from reading this paper is that if you have domain knowledge then you might be able to construct a better set of features by making use of this knowledge.

Another suggestion is to come up with a variable ranking for each feature i.e. how much that feature contributes to the outcome/prediction. This is something also suggested in the Coursera Data Analysis course and in R we can use the glm function to help work this out.

The authors also point out that we should separate the problem of model selection (i.e. working out which features to use) from the problem of testing our classifier.

To test the classifier we’d most likely keep a test set aside but we shouldn’t use this data for testing feature selection, rather we should use the training data. Cross validation probably works best here.

There’s obviously more covered in the presentation & paper than what I’ve covered here but I’ve found that in general the material I’ve come across tends to drift towards being quite abstract/theoretical and therefore quite difficult for me to follow.

If anyone has come across any articles/books which explain how to go about feature selection using an example I’d love to read it/them!

Written by Mark Needham

February 10th, 2013 at 3:42 pm

Posted in Machine Learning

Tagged with

Kaggle Digit Recognizer: A feature extraction #fail

with 4 comments

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

The code reads like this:

initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)

I then wrote those features out into a CSV file like so:

newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,

Mahout: Parallelising the creation of DecisionTrees

with 5 comments

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list

deb oneiric main 
deb-src oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

startTime=`date '+%s'`
seq 1 ${numberOfRuns} | parallel -P 8 "./"
endTime=`date '+%s'`
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Written by Mark Needham

December 27th, 2012 at 12:08 am

Posted in Machine Learning

Tagged with ,

Weka: Saving and loading classifiers

with 3 comments

In our continued machine learning travels Jen and I have been building some classifiers using Weka and one thing we wanted to do was save the classifier and then reuse it later.

There is documentation for how to do this from the command line but we’re doing everything programatically and wanted to be able to save our classifiers from Java code.

As it turns out it’s not too tricky when you know which classes to call and saving a classifier to a file is as simple as this:

MultilayerPerceptron classifier = new MultilayerPerceptron();
classifier.buildClassifier(instances); // instances gets passed in from elsewhere
Debug.saveToFile("/path/to/weka-neural-network", classifier);

If we want to load that classifier up we can make use of the SerializedClassifier class like so:

SerializedClassifier classifier = new SerializedClassifier();
classifier.setModelFile(new File("/path/to/weka-neural-network"));


Written by Mark Needham

December 12th, 2012 at 12:04 am

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Weka AdaBoost attempt

without comments

In our latest attempt at Kaggle’s Digit Recognizer Jen and I decided to try out boosting on our random forest algorithm, an approach that Jen had come across in a talk at the Clojure Conj.

We couldn’t find any documentation that it was possible to apply boosting to Mahout’s random forest algorithm but we knew it was possible with Weka so we decided to use that instead!

As I understand it the way that boosting works in the context of random forests is that each of the trees in the forest will be assigned a weight based on how accurately it’s able to classify the data set and these weights are then used in the voting stage.

There’s a more detailed explanation of the algorithm in this paper.

We had the following code to train the random forest:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        FastVector attributes = attributes();
        Instances instances = new Instances("digit recognizer", attributes, 40000);
        String[] trainingDataValues = KaggleInputReader.fileAsStringArray("data/train.csv");
        for (String trainingDataValue : trainingDataValues) {
            Instance instance = createInstance(trainingDataValue);
        Classifier classifier = buildClassifier(instances);
    private static Classifier buildClassifier(Instances instances) throws Exception {
        RandomForest randomForest = new RandomForest();
        MultiBoostAB multiBoostAB = new MultiBoostAB();
        return multiBoostAB;
    private static FastVector attributes() {
        FastVector attributes = new FastVector();
        for (int i = 0; i <= 783; i++) {
            attributes.addElement(new Attribute("pixel" + i));
        return attributes;
    private static Attribute digit() {
        FastVector possibleClasses = new FastVector(10);
        return new Attribute("label", possibleClasses, 0);

The code in the KaggleInputReader is used to process the CSV file and is the same as that included in a previous post so I won’t bother including it in this post.

The Weka API is slightly different to the Mahout one in that we have to tell it the names of all the labels that a combination of features belong to whereas with Mahout it seems to work it out for you.

Wf use the RandomForest class to build up our trees and then wrap it in the MultiBoostAB class to apply the boosting. There is another class we could use to do this called AdaBoostM1 but they both seem to give similar results so we stuck with this one.

Once we’d trained the classifier up we ran it against our test data set like so:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        String[] testDataValues = KaggleInputReader.fileAsStringArray("data/test.csv");
        FileWriter fileWriter = new FileWriter("weka-attempts/out-" + System.currentTimeMillis() + ".txt");
        PrintWriter out = new PrintWriter(fileWriter);
        for (String testDataValue : testDataValues) {
            Iteration iterate = iterate(testDataValue, classifier, instances);
            out.println((int) iterate.getPrediction());
            System.out.println("Actual: " + iterate.getActual() + ", Prediction: " + iterate.getPrediction());
    private static Iteration iterate(String testDataValue, Classifier classifier, Instances instances) throws Exception {
        Instance predictMe = createTestDataBasedInstanceToPredict(testDataValue, instances);
        double prediction = classifier.classifyInstance(predictMe);
        return new Iteration(new Double(testDataValue.split(",")[0]), prediction);
    private static Instance createTestDataBasedInstanceToPredict(String testDataValue, Instances instances) {
        String[] columns = testDataValue.split(",");
        Instance instance = new Instance(785);
        for (int i = 0; i < columns.length; i++) {
            instance.setValue(new Attribute("pixel" + i, i+1), new Double(columns[i]));
        return instance;

We got an accuracy of 96.529% with this code which is 0.2% higher than we managed with the Mahout Random forest without any boosting. The full code for this solution is on github as always!

We still haven’t managed to get an accuracy higher than the default solution provided by Kaggle so any suggestions about what else to try are welcome!

We’ve been playing around with neural networks using encog but they seem a bit magical and the moment and it’s difficult to work out why they don’t work when you don’t get the result you expect!

Written by Mark Needham

November 29th, 2012 at 5:09 pm

Posted in Machine Learning

Tagged with ,