Mark Needham

Thoughts on Software Development

scikit-learn: First steps with log_loss

Over the last week I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition, and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

`>>> actual_labels = ["bam", "ham", "spam"]`

To work out the log loss score we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

e.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First we’ll import the function:

`>>> from sklearn.metrics import log_loss`

Now let’s see what score we get if we make a perfect prediction:

```>>> log_loss(actual_labels, [[1, 0, 0], [0, 1, 0], [0, 0, 1]]) 2.1094237467877998e-15```

What about if we make a completely wrong prediction?

```>>> log_loss(actual_labels, [[0, 0, 1], [1, 0, 0], [0, 1, 0]]) 34.538776394910684```

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the cass that that your model only gives less than a 36% probability estimate for the correct class.

This is the formula of logloss:

In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The yij value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First we need to work out what value would have been passed to the log function which is easy in this case. The value of yij is

```# every prediction exactly right >>> math.log(1) 0.0   >>> math.exp(0) 1.0```
```# every prediction completely wrong >>> math.log(0.000000001) -20.72326583694641   >>> math.exp(-20.72326583694641) 1.0000000000000007e-09```

I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

```>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673```

We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:

```# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7)   >>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673```

In this case, on average our probability estimate would be:

```# we put in the negative value since we multiplied by -1/N >>> math.exp(-0.363548039673) 0.6952053289772744```

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time we’ll make one more very certain (90%) prediction for ‘spam’:

```>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]]) 0.299001158669   >>> math.exp(-0.299001158669) 0.741558550213609```

74% accuracy overall, sounds about right!

Written by Mark Needham

September 14th, 2016 at 5:33 am

Posted in Machine Learning,Python

Tagged with

scikit-learn: Clustering and the curse of dimensionality

with one comment

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

```from sklearn.metrics.pairwise import cosine_similarity import numpy as np   def distances(a, b): return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1]   def mixed(n_zeros, n_ones): return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0)   def ones(n_ones): return np.repeat([1], n_ones)   print distances(mixed(2, 2), ones(4)) print distances(mixed(3, 3), ones(6)) print distances(mixed(50, 50), ones(100)) print distances(mixed(300, 300), ones(600))   (1.4142135623730951, 0.70710678118654746) (1.7320508075688772, 0.70710678118654768) (7.0710678118654755, 0.70710678118654757) (17.320508075688775, 0.70710678118654746)```

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

```from sklearn.decomposition import TruncatedSVD   n_components = 2 reducer = TruncatedSVD(n_components=n_components) reducer.fit(all) new_all = reducer.transform(all) print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum()))   2: Percentage explained: 0.124579183633```

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

```from sklearn.cluster import KMeans   for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(new_all) silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000)   print n_clusters, silhouette_avg   2 0.559681096025 3 0.498456585461 4 0.524704352941 5 0.441580592398 6 0.44703058946 7 0.447895331824 8 0.433698007009 9 0.459874485986```

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Written by Mark Needham

August 27th, 2016 at 8:32 pm

scikit-learn: Trying to find clusters of Game of Thrones episodes

In my last post I showed how to find similar Game of Thrones episodes based on the characters that appear in different episodes. This allowed us to find similar episodes on an episode by episode basis, but I was curious whether there were groups of similar episodes that we could identify.

scikit-learn provides several clustering algorithms that can run over our episode vectors and hopefully find clusters of similar episodes. A clustering algorithm groups similar documents together, where similarity is based on calculating a ‘distance’ between documents. Documents separated by a small distance would be in the same cluster, whereas if there’s a large distance between episodes then they’d probably be in different clusters.

The simplest variant is K-means clustering:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

The output from the algorithm is a list of labels which correspond to the cluster assigned to each episode.

Let’s give it a try on the Game of Thrones episodes. We’ll start from the 2 dimensional array of episodes/character appearances that we created in the previous post.

```>>> all.shape (60, 638)   >>> all array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])```

We have a 60 (episodes) x 638 (characters) array which we can now plug into the K-means clustering algorithm:

```>>> from sklearn.cluster import KMeans   >>> n_clusters = 3 >>> km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) >>> cluster_labels = km.fit_predict(all)   >>> cluster_labels array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)```

cluster_labels is an array containing a label for each episode in the all array. The spread of these labels is as follows:

```>>> import numpy as np >>> np.bincount(cluster_labels) array([19, 12, 29])```

i.e. 19 episodes in cluster 0, 12 in cluster 1, and 29 in cluster 2.

How do we know if the clustering is any good?

Ideally we’d have some labelled training data which we could compare our labels against, but since we don’t we can measure the effectiveness of our clustering by calculating inter-centroidal separation and intra-cluster variance.

i.e. how close are the episodes to other episodes in the same cluster vs how close are they to episodes in the closest different cluster.

scikit-learn gives us a function that we can use to calculate this score – the silhouette coefficient.

The output of this function is a score between -1 and 1.

• A score of 1 means that our clustering has worked well and a document is far away from the boundary of another cluster.
• A score of -1 means that our document should have been placed in another cluster.
• A score of 0 means that the document is very close to the decision boundary between two clusters.

I tried calculating this coefficient for some different values of K. This is what I found:

```from sklearn import metrics   for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(all)   silhouette_avg = metrics.silhouette_score(all, cluster_labels, sample_size=1000) sample_silhouette_values = metrics.silhouette_samples(all, cluster_labels)   print n_clusters, silhouette_avg   2 0.0798610142955 3 0.0648416081725 4 0.0390877994786 5 0.020165277756 6 0.030557856406 7 0.0389677156458 8 0.0590721834989 9 0.0466170527996```

The best score we manage here is 0.07 when we set the number of clusters to 2. Even our highest score is much lower than the lowest score on the documentation page!

I tried it out with some higher values of K but only saw a score over 0.5 once I put the number of clusters to 40 which would mean 1 or 2 episodes per cluster at most.

At the moment our episode arrays contain 638 elements so they’re too long to visualise on a 2D silhouette plot. We’d need to apply a dimensionality reduction algorithm before doing that.

In summary it looks like character co-occurrence isn’t a good way to cluster episodes. I’m curious what would happen if we flip the array on its head and try and cluster the characters instead, but that’s for another day.

If anyone spots anything that I’ve missed when reading the output of the algorithm let me know in the comments. I’m just learning by experimentation at the moment.

Written by Mark Needham

August 25th, 2016 at 10:07 pm

Posted in Machine Learning,Python

Tagged with

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

`:play http://guides.neo4j.com/got`

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

```Episode 1 = [1, 1, 0] Episode 2 = [0, 1, 1]```

We could then calculate the cosine similarity between these two episodes like this:

```>>> from sklearn.metrics.pairwise import cosine_similarity >>> one = [1,1,0] >>> two = [0,1,1]   >>> cosine_similarity([one, two]) array([[ 1. , 0.5], [ 0.5, 1. ]])```

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

```from neo4j.v1 import GraphDatabase, basic_auth driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo")) session = driver.session()   rows = session.run(""" MATCH (c:Character), (e:Episode) OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e) RETURN e, c, appearance ORDER BY e.id, c.id""")```

We can iterate through the rows to see what the output looks like:

```>>> for row in rows: print row   <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>```

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

```episodes = {} for row in rows: if episodes.get(row["e"]["id"]) is None: if row["appearance"] is None: episodes[row["e"]["id"]] = [0] else: episodes[row["e"]["id"]] = [1] else: if row["appearance"] is None: episodes[row["e"]["id"]].append(0) else: episodes[row["e"]["id"]].append(1)```

Here’s an example of one entry in the matrix:

```>>> len(episodes) 60   >>> len(episodes[1]) 638   >>> episodes[1] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]```

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

```>>> all = episodes.values()   >>> cosine_similarity(all[0:1], all)[0] array([ 1. , 0.69637306, 0.48196269, 0.54671752, 0.48196269, 0.44733753, 0.31707317, 0.42340087, 0.34989921, 0.43314808, 0.36597766, 0.18421252, 0.30961158, 0.2328101 , 0.30616181, 0.41905818, 0.36842504, 0.35338088, 0.18376917, 0.3569686 , 0.2328101 , 0.34539847, 0.25043516, 0.31707317, 0.25329221, 0.33342786, 0.34921515, 0.2174909 , 0.2533473 , 0.28429311, 0.23026565, 0.22310537, 0.22365301, 0.23816275, 0.28242289, 0.16070148, 0.24847093, 0.21434648, 0.03582872, 0.21189672, 0.15460414, 0.17161693, 0.15460414, 0.17494961, 0.1234662 , 0.21426863, 0.21434648, 0.18748505, 0.15308091, 0.20161946, 0.19877675, 0.30920827, 0.21058466, 0.19127301, 0.24607943, 0.18033393, 0.17734311, 0.16296707, 0.18740851, 0.23995201])```

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

```>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]: print idx, score   0 1.0 1 0.696373059207 3 0.546717521051 2 0.481962692712 4 0.481962692712```

Or we can see how similar the last episode of season 6 is compared to the others:

```>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]: print idx, score   59 1.0 52 0.500670191678 46 0.449085146211 43 0.448218732478 49 0.446296233312```

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

```>>> episode_mapping = {} >>> for idx, episode_id in enumerate(episodes): episode_mapping[idx] = episode_id   >>> for idx, episode_id in enumerate(episodes): similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0] for other_idx, similarity_score in enumerate(similarity_matrix): other_episode_id = episode_mapping[other_idx] print episode_id, other_episode_id, similarity_score if episode_id != other_episode_id: session.run(""" MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}}) MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2) ON CREATE SET similarity.score = {similarityScore} """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})   session.close()```

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

```MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode) WHERE ID(episode1) > ID(episode2) RETURN "S" + episode1.season + "E" + episode1.number AS ep1, "S" + episode2.season + "E" + episode2.number AS ep2, similarity.score AS score ORDER BY similarity.score DESC LIMIT 10   ╒═════╤════╤══════════════════╕ │ep1 │ep2 │score │ ╞═════╪════╪══════════════════╡ │S1E2 │S1E1│0.6963730592072543│ ├─────┼────┼──────────────────┤ │S1E4 │S1E3│0.6914173051223086│ ├─────┼────┼──────────────────┤ │S1E9 │S1E8│0.6869464497590777│ ├─────┼────┼──────────────────┤ │S2E10│S2E8│0.6869037302955034│ ├─────┼────┼──────────────────┤ │S3E7 │S3E6│0.6819943394704735│ ├─────┼────┼──────────────────┤ │S2E7 │S2E6│0.6813598225089799│ ├─────┼────┼──────────────────┤ │S1E10│S1E9│0.6796436827080401│ ├─────┼────┼──────────────────┤ │S1E5 │S1E4│0.6698105143372364│ ├─────┼────┼──────────────────┤ │S1E10│S1E8│0.6624062584864754│ ├─────┼────┼──────────────────┤ │S4E5 │S4E4│0.6518358737330705│ └─────┴────┴──────────────────┘```

And the least popular?

```MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode) WHERE ID(episode1) > ID(episode2) RETURN "S" + episode1.season + "E" + episode1.number AS ep1, "S" + episode2.season + "E" + episode2.number AS ep2, similarity.score AS score ORDER BY similarity.score LIMIT 10   ╒════╤════╤═══════════════════╕ │ep1 │ep2 │score │ ╞════╪════╪═══════════════════╡ │S4E9│S1E5│0 │ ├────┼────┼───────────────────┤ │S4E9│S1E6│0 │ ├────┼────┼───────────────────┤ │S4E9│S4E2│0 │ ├────┼────┼───────────────────┤ │S4E9│S2E9│0 │ ├────┼────┼───────────────────┤ │S4E9│S2E4│0 │ ├────┼────┼───────────────────┤ │S5E6│S4E9│0 │ ├────┼────┼───────────────────┤ │S6E8│S4E9│0 │ ├────┼────┼───────────────────┤ │S4E9│S4E6│0 │ ├────┼────┼───────────────────┤ │S3E9│S2E9│0.03181423814878889│ ├────┼────┼───────────────────┤ │S4E9│S1E1│0.03582871819500093│ └────┴────┴───────────────────┘```

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

```MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode) WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5 return episode1, episode2   (no changes, no rows)```

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Written by Mark Needham

August 22nd, 2016 at 9:12 pm

How I met your mother: Story arcs

After weeks of playing around with various algorithms to extract story arcs in How I met your mother I’ve come to the conclusion that I don’t yet have the skills to completely automate this process so I’m going to change my approach.

The new plan is to treat the outputs of the algorithms as suggestions for possible themes but then have a manual step where I extract what I think are interesting themes in the series.

A theme can consist of a single word or a phrase and the idea is that once a story arc is identified we’ll search over the corpus and find the episodes where that phrase occurs.

We can then generate a CSV file of (story arc) -> (episodeId), store that into our HIMYM graph and use the story arc as another factor for episode similarity.

I ended up with the following script to work out which episodes contained a story arc:

```#!/bin/bash   find_term() { arc=\${1} searchTerm=\${2} episodes=\$(grep --color -iE "\${searchTerm}" data/import/sentences.csv | awk -F"," '{ print \$2 }' | sort | uniq) for episode in \${episodes}; do echo \${arc},\${episode} done }   find_term "Bro Code" "bro code" find_term "Legendary" "legen(.*)ary" find_term "Slutty Pumpkin" "slutty pumpkin" find_term "Magician's Code" "magician's code" find_term "Thanksgiving" "thanksgiving" find_term "The Playbook" "playbook" find_term "Slap Bet" "slap bet" find_term "Wrestlers and Robots" "wrestlers" find_term "Robin Sparkles" "sparkles" find_term "Blue French Horn" "blue french horn" find_term "Olive Theory" "olive" find_term "Thank You Linus" "thank you, linus" find_term "Have you met...?" "have you met" find_term "Laser Tag" "laser tag" find_term "Goliath National Bank" "goliath national bank" find_term "Challenge Accepted" "challenge accepted" find_term "Best Man" "best man"```

If we run this script we’ll see something like the following:

```\$ ./scripts/arcs.sh Bro Code,14 Bro Code,155 Bro Code,170 Bro Code,188 Bro Code,201 Bro Code,61 Bro Code,64 Legendary,117 Legendary,120 Legendary,122 Legendary,136 Legendary,137 Legendary,15 Legendary,152 Legendary,157 Legendary,162 Legendary,171 ... Best Man,208 Best Man,30 Best Man,32 Best Man,41 Best Man,42```

I pulled out these themes by eyeballing the output of the following scripts:

• TF/IDF – calculates TF/IDF scores for ngrams. This helps find important themes in the context of a single episode. I then did some manual searching to see how many of those themes existed in other episodes
• Weighted Term Frequency – this returns a weighted term frequency for ngrams of different lengths. The weights are determined by the skewed random discrete distribution I wrote about earlier in the week. I ran it with different skews and ngram lengths.
• Named entity extraction – this pulls out any phrases that are named entities. It mostly pulled out names of people (which I used as a stop word list in some other algorithms) but also revealed a couple of themes.
• Topic modelling – I used mallet to extract topics across the corpus. Most of them didn’t make much sense to me but there were a few which identified themes that I recognised.

I can’t remember off the top of my head if any obvious themes have been missed so if you know HIMYM better than me let me know and I’ll try and work out why those didn’t surface.

Next I want to see how these scripts fare against some other TV shows and see how quickly I can extract themes for those. It’d also be cool if I can make the whole process a bit more automated.

Written by Mark Needham

April 3rd, 2015 at 11:31 pm

Posted in Machine Learning

Tagged with

Topic Modelling: Working out the optimal number of topics

In my continued exploration of topic modelling I came across The Programming Historian blog and a post showing how to derive topics from a corpus using the Java library mallet.

The instructions on the blog make it very easy to get up and running but as with other libraries I’ve used, you have to specify how many topics the corpus consists of. I’m never sure what value to select but the authors make the following suggestion:

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse.

There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations (but for more see Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228-5235).

Since I haven’t yet had the time to dive into the paper or explore how to use the appropriate option in mallet I thought I’d do some variations on the stop words and number of topics and see how that panned out.

As I understand it, the idea is to try and get a uniform spread of topics -> documents i.e. we don’t want all the documents to have the same topic otherwise any topic similarity calculations we run won’t be that interesting.

I tried running mallet with 10,15,20 and 30 topics and also varied the stop words used. I had one version which just stripped out the main characters and the word ‘narrator’ & another where I stripped out the top 20% of words by occurrence and any words that appeared less than 10 times.

The reason for doing this was that it should identify interesting phrases across episodes better than TF/IDF can while not just selecting the most popular words across the whole corpus.

I used mallet from the command line and ran it in two parts.

1. Generate the model
2. Work out the allocation of topics and documents based on hyper parameters

I wrote a script to help me out:

```#!/bin/sh   train_model() { ./mallet-2.0.7/bin/mallet import-dir \ --input mallet-2.0.7/sample-data/himym \ --output \${2} \ --keep-sequence \ --remove-stopwords \ --extra-stopwords \${1} }   extract_topics() { ./mallet-2.0.7/bin/mallet train-topics \ --input \${2} --num-topics \${1} \ --optimize-interval 20 \ --output-state himym-topic-state.gz \ --output-topic-keys output/himym_\${1}_\${3}_keys.txt \ --output-doc-topics output/himym_\${1}_\${3}_composition.txt }   train_model "stop_words.txt" "output/himym.mallet" train_model "main-words-stop.txt" "output/himym.main.words.stop.mallet"   extract_topics 10 "output/himym.mallet" "all.stop.words" extract_topics 15 "output/himym.mallet" "all.stop.words" extract_topics 20 "output/himym.mallet" "all.stop.words" extract_topics 30 "output/himym.mallet" "all.stop.words"   extract_topics 10 "output/himym.main.words.stop.mallet" "main.stop.words" extract_topics 15 "output/himym.main.words.stop.mallet" "main.stop.words" extract_topics 20 "output/himym.main.words.stop.mallet" "main.stop.words" extract_topics 30 "output/himym.main.words.stop.mallet" "main.stop.words"```

As you can see, this script first generates a bunch of models from text files in ‘mallet-2.0.7/sample-data/himym’ – there is one file per episode of HIMYM. We then use that model to generate differently sized topic models.

The output is two files; one containing a list of topics and another describing what percentage of the words in each document come from each topic.

```\$ cat output/himym_10_all.stop.words_keys.txt   0 0.08929 back brad natalie loretta monkey show call classroom mitch put brunch betty give shelly tyler interview cigarette mc laren 1 0.05256 zoey jerry arthur back randy arcadian gael simon blauman blitz call boats becky appartment amy gary made steve boat 2 0.06338 back claudia trudy doug int abby call carl stuart voix rachel stacy jenkins cindy vo katie waitress holly front 3 0.06792 tony wendy royce back jersey jed waitress bluntly lucy made subtitle film curt mosley put laura baggage officer bell 4 0.21609 back give patrice put find show made bilson nick call sam shannon appartment fire robots top basketball wrestlers jinx 5 0.07385 blah bob back thanksgiving ericksen maggie judy pj valentine amanda made call mickey marcus give put dishes juice int 6 0.04638 druthers karen back jen punchy jeanette lewis show jim give pr dah made cougar call jessica sparkles find glitter 7 0.05751 nora mike pete scooter back magazine tiffany cootes garrison kevin halloween henrietta pumpkin slutty made call bottles gruber give 8 0.07321 ranjit back sandy mary burger call find mall moby heather give goat truck made put duck found stangel penelope 9 0.31692 back give call made find put move found quinn part ten original side ellen chicago italy locket mine show```
```\$ head -n 10 output/himym_10_all.stop.words_composition.txt #doc name topic proportion ... 0 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/1.txt 0 0.70961794636687 9 0.1294699168584466 8 0.07950442338871108 2 0.07192178481473664 4 0.008360809510263838 5 2.7862560133367015E-4 3 2.562409242784946E-4 7 2.1697378721335337E-4 1 1.982849604752168E-4 6 1.749937876710496E-4 1 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/10.txt 2 0.9811551470820473 9 0.016716882136209997 4 6.794128563082893E-4 0 2.807350575301132E-4 5 2.3219634098530471E-4 8 2.3018997315244256E-4 3 2.1354177341696056E-4 7 1.8081798384467614E-4 1 1.6524340216541808E-4 6 1.4583339433951297E-4 2 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/100.txt 2 0.724061485807234 4 0.13624729774423758 0 0.13546964196228636 9 0.0019436342339785994 5 4.5291919356563914E-4 8 4.490055982996677E-4 3 4.1653183421485213E-4 7 3.5270123154213927E-4 1 3.2232165301666123E-4 6 2.8446074162457316E-4 3 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/101.txt 2 0.7815231689893246 0 0.14798271520316794 9 0.023582384458063092 8 0.022251052243582908 1 0.022138209217973336 4 0.0011804626661380394 5 4.0343527385745457E-4 3 3.7102343418895774E-4 7 3.1416667687862693E-4 6 2.533818368250992E- 4 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/102.txt 6 0.6448245189567259 4 0.18612146979166502 3 0.16624873439661025 9 0.0012233726722317548 0 3.4467218590717303E-4 5 2.850788252495599E-4 8 2.8261550915084904E-4 2 2.446611421432842E-4 7 2.2199909869250053E-4 1 2.028774216237081E- 5 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/103.txt 8 0.7531586740033047 5 0.17839539108961253 0 0.06512376460651902 9 0.001282794040111701 4 8.746645156304241E-4 3 2.749100345664577E-4 2 2.5654476523149865E-4 7 2.327819863700214E-4 1 2.1273153572848481E-4 6 1.8774342292520802E-4 6 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/104.txt 7 0.9489502365148181 8 0.030091466847852504 4 0.017936457663121977 9 0.0013482824985091328 0 3.7986419553884905E-4 5 3.141861834124008E-4 3 2.889445824352445E-4 2 2.6964174000656E-4 1 2.2359178288566958E-4 6 1.9732799141958482E-4 7 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/105.txt 8 0.7339694064061175 7 0.1237041841318045 9 0.11889696041555338 0 0.02005288536233353 4 0.0014026751618923005 5 4.793786828705149E-4 3 4.408655780020889E-4 2 4.1141370625324785E-4 1 3.411516484151411E-4 6 3.0107890675777946E-4 8 file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/106.txt 5 0.37064909999661005 9 0.3613559917055785 0 0.14857567731040344 6 0.09545466082502917 4 0.022300625744661403 8 3.8725629469313333E-4 3 3.592484711785775E-4 2 3.3524900189121E-4 7 3.041961449432886E-4 1 2.779945050112539E-4```

The output is a bit tricky to understand on its own so I did a bit of post processing using pandas and then ran the results of that through matplotlib to see the distribution of documents for different topics sizes with different stop words. You can see the script here.

I ended up with the following chart:

On the left hand side we’re using more stop words and on the right just the main ones. For most of the variations there are one or two topics which most documents belong to but interestingly the most uniform distribution seems to be when we have few topics.

These are the main words for the most popular topics on the left hand side:

15 topics

`8 0.50732 back give call made put find found part move show side ten mine top abby front fire full fianc`

20 topics

`12 0.61545 back give call made put find show found part move side mine top front ten full cry fire fianc`

30 topics

`22 0.713 back call give made put find show part found side move front ten full top mine fire cry bottom`

All contain more or less the same words which at first glance seem like quite generic words so I’m surprised they weren’t excluded.

On the right hand side we haven’t removed many words so we’d expect common words in the English language to dominate. Let’s see if they do:

10 topics

`1 3.79451 don yeah ll hey ve back time guys good gonna love god night wait uh thing guy great make`

15 topics

```5 2.81543 good time love ll great man guy ve night make girl day back wait god life yeah years thing   10 1.52295 don yeah hey gonna uh guys didn back ve ll um kids give wow doesn thing totally god fine```

20 topics

```1 3.06732 good time love wait great man make day back ve god life years thought big give apartment people work   13 1.68795 don yeah hey gonna ll uh guys night didn back ve girl um kids wow guy kind thing baby```

30 topics

```14 1.42509 don yeah hey gonna uh guys didn back ve um thing ll kids wow time doesn totally kind wasn   24 2.19053 guy love man girl wait god ll back great yeah day call night people guys years home room phone   29 1.84685 good make ve ll stop time made nice put feel love friends big long talk baby thought things happy```

Again we have similar words across each run and as expected they are all quite generic words.

My take away from this exploration is that I should vary the stop word percentages as well and see if that leads to an improved distribution.

Taking out very common words like we do with the left hand side charts seems to make sense although I need to work out why there’s a single outlier in each group.

The authors suggest that having the majority of our texts in a small number of topics means we need to create more of them so I will investigate that too.

The code is all on github along with the transcripts so give it a try and let me know what you think.

Written by Mark Needham

March 24th, 2015 at 10:33 pm

Posted in Machine Learning,Python

Tagged with

Python: scikit-learn/lda: Extracting topics from QCon talk abstracts

with one comment

Following on from Rik van Bruggen’s blog post on a QCon graph he’s created ahead of this week’s conference, I was curious whether we could extract any interesting relationships between talks based on their abstracts.

Talks are already grouped by their hosting track but there’s likely to be some overlap in topics even for talks on different tracks.
I therefore wanted to extract topics and connect each talk to the topic that describes it best.

My first attempt was following an example which uses Non-Negative Matrix factorization which worked very well for extracting topics but didn’t seem to provide an obvious way to work out how to link those topics to individual talks.

Instead I ended up looking at the lda library which uses Latent Dirichlet Allocation and allowed me to achieve both goals.

I already had some code to run TF/IDF over each of the talks so I thought I’d be able to feed the matrix output from that into the LDA function. This is what I started with:

```import csv   from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import NMF from collections import defaultdict from bs4 import BeautifulSoup, NavigableString from soupselect import select   def uri_to_file_name(uri): return uri.replace("/", "-")   sessions = {} with open("data/sessions.csv", "r") as sessions_file: reader = csv.reader(sessions_file, delimiter = ",") reader.next() # header for row in reader: session_id = int(row[0]) filename = "data/sessions/" + uri_to_file_name(row[4]) page = open(filename).read() soup = BeautifulSoup(page) abstract = select(soup, "div.brenham-main-content p") if abstract: sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] } else: abstract = select(soup, "div.pane-content p") sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }   corpus = [] titles = [] for id, session in sorted(sessions.iteritems(), key=lambda t: int(t[0])): corpus.append(session["abstract"]) titles.append(session["title"])   n_topics = 15 n_top_words = 50 n_features = 6000   vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english') matrix = vectorizer.fit_transform(corpus) feature_names = vectorizer.get_feature_names()   import lda import numpy as np   vocab = feature_names   model = lda.LDA(n_topics=20, n_iter=500, random_state=1) model.fit(matrix) topic_word = model.topic_word_ n_top_words = 20   for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1] print('Topic {}: {}'.format(i, ' '.join(topic_words)))```

And if we run it?

```Topic 0: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 1: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 2: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 3: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 4: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 5: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 6: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 7: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 8: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 9: 10 faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 10: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 11: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 12: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 13: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 14: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 15: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 16: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 17: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 18: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure Topic 19: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure```

As you can see, every topic has the same set of words which isn’t what we want. Let’s switch out our TF/IDF vectorizer for a simpler count based one:

`vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')`

The rest of the code stays the same and these are the topics that get extracted:

```Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck Topic 1: process development delivery platform developer continuous testing rapidly deployment implementing release demonstrate paas advice hard light predictable radically introduce Topic 2: way open space kind people change meetings ll lead powerful practice times everyday simple qconlondon organization unconference track extraordinary Topic 3: apache apis processing open spark distributed leading making environments solr cases brooklyn components existing ingestion contributing data target evolved Topic 4: management million effective cost halo gameplay player billion ad catastrophic store microsoft final music influence information launch research purchased Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better Topic 8: ve development teams use things world like time learned lessons think methods multiple story say customer developer experiences organisations Topic 9: software building docker built challenges monitoring gilt application discuss solution decision talk download source center critical decisions bintray customers Topic 10: years infrastructure tools language different service lot devops talk adoption scala popular clojure advantages introduced effectively looking wasn includes Topic 11: high does latency session requirements functional performance real world questions problem second engineering patterns gravity explain discuss expected time Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going Topic 13: need design systems large driven scale software applications slow protocol change needs approach gets new contracts solutions complicated distributed Topic 14: architecture service micro architectures increasing talk microservices order market value values new present presents services scalable trading practices today Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring Topic 16: data services using traditional create ways support uk large user person complex systems production impact art organizations accessing mirage Topic 17: agile team experience don work doing processes based key reach extra defined pressure machines nightmare practices learn goals guidance Topic 18: internet new devices programming things iot big number deliver day connected performing growing got state thing provided times automated Topic 19: cloud including deploy session api government security culture software type attack techniques environment digital secure microservice better creation interaction```

Some of the groupings seem to make sense e.g. Topic 11 contains words related to high performance code with low latency; Topic 15 covers Java, the JVM and other related words; but others are more difficult to decipher

e.g. both Topic 14 and Topic 19 talk about micro services but the latter mentions ‘government’ and ‘security’ so perhaps the talks linked to that topic come at micro services from a different angle altogether.

Next let’s see which topics a talk is most likely to be about. We’ll look at the first ten:

```doc_topic = model.doc_topic_ for i in range(0, 10): print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax())) print(doc_topic[i].argsort()[::-1][:3])   To the Moon (top topic: 8) [ 8 0 11] Evolutionary Architecture and Micro-Services - A Match Enabled by Continuous Delivery (top topic: 14) [14 19 16] How SoundCloud uses Cassandra (top topic: 0) [0 6 5] DevOps and the Need for Speed (top topic: 18) [18 5 16] Neuro-diversity and agile (top topic: 7) [17 7 2] Java 8 in Anger (top topic: 7) [ 7 15 12] APIs that Change Lifestyles (top topic: 9) [ 9 6 19] Elasticsearch powers the Citizen Advice Bureau (CAB) to monitor trends in society before they become issues (top topic: 16) [16 12 19] Architecture Open Space (top topic: 2) [ 2 19 18] Don’t let Data Gravity crush your infrastructure (top topic: 11) [11 16 3]```

So our third talk on the list ‘How SoundCloud uses Cassandra’ does end up being tagged with topic 0 which mentions SoundCloud so that’s good!

`Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck`

It’s next two topics are 5 & 6 which contain the following words…

```Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently```

…which are not as intuitive. What about Java 8 in Anger? It’s been tagged with topics 7, 15 and 12:

```Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going```

15 makes sense since that mentions Java and perhaps 12 and 7 do as well as they both mention developers.

So while the topics pulled out are not horrendous I don’t think they’re particularly useful yet either. These are some of the areas I need to do some more research around:

• How do you measure the success of topic modelling? I’ve been eyeballing the output of the algorithm but I imagine there’s an automated way to do that.
• How do you determine the right number of topics? I found an article written by Christophe Grainger which explains a way of doing that which I need to look at in more detail.
• It feels like I would be able to pull out better topics if I had an ontology of computer science/software words and then ran the words through that to derive topics.
• Another approach suggested by Michael is to find the most popular words using the CountVectorizer and tag talks with those instead.

If you have any suggestions let me know. The full code is on github if you want to play around with it.

Written by Mark Needham

March 5th, 2015 at 8:52 am

Posted in Machine Learning,Python

Tagged with

Kaggle Titanic: Python pandas attempt

with one comment

Nathan and I have been looking at Kaggle’s Titanic problem and while working through the Python tutorial Nathan pointed out that we could greatly simplify the code if we used pandas instead.

The problem we had with numpy is that you use integers to reference columns. We spent a lot of time being thoroughly confused as to why something wasn’t working only to realise we were using the wrong column.

The algorithm scores an accuracy of 77.99% and the way it works is that we build a ‘survival table’ which works out an average survival rate of people based on 3 features:

• Passenger Class
• Passenger Fare (grouped into those who paid 0-9, 10-19, 20-29, 30+)
• Gender

It looks like this:

And the code that creates that is:

```import pandas as pd   def addrow(df, row): return df.append(pd.DataFrame(row), ignore_index=True)   def fare_in_bucket(fare, fare_bracket_size, bucket): return (fare > bucket * fare_bracket_size) & (fare <= ((bucket+1) * fare_bracket_size))   def build_survival_table(training_file): fare_ceiling = 40 train_df = pd.read_csv(training_file) train_df[train_df['Fare'] >= 39.0] = 39.0 fare_bracket_size = 10 number_of_price_brackets = fare_ceiling / fare_bracket_size number_of_classes = 3 #There were 1st, 2nd and 3rd classes on board   survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived', 'NumberOfPeople'])   for pclass in range(1, number_of_classes + 1): # add 1 to handle 0 start for bucket in range(0, number_of_price_brackets): for sex in ['female', 'male']: survival = train_df[(train_df['Sex'] == sex) & (train_df['Pclass'] == pclass) & fare_in_bucket(train_df["Fare"], fare_bracket_size, bucket)]   row = [dict(Pclass=pclass, Sex=sex, PriceDist = bucket, Survived = round(survival['Survived'].mean()), NumberOfPeople = survival.count()[0]) ] survival_table = addrow(survival_table, row)   return survival_table.fillna(0)   survival_table = build_survival_table("train.csv")```

where ‘train.csv’ is structured like so:

```\$ head -n5 train.csv PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S```

After we’ve built that we iterate through the test data set and look up each person in the table and find their survival rate.

```def select_bucket(fare): if (fare >= 0 and fare < 10): return 0 elif (fare >= 10 and fare < 20): return 1 elif (fare >= 20 and fare < 30): return 2 else: return 3   def calculate_survival(survival_table, row): survival_row = survival_table[(survival_table["Sex"] == row["Sex"]) & (survival_table["Pclass"] == row["Pclass"]) & (survival_table["PriceDist"] == select_bucket(row["Fare"]))] return int(survival_row["Survived"].iat[0])   test_df = pd.read_csv('test.csv') test_df["Survived"] = test_df.apply(lambda row: calculate_survival(survival_table, row), axis=1)```

I wrote up the difficulties we had working out how to append the ‘Survived’ column if you want more detail.

‘test.csv’ looks like this:

```\$ head -n5 test.csv PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S```

We then write out the survival value for each customer along with their ID:

`test_df.to_csv("result.csv", cols=['PassengerId', 'Survived'], index=False)`
```\$ head -n5 result.csv PassengerId,Survived 892,0 893,1 894,0 895,0```

I’ve pasted the code as a gist for those who want to see it all as one.

Next step: introduce some real machine learning, probably using scikit-learn unless there’s something else we should be using?

Written by Mark Needham

October 30th, 2013 at 7:26 am

Posted in Machine Learning

Tagged with ,

Feature Extraction/Selection – What I’ve learnt so far

with one comment

A couple of weeks ago I wrote about some feature extraction work that I’d done on the Kaggle Digit Recognizer data set and having realised that I had no idea what I was doing I thought I should probably learn a bit more.

I came across Dunja Mladenic’s ‘Dimensionality Reduction by Feature Selection in Machine Learning‘ presentation in which she sweeps across the landscape of feature selection and explains how everything fits together.

The talk starts off by going through some reasons that we’d want to use dimensionality reduce/feature selection:

• Improve the prediction performance
• Improve learning efficiency
• Provide faster predictors possibly requesting less information on the original data
• Reduce complexity of the learned results, enable better understanding of the underlying process

Mladenic suggests that there are a few ways we can go about reducing the dimensionality of data:

• Selecting a subset of the original features
• Constructing features to replace the original features
• Using background knowledge to construct new features to be used in addition to the original features

The talk focuses on the first of these and a lot of it focuses on how we can go about using feature selection as a pre-processing step on our data sets.

The approach seems to involve either starting with all the features and removing them one at a time and seeing how the outcome is affected or starting with none of the features and adding them one at a time.

However, about half way through the talk Mladenic points out that some algorithms actually have feature selection built into them so there’s no need to have the pre-processing step.

I think this is the case with random forests of decision trees because the decision trees are constructed by taking into account which features give the greatest information gain so low impact features are less likely to be used.

I previously wrote a blog post describing how I removed all the features with zero variance from the data set and after submitting a random forest trained on the new data set we saw no change in accuracy which proved the point.

I also came across an interesting paper by Isabelle Guyon & Andre Elisseeff titled ‘An Introduction to Variable and Feature Selection‘ which has a flow chart-ish set of questions to help you work out where to start.

One of the things I picked up from reading this paper is that if you have domain knowledge then you might be able to construct a better set of features by making use of this knowledge.

Another suggestion is to come up with a variable ranking for each feature i.e. how much that feature contributes to the outcome/prediction. This is something also suggested in the Coursera Data Analysis course and in R we can use the glm function to help work this out.

The authors also point out that we should separate the problem of model selection (i.e. working out which features to use) from the problem of testing our classifier.

To test the classifier we’d most likely keep a test set aside but we shouldn’t use this data for testing feature selection, rather we should use the training data. Cross validation probably works best here.

There’s obviously more covered in the presentation & paper than what I’ve covered here but I’ve found that in general the material I’ve come across tends to drift towards being quite abstract/theoretical and therefore quite difficult for me to follow.

If anyone has come across any articles/books which explain how to go about feature selection using an example I’d love to read it/them!

Written by Mark Needham

February 10th, 2013 at 3:42 pm

Posted in Machine Learning

Tagged with

Kaggle Digit Recognizer: A feature extraction #fail

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

```initial <- read.csv("train.csv", header = TRUE) initial\$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries))) initial\$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries))) initial\$meanPixels <- apply(initial, 1, mean) initial\$middlePixels <- apply(initial[,200:500], 1, mean)```

I then wrote those features out into a CSV file like so:

```newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels)) write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")```

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,