## Archive for the ‘scikit-learn’ tag

## scikit-learn: Random forests – Feature Importance

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

### Prerequisites

import numpy as np import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split # We'll use this library to make the display pretty from tabulate import tabulate |

### Load Data

train = pd.read_csv('train.csv') # the model can only handle numeric values so filter out the rest data = train.select_dtypes(include=[np.number]).interpolate().dropna() |

### Split train/test sets

y = train.SalePrice X = data.drop(["SalePrice", "Id"], axis=1) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33) |

### Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000) model = clf.fit(X_train, y_train) |

### Feature Importance

headers = ["name", "score"] values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1) print(tabulate(values, headers, tablefmt="plain")) |

name score OverallQual 0.553829 GrLivArea 0.131 BsmtFinSF1 0.0374779 TotalBsmtSF 0.0372076 1stFlrSF 0.0321814 GarageCars 0.0226189 GarageArea 0.0215719 LotArea 0.0214979 YearBuilt 0.0184556 2ndFlrSF 0.0127248 YearRemodAdd 0.0126581 WoodDeckSF 0.0108077 OpenPorchSF 0.00945239 LotFrontage 0.00873811 TotRmsAbvGrd 0.00803121 GarageYrBlt 0.00760442 BsmtUnfSF 0.00715158 MasVnrArea 0.00680341 ScreenPorch 0.00618797 Fireplaces 0.00521741 OverallCond 0.00487722 MoSold 0.00461165 MSSubClass 0.00458496 BedroomAbvGr 0.00253031 FullBath 0.0024245 YrSold 0.00211638 HalfBath 0.0014954 KitchenAbvGr 0.00140786 BsmtFullBath 0.00137335 BsmtFinSF2 0.00107147 EnclosedPorch 0.000951266 3SsnPorch 0.000501238 PoolArea 0.000261668 LowQualFinSF 0.000241304 BsmtHalfBath 0.000179506 MiscVal 0.000154799 |

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

- Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
- Remove the most predictive features and build a model that only uses the other features

## scikit-learn: First steps with log_loss

Over the last week I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition, and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

>>> actual_labels = ["bam", "ham", "spam"] |

To work out the log loss score we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

e.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First we’ll import the function:

>>> from sklearn.metrics import log_loss |

Now let’s see what score we get if we make a perfect prediction:

>>> log_loss(actual_labels, [[1, 0, 0], [0, 1, 0], [0, 0, 1]]) 2.1094237467877998e-15 |

What about if we make a completely wrong prediction?

>>> log_loss(actual_labels, [[0, 0, 1], [1, 0, 0], [0, 1, 0]]) 34.538776394910684 |

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the cass that that your model only gives less than a 36% probability estimate for the correct class.

This is the formula of logloss:

In which y_{ij} is 1 for the correct class and 0 for other classes and p_{ij} is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The y_{ij} value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First we need to work out what value would have been passed to the log function which is easy in this case. The value of y_{ij} is

# every prediction exactly right >>> math.log(1) 0.0 >>> math.exp(0) 1.0 |

# every prediction completely wrong >>> math.log(0.000000001) -20.72326583694641 >>> math.exp(-20.72326583694641) 1.0000000000000007e-09 |

I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673 |

We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:

# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7) >>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673 |

In this case, on average our probability estimate would be:

# we put in the negative value since we multiplied by -1/N >>> math.exp(-0.363548039673) 0.6952053289772744 |

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time we’ll make one more very certain (90%) prediction for ‘spam’:

>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]]) 0.299001158669 >>> math.exp(-0.299001158669) 0.741558550213609 |

74% accuracy overall, sounds about right!

## scikit-learn: Clustering and the curse of dimensionality

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity import numpy as np def distances(a, b): return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1] def mixed(n_zeros, n_ones): return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0) def ones(n_ones): return np.repeat([1], n_ones) print distances(mixed(2, 2), ones(4)) print distances(mixed(3, 3), ones(6)) print distances(mixed(50, 50), ones(100)) print distances(mixed(300, 300), ones(600)) (1.4142135623730951, 0.70710678118654746) (1.7320508075688772, 0.70710678118654768) (7.0710678118654755, 0.70710678118654757) (17.320508075688775, 0.70710678118654746) |

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD n_components = 2 reducer = TruncatedSVD(n_components=n_components) reducer.fit(all) new_all = reducer.transform(all) print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum())) 2: Percentage explained: 0.124579183633 |

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(new_all) silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000) print n_clusters, silhouette_avg 2 0.559681096025 3 0.498456585461 4 0.524704352941 5 0.441580592398 6 0.44703058946 7 0.447895331824 8 0.433698007009 9 0.459874485986 |

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

## scikit-learn: Trying to find clusters of Game of Thrones episodes

In my last post I showed how to find similar Game of Thrones episodes based on the characters that appear in different episodes. This allowed us to find similar episodes on an episode by episode basis, but I was curious whether there were **groups of similar episodes** that we could identify.

scikit-learn provides several clustering algorithms that can run over our episode vectors and hopefully find clusters of similar episodes. A clustering algorithm groups similar documents together, where similarity is based on calculating a ‘distance’ between documents. Documents separated by a small distance would be in the same cluster, whereas if there’s a large distance between episodes then they’d probably be in different clusters.

The simplest variant is K-means clustering:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

The output from the algorithm is a list of labels which correspond to the cluster assigned to each episode.

Let’s give it a try on the Game of Thrones episodes. We’ll start from the 2 dimensional array of episodes/character appearances that we created in the previous post.

>>> all.shape (60, 638) >>> all array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]]) |

We have a 60 (episodes) x 638 (characters) array which we can now plug into the K-means clustering algorithm:

>>> from sklearn.cluster import KMeans >>> n_clusters = 3 >>> km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) >>> cluster_labels = km.fit_predict(all) >>> cluster_labels array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32) |

cluster_labels is an array containing a label for each episode in the all array. The spread of these labels is as follows:

>>> import numpy as np >>> np.bincount(cluster_labels) array([19, 12, 29]) |

i.e. 19 episodes in cluster 0, 12 in cluster 1, and 29 in cluster 2.

### How do we know if the clustering is any good?

Ideally we’d have some labelled training data which we could compare our labels against, but since we don’t we can measure the effectiveness of our clustering by calculating inter-centroidal separation and intra-cluster variance.

i.e. how close are the episodes to other episodes in the same cluster vs how close are they to episodes in the closest different cluster.

scikit-learn gives us a function that we can use to calculate this score – the silhouette coefficient.

The output of this function is a score between -1 and 1.

- A score of 1 means that our clustering has worked well and a document is far away from the boundary of another cluster.
- A score of -1 means that our document should have been placed in another cluster.
- A score of 0 means that the document is very close to the decision boundary between two clusters.

I tried calculating this coefficient for some different values of K. This is what I found:

from sklearn import metrics for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(all) silhouette_avg = metrics.silhouette_score(all, cluster_labels, sample_size=1000) sample_silhouette_values = metrics.silhouette_samples(all, cluster_labels) print n_clusters, silhouette_avg 2 0.0798610142955 3 0.0648416081725 4 0.0390877994786 5 0.020165277756 6 0.030557856406 7 0.0389677156458 8 0.0590721834989 9 0.0466170527996 |

The best score we manage here is 0.07 when we set the number of clusters to 2. Even our highest score is much lower than the lowest score on the documentation page!

I tried it out with some higher values of K but only saw a score over 0.5 once I put the number of clusters to 40 which would mean 1 or 2 episodes per cluster at most.

At the moment our episode arrays contain 638 elements so they’re too long to visualise on a 2D silhouette plot. We’d need to apply a dimensionality reduction algorithm before doing that.

In summary it looks like character co-occurrence isn’t a good way to cluster episodes. I’m curious what would happen if we flip the array on its head and try and cluster the characters instead, but that’s for another day.

If anyone spots anything that I’ve missed when reading the output of the algorithm let me know in the comments. I’m just learning by experimentation at the moment.

## Python/scikit-learn: Detecting which sentences in a transcript contain a speaker

Over the past couple of months I’ve been playing around with How I met your mother transcripts and the most recent thing I’ve been working on is how to extract the speaker for a particular sentence.

This initially seemed like a really simple problem as most of the initial sentences I looked at weere structured like this:

<speaker>: <sentence> |

If there were all in that format then we could write a simple regular expression and then move on but unfortunately they aren’t. We could probably write a more complex regex to pull out the speaker but I thought it’d be fun to see if I could train a model to work it out instead.

The approach I’ve taken is derived from an example in the NLTK book.

The first problem with this approach was that I didn’t have any labelled data to work with so I wrote a little web application that made it easy for me to train chunks of sentences at a time:

I stored the trained words in a JSON file. Each entry looks like this:

import json with open("data/import/trained_sentences.json", "r") as json_file: json_data = json.load(json_file) >>> json_data[0] {u'words': [{u'word': u'You', u'speaker': False}, {u'word': u'ca', u'speaker': False}, {u'word': u"n't", u'speaker': False}, {u'word': u'be', u'speaker': False}, {u'word': u'friends', u'speaker': False}, {u'word': u'with', u'speaker': False}, {u'word': u'Robin', u'speaker': False}, {u'word': u'.', u'speaker': False}]} >>> json_data[1] {u'words': [{u'word': u'Robin', u'speaker': True}, {u'word': u':', u'speaker': False}, {u'word': u'Well', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'it', u'speaker': False}, {u'word': u"'s", u'speaker': False}, {u'word': u'a', u'speaker': False}, {u'word': u'bit', u'speaker': False}, {u'word': u'early', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'but', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'of', u'speaker': False}, {u'word': u'course', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'might', u'speaker': False}, {u'word': u'consider', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'moved', u'speaker': False}, {u'word': u'here', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'let', u'speaker': False}, {u'word': u'me', u'speaker': False}, {u'word': u'think', u'speaker': False}, {u'word': u'.', u'speaker': False}]} |

Each word in the sentence is represented by a JSON object which also indicates if that word was a speaker in the sentence.

### Feature selection

Now that I’ve got some trained data to work with I needed to choose which features I’d use to train my model.

One of the most obvious indicators that a word is the speaker in the sentence is that the next word is ‘:’ so ‘next word’ can be a feature. I also went with ‘previous word’ and the word itself for my first cut.

This is the function I wrote to convert a word in a sentence into a set of features:

def pos_features(sentence, i): features = {} features["word"] = sentence[i] if i == 0: features["prev-word"] = "<START>" else: features["prev-word"] = sentence[i-1] if i == len(sentence) - 1: features["next-word"] = "<END>" else: features["next-word"] = sentence[i+1] return features |

Let’s try a couple of examples:

import nltk >>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 0) {'prev-word': '<START>', 'word': 'Robin', 'next-word': ':'} >>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 5) {'prev-word': ',', 'word': 'how', 'next-word': 'are'} |

Now let’s run that function over our full set of labelled data:

with open("data/import/trained_sentences.json", "r") as json_file: json_data = json.load(json_file) tagged_sents = [] for sentence in json_data: tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]]) featuresets = [] for tagged_sent in tagged_sents: untagged_sent = nltk.tag.untag(tagged_sent) for i, (word, tag) in enumerate(tagged_sent): featuresets.append( (pos_features(untagged_sent, i), tag) ) |

Here’s a sample of the contents of featuresets:

>>> featuresets[:5] [({'prev-word': '<START>', 'word': u'You', 'next-word': u'ca'}, False), ({'prev-word': u'You', 'word': u'ca', 'next-word': u"n't"}, False), ({'prev-word': u'ca', 'word': u"n't", 'next-word': u'be'}, False), ({'prev-word': u"n't", 'word': u'be', 'next-word': u'friends'}, False), ({'prev-word': u'be', 'word': u'friends', 'next-word': u'with'}, False)] |

It’s nearly time to train our model, but first we need to split out labelled data into training and test sets so we can see how well our model performs on data it hasn’t seen before. sci-kit learn has a function that does this for us:

from sklearn.cross_validation import train_test_split train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80) >>> len(train_data) 9480 >>> len(test_data) 2370 |

Now let’s train our model. I decided to try out Naive Bayes and Decision tree models to see how they got on:

>>> classifier = nltk.NaiveBayesClassifier.train(train_data) >>> print nltk.classify.accuracy(classifier, test_data) 0.977215189873 >>> classifier = nltk.DecisionTreeClassifier.train(train_data) >>> print nltk.classify.accuracy(classifier, test_data) 0.997046413502 |

It looks like both are doing a good job here with the decision tree doing slightly better. One thing to keep in mind is that most of the sentences we’ve trained at in the form ‘

If we explore the internals of the decision tree we’ll see that it’s massively overfitting which makes sense given our small training data set and the repetitiveness of the data:

>>> print(classifier.pseudocode(depth=2)) if next-word == u'!': return False if next-word == u'$': return False ... if next-word == u"'s": return False if next-word == u"'ve": return False if next-word == u'(': if word == u'!': return False ... if next-word == u'*': return False if next-word == u'*****': return False if next-word == u',': if word == u"''": return False ... if next-word == u'--': return False if next-word == u'.': return False if next-word == u'...': ... if word == u'who': return False if word == u'you': return False if next-word == u'/i': return False if next-word == u'1': return True ... if next-word == u':': if prev-word == u"'s": return True if prev-word == u',': return False if prev-word == u'...': return False if prev-word == u'2030': return True if prev-word == '<START>': return True if prev-word == u'?': return False ... if next-word == u'\u266a\u266a': return False |

One update I may make to the features is to include the part of speech of the word rather than its actual value to see if that makes the model a bit more general. Another option is to train a bunch of decision trees against a subset of the data and build an ensemble/random forest of those trees.

Once I’ve got a working ‘speaker detector’ I want to then go and **work out who the likely speaker is** for the sentences which don’t contain a speaker. The plan is to calculate the word distributions of the speakers from sentences I do have and then calculate the probability that they spoke the unlabelled sentences.

This might not work perfectly as there could be new characters in those episodes but hopefully we can come up with something decent.

The full code for this example is on github if you want to have a play with it.

Any suggestions for improvements are always welcome in the comments.