Mark Needham

Thoughts on Software Development

Archive for the ‘kaggle’ tag

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:


import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model =, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

without comments

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
data = pd.concat([
    ], axis=1).select_dtypes(include=[np.number])
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
lr = linear_model.LinearRegression()
model =, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

Written by Mark Needham

June 4th, 2017 at 9:22 am

Posted in Data Science,Python

Tagged with ,

Kaggle Titanic: Python pandas attempt

with one comment

Nathan and I have been looking at Kaggle’s Titanic problem and while working through the Python tutorial Nathan pointed out that we could greatly simplify the code if we used pandas instead.

The problem we had with numpy is that you use integers to reference columns. We spent a lot of time being thoroughly confused as to why something wasn’t working only to realise we were using the wrong column.

The algorithm scores an accuracy of 77.99% and the way it works is that we build a ‘survival table’ which works out an average survival rate of people based on 3 features:

  • Passenger Class
  • Passenger Fare (grouped into those who paid 0-9, 10-19, 20-29, 30+)
  • Gender

It looks like this:

2013 10 30 07 05 03

And the code that creates that is:

import pandas as pd
def addrow(df, row):
    return df.append(pd.DataFrame(row), ignore_index=True)
def fare_in_bucket(fare, fare_bracket_size, bucket):
    return (fare > bucket * fare_bracket_size) & (fare <= ((bucket+1) * fare_bracket_size))
def build_survival_table(training_file):    
    fare_ceiling = 40
    train_df = pd.read_csv(training_file)
    train_df[train_df['Fare'] >= 39.0] = 39.0
    fare_bracket_size = 10
    number_of_price_brackets = fare_ceiling / fare_bracket_size
    number_of_classes = 3 #There were 1st, 2nd and 3rd classes on board 
    survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived', 'NumberOfPeople'])
    for pclass in range(1, number_of_classes + 1): # add 1 to handle 0 start
        for bucket in range(0, number_of_price_brackets):
            for sex in ['female', 'male']:
                survival = train_df[(train_df['Sex'] == sex) 
                                    & (train_df['Pclass'] == pclass) 
                                    & fare_in_bucket(train_df["Fare"], fare_bracket_size, bucket)]
                row = [dict(Pclass=pclass, Sex=sex, PriceDist = bucket, 
                            Survived = round(survival['Survived'].mean()), 
                            NumberOfPeople = survival.count()[0]) ]
                survival_table = addrow(survival_table, row)
    return survival_table.fillna(0)
survival_table = build_survival_table("train.csv")

where ‘train.csv’ is structured like so:

$ head -n5 train.csv 
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

After we’ve built that we iterate through the test data set and look up each person in the table and find their survival rate.

def select_bucket(fare):
    if (fare >= 0 and fare < 10):
        return 0
    elif (fare >= 10 and fare < 20):
        return 1
    elif (fare >= 20 and fare < 30):
        return 2
        return 3
def calculate_survival(survival_table, row):
    survival_row = survival_table[(survival_table["Sex"] == row["Sex"]) & (survival_table["Pclass"] == row["Pclass"]) & (survival_table["PriceDist"] == select_bucket(row["Fare"]))]
    return int(survival_row["Survived"].iat[0])    
test_df = pd.read_csv('test.csv')             
test_df["Survived"] = test_df.apply(lambda row: calculate_survival(survival_table, row), axis=1)

I wrote up the difficulties we had working out how to append the ‘Survived’ column if you want more detail.

‘test.csv’ looks like this:

$ head -n5 test.csv 
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S

We then write out the survival value for each customer along with their ID:

test_df.to_csv("result.csv", cols=['PassengerId', 'Survived'], index=False)
$ head -n5 result.csv 

I’ve pasted the code as a gist for those who want to see it all as one.

Next step: introduce some real machine learning, probably using scikit-learn unless there’s something else we should be using?

Written by Mark Needham

October 30th, 2013 at 7:26 am

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: A feature extraction #fail

with 4 comments

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

The code reads like this:

initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)

I then wrote those features out into a CSV file like so:

newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Finding pixels with no variance using R

with one comment

I’ve written previously about our attempts at the Kaggle Digit Recogniser problem and our approach so far has been to use the data provided and plug it into different algorithms and see what we end up with.

From browsing through the forums we saw others mentioning feature extraction – an approach where we transform the data into another format , the thinking being that we can train a better classifier with better data.

There was quite a nice quote from a post written by Rachel Schutt about the Columbia Data Science course which summed up the mistake we’d made:

The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One thing we’d noticed while visually scanning the data set was that a lot of the features seemed to consistently have a value of 0. We thought we’d try and find out which pixels those were by finding the features which had zero variance in their values.

I started out by loading a subset of the data set and then taking a sample of that to play around with:

initial <- read.csv("train.csv", nrows=10000, header = TRUE)
# take a sample of 1000 rows of the input 
sampleSet <- initial[sample(1:nrow(initial), 1000), ]

Just for fun I thought it’d be interesting to see how well the labels were distributed in my sample which we can do with the following code:

# get all the labels
sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
  0   1   2   3   4   5   6   7   8   9 
102 116 100  97  95  91  79 122 102  96

There are a few more 1’s and 7s than the other labels but they’re roughly in the same ballpark so it’s ok.

I wanted to exclude the ‘label’ field from the data set because the variance of labels isn’t interesting to us on this occasion. We can do that with the following code:

# get data set excluding label
excludingLabel <- subset( sampleSet, select = -label)

To find all the features with no variance we then do this:

# show all the features which don't have any variance - all have the same value
variances <- apply(excludingLabel, 2, var)
# get the names of the labels which have no variance
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
 [1] "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"   "pixel6"   "pixel7"  
  [9] "pixel8"   "pixel9"   "pixel10"  "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15" 
 [17] "pixel16"  "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
 [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29"  "pixel30"  "pixel31" 
 [33] "pixel32"  "pixel33"  "pixel51"  "pixel52"  "pixel53"  "pixel54"  "pixel55"  "pixel56" 
 [41] "pixel57"  "pixel58"  "pixel59"  "pixel60"  "pixel82"  "pixel83"  "pixel84"  "pixel85" 
 [49] "pixel86"  "pixel88"  "pixel110" "pixel111" "pixel112" "pixel113" "pixel114" "pixel139"
 [57] "pixel140" "pixel141" "pixel142" "pixel168" "pixel169" "pixel196" "pixel252" "pixel280"
 [65] "pixel308" "pixel335" "pixel336" "pixel364" "pixel365" "pixel392" "pixel393" "pixel420"
 [73] "pixel421" "pixel448" "pixel476" "pixel504" "pixel532" "pixel559" "pixel560" "pixel587"
 [81] "pixel615" "pixel643" "pixel644" "pixel645" "pixel671" "pixel672" "pixel673" "pixel699"
 [89] "pixel700" "pixel701" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731" "pixel752"
 [97] "pixel753" "pixel754" "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
[105] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"

We can count how many labels there are by using the length function:

# count how many labels have no variance
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 109

I then wrote those out to a file so that we could use them as the input to the code which builds up our classifier.

write(file="pointless-features.txt", pointlessFeatures)

Of course we should run the variance test against the full data set rather than just a sample and on the whole data set there are only 76 features with zero variance:

> sampleSet <- read.csv("train.csv", header = TRUE)
> sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
   0    1    2    3    4    5    6    7    8    9 
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
> excludingLabel <- subset( sampleSet, select = -label)
> variances <- apply(excludingLabel, 2, var)
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 76

We’ve built decision trees using this reduced data set but haven’t yet submitted the forest to Kaggle to see if it’s any more accurate!

I picked up the little R I know from the Computing for Data Analysis course which started last week and from the book ‘R in a Nutshell‘ which my colleague Paul Lam recommended.

Written by Mark Needham

January 8th, 2013 at 12:48 am

Posted in Machine Learning,R

Tagged with

Kaggle Digit Recognizer: Weka AdaBoost attempt

without comments

In our latest attempt at Kaggle’s Digit Recognizer Jen and I decided to try out boosting on our random forest algorithm, an approach that Jen had come across in a talk at the Clojure Conj.

We couldn’t find any documentation that it was possible to apply boosting to Mahout’s random forest algorithm but we knew it was possible with Weka so we decided to use that instead!

As I understand it the way that boosting works in the context of random forests is that each of the trees in the forest will be assigned a weight based on how accurately it’s able to classify the data set and these weights are then used in the voting stage.

There’s a more detailed explanation of the algorithm in this paper.

We had the following code to train the random forest:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        FastVector attributes = attributes();
        Instances instances = new Instances("digit recognizer", attributes, 40000);
        String[] trainingDataValues = KaggleInputReader.fileAsStringArray("data/train.csv");
        for (String trainingDataValue : trainingDataValues) {
            Instance instance = createInstance(trainingDataValue);
        Classifier classifier = buildClassifier(instances);
    private static Classifier buildClassifier(Instances instances) throws Exception {
        RandomForest randomForest = new RandomForest();
        MultiBoostAB multiBoostAB = new MultiBoostAB();
        return multiBoostAB;
    private static FastVector attributes() {
        FastVector attributes = new FastVector();
        for (int i = 0; i <= 783; i++) {
            attributes.addElement(new Attribute("pixel" + i));
        return attributes;
    private static Attribute digit() {
        FastVector possibleClasses = new FastVector(10);
        return new Attribute("label", possibleClasses, 0);

The code in the KaggleInputReader is used to process the CSV file and is the same as that included in a previous post so I won’t bother including it in this post.

The Weka API is slightly different to the Mahout one in that we have to tell it the names of all the labels that a combination of features belong to whereas with Mahout it seems to work it out for you.

Wf use the RandomForest class to build up our trees and then wrap it in the MultiBoostAB class to apply the boosting. There is another class we could use to do this called AdaBoostM1 but they both seem to give similar results so we stuck with this one.

Once we’d trained the classifier up we ran it against our test data set like so:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        String[] testDataValues = KaggleInputReader.fileAsStringArray("data/test.csv");
        FileWriter fileWriter = new FileWriter("weka-attempts/out-" + System.currentTimeMillis() + ".txt");
        PrintWriter out = new PrintWriter(fileWriter);
        for (String testDataValue : testDataValues) {
            Iteration iterate = iterate(testDataValue, classifier, instances);
            out.println((int) iterate.getPrediction());
            System.out.println("Actual: " + iterate.getActual() + ", Prediction: " + iterate.getPrediction());
    private static Iteration iterate(String testDataValue, Classifier classifier, Instances instances) throws Exception {
        Instance predictMe = createTestDataBasedInstanceToPredict(testDataValue, instances);
        double prediction = classifier.classifyInstance(predictMe);
        return new Iteration(new Double(testDataValue.split(",")[0]), prediction);
    private static Instance createTestDataBasedInstanceToPredict(String testDataValue, Instances instances) {
        String[] columns = testDataValue.split(",");
        Instance instance = new Instance(785);
        for (int i = 0; i < columns.length; i++) {
            instance.setValue(new Attribute("pixel" + i, i+1), new Double(columns[i]));
        return instance;

We got an accuracy of 96.529% with this code which is 0.2% higher than we managed with the Mahout Random forest without any boosting. The full code for this solution is on github as always!

We still haven’t managed to get an accuracy higher than the default solution provided by Kaggle so any suggestions about what else to try are welcome!

We’ve been playing around with neural networks using encog but they seem a bit magical and the moment and it’s difficult to work out why they don’t work when you don’t get the result you expect!

Written by Mark Needham

November 29th, 2012 at 5:09 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: K-means optimisation attempt

with one comment

I recently wrote a blog post explaining how Jen and I used the K-means algorithm to classify digits in Kaggle’s Digit Recognizer problem and one of the things we’d read was that with this algorithm you often end up with situations where it’s difficult to classify a new item because if falls between two labels.

We decided to have a look at the output of our classifier function to see whether or not that was the case.

As a refresher, this was our original function to determine the label for a new test data row…

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)]
    [(ffirst (sort-by second all-the-gaps)) all-the-gaps]))

…and that would return a result like this:

user>  (which-am-i (first test-data))
[0 ([0 1763.5688862988827] [1 2768.1143197890624] [2 2393.9091578180937] 
[3 2598.4629450761286] [4 2615.1233720558307] [5 2287.1791665580586] 
[6 2470.096959417967] [7 2406.0132574502527] [8 2489.3635108564304] [9 2558.0054056506265])]

We first changed the function to return another entry in the vector representing the top two picks:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))]
    [(ffirst (sort-by second all-the-gaps)) top-two all-the-gaps]))

If we run that:

user> (which-am-i (first test-data))
[2 ([2 1855.3605185002546] [0 2233.654619238101]) 
([0 2233.654619238101] [1 2661.7148686603714] [2 1855.3605185002546] 
[3 2512.429018687221] [4 2357.637631775974] [5 2457.9850052966344] 
[6 2243.724487123002] [7 2554.1158473740174] [8 2317.567588716217] [9 2520.667565741239])]

In this case the difference between the top 2 labels is about 400 but we next changed the function to output the actual difference so we could see how close the top 2 were over the whole test set:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))]
    [(ffirst (sort-by second all-the-gaps)) top-two difference-between-top-two all-the-gaps]))

We then ran it like this, taking just the first 10 differences on this occasion:

user>  (take 10 (->> test-data (map which-am-i) (map #(nth % 2))))
(378.2941007378465 523.6102802591759 73.57510749262792 3.8557350283749656 5.806672422475231 
183.90928740097775 713.1626629833258 335.38646365464047 538.6191727330108 161.68429111785372)

From visually looking at this over a larger subset we noticed that there seemed to be a significant number of top twos within a distance of 50.

We therefore changed the function to use shuffle to randomly pick between those two labels when the distance was that small.

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))
        very-close (< difference-between-top-two 50)
        best-one (if very-close (ffirst (shuffle top-two)) (ffirst top-two))]
    [best-one top-two all-the-gaps]))

Our assumption was that this might marginally improve the performance of the algorithm but it actually made it marginally worse, with a new accuracy of 79.3%

At this point we weren’t really sure what we should be doing to try and improve the algorithm’s performance so we moved onto random forests but it’s interesting that you can get a reasonable accuracy even with such a simple approach!

Written by Mark Needham

October 27th, 2012 at 12:27 pm

Posted in Machine Learning

Tagged with ,