Mark Needham

Thoughts on Software Development

Archive for the ‘kaggle’ tag

Kaggle Titanic: Python pandas attempt

with one comment

Nathan and I have been looking at Kaggle’s Titanic problem and while working through the Python tutorial Nathan pointed out that we could greatly simplify the code if we used pandas instead.

The problem we had with numpy is that you use integers to reference columns. We spent a lot of time being thoroughly confused as to why something wasn’t working only to realise we were using the wrong column.

The algorithm scores an accuracy of 77.99% and the way it works is that we build a ‘survival table’ which works out an average survival rate of people based on 3 features:

  • Passenger Class
  • Passenger Fare (grouped into those who paid 0-9, 10-19, 20-29, 30+)
  • Gender

It looks like this:

2013 10 30 07 05 03

And the code that creates that is:

import pandas as pd
 
def addrow(df, row):
    return df.append(pd.DataFrame(row), ignore_index=True)
 
def fare_in_bucket(fare, fare_bracket_size, bucket):
    return (fare > bucket * fare_bracket_size) & (fare <= ((bucket+1) * fare_bracket_size))
 
def build_survival_table(training_file):    
    fare_ceiling = 40
    train_df = pd.read_csv(training_file)
    train_df[train_df['Fare'] >= 39.0] = 39.0
    fare_bracket_size = 10
    number_of_price_brackets = fare_ceiling / fare_bracket_size
    number_of_classes = 3 #There were 1st, 2nd and 3rd classes on board 
 
    survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived', 'NumberOfPeople'])
 
    for pclass in range(1, number_of_classes + 1): # add 1 to handle 0 start
        for bucket in range(0, number_of_price_brackets):
            for sex in ['female', 'male']:
                survival = train_df[(train_df['Sex'] == sex) 
                                    & (train_df['Pclass'] == pclass) 
                                    & fare_in_bucket(train_df["Fare"], fare_bracket_size, bucket)]
 
                row = [dict(Pclass=pclass, Sex=sex, PriceDist = bucket, 
                            Survived = round(survival['Survived'].mean()), 
                            NumberOfPeople = survival.count()[0]) ]
                survival_table = addrow(survival_table, row)
 
    return survival_table.fillna(0)
 
survival_table = build_survival_table("train.csv")

where ‘train.csv’ is structured like so:

$ head -n5 train.csv 
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

After we’ve built that we iterate through the test data set and look up each person in the table and find their survival rate.

def select_bucket(fare):
    if (fare >= 0 and fare < 10):
        return 0
    elif (fare >= 10 and fare < 20):
        return 1
    elif (fare >= 20 and fare < 30):
        return 2
    else:
        return 3
 
def calculate_survival(survival_table, row):
    survival_row = survival_table[(survival_table["Sex"] == row["Sex"]) & (survival_table["Pclass"] == row["Pclass"]) & (survival_table["PriceDist"] == select_bucket(row["Fare"]))]
    return int(survival_row["Survived"].iat[0])    
 
test_df = pd.read_csv('test.csv')             
test_df["Survived"] = test_df.apply(lambda row: calculate_survival(survival_table, row), axis=1)

I wrote up the difficulties we had working out how to append the ‘Survived’ column if you want more detail.

‘test.csv’ looks like this:

$ head -n5 test.csv 
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S

We then write out the survival value for each customer along with their ID:

test_df.to_csv("result.csv", cols=['PassengerId', 'Survived'], index=False)
$ head -n5 result.csv 
PassengerId,Survived
892,0
893,1
894,0
895,0

I’ve pasted the code as a gist for those who want to see it all as one.

Next step: introduce some real machine learning, probably using scikit-learn unless there’s something else we should be using?

Written by Mark Needham

October 30th, 2013 at 7:26 am

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: A feature extraction #fail

with 4 comments

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

The code reads like this:

initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)

I then wrote those features out into a CSV file like so:

newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Finding pixels with no variance using R

with one comment

I’ve written previously about our attempts at the Kaggle Digit Recogniser problem and our approach so far has been to use the data provided and plug it into different algorithms and see what we end up with.

From browsing through the forums we saw others mentioning feature extraction – an approach where we transform the data into another format , the thinking being that we can train a better classifier with better data.

There was quite a nice quote from a post written by Rachel Schutt about the Columbia Data Science course which summed up the mistake we’d made:

The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One thing we’d noticed while visually scanning the data set was that a lot of the features seemed to consistently have a value of 0. We thought we’d try and find out which pixels those were by finding the features which had zero variance in their values.

I started out by loading a subset of the data set and then taking a sample of that to play around with:

initial <- read.csv("train.csv", nrows=10000, header = TRUE)
 
# take a sample of 1000 rows of the input 
sampleSet <- initial[sample(1:nrow(initial), 1000), ]

Just for fun I thought it’d be interesting to see how well the labels were distributed in my sample which we can do with the following code:

# get all the labels
sampleSet.labels <- as.factor(sampleSet$label)
 
> table(sampleSet.labels)
sampleSet.labels
  0   1   2   3   4   5   6   7   8   9 
102 116 100  97  95  91  79 122 102  96

There are a few more 1′s and 7s than the other labels but they’re roughly in the same ballpark so it’s ok.

I wanted to exclude the ‘label’ field from the data set because the variance of labels isn’t interesting to us on this occasion. We can do that with the following code:

# get data set excluding label
excludingLabel <- subset( sampleSet, select = -label)

To find all the features with no variance we then do this:

# show all the features which don't have any variance - all have the same value
variances <- apply(excludingLabel, 2, var)
 
# get the names of the labels which have no variance
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
 
 [1] "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"   "pixel6"   "pixel7"  
  [9] "pixel8"   "pixel9"   "pixel10"  "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15" 
 [17] "pixel16"  "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
 [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29"  "pixel30"  "pixel31" 
 [33] "pixel32"  "pixel33"  "pixel51"  "pixel52"  "pixel53"  "pixel54"  "pixel55"  "pixel56" 
 [41] "pixel57"  "pixel58"  "pixel59"  "pixel60"  "pixel82"  "pixel83"  "pixel84"  "pixel85" 
 [49] "pixel86"  "pixel88"  "pixel110" "pixel111" "pixel112" "pixel113" "pixel114" "pixel139"
 [57] "pixel140" "pixel141" "pixel142" "pixel168" "pixel169" "pixel196" "pixel252" "pixel280"
 [65] "pixel308" "pixel335" "pixel336" "pixel364" "pixel365" "pixel392" "pixel393" "pixel420"
 [73] "pixel421" "pixel448" "pixel476" "pixel504" "pixel532" "pixel559" "pixel560" "pixel587"
 [81] "pixel615" "pixel643" "pixel644" "pixel645" "pixel671" "pixel672" "pixel673" "pixel699"
 [89] "pixel700" "pixel701" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731" "pixel752"
 [97] "pixel753" "pixel754" "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
[105] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"

We can count how many labels there are by using the length function:

# count how many labels have no variance
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 109

I then wrote those out to a file so that we could use them as the input to the code which builds up our classifier.

write(file="pointless-features.txt", pointlessFeatures)

Of course we should run the variance test against the full data set rather than just a sample and on the whole data set there are only 76 features with zero variance:

> sampleSet <- read.csv("train.csv", header = TRUE)
> sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
sampleSet.labels
   0    1    2    3    4    5    6    7    8    9 
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
> excludingLabel <- subset( sampleSet, select = -label)
> variances <- apply(excludingLabel, 2, var)
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 76

We’ve built decision trees using this reduced data set but haven’t yet submitted the forest to Kaggle to see if it’s any more accurate!

I picked up the little R I know from the Computing for Data Analysis course which started last week and from the book ‘R in a Nutshell‘ which my colleague Paul Lam recommended.

Written by Mark Needham

January 8th, 2013 at 12:48 am

Posted in Machine Learning,R

Tagged with

Kaggle Digit Recognizer: Weka AdaBoost attempt

without comments

In our latest attempt at Kaggle’s Digit Recognizer Jen and I decided to try out boosting on our random forest algorithm, an approach that Jen had come across in a talk at the Clojure Conj.

We couldn’t find any documentation that it was possible to apply boosting to Mahout’s random forest algorithm but we knew it was possible with Weka so we decided to use that instead!

As I understand it the way that boosting works in the context of random forests is that each of the trees in the forest will be assigned a weight based on how accurately it’s able to classify the data set and these weights are then used in the voting stage.

There’s a more detailed explanation of the algorithm in this paper.

We had the following code to train the random forest:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        FastVector attributes = attributes();
 
        Instances instances = new Instances("digit recognizer", attributes, 40000);
        instances.setClassIndex(0);
 
        String[] trainingDataValues = KaggleInputReader.fileAsStringArray("data/train.csv");
 
        for (String trainingDataValue : trainingDataValues) {
            Instance instance = createInstance(trainingDataValue);
            instances.add(instance);
        }
 
        Classifier classifier = buildClassifier(instances);
    }
 
    private static Classifier buildClassifier(Instances instances) throws Exception {
        RandomForest randomForest = new RandomForest();
        randomForest.setNumTrees(200);
 
        MultiBoostAB multiBoostAB = new MultiBoostAB();
        multiBoostAB.setClassifier(randomForest);
        multiBoostAB.buildClassifier(instances);
        return multiBoostAB;
    }
 
    private static FastVector attributes() {
        FastVector attributes = new FastVector();
        attributes.addElement(digit());
 
        for (int i = 0; i <= 783; i++) {
            attributes.addElement(new Attribute("pixel" + i));
        }
 
        return attributes;
    }
 
    private static Attribute digit() {
        FastVector possibleClasses = new FastVector(10);
        possibleClasses.addElement("0");
        possibleClasses.addElement("1");
        possibleClasses.addElement("2");
        possibleClasses.addElement("3");
        possibleClasses.addElement("4");
        possibleClasses.addElement("5");
        possibleClasses.addElement("6");
        possibleClasses.addElement("7");
        possibleClasses.addElement("8");
        possibleClasses.addElement("9");
        return new Attribute("label", possibleClasses, 0);
 
    }
 
}

The code in the KaggleInputReader is used to process the CSV file and is the same as that included in a previous post so I won’t bother including it in this post.

The Weka API is slightly different to the Mahout one in that we have to tell it the names of all the labels that a combination of features belong to whereas with Mahout it seems to work it out for you.

Wf use the RandomForest class to build up our trees and then wrap it in the MultiBoostAB class to apply the boosting. There is another class we could use to do this called AdaBoostM1 but they both seem to give similar results so we stuck with this one.

Once we’d trained the classifier up we ran it against our test data set like so:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        ...
        String[] testDataValues = KaggleInputReader.fileAsStringArray("data/test.csv");
 
 
        FileWriter fileWriter = new FileWriter("weka-attempts/out-" + System.currentTimeMillis() + ".txt");
        PrintWriter out = new PrintWriter(fileWriter);
        for (String testDataValue : testDataValues) {
            Iteration iterate = iterate(testDataValue, classifier, instances);
            out.println((int) iterate.getPrediction());
            System.out.println("Actual: " + iterate.getActual() + ", Prediction: " + iterate.getPrediction());
        }
        out.close();
    }
 
    private static Iteration iterate(String testDataValue, Classifier classifier, Instances instances) throws Exception {
        Instance predictMe = createTestDataBasedInstanceToPredict(testDataValue, instances);
        double prediction = classifier.classifyInstance(predictMe);
 
        return new Iteration(new Double(testDataValue.split(",")[0]), prediction);
    }
 
    private static Instance createTestDataBasedInstanceToPredict(String testDataValue, Instances instances) {
        String[] columns = testDataValue.split(",");
        Instance instance = new Instance(785);
 
        for (int i = 0; i < columns.length; i++) {
            instance.setValue(new Attribute("pixel" + i, i+1), new Double(columns[i]));
        }
 
        instance.setDataset(instances);
        return instance;
    }
}

We got an accuracy of 96.529% with this code which is 0.2% higher than we managed with the Mahout Random forest without any boosting. The full code for this solution is on github as always!

We still haven’t managed to get an accuracy higher than the default solution provided by Kaggle so any suggestions about what else to try are welcome!

We’ve been playing around with neural networks using encog but they seem a bit magical and the moment and it’s difficult to work out why they don’t work when you don’t get the result you expect!

Written by Mark Needham

November 29th, 2012 at 5:09 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: K-means optimisation attempt

without comments

I recently wrote a blog post explaining how Jen and I used the K-means algorithm to classify digits in Kaggle’s Digit Recognizer problem and one of the things we’d read was that with this algorithm you often end up with situations where it’s difficult to classify a new item because if falls between two labels.

We decided to have a look at the output of our classifier function to see whether or not that was the case.

As a refresher, this was our original function to determine the label for a new test data row…

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)]
    [(ffirst (sort-by second all-the-gaps)) all-the-gaps]))

…and that would return a result like this:

user>  (which-am-i (first test-data))
[0 ([0 1763.5688862988827] [1 2768.1143197890624] [2 2393.9091578180937] 
[3 2598.4629450761286] [4 2615.1233720558307] [5 2287.1791665580586] 
[6 2470.096959417967] [7 2406.0132574502527] [8 2489.3635108564304] [9 2558.0054056506265])]

We first changed the function to return another entry in the vector representing the top two picks:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))]
    [(ffirst (sort-by second all-the-gaps)) top-two all-the-gaps]))

If we run that:

user> (which-am-i (first test-data))
[2 ([2 1855.3605185002546] [0 2233.654619238101]) 
([0 2233.654619238101] [1 2661.7148686603714] [2 1855.3605185002546] 
[3 2512.429018687221] [4 2357.637631775974] [5 2457.9850052966344] 
[6 2243.724487123002] [7 2554.1158473740174] [8 2317.567588716217] [9 2520.667565741239])]

In this case the difference between the top 2 labels is about 400 but we next changed the function to output the actual difference so we could see how close the top 2 were over the whole test set:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))]
    [(ffirst (sort-by second all-the-gaps)) top-two difference-between-top-two all-the-gaps]))

We then ran it like this, taking just the first 10 differences on this occasion:

user>  (take 10 (->> test-data (map which-am-i) (map #(nth % 2))))
(378.2941007378465 523.6102802591759 73.57510749262792 3.8557350283749656 5.806672422475231 
183.90928740097775 713.1626629833258 335.38646365464047 538.6191727330108 161.68429111785372)

From visually looking at this over a larger subset we noticed that there seemed to be a significant number of top twos within a distance of 50.

We therefore changed the function to use shuffle to randomly pick between those two labels when the distance was that small.

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))
        very-close (< difference-between-top-two 50)
        best-one (if very-close (ffirst (shuffle top-two)) (ffirst top-two))]
    [best-one top-two all-the-gaps]))

Our assumption was that this might marginally improve the performance of the algorithm but it actually made it marginally worse, with a new accuracy of 79.3%

At this point we weren’t really sure what we should be doing to try and improve the algorithm’s performance so we moved onto random forests but it’s interesting that you can get a reasonable accuracy even with such a simple approach!

Written by Mark Needham

October 27th, 2012 at 12:27 pm

Posted in Machine Learning

Tagged with ,