Mark Needham

Thoughts on Software Development

Archive for the ‘Machine Learning’ Category

Kaggle Digit Recognizer: Finding pixels with no variance using R

with one comment

I’ve written previously about our attempts at the Kaggle Digit Recogniser problem and our approach so far has been to use the data provided and plug it into different algorithms and see what we end up with.

From browsing through the forums we saw others mentioning feature extraction – an approach where we transform the data into another format , the thinking being that we can train a better classifier with better data.

There was quite a nice quote from a post written by Rachel Schutt about the Columbia Data Science course which summed up the mistake we’d made:

The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One thing we’d noticed while visually scanning the data set was that a lot of the features seemed to consistently have a value of 0. We thought we’d try and find out which pixels those were by finding the features which had zero variance in their values.

I started out by loading a subset of the data set and then taking a sample of that to play around with:

initial <- read.csv("train.csv", nrows=10000, header = TRUE)
 
# take a sample of 1000 rows of the input 
sampleSet <- initial[sample(1:nrow(initial), 1000), ]

Just for fun I thought it’d be interesting to see how well the labels were distributed in my sample which we can do with the following code:

# get all the labels
sampleSet.labels <- as.factor(sampleSet$label)
 
> table(sampleSet.labels)
sampleSet.labels
  0   1   2   3   4   5   6   7   8   9 
102 116 100  97  95  91  79 122 102  96

There are a few more 1’s and 7s than the other labels but they’re roughly in the same ballpark so it’s ok.

I wanted to exclude the ‘label’ field from the data set because the variance of labels isn’t interesting to us on this occasion. We can do that with the following code:

# get data set excluding label
excludingLabel <- subset( sampleSet, select = -label)

To find all the features with no variance we then do this:

# show all the features which don't have any variance - all have the same value
variances <- apply(excludingLabel, 2, var)
 
# get the names of the labels which have no variance
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
 
 [1] "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"   "pixel6"   "pixel7"  
  [9] "pixel8"   "pixel9"   "pixel10"  "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15" 
 [17] "pixel16"  "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
 [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29"  "pixel30"  "pixel31" 
 [33] "pixel32"  "pixel33"  "pixel51"  "pixel52"  "pixel53"  "pixel54"  "pixel55"  "pixel56" 
 [41] "pixel57"  "pixel58"  "pixel59"  "pixel60"  "pixel82"  "pixel83"  "pixel84"  "pixel85" 
 [49] "pixel86"  "pixel88"  "pixel110" "pixel111" "pixel112" "pixel113" "pixel114" "pixel139"
 [57] "pixel140" "pixel141" "pixel142" "pixel168" "pixel169" "pixel196" "pixel252" "pixel280"
 [65] "pixel308" "pixel335" "pixel336" "pixel364" "pixel365" "pixel392" "pixel393" "pixel420"
 [73] "pixel421" "pixel448" "pixel476" "pixel504" "pixel532" "pixel559" "pixel560" "pixel587"
 [81] "pixel615" "pixel643" "pixel644" "pixel645" "pixel671" "pixel672" "pixel673" "pixel699"
 [89] "pixel700" "pixel701" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731" "pixel752"
 [97] "pixel753" "pixel754" "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
[105] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"

We can count how many labels there are by using the length function:

# count how many labels have no variance
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 109

I then wrote those out to a file so that we could use them as the input to the code which builds up our classifier.

write(file="pointless-features.txt", pointlessFeatures)

Of course we should run the variance test against the full data set rather than just a sample and on the whole data set there are only 76 features with zero variance:

> sampleSet <- read.csv("train.csv", header = TRUE)
> sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
sampleSet.labels
   0    1    2    3    4    5    6    7    8    9 
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
> excludingLabel <- subset( sampleSet, select = -label)
> variances <- apply(excludingLabel, 2, var)
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 76

We’ve built decision trees using this reduced data set but haven’t yet submitted the forest to Kaggle to see if it’s any more accurate!

I picked up the little R I know from the Computing for Data Analysis course which started last week and from the book ‘R in a Nutshell‘ which my colleague Paul Lam recommended.

Written by Mark Needham

January 8th, 2013 at 12:48 am

Posted in Machine Learning,R

Tagged with

Mahout: Parallelising the creation of DecisionTrees

with 5 comments

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
 
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
trees.addAll(forest.getTrees());
 
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list

deb http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main 
deb-src http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

parallelise-forests.sh

#!/bin/bash 
 
start=`date`
startTime=`date '+%s'`
numberOfRuns=$1
 
seq 1 ${numberOfRuns} | parallel -P 8 "./build-forest.sh"
 
end=`date`
endTime=`date '+%s'`
 
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

build-forest.sh

#!/bin/bash
 
java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar main.java.MahoutPlaybox

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Written by Mark Needham

December 27th, 2012 at 12:08 am

Posted in Machine Learning

Tagged with ,

Weka: Saving and loading classifiers

with 3 comments

In our continued machine learning travels Jen and I have been building some classifiers using Weka and one thing we wanted to do was save the classifier and then reuse it later.

There is documentation for how to do this from the command line but we’re doing everything programatically and wanted to be able to save our classifiers from Java code.

As it turns out it’s not too tricky when you know which classes to call and saving a classifier to a file is as simple as this:

MultilayerPerceptron classifier = new MultilayerPerceptron();
classifier.buildClassifier(instances); // instances gets passed in from elsewhere
 
Debug.saveToFile("/path/to/weka-neural-network", classifier);

If we want to load that classifier up we can make use of the SerializedClassifier class like so:

SerializedClassifier classifier = new SerializedClassifier();
classifier.setModelFile(new File("/path/to/weka-neural-network"));

Simples!

Written by Mark Needham

December 12th, 2012 at 12:04 am

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Weka AdaBoost attempt

without comments

In our latest attempt at Kaggle’s Digit Recognizer Jen and I decided to try out boosting on our random forest algorithm, an approach that Jen had come across in a talk at the Clojure Conj.

We couldn’t find any documentation that it was possible to apply boosting to Mahout’s random forest algorithm but we knew it was possible with Weka so we decided to use that instead!

As I understand it the way that boosting works in the context of random forests is that each of the trees in the forest will be assigned a weight based on how accurately it’s able to classify the data set and these weights are then used in the voting stage.

There’s a more detailed explanation of the algorithm in this paper.

We had the following code to train the random forest:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        FastVector attributes = attributes();
 
        Instances instances = new Instances("digit recognizer", attributes, 40000);
        instances.setClassIndex(0);
 
        String[] trainingDataValues = KaggleInputReader.fileAsStringArray("data/train.csv");
 
        for (String trainingDataValue : trainingDataValues) {
            Instance instance = createInstance(trainingDataValue);
            instances.add(instance);
        }
 
        Classifier classifier = buildClassifier(instances);
    }
 
    private static Classifier buildClassifier(Instances instances) throws Exception {
        RandomForest randomForest = new RandomForest();
        randomForest.setNumTrees(200);
 
        MultiBoostAB multiBoostAB = new MultiBoostAB();
        multiBoostAB.setClassifier(randomForest);
        multiBoostAB.buildClassifier(instances);
        return multiBoostAB;
    }
 
    private static FastVector attributes() {
        FastVector attributes = new FastVector();
        attributes.addElement(digit());
 
        for (int i = 0; i <= 783; i++) {
            attributes.addElement(new Attribute("pixel" + i));
        }
 
        return attributes;
    }
 
    private static Attribute digit() {
        FastVector possibleClasses = new FastVector(10);
        possibleClasses.addElement("0");
        possibleClasses.addElement("1");
        possibleClasses.addElement("2");
        possibleClasses.addElement("3");
        possibleClasses.addElement("4");
        possibleClasses.addElement("5");
        possibleClasses.addElement("6");
        possibleClasses.addElement("7");
        possibleClasses.addElement("8");
        possibleClasses.addElement("9");
        return new Attribute("label", possibleClasses, 0);
 
    }
 
}

The code in the KaggleInputReader is used to process the CSV file and is the same as that included in a previous post so I won’t bother including it in this post.

The Weka API is slightly different to the Mahout one in that we have to tell it the names of all the labels that a combination of features belong to whereas with Mahout it seems to work it out for you.

Wf use the RandomForest class to build up our trees and then wrap it in the MultiBoostAB class to apply the boosting. There is another class we could use to do this called AdaBoostM1 but they both seem to give similar results so we stuck with this one.

Once we’d trained the classifier up we ran it against our test data set like so:

public class WekaAdaBoostRandomForest {
    public static void main(String[] args) {
        ...
        String[] testDataValues = KaggleInputReader.fileAsStringArray("data/test.csv");
 
 
        FileWriter fileWriter = new FileWriter("weka-attempts/out-" + System.currentTimeMillis() + ".txt");
        PrintWriter out = new PrintWriter(fileWriter);
        for (String testDataValue : testDataValues) {
            Iteration iterate = iterate(testDataValue, classifier, instances);
            out.println((int) iterate.getPrediction());
            System.out.println("Actual: " + iterate.getActual() + ", Prediction: " + iterate.getPrediction());
        }
        out.close();
    }
 
    private static Iteration iterate(String testDataValue, Classifier classifier, Instances instances) throws Exception {
        Instance predictMe = createTestDataBasedInstanceToPredict(testDataValue, instances);
        double prediction = classifier.classifyInstance(predictMe);
 
        return new Iteration(new Double(testDataValue.split(",")[0]), prediction);
    }
 
    private static Instance createTestDataBasedInstanceToPredict(String testDataValue, Instances instances) {
        String[] columns = testDataValue.split(",");
        Instance instance = new Instance(785);
 
        for (int i = 0; i < columns.length; i++) {
            instance.setValue(new Attribute("pixel" + i, i+1), new Double(columns[i]));
        }
 
        instance.setDataset(instances);
        return instance;
    }
}

We got an accuracy of 96.529% with this code which is 0.2% higher than we managed with the Mahout Random forest without any boosting. The full code for this solution is on github as always!

We still haven’t managed to get an accuracy higher than the default solution provided by Kaggle so any suggestions about what else to try are welcome!

We’ve been playing around with neural networks using encog but they seem a bit magical and the moment and it’s difficult to work out why they don’t work when you don’t get the result you expect!

Written by Mark Needham

November 29th, 2012 at 5:09 pm

Posted in Machine Learning

Tagged with ,

A first failed attempt at Natural Language Processing

without comments

One of the things I find fascinating about dating websites is that the profiles of people are almost identical so I thought it would be an interesting exercise to grab some of the free text that people write about themselves and prove the similarity.

I’d been talking to Matt Biddulph about some Natural Language Processing (NLP) stuff he’d been working on and he wrote up a bunch of libraries, articles and books that he’d found useful.

I started out by plugging the text into one of the many NLP libraries that Matt listed with the vague idea that it would come back with something useful.

I’m not sure exactly what I was expecting the result to be but after 5/6 hours of playing around with different libraries I’d got nowhere and parked the problem not really knowing where I’d gone wrong.

Last week I came across a paper titled “That’s What She Said: Double Entendre Identification” whose authors wanted to work out when a sentence could legitimately be followed by the phrase “that’s what she said”.

While the subject matter is a bit risque I found that reading about the way the authors went about solving their problem was very interesting and it allowed me to see some mistakes I’d made.

Vague problem statement

Unfortunately I didn’t do a good job of working out exactly what problem I wanted to solve – my problem statement was too general.

In the paper the authors narrowed down their problem space by focusing on a specific set of words which are typically used as double entendres and then worked out the sentence structure that the targeted sentences were likely to have.

Instead of defining my problem more specifically I plugged the text into Mallet, morpha-stemmer and Stanford Core NLP and tried to cluster the most popular words.

That didn’t really work because people use slightly different words to describe the same thing so I ended up looking at Yawni – a wrapper around WordNet which groups sets of words into cognitive synonyms.

In hindsight a more successful approach might have been to find the common words that people tend to use in these types of profiles and then work from there.

No Theory

I recently wrote about how I’ve been learning about neural networks by switching in between theory and practice but with NLP I didn’t bother reading any of the theory and thought I could get away with plugging some data into one of the libraries.

I now realise that was a mistake as I didn’t know what to do when the libraries didn’t work as I’d hoped because I wasn’t sure what they were supposed to be doing in the first place!

My next step should probably be to understand how text gets converted into vectors, then move onto tf-idf and see if I have a better idea of how to solve my problem.

Written by Mark Needham

November 24th, 2012 at 7:43 pm

Posted in Machine Learning

Tagged with

Learning: Switching between theory and practice

with one comment

In one of my first ever blog posts I wrote about the differences I’d experienced in learning the theory about a topic and then seeing it in practice.

The way I remember learning at school and university was that you learn all the theory first and then put it into practice but I typically don’t find myself doing this whenever I learn something new.

I spent a bit of time over the weekend learning more about neural networks as my colleague Jen Smith suggested this might be a more effective technique for getting a higher accuracy score on the Kaggle Digit Recogniser problem.

I first came across neural networks during Machine Learning Class about a year ago but I didn’t put any of that knowledge into practice and as a result it’s mostly been forgotten so my first step was to go back and watch the videos again.

Having got a high level understanding of how they work I thought I’d try and find a Neural Networks implementation in Mahout since Jen and I have been hacking with that so I have a reasonable level of familiarity with it.

I could only find people talking about writing an implementation rather than any suggestion that there was one so I turned to google and came across netz – a Clojure implementation of neural networks.

On its project page there were links to several ‘production ready’ Java frameworks for building neural networks including neuroph, encog and FANN.

I spent a few hours playing around with some of the encog examples and trying to see whether or not we’d be able to plug the Digit Recogniser problem into it.

To refresh, the digit recogniser problem is a multi class classification problem where we train a classifier with series of 784 pixel brightness values where we know which digit they refer to.

We should then be able to feed it any new set of 784 pixel brightness values and it will tell us which digit that is most likely to be.

I realised that the OCR encog example wouldn’t quite work because it assumed that you’d only have one training example for each class!

SOMClusterCopyTraining.java

* For now, this trainer will only work if you have equal or fewer training elements

* to the number of output neurons.

I was pretty sure that I didn’t want to have 40,000 output neurons but I thought I better switch back to theory and make sure I understood how neural networks were supposed to work by reading the slides from an introductory talk.

Now that I’ve read those I’m ready to go back into the practical side again and try and build up a network a bit more manually than I imagined the previous time by using the BasicNetwork class.

I’m sure as I do that I’ll have to switch back to theory again and read a bit more, then code a bit more and so the cycle goes on!

Written by Mark Needham

November 19th, 2012 at 1:31 pm

Posted in Learning,Machine Learning

Tagged with

Mahout: Using a saved Random Forest/DecisionTree

with one comment

One of the things that I wanted to do while playing around with random forests using Mahout was to save the random forest and then use use it again which is something Mahout does cater for.

It was actually much easier to do this than I’d expected and assuming that we already have a DecisionForest built we’d just need the following code to save it to disc:

int numberOfTrees = 1;
Data data = loadData(...);
DecisionForest forest = buildForest(numberOfTrees, data);
 
String path = "saved-trees/" + numberOfTrees + "-trees.txt";
DataOutputStream dos = new DataOutputStream(new FileOutputStream(path));
 
forest.write(dos);

When I was looking through the API for how to load that file back into memory again it seemed like all the public methods required you to be using Hadoop in some way which I thought was going to be a problem as I’m not using it.

For example the signature for DecisionForest.load reads like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
 
public static DecisionForest load(Configuration conf, Path forestPath) throws IOException { }

As it turns out though you can just pass an empty configuration and a normal file system path and the forest shall be loaded:

int numberOfTrees = 1;
 
Configuration config = new Configuration();
Path path = new Path("saved-trees/" + numberOfTrees + "-trees.txt");
DecisionForest forest = DecisionForest.load(config, path);

Much easier than expected!

Written by Mark Needham

October 27th, 2012 at 10:03 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: Mahout Random Forest attempt

with 10 comments

I’ve written previously about the K-means approach that Jen and I took when trying to solve Kaggle’s Digit Recognizer and having stalled at about 80% accuracy we decided to try one of the algorithms suggested in the tutorials section – the random forest!

We initially used a clojure random forests library but struggled to build the random forest from the training set data in a reasonable amount of time so we switched to Mahout’s version which is based on Leo Breiman’s random forests paper.

There’s a really good example explaining how ensembles work on the Factual blog which we found quite useful in helping us understand how random forests are supposed to work.

One of the most powerful Machine Learning techniques we turn to is ensembling. Ensemble methods build surprisingly strong models out of a collection of weak models called base learners, and typically require far less tuning when compared to models like Support Vector Machines.

Most ensemble methods use decision trees as base learners and many ensembling techniques, like Random Forests and Adaboost, are specific to tree ensembles.

We were able to adapt the BreimanExample included in the examples section of the Mahout repository to do what we wanted.

To start with we wrote the following code to build the random forest:

public class MahoutKaggleDigitRecognizer {
  public static void main(String[] args) throws Exception {
    String descriptor = "L N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N ";
    String[] trainDataValues = fileAsStringArray("data/train.csv");
 
    Data data = DataLoader.loadData(DataLoader.generateDataset(descriptor, false, trainDataValues), trainDataValues);
 
    int numberOfTrees = 100;
    DecisionForest forest = buildForest(numberOfTrees, data);
  }
 
  private static DecisionForest buildForest(int numberOfTrees, Data data) {
    int m = (int) Math.floor(Maths.log(2, data.getDataset().nbAttributes()) + 1);
 
    DefaultTreeBuilder treeBuilder = new DefaultTreeBuilder();
    treeBuilder.setM(m);
 
    return new SequentialBuilder(RandomUtils.getRandom(), treeBuilder, data.clone()).build(numberOfTrees);
  }
 
  private static String[] fileAsStringArray(String file) throws Exception {
    ArrayList<String> list = new ArrayList<String>();
 
    DataInputStream in = new DataInputStream(new FileInputStream(file));
    BufferedReader br = new BufferedReader(new InputStreamReader(in));
 
    String strLine;
    br.readLine(); // discard top one (header)
    while ((strLine = br.readLine()) != null) {
      list.add(strLine);
    }
 
    in.close();
    return list.toArray(new String[list.size()]);
  }
}

The training data file looks a bit like this:

label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8...,pixel783
1,0,0,0,0,0,0,...,0
0,0,0,0,0,0,0,...,0

So in this case the label is in the first column which is represented as an L in the descriptor and the next 784 columns are the numerical value of the pixels in the image (hence the 784 N‘s in the descriptor).

We’re telling it to create a random forest which contains 100 trees and since we have a finite number of categories that an entry can be classified as we pass false as the 2nd argument (regression) of DataLoader.generateDataSet.

The m value determines how many attributes (pixel values in this case) are used to construct each tree and supposedly log2(number_of_attributes) + 1 is the optimal value for that!

We then wrote the following code to predict the labels of the test data set:

public class MahoutKaggleDigitRecognizer {
  public static void main(String[] args) throws Exception {
    ...
    String[] testDataValues = testFileAsStringArray("data/test.csv");
    Data test = DataLoader.loadData(data.getDataset(), testDataValues);
    Random rng = RandomUtils.getRandom();
 
    for (int i = 0; i < test.size(); i++) {
    Instance oneSample = test.get(i);
 
    double classify = forest.classify(test.getDataset(), rng, oneSample);
    int label = data.getDataset().valueOf(0, String.valueOf((int) classify));
 
    System.out.println("Label: " + label);
  }
 
  private static String[] testFileAsStringArray(String file) throws Exception {
    ArrayList<String> list = new ArrayList<String>();
 
    DataInputStream in = new DataInputStream(new FileInputStream(file));
    BufferedReader br = new BufferedReader(new InputStreamReader(in));
 
    String strLine;
    br.readLine(); // discard top one (header)
    while ((strLine = br.readLine()) != null) {
      list.add("-," + strLine);
    }
 
    in.close();
    return list.toArray(new String[list.size()]);
  }
}

There were a couple of things that we found confusing when working out how to do this:

  1. The format of the test data needs to be identical to that of the training data which consisted of a label followed by 784 numerical values. Obviously with the test data we don’t have a label so Mahout excepts us to pass a ‘-‘ where the label would go otherwise it will throw an exception, which explains the ‘-‘ on the list.add line.
  2. We initially thought the value returned by forest.classify was the prediction but in actual fact it’s an index which we then need to look up on the data set.

When we ran this algorithm against the test data set with 10 trees we got an accuracy of 83.8%, with 50 trees we got 84.4%, with 100 trees we got 96.28% and with 200 trees we got 96.33% which is where we’ve currently peaked.

The amount of time it’s taking to build the forests as we increase the number of trees is also starting to become a problem so our next step is either to look at a way to parallelise the creation of the forest or do some sort of feature extraction to try and improve the accuracy.

The code is on github if you’re interested in playing with it or have any suggestions on how to improve it.

Written by Mark Needham

October 27th, 2012 at 8:24 pm

Posted in Machine Learning

Tagged with

Kaggle Digit Recognizer: K-means optimisation attempt

with one comment

I recently wrote a blog post explaining how Jen and I used the K-means algorithm to classify digits in Kaggle’s Digit Recognizer problem and one of the things we’d read was that with this algorithm you often end up with situations where it’s difficult to classify a new item because if falls between two labels.

We decided to have a look at the output of our classifier function to see whether or not that was the case.

As a refresher, this was our original function to determine the label for a new test data row…

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)]
    [(ffirst (sort-by second all-the-gaps)) all-the-gaps]))

…and that would return a result like this:

user>  (which-am-i (first test-data))
[0 ([0 1763.5688862988827] [1 2768.1143197890624] [2 2393.9091578180937] 
[3 2598.4629450761286] [4 2615.1233720558307] [5 2287.1791665580586] 
[6 2470.096959417967] [7 2406.0132574502527] [8 2489.3635108564304] [9 2558.0054056506265])]

We first changed the function to return another entry in the vector representing the top two picks:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))]
    [(ffirst (sort-by second all-the-gaps)) top-two all-the-gaps]))

If we run that:

user> (which-am-i (first test-data))
[2 ([2 1855.3605185002546] [0 2233.654619238101]) 
([0 2233.654619238101] [1 2661.7148686603714] [2 1855.3605185002546] 
[3 2512.429018687221] [4 2357.637631775974] [5 2457.9850052966344] 
[6 2243.724487123002] [7 2554.1158473740174] [8 2317.567588716217] [9 2520.667565741239])]

In this case the difference between the top 2 labels is about 400 but we next changed the function to output the actual difference so we could see how close the top 2 were over the whole test set:

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))]
    [(ffirst (sort-by second all-the-gaps)) top-two difference-between-top-two all-the-gaps]))

We then ran it like this, taking just the first 10 differences on this occasion:

user>  (take 10 (->> test-data (map which-am-i) (map #(nth % 2))))
(378.2941007378465 523.6102802591759 73.57510749262792 3.8557350283749656 5.806672422475231 
183.90928740097775 713.1626629833258 335.38646365464047 538.6191727330108 161.68429111785372)

From visually looking at this over a larger subset we noticed that there seemed to be a significant number of top twos within a distance of 50.

We therefore changed the function to use shuffle to randomly pick between those two labels when the distance was that small.

(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)
        top-two (take 2 (sort-by second all-the-gaps))
        difference-between-top-two (Math/abs (apply - (map second top-two)))
        very-close (< difference-between-top-two 50)
        best-one (if very-close (ffirst (shuffle top-two)) (ffirst top-two))]
    [best-one top-two all-the-gaps]))

Our assumption was that this might marginally improve the performance of the algorithm but it actually made it marginally worse, with a new accuracy of 79.3%

At this point we weren’t really sure what we should be doing to try and improve the algorithm’s performance so we moved onto random forests but it’s interesting that you can get a reasonable accuracy even with such a simple approach!

Written by Mark Needham

October 27th, 2012 at 12:27 pm

Posted in Machine Learning

Tagged with ,

Kaggle Digit Recognizer: A K-means attempt

with 2 comments

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem – a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28×28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

So we’d end up with something like this:

Label 0: Pixel 1: 214, Pixel 2: 12, Pixel 3: 10...Pixel 784: 23
Label 1: Pixel 1: 234, Pixel 2: 0, Pixel 3: 25...Pixel 784: 0
Label 2: Pixel 1: 50, Pixel 2: 23, Pixel 3: 20...Pixel 784: 29
...
Label 9: Pixel 1: 0, Pixel 2: 2, Pixel 3: 10...Pixel 784: 1

When we needed to classify a new image we calculated the distance between each pixel of the new image against the equivalent pixel of each of the 10 pixel averaged labels and then worked out which label our new image was closest to.

We started off with some code to load the training set data into memory so we could play around with it:

(require '[clojure.string :as string])    
(use 'clojure.java.io)
 
(defn parse [reader]
  (drop 1 (map #(string/split % #",") (line-seq reader))))
 
(defn get-pixels [pix] (map #( Integer/parseInt %) pix))
 
(defn create-tuple [[ head & rem]] {:pixels (get-pixels rem) :label head})
 
(defn parse-train-set [reader] (map create-tuple (parse reader)))
 
(defn read-train-set [n]
  (with-open [train-set-rd (reader "data/train.csv")]
    (vec (take n (parse-train-set train-set-rd)))))

One thing we learnt was that it’s helpful to just be able to take a small subset of the data set into memory rather than loading the whole thing in straight away. I ended up crashing my terminal a few times by evaluating a 40,000 line file into the Slime buffer – not a good idea!

To get the first row we’d do this:

user> (first (read-train-set 1))
{:pixels (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...), :label "0"}

We wrote the following code to work out the average pixel values for each of the labels:

(def all-the-data (read-train-set 1000))
 
(defn find-me-all-the [number] (filter #(= (str number) (:label %)) all-the-data))
 
(defn mean [& v]
  (float
   (/ (apply + v) (count v) )))
 
(defn averages [rows] (apply map mean (map :pixels rows)) )
 
(def all-the-averages
  (map vector (range 0 9) (map #(averages (find-me-all-the %)) (range 0 9))))

It’s mostly self explanatory although we had to use float in the mean calculation so that we’d get a decimal value rather than a fraction.

Jen also came up with a neat way of using apply in the averages function to map the mean function across each individual pixel.

I found it easier to follow when we ran the function with a smaller data set:

user> (averages [ {:pixels [1 2 3]} {:pixels [4 5 6]}])
(2.5 3.5 4.5)

That expands out to this:

user> (apply map mean [[1 2 3] [4 5 6]])

Which is conceptually the same as doing this:

user> (map mean [1 2 3] [4 5 6])

We can get the averages for the label ‘0’ like so:

user> (first all-the-averages)
[0 (1.317757 3.3551402 6.196262 7.373831155...74767 171.61682 147.51402 96.943924 48.728973 22.299065 3.037383 )]

To work out what label an untrained collection of pixels is closest to we wrote the following functions:

(defn distance-between [fo1 fo2]
  (Math/sqrt (apply + (map #(* % %) (map - fo1 fo2)))))
 
(defn find-gap [averages unranked-value]
  (vector (first averages) (distance-between (second averages) unranked-value)))
 
(defn which-am-i [unranked-value]
  (let [all-the-gaps (map #(find-gap %1 unranked-value) all-the-averages)]
    [(ffirst (sort-by second all-the-gaps)) all-the-gaps]))

distance-between finds the euclidean distance between the pixel values, find-gap then uses this to find the distance from each of the trained labels set of pixels to the test data set and we can then call which-am-i to find out which label a new set of pixels should be classified as:

user> (first test-data)
(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...)  
 
user>  (which-am-i (first test-data))
[0 ([0 1763.5688862988827] [1 2768.1143197890624] [2 2393.9091578180937] 
[3 2598.4629450761286] [4 2615.1233720558307] [5 2287.1791665580586] 
[6 2470.096959417967] [7 2406.0132574502527] [8 2489.3635108564304] [9 2558.0054056506265])]

The which-am-i function first returns its prediction and then also includes the distance from the test data set to each of the trained labels so that we can tell how close it was to being classified as something else.

We got an accuracy of 80.657% when classifying new values with this algorithm which isn’t great but doesn’t seem too bad given how simple it is and that we were able to get it up and running in a couple of hours.

The code is on Jen’s github if you’re interested in seeing more.

Written by Mark Needham

October 23rd, 2012 at 7:04 pm

Posted in Machine Learning

Tagged with