# Mark Needham

Thoughts on Software Development

## Kaggle Digit Recognizer: A feature extraction #fail

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I’ve highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

```initial <- read.csv("train.csv", header = TRUE) initial\$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries))) initial\$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries))) initial\$meanPixels <- apply(initial, 1, mean) initial\$middlePixels <- apply(initial[,200:500], 1, mean)```

I then wrote those features out into a CSV file like so:

```newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels)) write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")```

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Be Sociable, Share!

Written by Mark Needham

January 31st, 2013 at 11:24 pm

Posted in Machine Learning

Tagged with ,

• Cwharland

“Many people go straight from a data set to applying an algorithm.”

This is a big problem in data science right now in that most of the people hired into the roles are coders with little to know deep knowledge of mathematics or physics (problem solving). Many of the consultants and managers I talk to lament this but admit that the hiring practices right now are likely to continue the trend since it is perceived as risky to hire a scientist that has potential to code. I hope the industry can strike a balance in the near future that sees scientist and skilled coders working together to create incredible problem solving teams. It looks like your kaggle team is a good environment. Hope you are having fun with it.

• Cwharland

*shake fist ipad auto-correct*  know = no

• @63a734a216422efeab3b81d058f1b7b5:disqus yeh it would be cool to work alongside someone who knows the maths/stats inside out. Jen & I have been reading up on it but the texts tend to be massively academic/theoretical and therefore very difficult for us to follow! Hopefully there will be more accessible books as the field becomes more popular.