Mark Needham

Thoughts on Software Development

Mahout: Parallelising the creation of DecisionTrees

with 5 comments

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
 
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
trees.addAll(forest.getTrees());
 
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list

deb http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main 
deb-src http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

parallelise-forests.sh

#!/bin/bash 
 
start=`date`
startTime=`date '+%s'`
numberOfRuns=$1
 
seq 1 ${numberOfRuns} | parallel -P 8 "./build-forest.sh"
 
end=`date`
endTime=`date '+%s'`
 
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

build-forest.sh

#!/bin/bash
 
java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar main.java.MahoutPlaybox

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Written by Mark Needham

December 27th, 2012 at 12:08 am

Posted in Machine Learning

Tagged with ,

  • Pingback: Geek Reading December 27, 2012 | Regular Geek

  • http://www.HistorySquared.com HistorySquared

    pretty awesome. how much of a speed up did you get?

  • http://www.markhneedham.com/blog Mark Needham

    @HistorySquared:disqus It took 506 seconds to create  5 x 100 trees vs 168 seconds for each one individually. That was using an input with 784 features so the time is reduced with less features. I ran that benchmark using a Mac AIr which has 2 cores so it should be even faster using a machine with more cores.
    I tried it on an AWS High CPU instance with smaller forests – 10 trees each. Normally 1 of those forests takes 24 seconds and I was able to get 10 in 90 seconds. Problem was that 4 of them the process got killed so really it was only 6 in 90 seconds. I couldn’t figure out what was doing on and it doesn’t happen locally so I’m assuming it’s a problem that only manifests when there are more CPUs which is annoying!

  • Pingback: Kaggle Digit Recognizer: Finding pixels with no variance using R at Mark Needham

  • Pingback: Kaggle Digit Recognizer: A feature extraction #fail at Mark Needham