Mark Needham

Thoughts on Software Development

Archive for the ‘mahout’ tag

Mahout/Hadoop: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

with one comment

I’ve been working my way through Dragan Milcevski’s mini tutorial on using Mahout to do content based filtering on documents and reached the final step where I needed to read in the generated item-similarity files.

I got the example compiling by using the following Maven dependency:

<dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.9</version>
</dependency>

Unfortunately when I ran the code I ran into a version incompatibility problem:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
	at org.apache.hadoop.ipc.Client.call(Client.java:1113)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
	at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
	at com.markhneedham.mahout.Similarity.getDocIndex(Similarity.java:86)
	at com.markhneedham.mahout.Similarity.main(Similarity.java:25)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Version 0.9.0 of mahout-core was published in early 2014 so I expect it was built against an earlier version of Hadoop than I’m using (2.7.2).

I tried updating the Hadoop dependencies that were being called in the stack trace to no avail.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.2</version>
</dependency>
 
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version>
</dependency>

When stepping through the stack trace I noticed that my program was still using an old version of hadoop-core, so with one last throw of the dice I decided to try explicitly excluding that:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-core</artifactId>
    <version>0.9</version>
 
    <exclusions>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>

And amazingly it worked. Now, finally, I can see how similar my documents are!

Written by Mark Needham

July 22nd, 2016 at 1:55 pm

Posted in Hadoop

Tagged with ,

Mahout: Exception in thread “main” java.lang.IllegalArgumentException: Wrong FS: file:/… expected: hdfs://

without comments

I’ve been playing around with Mahout over the last couple of days to see how well it works for content based filtering.

I started following a mini tutorial from Stack Overflow but ran into trouble on the first step:

bin/mahout seqdirectory \
--input file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo \
--output file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out \
-c UTF-8 \
-chunk 64 \
-prefix mah
16/07/21 21:19:20 INFO AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo], --keyPrefix=[mah], --method=[mapreduce], --output=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out], --startPhase=[0], --tempDir=[temp]}
16/07/21 21:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/21 21:19:20 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/07/21 21:19:20 INFO deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/07/21 21:19:20 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo, expected: hdfs://localhost:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:646)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
	at org.apache.mahout.text.SequenceFilesFromDirectory.runMapReduce(SequenceFilesFromDirectory.java:156)
	at org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:90)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I was trying to run the command against the local file system on my laptop which should have been possible according to the instructions. I couldn’t find any flag I could pass any flag that I could pass to Mahout to tell it not to use HDFS but I eventually stumbled on someone else experiencing a similar problem.

It turns out the last time I was playing around with Hadoop, in late 2015, I’d actually configured that and had completely forgotten. I needed to comment out the following config:

/usr/local/Cellar/hadoop/2.7.1/libexec/etc/hadoop/core-site.xml

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
</property>

I commented that property out and all was happy with the (Hadoop) world again.

Written by Mark Needham

July 21st, 2016 at 5:57 pm

Posted in Hadoop

Tagged with ,

Mahout: Parallelising the creation of DecisionTrees

with 5 comments

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
 
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
trees.addAll(forest.getTrees());
 
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list

deb http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main 
deb-src http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

parallelise-forests.sh

#!/bin/bash 
 
start=`date`
startTime=`date '+%s'`
numberOfRuns=$1
 
seq 1 ${numberOfRuns} | parallel -P 8 "./build-forest.sh"
 
end=`date`
endTime=`date '+%s'`
 
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

build-forest.sh

#!/bin/bash
 
java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar main.java.MahoutPlaybox

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Written by Mark Needham

December 27th, 2012 at 12:08 am

Posted in Machine Learning

Tagged with ,

Clojure: Mahout’s ‘entropy’ function

with 3 comments

As I mentioned in a couple of previous posts Jen and I have been playing around with Mahout random forests and for a few hours last week we spent some time looking through the code to see how it worked.

In particular we came across an entropy function which is used to determine how good a particular ‘split’ point in a decision tree is going to be.

I quite like the following definition:

The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).

Information Theory (developed by Claude Shannon 1948) defines this value of uncertainty as entropy, a probability-based measure used to calculate the amount of uncertainty.

For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.

The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications.

The function reads like this:

  private static double entropy(int[] counts, int dataSize) {
    if (dataSize == 0) {
      return 0.0;
    }
 
    double entropy = 0.0;
    double invDataSize = 1.0 / dataSize;
 
    for (int count : counts) {
      if (count == 0) {
        continue; // otherwise we get a NaN
      }
      double p = count * invDataSize;
      entropy += -p * Math.log(p) / LOG2;
    }
 
    return entropy;
  }

We decided to see what the function would look like it was written in Clojure and it was clear from looking at how the entropy variable is being mutated that we’ll need to do a reduce over a collection to get our final result.

In my first attempt at writing this function I started with the call to reduce and then worked out from there:

(defn individual-entropy [x data-size]
  (let [p (float (/ x data-size))]
    (/ (* (* -1 p) (Math/log p)) (Math/log 2.0))))
 
(defn calculate-entropy [counts data-size]
  (if (= 0 data-size)
    0.0
    (reduce
     (fn [entropy x] (+ entropy (individual-entropy x data-size)))
     0
     (remove (fn [count] (= 0 count)) counts))))

Jen was pretty much watching on with horror the whole time I wrote this function but I was keen to see how our approaches differed so I insisted she allow me to finish!

We then moved onto Jen’s version where instead of writing the code all in one go like I did, we would try to reduce the problem to the point where we wouldn’t need to pass a custom anonymous function to reduce but could instead pass a +.

This meant we’d need to run a map over the counts collection to get the individual entropy values first and then add them all together.

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #(= 0 %))
       (map #(individual-entropy % data-size))
       (reduce +)))

Here we’re using the threading operator to make the code a bit easier rather than nesting functions as I had done.

Jen also showed me a neat way of rewriting the line with the remove function to use a set instead:

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #{0})
       (map #(individual-entropy % data-size))
       (reduce +)))

I hadn’t seen this before although Jay Fields has a post showing a bunch of examples of using sets and maps as functions.

In this case if the set is applied to 0 the value will be returned:

user> (#{0} 0)
0

But if the set is applied to a non 0 value we’ll get a nil back:

user> (#{0} 2)
nil

So if we apply that to a collection of values we’d see the 0s removed:

user> (remove #{0} [1 2 3 4 5 6 0])
(1 2 3 4 5 6)

I wrote a similar post earlier in the year where another colleague showed me his way of breaking down a problem but clearly I still haven’t quite got into the mindset so I thought it was worth writing up.

Written by Mark Needham

October 30th, 2012 at 10:46 pm

Posted in Clojure

Tagged with ,

Mahout: Using a saved Random Forest/DecisionTree

with one comment

One of the things that I wanted to do while playing around with random forests using Mahout was to save the random forest and then use use it again which is something Mahout does cater for.

It was actually much easier to do this than I’d expected and assuming that we already have a DecisionForest built we’d just need the following code to save it to disc:

int numberOfTrees = 1;
Data data = loadData(...);
DecisionForest forest = buildForest(numberOfTrees, data);
 
String path = "saved-trees/" + numberOfTrees + "-trees.txt";
DataOutputStream dos = new DataOutputStream(new FileOutputStream(path));
 
forest.write(dos);

When I was looking through the API for how to load that file back into memory again it seemed like all the public methods required you to be using Hadoop in some way which I thought was going to be a problem as I’m not using it.

For example the signature for DecisionForest.load reads like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
 
public static DecisionForest load(Configuration conf, Path forestPath) throws IOException { }

As it turns out though you can just pass an empty configuration and a normal file system path and the forest shall be loaded:

int numberOfTrees = 1;
 
Configuration config = new Configuration();
Path path = new Path("saved-trees/" + numberOfTrees + "-trees.txt");
DecisionForest forest = DecisionForest.load(config, path);

Much easier than expected!

Written by Mark Needham

October 27th, 2012 at 10:03 pm

Posted in Machine Learning

Tagged with ,