Mark Needham

Thoughts on Software Development

Clojure: Mahout’s ‘entropy’ function

with 3 comments

As I mentioned in a couple of previous posts Jen and I have been playing around with Mahout random forests and for a few hours last week we spent some time looking through the code to see how it worked.

In particular we came across an entropy function which is used to determine how good a particular ‘split’ point in a decision tree is going to be.

I quite like the following definition:

The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).

Information Theory (developed by Claude Shannon 1948) defines this value of uncertainty as entropy, a probability-based measure used to calculate the amount of uncertainty.

For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.

The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications.

The function reads like this:

  private static double entropy(int[] counts, int dataSize) {
    if (dataSize == 0) {
      return 0.0;
    double entropy = 0.0;
    double invDataSize = 1.0 / dataSize;
    for (int count : counts) {
      if (count == 0) {
        continue; // otherwise we get a NaN
      double p = count * invDataSize;
      entropy += -p * Math.log(p) / LOG2;
    return entropy;

We decided to see what the function would look like it was written in Clojure and it was clear from looking at how the entropy variable is being mutated that we’ll need to do a reduce over a collection to get our final result.

In my first attempt at writing this function I started with the call to reduce and then worked out from there:

(defn individual-entropy [x data-size]
  (let [p (float (/ x data-size))]
    (/ (* (* -1 p) (Math/log p)) (Math/log 2.0))))
(defn calculate-entropy [counts data-size]
  (if (= 0 data-size)
     (fn [entropy x] (+ entropy (individual-entropy x data-size)))
     (remove (fn [count] (= 0 count)) counts))))

Jen was pretty much watching on with horror the whole time I wrote this function but I was keen to see how our approaches differed so I insisted she allow me to finish!

We then moved onto Jen’s version where instead of writing the code all in one go like I did, we would try to reduce the problem to the point where we wouldn’t need to pass a custom anonymous function to reduce but could instead pass a +.

This meant we’d need to run a map over the counts collection to get the individual entropy values first and then add them all together.

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #(= 0 %))
       (map #(individual-entropy % data-size))
       (reduce +)))

Here we’re using the threading operator to make the code a bit easier rather than nesting functions as I had done.

Jen also showed me a neat way of rewriting the line with the remove function to use a set instead:

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #{0})
       (map #(individual-entropy % data-size))
       (reduce +)))

I hadn’t seen this before although Jay Fields has a post showing a bunch of examples of using sets and maps as functions.

In this case if the set is applied to 0 the value will be returned:

user> (#{0} 0)

But if the set is applied to a non 0 value we’ll get a nil back:

user> (#{0} 2)

So if we apply that to a collection of values we’d see the 0s removed:

user> (remove #{0} [1 2 3 4 5 6 0])
(1 2 3 4 5 6)

I wrote a similar post earlier in the year where another colleague showed me his way of breaking down a problem but clearly I still haven’t quite got into the mindset so I thought it was worth writing up.

Be Sociable, Share!

Written by Mark Needham

October 30th, 2012 at 10:46 pm

Posted in Clojure

Tagged with ,

  • Nice to see how other folks are constructing – and analyzing – solutions in Clojure!

    I would probably have used (remove zero?) instead of (remove #{0}) but the set lookup is a neat trick. I’d probably swap the arguments on individual-entropy so I could (map (partial individual-entropy data-size)) to avoid the raw anonymous function (and the varying argument is the last one, which I like better).

    I would probably simplify the math calculation, taking advantage of one-argument versions of – and /

    (* (- p) (Math/log p) (/ (Math/log 2.0)))

    which may or may not be more intuitive?Overall I like Jen’s reduce-over-map style better than the single reduce with more functions.

  • @twitter-808096:disqus yep me too Jen’s version is much better! The difference in the two ways we approached the problem are described pretty well here I think ->

    I was thinking about the ‘partial’ idea – that would work much better in Haskell where you could just do ‘map (individual-entropy data-size) counts’ but the clojure way isn’t bad either. In actual fact the ‘sizes’ variable is unnecessary because it’s just the count of all the values in ‘counts’ so it could be removed. 

    Neat idea with the shortened Math calculation, didn’t know you could do that!

  • Sean Corfield

    Interesting re: intuitive vs reasoning. I recognize bits of both in myself, although I’m more in the REPL camp overall 🙂