Archive for the ‘Clojure’ Category
clojure/Java Interop: The doto macro
I recently wrote about some code I’ve been playing with to import neo4j spatial data and while looking to simplify the code I came across the doto macro.
The doto macro allows us to chain method calls on an initial object and then returns the resulting object. e.g.
(doto (new java.util.HashMap) (.put "a" 1) (.put "b" 2)) -> {a=1, b=2}
In our case this comes in quite useful in the function used to create a stadium node which initially reads like this:
(defn create-stadium-node [db line] (let [stadium-node (.. db createNode)] (.. stadium-node (setProperty "wkt" (format "POINT(%s %s)" (:long line) (:lat line)))) (.. stadium-node (setProperty "name" (:stadium line))) stadium-node))
Here we first create a node, set a couple of properties on the node and then return it.
Using the macro it would read like this:
(defn create-stadium-node [db line] (doto (.. db createNode) (.setProperty "wkt" (format "POINT(%s %s)" (:long line) (:lat line))) (.setProperty "name" (:stadium line))))
We can also use it to close the transaction at the end of our function although we don't actually have a need for the transaction object which gets returned:
# the end of our main function (.. tx success) (.. tx finish)
...becomes…
(doto tx (.success) (.finish))
As far as I can tell this is pretty similar in functionality to the Object#tap function in Ruby:
{}.tap { |x| x[:a] = 1; x[:b] = 2 } => {:a=>1, :b=>2}
Either way it's a pretty neat way of simplifying code.
Clojure: Reading and writing a reasonably sized file
In a post a couple of days ago I described some code I’d written in R to find out all the features with zero variance in the Kaggle Digit Recognizer data set and yesterday I started working on some code to remove those features.
Jen and I had previously written some code to parse the training data in Clojure so I thought I’d try and adapt that to write out a new file without the unwanted pixels.
In the first version we’d encapsulated the reading of the file and parsing of it into a more useful data structure like so:
(defn get-pixels [pix] (map #( Integer/parseInt %) pix)) (defn create-tuple [[ head & rem]] {:pixels (get-pixels rem) :label head}) (defn tuples [rows] (map create-tuple rows)) (defn parse-row [row] (map #(clojure.string/split % #",") row)) (defn read-raw [path n] (with-open [reader (clojure.java.io/reader path)] (vec (take n (rest (line-seq reader)))))) (def read-train-set-raw (partial read-raw "data/train.csv")) (def parsed-rows (tuples (parse-row (read-train-set-raw 42000))))
So the def parsed-rows gives an in memory representation of a row where we’ve separated the label and pixels into different key entries in a map.
We wanted to remove any pixels which had a variance of 0 across the data set which in this case means that they always have a value of 0:
(def dead-to-us-pixels [0 1 2 3 4 5 6 7 8 9 10 11 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 52 53 54 55 56 57 82 83 84 85 111 112 139 140 141 168 196 392 420 421 448 476 532 560 644 645 671 672 673 699 700 701 727 728 729 730 731 754 755 756 757 758 759 760 780 781 782 783]) (defn in? "true if seq contains elm" [seq elm] (some #(= elm %) seq)) (defn dead-to-us? [pixel-with-index] (in? dead-to-us-pixels (first pixel-with-index))) (defn remove-unwanted-pixels [row] (let [new-pixels (->> row :pixels (map-indexed vector) (remove dead-to-us?) (map second))] {:pixels new-pixels :label (:label row)})) (defn -main [] (with-open [wrt (clojure.java.io/writer "/tmp/attempt-1.txt")] (doseq [line parsed-rows] (let [line-without-pixels (to-file-format (remove-unwanted-pixels line))] (.write wrt (str line-without-pixels "\n"))))))
We then ran the main method using ‘leon run’ which wrote out the new file.
A print screen of the heap space usage while this function was running looks like this:
While I was writing this version of the function I made a mistake somewhere and ended up passing the wrong data structure to one of the functions which resulted in all the intermediate steps that the data structure goes through getting stored in memory and caused an OutOfMemory exception.
A heap dump showed the following:
When I reduced the size of the erroneous collection by using a ‘take 10′ I got an exception indicating that the function couldn’t process the data structure which allowed me to sort it out.
I initially thought that the problem was to do with the loading of the file into memory at all but since the above seems to work I don’t think it is.
When I was working along that theory Jen suggested it might make more sense to do the reading and writing of the files within a ‘with-open’ which tallies with a suggestion I came across in a StackOverflow post.
I ended up with the following code:
(defn split-on-comma [line] (string/split line #",")) (defn clean-train-file [] (with-open [rdr (clojure.java.io/reader "data/train.csv") wrt (clojure.java.io/writer "/tmp/attempt-2.csv")] (doseq [line (drop 1 (line-seq rdr))] (let [line-with-removed-pixels ((comp to-file-format remove-unwanted-pixels create-tuple split-on-comma) line)] (.write wrt (str line-with-removed-pixels "\n"))))))
Which got called in the main method like this:
(defn -main [] (clean-train-file))
This version had the following heap usage:
Its peaks are slightly lower than the first one and it seems like it buffers a bunch of lines, writes them out to the file (and therefore out of memory) and repeats.
Any thoughts on this approach are as always very welcome!
Clojure: Thread last (->>) vs Thread first (->)
In many of the Clojure examples that I’ve come across the thread last (->>) macro is used to make it easier (for people from a non lispy background!) to see the transformations that the initial data structure is going through.
In one of my recent posts I showed how Jen & I had rewritten Mahout’s entropy function in Clojure:
(defn calculate-entropy [counts data-size] (->> counts (remove #{0}) (map (partial individual-entropy data-size)) (reduce +)))
Here we are using the thread last operator to first pass counts as the last argument of the remove function on the next line, then to pass the result of that to the map function on the next line and so on.
The function expands out like this:
(remove #{0} counts)
(map (partial individual-entropy data-size) (remove #{0} counts))
(reduce + (map (partial individual-entropy data-size) (remove #{0} counts)))
We can also use clojure.walk/macroexpand-all to see the expanded form of this function:
user> (use 'clojure.walk) user> (macroexpand-all '(->> counts (remove #{0}) (map (partial individual-entropy data-size)) (reduce +))) (reduce + (map (partial individual-entropy data-size) (remove #{0} counts)))
I recently came across the thread first (->) macro while reading one of Jay Fields’ blog posts and thought I’d have a play around with it.
The thread first (->) macro is similar but it passes its first argument as the first argument to the next form, then passes the result of that as the first argument to the next form and so on.
It’s pointless to convert this function to use -> because all the functions take the previous result as their last argument but just in case we wanted to the equivalent function would look like this:
(defn calculate-entropy [counts data-size] (-> counts (->> (remove #{0})) (->> (map (partial individual-entropy data-size))) (->> (reduce +))))
As you can see we end up using ->> to pass counts as the last argument to remove, then map and then reduce.
The function would expand out like this:
(->> counts (remove #{0}))
(->> (->> counts (remove #{0})) (map (partial individual-entropy data-size)))
(->> (->> (->> counts (remove #{0})) (map (partial individual-entropy data-size))) (reduce +))
If we then evaluate the ->> macro we end up with the nested form:
(->> (->> (remove #{0} counts) (map (partial individual-entropy data-size))) (reduce +))
(->> (map (partial individual-entropy data-size) (remove #{0} counts)) (reduce +))
(reduce + (map (partial individual-entropy data-size) (remove #{0} counts)))
I haven’t written enough Clojure to come across a real use for the thread first macro but Jay has an example on his blog showing how he refactored some code which was initially using the thread last macro to use thread first instead.
Emacs/Clojure: Starting out with paredit
I’ve been complaining recently to Jen and Bruce about the lack of a beginner’s guide to emacs paredit mode which seems to be the defacto approach for people working with Clojure and both pointed me to the paredit cheat sheet.
While it’s very comprehensive, I found that it’s a little overwhelming for a complete newbie like myself.
I therefore thought it’d be useful to write a bit about a couple of things that I’ve picked up from pairing with Jen on little bits of Clojure over the last couple of months.
Let’s say we start with a simple function to add two numbers together:

And say for example we decide that we want to add 5 to the result so the function adds the two numbers together and then adds 5.
Jen showed me that the best way to do this is to go beyond the furthest bracket to the left and start typing there:

The brackets are now a bit misaligned. We need the ‘)’ where the cursor currently is to go to the end of the line.
One way to do this is to move the cursor in front of the ‘(‘ of the second ‘+’ on the line and press ‘Ctrl + K’ which in emacs means ‘kill line to end’ but in this case kills to the end of the expression that we’re at the beginning of:

We then move the cursor back to just after the ’5′ and press ‘Ctrl + Y’ which in emacs means re-insert the last text that was killed:
![]()
This works but it’s a little bit long winded and Jen showed me a quicker way.
If we go back to the position where we had just inserted the ‘+ 5′ and place our cursor just in front of the ‘)’:

We can then press ‘Ctrl + Shift + Right Arrow’ to push the right bracket to the end of the line:
![]()
From what I can tell, this can also be achieved by pressing ‘Meta + X’ followed by ‘paredit-forward-slurp-sexp’ or ‘Meta + Shift + )’.
We have to be a little bit careful about where we position the cursor because if we put it after the bracket then we can end up bringing another function into our one by mistake!
For example say just below our ‘add’ function we have a subtract one:

And we put our cursor just after the ‘)’ of the ‘+ 5′ bit of code and press ‘Ctrl + Shift + RIght Arrow’:

We now have a bizarre looking function ‘add’ which has the ‘subtract’ function contained inside it!
At the moment this is the main paredit shortcut I know and it seems to work reasonably well. I also find myself using ‘Ctrl + Shift + -’ which allows me to undo any mistakes I make!
Now to learn my next command! Any suggestions?
Clojure: Mahout’s ‘entropy’ function
As I mentioned in a couple of previous posts Jen and I have been playing around with Mahout random forests and for a few hours last week we spent some time looking through the code to see how it worked.
In particular we came across an entropy function which is used to determine how good a particular ‘split’ point in a decision tree is going to be.
I quite like the following definition:
The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).
Information Theory (developed by Claude Shannon 1948) defines this value of uncertainty as entropy, a probability-based measure used to calculate the amount of uncertainty.
For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.
The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications.
The function reads like this:
private static double entropy(int[] counts, int dataSize) { if (dataSize == 0) { return 0.0; } double entropy = 0.0; double invDataSize = 1.0 / dataSize; for (int count : counts) { if (count == 0) { continue; // otherwise we get a NaN } double p = count * invDataSize; entropy += -p * Math.log(p) / LOG2; } return entropy; }
We decided to see what the function would look like it was written in Clojure and it was clear from looking at how the entropy variable is being mutated that we’ll need to do a reduce over a collection to get our final result.
In my first attempt at writing this function I started with the call to reduce and then worked out from there:
(defn individual-entropy [x data-size] (let [p (float (/ x data-size))] (/ (* (* -1 p) (Math/log p)) (Math/log 2.0)))) (defn calculate-entropy [counts data-size] (if (= 0 data-size) 0.0 (reduce (fn [entropy x] (+ entropy (individual-entropy x data-size))) 0 (remove (fn [count] (= 0 count)) counts))))
Jen was pretty much watching on with horror the whole time I wrote this function but I was keen to see how our approaches differed so I insisted she allow me to finish!
We then moved onto Jen’s version where instead of writing the code all in one go like I did, we would try to reduce the problem to the point where we wouldn’t need to pass a custom anonymous function to reduce but could instead pass a +.
This meant we’d need to run a map over the counts collection to get the individual entropy values first and then add them all together.
(defn calculate-entropy [counts data-size] (->> counts (remove #(= 0 %)) (map #(individual-entropy % data-size)) (reduce +)))
Here we’re using the threading operator to make the code a bit easier rather than nesting functions as I had done.
Jen also showed me a neat way of rewriting the line with the remove function to use a set instead:
(defn calculate-entropy [counts data-size] (->> counts (remove #{0}) (map #(individual-entropy % data-size)) (reduce +)))
I hadn’t seen this before although Jay Fields has a post showing a bunch of examples of using sets and maps as functions.
In this case if the set is applied to 0 the value will be returned:
user> (#{0} 0) 0
But if the set is applied to a non 0 value we’ll get a nil back:
user> (#{0} 2) nil
So if we apply that to a collection of values we’d see the 0s removed:
user> (remove #{0} [1 2 3 4 5 6 0]) (1 2 3 4 5 6)
I wrote a similar post earlier in the year where another colleague showed me his way of breaking down a problem but clearly I still haven’t quite got into the mindset so I thought it was worth writing up.
Clojure: Casting to a Java class…or not!
I have a bit of Java code for working out the final destination of a URL assuming that there might be one redirect which looks like this:
private String resolveUrl(String url) { try { HttpURLConnection con = (HttpURLConnection) (new URL(url).openConnection()); con.setInstanceFollowRedirects(false); con.connect(); int responseCode = con.getResponseCode(); if (String.valueOf(responseCode).startsWith("3")) { return con.getHeaderField("Location"); } } catch (IOException e) { return url; } return url; }
I need to cast to HttpURLConnection on the first line so that I can make the call to setInstanceFollowRedirects which isn’t available on URLConnection.
I wanted to write some similar code in Clojure and my first thought was that I needed to work out how to do the cast, which I didn’t know how to do.
I then remembered that Clojure is actually dynamically typed so there isn’t any need – as long as the object has the method that we want to call on it everything will be fine.
In this case we end up with the following code:
(defn resolve-url [url] (let [con (.. (new URL url) openConnection)] (doall (.setInstanceFollowRedirects con false) (.connect con)) (if (.startsWith (str (.getResponseCode con)) "3") (.getHeaderField con "Location") url)))
Which can be simplified to this:
(defn resolve-url [url] (let [con (doto (.. (new URL url) openConnection) (.setInstanceFollowRedirects false) (.connect))] (if (.startsWith (str (.getResponseCode con)) "3") (.getHeaderField con "Location") url)))
Leiningen: Using goose via a local Maven repository
I’ve been playing around a little bit with goose – a HTML content/article extractor – originally in Java but later in clojure where I needed to work out how to include goose and all its dependencies via Leiningen.
goose isn’t included in a Maven repository so I needed to create a local repository, something which I’ve got stuck on in the past.
Luckily Paul Gross has written a cool blog post explaining how his team got past this problem.
Following the instructions from Paul’s post this is how I got goose playing nicely with clojure:
Inside my clojure project:
/Users/mneedham/github/android/text-extraction $ mkdir maven_repository
I then ran the following command from where I had goose checked out on my machine:
mvn install:install-file -Dfile=target/goose-2.1.6.jar -DartifactId=goose -Dversion=2.1.6 -DgroupId=goose -Dpackaging=jar -DlocalRepositoryPath=/Users/mneedham/github/android/text-extraction/maven_repository -DpomFile=pom.xml
I added the repository and goose dependency to my project.clj file which now looks like this:
(defproject textextraction "0.1.0"
:description "Extract text from urls"
:dependencies [[org.clojure/clojure "1.2.0"],
[org.clojure/clojure-contrib "1.2.0"],
[ring/ring-jetty-adapter "0.3.11"],
[compojure "0.6.4"]
[goose "2.1.6"]]
:dev-dependencies [[swank-clojure "1.2.1"]]
:repositories {"local" ~(str (.toURI (java.io.File. "maven_repository")))}
:main textextraction.main)I then run:
/Users/mneedham/github/android/text-extraction $ lein run
And goose and all its dependencies are included in the ‘lib’ directory.
Clojure: Getting caught out by lazy collections
Most of the work that I’ve done with Clojure has involved running a bunch of functions directly in the REPL or through Leiningen’s run target which led to me getting caught out when I created a JAR and tried to run that.
As I mentioned a few weeks ago I’ve been rewriting part of our system in Clojure to see how the design would differ and a couple of levels down the Clojure version comprises of applying a map function over a collection of documents.
The code in question originally looked like this:
(ns aim.main (:gen-class)) (defn import-zip-file [zipFile working-dir] (let [xml-files (filter xml-file? (unzip zipFile working-dir))] (map import-document xml-files))) (defn -main [& args] (import-zip-file "our/file.zip", "/tmp/unzip/to/here"))
Which led to absolutely nothing happening when run like this!
$ lein uberjar && java -jar my-project-0.1.0-standalone.jar
I originally assumed that I had something wrong in the code but my colleague Uday reminded me that collections in Clojure are lazily evaluated and there was nothing in the code that would force the evaluation of ours.
In this situation we had to wrap the map with a doall in order to force evaluation of the collection:
(ns aim.main (:gen-class)) (defn import-zip-file [zip-file working-dir] (let [xml-files (filter xml-file? (unzip zip-file working-dir))] (doall (map import-document xml-files)))) (defn -main [& args] (import-zip-file "our/file.zip", "/tmp/unzip/to/here"))
When we run the code in the REPL or through ‘lein run’ the code is being eagerly evaluated as far as I understand it which is why we see a different behaviour than when we run it on its own.
I also got caught out on another occasion where I tried to pass around a collection of input streams which I’d retrieved from a zip file only to realise that when the code which used the input stream got evaluated the ZIP file was no longer around!
Clojure: Creating XML document with namespaces
As I mentioned in an earlier post we’ve been parsing XML documents with the Clojure zip-filter API and the next thing we needed to do was create a new XML document containing elements which needed to be inside a namespace.
We wanted to end up with a document which looked something like this:
<root> <mynamespace:foo xmlns:mynamespace="http://www.magicalurlfornamespace.com"> <mynamespace:bar>baz</mynamespace:bar> </mynamespace:foo> </root>
We can make use of lazy-xml/emit to output an XML string from *some sort of input?* by wrapping it inside with-out-str like so:
(require '[clojure.contrib.lazy-xml :as lxml]) (defn xml-string [xml-zip] (with-out-str (lxml/emit xml-zip)))
I was initially confused about how we’d be able to create a map representing name spaced elements to pass to xml-string but it turned out to be reasonably simple.
To create a non namespaced XML string we might pass xml-string the following map:
(xml-string {:tag :root :content [{:tag :foo :content [{:tag :bar :content ["baz"]}]}]})
Which gives us this:
"<?xml version=\"1.0\" encoding=\"UTF-8\"?> <root> <foo> <bar>baz</bar> </foo> </root>"
Ideally I wanted to prepend :foo and :bar with ‘:mynamespace” but I thought that wouldn’t work since that type of syntax would be invalid in Ruby and I thought it’d be the same in Clojure.
mneedham@Administrators-MacBook-Pro-5.local ~$ irb
>> { :mynamespace:foo "bar" }
SyntaxError: compile error
(irb):1: odd number list for Hash
{ :mynamespace:foo "bar" }
^
(irb):1: syntax error, unexpected ':', expecting '}'
{ :mynamespace:foo "bar" }
^
(irb):1: syntax error, unexpected '}', expecting $end
from (irb):1
>>In fact it isn’t so we can just do this:
(xml-string {:tag :root :content [{:tag :mynamespace:foo :attrs {:xmlns:meta "http://www.magicalurlfornamespace.com"} :content [{:tag :mynamespace:bar :content ["baz"]}]}]})
"<?xml version=\"1.0\" encoding=\"UTF-8\"?> <root> <mynamespace:foo xmlns:meta=\"http://www.magicalurlfornamespace.com"> <mynamespace:bar>baz</mynamespace:bar> </mynamespace:foo> </root>"
As a refactoring step, since I had to append the namespace to a lot of tags, I was able to make use of the keyword function to do so:
(defn tag [name value] {:tag (keyword (str "mynamespace" name)) :content [value]})
> (tag :Foo "hello")
{:tag :mynamespace:Foo, :content ["hello"]}Clojure: Extracting child elements from an XML document with zip-filter
I’ve been following Nurullah Akkaya’s blog post about navigating XML documents using the Clojure zip-filter API and I came across an interesting problem in a document I’m parsing which goes beyond what’s covered in his post.
Nurullah provides a neat zip-str function which we can use to convert an XML string into a zipper object:
(require '[clojure.zip :as zip] '[clojure.xml :as xml]) (use '[clojure.contrib.zip-filter.xml]) (defn zip-str [s] (zip/xml-zip (xml/parse (java.io.ByteArrayInputStream. (.getBytes s)))))
The fragment of the document I’m parsing looks like this:
(def test-doc (zip-str "<?xml version='1.0' encoding='UTF-8'?> <root> <Person> <FirstName>Charles</FirstName> <LastName>Kubicek</LastName> </Person> <Person> <FirstName>Mark</FirstName> <MiddleName>H</MiddleName> <LastName>Needham</LastName> </Person> </root>"))
I wanted to be able to get the full names of each of the people such that I’d have a collection which looked like this:
("Charles Kubicek" "Mark H Needham")
My initial thinking was to get all the child elements of the Person element and operate on those:
(require '[clojure.contrib.zip-filter :as zf]) (xml-> test-doc :Person zf/children text)
Unfortunately that gives back all the names in one collection like so:
("Charles" "Kubicek" "Mark" "H" "Needham")
Since it’s not mandatory to have a MiddleName element it’s not possible to work out which names go with which person!
A bit of googling led me to stackoverflow where Timothy Pratley suggests that we need to get up to the Person element and then pick each of the child elements individually.
We can do that by mapping over the collection with a function which creates a vector for each Person containing all their names.
In pseudo-code this is what we want to do:
> (map magic-function (xml-> test-doc :Person)) (["Charles" "Kubicek"] ["Mark" "H" "Needham"])
Timothy suggests the juxt function which is defined like so:
juxt
Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right).
A simple use of juxt could be to create some values containing my name:
((juxt #(str % " loves Clojure") #(str % " loves Scala")) "Mark")
Which returns:
["Mark loves Clojure" "Mark loves Scala"]
We can use juxt to build the collection of names and then use clojure.string/join to separate them with a space.
The code to do this ends up looking like this:
(require '[clojure.string :as str]) (defn get-names [doc] (->> (xml-> doc :Person) (map (juxt #(xml1-> % :FirstName text) #(xml1-> % :MiddleName text) #(xml1-> % :LastName text))) (map (partial filter seq)) (map (partial str/join " "))))
We use a filter on the second last line to get rid of any nil values in the vector (e.g. no middle name) and then combine the names on the last line.
We can then call the function:
> (get-names test-doc) ("Charles Kubicek" "Mark H Needham")