Mark Needham

Thoughts on Software Development

Archive for July, 2011

Clojure: Getting caught out by lazy collections

with one comment

Most of the work that I’ve done with Clojure has involved running a bunch of functions directly in the REPL or through Leiningen’s run target which led to me getting caught out when I created a JAR and tried to run that.

As I mentioned a few weeks ago I’ve been rewriting part of our system in Clojure to see how the design would differ and a couple of levels down the Clojure version comprises of applying a map function over a collection of documents.

The code in question originally looked like this:

(ns aim.main (:gen-class))
 
(defn import-zip-file [zipFile working-dir]
  (let [xml-files (filter xml-file? (unzip zipFile working-dir))]
    (map import-document xml-files)))
 
(defn -main [& args]
  (import-zip-file "our/file.zip", "/tmp/unzip/to/here"))

Which led to absolutely nothing happening when run like this!

$ lein uberjar && java -jar my-project-0.1.0-standalone.jar

I originally assumed that I had something wrong in the code but my colleague Uday reminded me that collections in Clojure are lazily evaluated and there was nothing in the code that would force the evaluation of ours.

In this situation we had to wrap the map with a doall in order to force evaluation of the collection:

(ns aim.main (:gen-class))
 
(defn import-zip-file [zip-file working-dir]
  (let [xml-files (filter xml-file? (unzip zip-file working-dir))]
    (doall (map import-document xml-files))))
 
(defn -main [& args]
  (import-zip-file "our/file.zip", "/tmp/unzip/to/here"))

When we run the code in the REPL or through ‘lein run’ the code is being eagerly evaluated as far as I understand it which is why we see a different behaviour than when we run it on its own.

I also got caught out on another occasion where I tried to pass around a collection of input streams which I’d retrieved from a zip file only to realise that when the code which used the input stream got evaluated the ZIP file was no longer around!

Written by Mark Needham

July 31st, 2011 at 9:40 pm

Posted in Clojure

Tagged with

Performance tuning our data import: Gather precise data

with one comment

One of the interesting problems that we have to solve on my current project is working out how to import a few million XML documents into our database in a reasonable amount of time.

The stages of the import process are as follows:

  1. Extract a bunch of ZIP files to the disc
  2. Processing only the XML documents…
  3. Load the XML document and determine whether the document is valid to import
  4. Add some meta data to the document for database indexing
  5. Import the document into the database

We’ve been working on this quite a bit recently and one of the main things we’ve learnt is the value of gathering detailing information about what’s actually happening in the code.

When we started we only gathered the end to end time for the whole job to run against a certain number of documents.

The problem with doing this is that we couldn’t see where the constraint in the process is and therefore went and parallelised the process using Akka which gave some improvement but not as much as expected.

Having realised that we didn’t really know where the bottle neck was we added in much more logging to our code to try and identify where the most time was being taken.

For each document there are effectively 3 main things that we’re doing:

  • Loading the XML file
  • Applying the XPath expressions against the file
  • Importing the document into the database
Performance

We ran our import process a few times and recorded how long was being taken on each stage.

It was then much easier to see where we needed to focus our attention if we wanted to see big improvements.

We gathered this data for our local environment and QA environment and noticed that there was a big difference on the loading of the XML files – it was 6 or 7 times quicker on the QA environment.

By chance I ended up running the import on a laptop on the train and noticed that it aborted because it couldn’t access an external DTD which was referenced in each XML file.

The QA machine is sitting inside a data centre with a high speed connection which means that the downloading of the DTD files is significantly faster than we can achieve locally.

We realised that we could solve this problem by forcing the parser to load the DTDs locally and immediately saw a huge decrease in the overall time.

Without collecting the data and seeing so clearly where the constraint was it would have taken us much longer to realise where we needed to make improvements.

We still have many more improvements to make but measuring the performance instead of speculating seems to be the way to go.

Written by Mark Needham

July 29th, 2011 at 1:34 am

Unix: Summing the total time from a log file

with 2 comments

As I mentioned in my last post we’ve been doing some profiling of a data ingestion job and as a result have been putting some logging into our code to try and work out where we need to work on.

We end up with a log file peppered with different statements which looks a bit like the following:

18:50:08.086 [akka:event-driven:dispatcher:global-5] DEBUG - Imported document. /Users/mneedham/foo.xml in: 1298
18:50:09.064 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo2.xml in: 798
18:50:09.712 [akka:event-driven:dispatcher:global-4] DEBUG - Imported document. /Users/mneedham/foo3.xml in: 298
18:50:10.336 [akka:event-driven:dispatcher:global-3] DEBUG - Imported document. /Users/mneedham/foo4.xml in: 898
18:50:10.982 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo5.xml in: 12298

I can never quite tell which column I need to get so end up doing some exploration with awk like this to find out:

$ cat foo.log | awk ' { print $9 }'
1298
798
298
898
12298

Once we’ve worked out the column then we can add them together like this:

$ cat foo.log | awk ' { total+=$9 } END { print total }'
15590

I think that’s much better than trying to determine the total run time in the application and printing it out to the log file.

We can also calculate other stats if we record a log entry for each record:

$ cat foo.log | awk ' { total+=$9; number+=1 } END { print total/number }'
3118
$ cat foo.log | awk 'min=="" || $9 < min {min=$9; minline=$0}; END{ print min}' 
298

Written by Mark Needham

July 27th, 2011 at 11:02 pm

Posted in Shell Scripting

Tagged with ,

A crude way of telling if a remote machine is a VM

without comments

We were doing a bit of profiling of a data importing process we’ve been running across various environments and wanted to check whether or not one of the environments was a physical machine or a VM.

A bit of googling first led me to the following site where you can fill a MAC address and it will tell you which vendor it belongs to.

macvendorlookup.com is even better though because it’s more easily scriptable!

If I wanted to find the vendor of my MAC address on the ethernet port I could try the following:

ifconfig | grep -A1 en1 | grep ether | cut -d" " -f2 | xargs -I {} curl -s http://www.macvendorlookup.com/getoui.php?mac={} -o - | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

Which gives:

Vendor: Apple Inc

Sed magic was shamelessly stolen from sed one liners.

As it turns out the machine we wanted to learn about was a VM hosted on VMWare!

Written by Mark Needham

July 27th, 2011 at 10:31 pm

Scala: Prettifying test builders with package object

without comments

We have several different test builders in our code base which look roughly like this:

case class FooBuilder(bar : String, baz : String) {
	def build = new Foo(bar, baz)
}

In our tests we originally used them like this:

class FooPageTest extends Specs with ShouldMatchers {
	it("should let us load a foo") {
		when(databaseHas(FooBuilder(bar = "Bar", baz = "Bazz")))
		// and so on...
	}
}

This works well but we wanted our tests to only contain domain language and no implementation details.

We therefore started pulling out methods like so:

class FooPageTest extends Specs with ShouldMatchers {
	it("should let us load a foo") {
		when(databaseHas(aFooWithBarAndBaz("Bar", "Bazza")))
		// and so on...
	}
 
	def aFooWithBarAndBaz(bar:String, baz;String) = FooBuilder(bar = bar, baz = baz)
}

This was fine to start with but we eventually ended up with 10-12 different variations oh how Foo could be constructed, negating the value that the builder pattern provides.

Instead what we can do is use of an alias of FooBuilder to achieve something equally readable:

package object TestSugar {
	val aFooWith = FooBuilder
}

We can then use aFooWith like so:

import TestSugar._
 
class FooPageTest extends Specs with ShouldMatchers {
	it("should let us load a foo") {
		when(databaseHas(aFooWith(bar = "Bar", baz = "Bazza")))
		// and so on...
	}
}

We could also achieve that by renaming FooBuilder to aFooWith but that makes it much less discoverable whereas this solution lets us achieve both goals.

The package object approach isn’t really needed – we could easily put those vals onto an Object or Class but they don’t really seem to belong to any which is why we’ve gone for this approach.

Written by Mark Needham

July 26th, 2011 at 10:31 pm

Posted in Scala

Tagged with

Retrospectives: The 4 L’s Retrospective

without comments

I facilitated the latest retrospective my team had last week and decided to try The 4 L’s technique which I’d come across while browsing the ‘retrospectives’ tag on del.icio.us.

We had 4 posters around the room representing each of the L’s:

  • Liked
  • Learned
  • Lacked
  • Longed for

I’m not really a fan of the majority of a retrospective being dominated by a full group discussion as many people aren’t comfortable giving their opinions to that many people and therefore end up not participating at all.

I’ve seen much more participation if the facilitator tries to encourage less vocal people to give their opinions and if the first part of the retrospective is done in smaller groups.

We therefore started in groups of three where people discussed the previous iteration and came up with ideas which they stuck under each section. That lasted for around 15 minutes.

After that we split into groups of about 5 – one for each of the L’s – and each group spent 6/7 minutes grouping together the stickies and looking for any trends.

One member of each group then presented a summary of their section to the rest of the group and suggested what they thought the most important thing to discuss was.

Having gone around all of the groups we now had 30 minutes to discuss the 4 topics we’d identified. In fact two of them were the same so we only had 3!

My observation of this style of retrospective is that it seemed to achieve the goal of getting more people to participate. At least 2 or 3 people who have never spoken in one of our retrospectives before were giving their opinions to the whole group.

I was curious to see whether we’d cover all the topics that people wanted to discuss as I cut the whole group voting system which I’ve seen used in most retrospectives I’ve attended.

After we’d finished discussing the 3 main topics a couple of other points were raised which had both been on the ‘longed for’ wall.

We ended up just quickly agreeing to give these things a try for an iteration instead of having a prolonged discussion about the advantages/disadvantages of the idea.

Facilitating wise I think I could have been clearer with my instructions as people were a bit confused at times about what exactly they were supposed to be doing.

I think it’s vital to get everyone in the group involved early on or they just zone out and their insight is lost.

I’d be interested in hearing other types of retrospectives people have run which allow you to do that.

Written by Mark Needham

July 25th, 2011 at 9:00 pm

Posted in Agile

Tagged with

Scala: Making it easier to abstract code

without comments

A couple of months ago I attended Michael Feathers’ ‘Brutal Refactoring’ workshop at XP 2011 where he opined that developers generally do the easiest thing when it comes to code bases.

More often than not this means adding to an existing method or existing class rather than finding the correct place to put the behaviour that they want to add.

Something interesting that I’ve noticed so far on the project I’m working on is that so far we haven’t been seeing the same trend.

Our code at the moment is comprised of lots of little classes with small amounts of logic in them and I’m inclined to believe that Scala as a language has had a reasonable influence on that.

The following quote from ‘Why programming languages?‘ sums it up quite well:

Sometimes the growing complexity of existing programming languages prompts language designers to design new languages that lie within the same programming paradigm, but with the explicit goal of minimising complexity and maximising consistency, regularity and uniformity (in short, conceptual integrity).

It’s incredibly easy to pull out a new class in Scala and the amount of code required to do so is minimal which seems to be contributing to the willingness to do so.

At the moment nearly all the methods in our code base are one line long and the ones which aren’t do actually stand out which I think psychologically makes you want to find a way to keep to the one line method pattern.

Traits

As I’ve mentioned previously we’ve been pulling out a lot of traits as well and the only problem we’ve had there is ensuring that we don’t end up testing their behaviour multiple times in the objects which mix-in the trait.

I tend to pull traits out when it seems like there might be an opportunity to use that bit of code rather than waiting for the need to arise.

That’s generally not a good idea but it seems to be a bit of a trade off between making potentially reusable code discoverable and abstracting out the wrong bit of code because we did it too early.

Companion Objects

The fact that we have companion objects in the language also seems to help us push logic into the right place rather than putting it into an existing class.

We often have companion objects which take in an XML node, extract the appropriate parts of the document and then instantiate a case class object.

In Summary

There’s no reason you couldn’t achieve the same things in C#¬†or Java but I haven’t seen code bases in those languages evolve in the same way.

It will be interesting to see if my observations remain the same as the code base increases in size.

Written by Mark Needham

July 23rd, 2011 at 12:05 pm

Posted in Scala

Tagged with

Scala: Companion Objects

without comments

One of the language features available to us in Scala which I think is having a big impact in helping us to make our code base easier to follow is the companion object.

We’ve been using companion objects quite liberally in our code base to define factory methods for our classes.

As I mentioned in a previous post a lot of our objects are acting as wrappers around XML documents and we’ve been pushing some of the data extraction from the XML into companion objects so that our classes can take in non XML values.

This means we can test the data extraction against the companion object and then create simpler tests against any other logic in the object because we don’t have to create XML documents in each of our tests.

The following is an example of a Foo object being constructed with data from an XML document:

object Foo {
   def apply(element: Node) = {
    val bar = element.attribute("bar").get.head.text
    val baz = (element \\ "baz").text
    new Foo(bar, baz)
  }
}

There is also some other logic around how a collection of Foos should be ordered and by using the companion object to parse the XML we can create a test with appropriate bar and baz values to test that.

case class Foo(bar: String, baz:String) extends Ordered[Foo] {
   def compare(that: Foo) = {
     // logic to compare Foos
   }
}

Before we had the companion object we were putting the logic to create Foo inside the object where it is created from which increased the complexity of that object and made it more difficult for people to read.

We’ve also been using this approach to build up page objects representing sub sections of a page in our Web Driver tests and it seems to work quite nicely there as well.

Written by Mark Needham

July 23rd, 2011 at 11:57 am

Posted in Scala

Tagged with

Clojure: Creating XML document with namespaces

without comments

As I mentioned in an earlier post we’ve been parsing XML documents with the Clojure zip-filter API and the next thing we needed to do was create a new XML document containing elements which needed to be inside a namespace.

We wanted to end up with a document which looked something like this:

<root>
<mynamespace:foo xmlns:mynamespace="http://www.magicalurlfornamespace.com">
	<mynamespace:bar>baz</mynamespace:bar>
</mynamespace:foo>
</root>

We can make use of lazy-xml/emit to output an XML string from *some sort of input?* by wrapping it inside with-out-str like so:

(require '[clojure.contrib.lazy-xml :as lxml])
(defn xml-string [xml-zip] (with-out-str (lxml/emit xml-zip)))

I was initially confused about how we’d be able to create a map representing name spaced elements to pass to xml-string but it turned out to be reasonably simple.

To create a non namespaced XML string we might pass xml-string the following map:

(xml-string {:tag :root :content [{:tag :foo :content [{:tag :bar :content ["baz"]}]}]})

Which gives us this:

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<root>
	<foo>
		<bar>baz</bar>
	</foo>
</root>"

Ideally I wanted to prepend :foo and :bar with ‘:mynamespace” but I thought that wouldn’t work since that type of syntax would be invalid in Ruby and I thought it’d be the same in Clojure.

mneedham@Administrators-MacBook-Pro-5.local ~$ irb
>> { :mynamespace:foo "bar" }
SyntaxError: compile error
(irb):1: odd number list for Hash
{ :mynamespace:foo "bar" }
               ^
(irb):1: syntax error, unexpected ':', expecting '}'
{ :mynamespace:foo "bar" }
               ^
(irb):1: syntax error, unexpected '}', expecting $end
	from (irb):1
>>

In fact it isn’t so we can just do this:

(xml-string {:tag :root 
  :content [{:tag :mynamespace:foo :attrs {:xmlns:meta "http://www.magicalurlfornamespace.com"} 
              :content [{:tag :mynamespace:bar :content ["baz"]}]}]})
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<root>
<mynamespace:foo xmlns:meta=\"http://www.magicalurlfornamespace.com">
	<mynamespace:bar>baz</mynamespace:bar>
</mynamespace:foo>
</root>"

As a refactoring step, since I had to append the namespace to a lot of tags, I was able to make use of the keyword function to do so:

(defn tag [name value] {:tag (keyword (str "mynamespace" name)) :content [value]})
> (tag :Foo "hello")  
{:tag :mynamespace:Foo, :content ["hello"]}

Written by Mark Needham

July 20th, 2011 at 8:28 pm

Posted in Clojure

Tagged with

Scala: Rolling with implicit

with one comment

We’ve been coding in Scala on my project for around 6 weeks now and are getting to the stage where we’re probably becoming a big dangerous with our desire to try out some of the language features.

One that we’re trying out at the moment is the implicit key word which allows you to pass arguments to objects and methods without explicitly defining them in the parameter list.

The website we’re working on needs to be accessible in multiple languages and therefore we need to be able to translate some words before they get displayed on the page.

Most of the time it’s just static labels which need to be internationalised but there are a few words which are retrieved from the database and aren’t as easy to deal with.

We introduced the idea of the LanguageAwareString which acts as a wrapper around a String and has its own toString method which delegates to a Language class which contains a dictionary

It’s defined like this:

case class LanguageAwareString(ignorantValue:String)(implicit val language : Language) {
  def toString = language.translate(ignorantValue)
}

We didn’t want to have to pass Language to the LanguageAwareString factory method every time we’re going to be calling it in quite a few places.

We therefore create an implicit val at the beginning of our application in the Scalatra entry code

class Controllers extends ScalatraFilter with ScalateSupport {
	...
	implicit def currentLanguage : Language = // work out the current language
}

As I understand it, whenever the Scala compiler encounters an implicit it looks in its execution scope for any value defined as implicit with the expected type.

As long as there’s only one such value in the scope it make use of that value but if there’s more than one we’d see a compilation error since it wouldn’t know which one to use.

We therefore needed to define Language as an implicit on all the classes/methods which the code follows on its way down to LanguageAwareStrong.

The problem we’ve had is that it’s not immediately obvious what’s going on to someone who hasn’t come across implicit before and we therefore end up having to go the above each time!

We’ve decided that to ease that transition we’d explicitly pass Language down through the first few classes so that it’s more obvious what’s going on.

We therefore have code like this in a few places:

new ObjectThatTakesLanguageImplicitly(someArg)(currentLanguage)

Maybe we can phase that out as people get used to implicit or maybe we’ll just get rid of implicit and decide it’s not worth the hassle!

Written by Mark Needham

July 19th, 2011 at 6:39 am

Posted in Scala

Tagged with