Mark Needham

Thoughts on Software Development

Archive for December, 2009

F#: Word Count – A somewhat failed attempt

with 2 comments

I came across Zach Cox’s word count problem via Sam Aaron and Ola Bini’s twitter streams and I thought it’d be interesting to try it out in F# to see what the solution would be like.

The solution needs to count word frequencies from a selection of newsgroup articles.

I wanted to see if it was possible to write it in F# without using a map to keep track of how many of each word had been found.

My thinking was that I would need to keep all of the words found and then calculate the totals at the end.

After a bit of fiddling this is the version I ended up with:


open System
open System.IO
open System.Text.RegularExpressions
let (|File|Directory|) path = if(Directory.Exists path) then Directory(path) else File(path)
let getFileSystemEntries path = Directory.GetFileSystemEntries path |> Array.to_list
let files path = 
    let rec inner fileSystemEntries files =
        match fileSystemEntries with
            | [] -> files
            | File path :: rest -> inner rest (path :: files)  
            | Directory path :: rest -> inner (List.append rest (getFileSystemEntries path)) files  
    inner (getFileSystemEntries path) []
let downloadFile path = 
    use streamReader = new StreamReader(File.OpenRead path)
let words input= Regex.Matches(input, "\w+") |> Seq.cast |> (fun (x:Match) -> x.Value.ToLower())
let wordCount = files >> 
       downloadFile >>
       words >>
                List.fold (fun acc x -> Seq.append acc x) Seq.empty >> 
                Seq.groupBy (fun x -> x) >> 
       (fun (value, sequence) -> (value, Seq.length sequence))
let writeTo (path:string) (values:seq<string * int>) = 
    use writer = new StreamWriter(path)
    values |> Seq.iter (fun (value,count) -> writer.WriteLine(value + " " + count.ToString()))  
let startTime = DateTime.Now
let count = wordCount "Z:\\20_newsgroups"
printfn "Writing counts in alphabetical order"
count |> Seq.sort |> writeTo "C:\\results\\counts-alphabetical-fsharp.txt"
printfn "Writing counts in descending order"
count |> Seq.sortBy (fun (_, count) -> count * -1) |> writeTo "C:\\results\\counts-descending-fsharp.txt"
let endTime = DateTime.Now
printfn "Finished in: %d seconds" (endTime - startTime).Seconds

The problem is that this version results in a StackOverFlow exception when I try to execute it with all the newsgroup articles although it does work correctly if I select just one of the folders.

From what I can tell the exception happens on line 24 when I get the text out of each of the files and store it in the list.

I tried changing this bit of code so that instead of doing that I combined the ‘words’ and ‘downloadFile’ functions so that the whole string wouldn’t be saved but this doesn’t seem to help that much. The exception just ended up happening a bit further down.

I’m not sure if it’s possible to make this work by making use of lazy collections – I’m not that familiar with those yet so I”m not sure how to do it – or if this approach is just doomed!

Despite the fact it doesn’t quite work there were some interesting things I noticed while playing with this problem:

  • At one stage I was trying to deal with a list of sequences whereby I had a list of sequences of words for each of the newsgroup articles. I found this really difficult to reason about as I was writing a ‘’ and then a ‘’ inside that.

    I originally had the ‘List.fold (fun acc x -> Seq.append acc x) Seq.empty’ line happening later on in that composition of functions such that I grouped all the words and then counted how many there were before folding down into a single sequence.

    I realised this didn’t make much sense and it would be much easier to just go to the sequence earlier on and make the code easier to follow.

  • I’ve previously written about wrapping .NET library calls and I was doing this quite a lot when I started writing this code.

    For example I had written a function called ‘isDirectory’ which wrapped ‘Directory.Exists’ which I wrote more out of habit than anything else but it doesn’t really add much value. I think when we’re talking about wrapping static methods this is probably always the case. It’s when we want to call a method on a C# object that the wrapping approach can be helpful.

  • I quite like the Ruby way of writing to a file…
    open("counts-descreasing-ruby", "w") do |out|
      counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }

    …so I thought I’d see what it would look like to change my ‘writeTo’ function to be more like that:

    let writeTo (path:string) (f: StreamWriter -> Unit) = 
        use writer = new StreamWriter(path)
        f writer
    writeTo "C:\\results\\counts-alphabetical-fsharp.txt" (fun out -> 
        count |> Seq.sort |> Seq.iter (fun (v,c) -> out.WriteLine(v + " " + c.ToString())))

    I’m not sure it reads as well as the original version – the writing to file seems to becomes more prominent in this version than the data being written to it.

If anyone has any ideas about how I can get this not to blow up that would be cool!

These are some of the other solutions that I’ve come across:

Written by Mark Needham

December 18th, 2009 at 2:58 am

Posted in F#

Tagged with

Coding: Naming

with 4 comments

Sarah Taraporewalla recently wrote an interesting post about the importance of words with respect to the way that we use them in our code and it reminded me of some conversations I’ve had with Dave Cameron about the importance of creating a shared understanding of the different types/objects in the systems that we build.

On a few projects that I’ve worked on where we didn’t have a common understanding of what different concepts in the domain should be I noticed that there was a reluctance to make changes to class names.

Dave pointed out that this was probably because people have different maps in their head about what different domain concepts actually mean and changing class names ruins that map and therefore people’s mechanism for remembering what different classes do.

So how can we identify the real names?

If we can come up with a good enough way to name the types that exist in our system such that everyone has the same understanding then the mapping problem disappears.

Domain Driven Design suggests the ubiquitous language as a way to solve this problem, and while getting the language of the business into our code is the best way that I’ve come across for doing this sometimes it’s difficult to get this sort of access to domain experts to make it possible.

In that situation I think just listening to how we talk about the code is enough to identify problem areas.

Typically if a concept isn’t described by the name that it’s been given then the person will describe the actual concept in the words they use to explain what they’re actually referring to.

The neat thing about a ubiquitous language is that if we name things in such a way that everyone does have the same understanding then if we decide later that we need a new name by definition the new name will do a better name of describing an object than the old one.

Names are there to make it easier for us to talk and reason about code – if they’re not doing that then we need to change them until they do.

Written by Mark Needham

December 16th, 2009 at 10:08 pm

Posted in Coding

Tagged with

The Computer Scientist as Toolsmith – Fred Brooks

with 2 comments

I’ve come across a couple of posts recently talking about the gender specificness of the term ‘Software Craftsman’ and Victoria suggests that the term ‘Codesmith’ would be a more appropriate name to use.

I’m not that bothered what the name is but I was reading the transcript of Fred Brooks’ acceptance speech for winning the ACM Allen Newell Award in 1994 titled ‘The Computer Scientist as Toolsmith‘ which has some interesting ideas about what our role should be.

These were some of the parts that I found interesting in the talk:

  • I quite liked the following quote and it seems to cover the same type of ground that we try to cover with the agile approach to software development with respect to writing software that actually provides value to our users:

    If we perceive our role aright, we then see more clearly the proper criterion for success: a toolmaker succeeds as, and only as, the users of his tool succeed with his aid.

    The article goes on to say the following:

    …we tend to forget our users and their real problems, climbing into our ivory towers to dissect tractable abstractions of these problems , abstractions that may have left behind the essence of the real problem.

    I also came across an interesting blog post by Rob Bowley where he discusses some of the limitations he’s noticed in the agile approach with respect to addressing the needs of our customers which seems to cover similar ground.

  • He also makes some interesting points around interdisciplinary collaboration:

    There are real costs associated with any professional collaboration, and interdisciplinary collaborations have some unique costs. I find that our teams spend about a quarter of our professional effort on routine work that supports our collaborators

    I’m not sure whether that last statistic stands true for collaboration between software development teams and the business but a fairly common objection from the business is that they don’t have the time to interact with the software guys and that we should just get on build what they want.

    Rob’s post seems to suggest that we’re not collaborating in a particularly effective way and Brooks goes on to suggest that we need to do some preparation before interacting with these guys:

    Our Ph.D. students often take introductory courses in the using disciplines, and they always take reading courses from our collaborators to prepare them for their dissertation work. One need not become an expert in the partner’s field, of course, but one does need to learn the basic principles, the vocabulary, and the partner’s research objectives.

    I wonder if this is where we sometimes go wrong – we’re focused on the software solution rather than stepping back and working out our customers’ real problems and helping them solve those.

    This part of the article also reminded me of comments made by Eric Evans in his QCon talk about not wasting domain experts time.

  • In one part of the article Brooks talks about artificial intelligence, suggesting:

    …intelligence amplifying systems can, at any given level of available systems technology, beat AI systems. That is, a machine and a mind can beat a mind-imitating machine working by itself

    This seems quite similar to an idea that I read in Taaichi Ohno’s Workplace Management whereby we look to automate processes but not just for the sake of automation. We should automate so that the human can do their job more effectively.

    On the projects I’ve worked on we often make use of automation to provide us with code metrics but a human would then analyse those and work out whether we need to make any changes to the way that we do things based on those metrics.

    Google seem to be going with an even more automated approach with respect to understanding which tests are useful as Marcus Striebeck described in a talk at XP Day and since it seems to be working well for them, perhaps we haven’t yet worked out where the usefulness of a machine ends and a human is required.

Written by Mark Needham

December 16th, 2009 at 6:15 am

Coding: The little details all add to our understanding

with 2 comments

I’ve been watching an interesting presentation by Scott Hanselmann titled ‘Information Overload and Managing the Flow‘ from OreDev where he covers various strategies to allow us to be more productive in the face of the huge amounts of information constantly threatening to overwhelm us.

One interesting suggestion he has around 37 minutes in is that when learning a new language it might be a good idea to contact someone who’s an expert in that language and get some framing knowledge on the type of stuff that’s worth learning and what we might not bother with.

While this seems like a pretty good idea I think it’s also useful to note that a lot of the knowledge that people have in a subject area comes from the little things that may seem insignificant as part of the big picture.

Donald Knuth points out something similar in his interview in Coders at Work:


It seems a lot of the people I’ve talked to had direct access to a machine when they were starting out. Yet Dijkstra has a paper I’m sure you’re familiar with, where he basically says we shouldn’t let computer science students touch a machine for the first few years of their training; they should spend all their time manipulating symbols.


But that’s not the way he learned either. He said a lot of really great things and inspirational things, but he’s not always right. Neither am I, but my take on it is this: Take a scientist in any field. The scientist gets older and says, “Oh, yes, some of the things that I’ve been doing have a really great payoff and other things, I’m not using anymore. I’m not going to have my students waste time on the stuff that doesn’t make giant steps. I’m not going to talk about low-level stuff at all. These theoretical concepts are really so powerful—that’s the whole story. Forget about how I got to this point.”

I think that’s a fundamental error made by scientists in every field. They don’t realize that when you’re learning something you’ve got to see something at all levels. You’ve got to see the floor before you build the ceiling. That all goes into the brain and gets shoved down to the point where the older people forget that they needed it.

Over the last year or so I’ve played around with F#, Scala and Clojure and I’ve found with all three of them that I only start to understand how things work when I take an example from the book and then play around with it in the REPL and work out what combinations of the syntax lead to a compilation error and which don’t.

It’s a pretty inefficient way of learning if you look at it from a high level but I find it’s a necessary step for me. When I don’t fight with the compiler like this then I don’t really understand what I’ve read in a book or blog post well enough to write some code myself or explain it to someone else.

I don’t find that anything is truly a waste of time – even if it’s something indirectly related such as reading a functional programming paper or following some examples of how lazy evaluation works in another language it all contributes to our understanding.

It’s certainly useful to get advice and help from people who know a language really well but we need to be careful not to under estimate the value of making mistakes and learning from them.

Written by Mark Needham

December 15th, 2009 at 8:09 am

Posted in Coding

Tagged with

TDD: Only mock types you own

with 10 comments

Liz recently posted about mock objects and the original ‘mock roles, not objects‘ paper and one thing that stood out for me is the idea that we should only mock types that we own.

I think this is quite an important guideline to follow otherwise we can end up in a world of pain.

One area which seems particularly vulnerable to this type of thing is when it comes to testing code which interacts with Hibernate.

A common pattern that I’ve noticed is to create a mock for the ‘EntityManager‘ and then verify that certain methods on it were called when we persist or load an object for example.

There are a couple of reasons why doing this isn’t a great idea:

  1. We have no idea what the correct method calls are in the first place so we’re just guessing based on looking through the Hibernate code and selecting the methods that we think make it work correctly.
  2. If the library code gets changed then our tests break even though functionally the code might still work

The suggestion in the paper when confronted with this situation is to put a wrapper around the library and then presumably test that the correct methods were called on the wrapper.

Programmers should not write mocks for fixed types, such as those defined by the runtime or external libraries. Instead they should write thin wrappers to implement the application abstractions in terms of the underlying infrastructure. Those wrappers will have been defined as part of a need-driven test.

I’ve never actually used that approach but I’ve found that with Hibernate in particular it makes much more sense to write functional tests which verify the expected behaviour of using the library.

With other libraries which perhaps don’t have side effects like Hibernate does those tests would be closer to unit tests but the goal is still to test the result that we get from using the library rather than being concerned with the way that the library achieves that result.

Written by Mark Needham

December 13th, 2009 at 9:47 pm

Posted in Testing

Tagged with ,

Clojure: My first attempt at a macro

with one comment

I’m up to the chapter on using macros in Stuart Halloway’s ‘Programming Clojure‘ book and since I’ve never used a language which has macros in before I thought it’d be cool to write one.

In reality there’s no reason to create a macro to do what I want to do but I wanted to keep the example simple so I could try and understand exactly how macros work.

I want to create a macro which takes in one argument and then prints hello and the person’s name.

In the book Halloway suggests that we should start with the expression that we want to end up with, so this is what I want:

(println "Hello" person)

My first attempt to do that was:

(defmacro say-hello [person]
  println "Hello" person)

I made the mistake of forgetting to include the brackets around the ‘println’ expression so it doesn’t actually pass ‘”Hello”‘ and ‘person’ to ‘println’. Instead each symbol is evaluated individually.

When we evaluate this in the REPL we therefore don’t quite get what we want:

user=> (say-hello "mark")          

Expanding the macro results in:

user=> (macroexpand-1 '(say-hello "Mark"))

Which is the equivalent of doing this:

user=> (eval (do println "hello" "Mark")) 

As I wrote previously this is because ‘do’ evaluates each argument in order and then returns the last one which in this case is “Mark”.

I fixed that mistake and got the following:

(defmacro say-hello [person]
  (println "Hello" person))

Which returns the right result…

user=> (say-hello "Mark")
Hello Mark

…but actually evaluated the expression rather than expanding it because I didn’t escape it correctly:

user=> (macroexpand-1 '(say-hello "Mark"))
Hello Mark

After these failures I decided to try and change one of the examples from the book instead of my trial and error approach.

One approach used is to build a list of Clojure symbols inside the macro definition:

(defmacro say-hello [person]
  (list println "hello" person))
user=> (macroexpand-1 '(say-hello "Mark"))
(#<core$println__5440 clojure.core$println__5440@681ff4> "hello" "Mark")

This is pretty much what we want and although the ‘println’ symbol has been evaluated at macro expansion time it doesn’t actually make any difference to the way the macro works.

We can fix that by escaping ‘println’ so that it won’t be evaluated until evaluation time:

(defmacro say-hello [person]
  (list 'println "hello" person))
user=> (macroexpand-1 '(say-hello "Mark"))
(println "hello" "Mark")

I thought it should also be possible to quote(‘) the whole expression instead of building up the list:

(defmacro say-hello [person] 
  '(println "hello" person))

This expands correctly but when we try to use it this happens:

user=> (say-hello "Mark")
java.lang.Exception: Unable to resolve symbol: person in this context

The problem is that when we use quote there is no evaluation of any of the symbols in the expression so the symbol ‘person’ is only evaluated at runtime and since it hasn’t been bound to any value we end up with the above error.

If we want to use the approach of non evaluation then we need to make use of the backquote(`) which stops evaluation of anything unless it’s preceded by a ~.

(defmacro a [person]
  `(println "hello" ~person))

This allows us to evaluate ‘person’ at expand time and replace it with the appropriate value.

In hindsight the approach I took to write this macro was pretty ineffective although it’s been quite interesting to see all the different ways that I’ve found to mess up the writing of one!

Thanks to A. J. Lopez, Patrick Logan and fogus for helping me to understand all this a bit better than I did to start with!

Written by Mark Needham

December 12th, 2009 at 3:53 am

Posted in Clojure

Tagged with

Clojure: Forgetting the brackets

without comments

I’ve been playing around with macros over the last few days and while writing a simple one forgot to include the brackets to make it evaluate correctly:

(defmacro say-hello [person]
  println "Hello" person)

This macro doesn’t even expand like I thought it would:

user=> (macroexpand-1 '(say-hello blah))

That seemed a bit strange to me but I eventually realised that I’d missed off the brackets around ‘println’ and the arguments following it which would have resulted in ‘println’ being evaluated with those arguments.

I was a bit curious as to why that happened so I tried the following expression without any brackets to see what would happen:

user=> println "hello" "mark"
#<core$println__5440 clojure.core$println__5440@681ff4>

It seems to just evaluate each thing individually and when we put this type of expression into a function definition the function will do the same thing but also return the last thing evaluated:

(defn say-hello [] println "hello" "mark")
user=> (say-hello)

A. J. Lopez pointed out that this is quite like progn in other LISPs and is the same as doing the following:

user=> (do println "hello" "mark")

do is defined as follows:

(do exprs*)
Evaluates the expressions in order and returns the value of the last. If no expressions are supplied, returns nil.

The way to write a function which passes those two arguments to ‘println’ is of course to put brackets around the statement:

(defn say-hello [] (println "hello" "mark"))
user=> (say-hello) 
hello mark

Written by Mark Needham

December 12th, 2009 at 3:51 am

Posted in Clojure

Tagged with

TDD: Big leaps and small steps

with 6 comments

About a month ago or so Gary Bernhardt wrote a post showing how to get started with TDD and while the post is quite interesting, several comments on the post pointed out that he had jumped from iteratively solving the problem straight to the solution with his final step.

Something which I’ve noticed while solving algorithmic problems in couple of different functional programming languages is that the test driven approach doesn’t work so well for these types of problems.

Dan North points out something similar in an OreDev presentation where he talks about writing a BDD framework in Clojure.

To paraphrase:

If you can’t explain to me where this approach breaks down then you don’t know it well enough. You’re trying to sell a silver bullet.

The classic failure mode for iterative development is the big algorithm case. That’s about dancing with the code and massaging it until all the test cases pass.

Uncle Bob also points this out while referring to the way we develop code around the UI:

There is a lot of coding that goes into a Velocity template. But to use TDD for those templates would be absurd. The problem is that I’m not at all sure what I want a page to look like. I need the freedom to fiddle around with the formatting and the structure until everything is just the way I want it. Trying to do that fiddling with TDD is futile. Once I have the page the way I like it, then I’ll write some tests that make sure the templates work as written.

I think the common theme is that TDD works pretty well when we have a rough idea of where we intend to go with the code but we just don’t know the exact path yet. We can take small steps and incrementally work out exactly how we’re going to get there.

When we don’t really know how to solve the problem – which more often than not seems to be the case with algorithmic type problems – then at some stage we will take a big leap from being nowhere near a working solution to the working solution.

In those cases I think it still makes sense to have some automated tests both to act as regression to ensure we don’t break the code and to tell us when we’ve written the algorithm correctly.

An example of a problem where TDD doesn’t work that well is solving the traveling salesman problem.

In this case the solution to the problem is the implementation of an algorithm and it’s pretty difficult to get there unless you actually know the algorithm.

During that dojo Julio actually spent some time working on the problem a different way – by implementing the algorithm directly – and he managed to get much further than we did.

It seems to me that perhaps this explains why although TDD is a useful design technique it’s not the only one that we should look to use.

When we have worked out where we are driving a design then TDD can be quite a useful tool for working incrementally towards that but it’s no substitute for taking the time to think about what exactly we’re trying to solve.

Written by Mark Needham

December 10th, 2009 at 10:14 pm

Posted in Testing

Tagged with

Haskell vs F#: Function composition

with 11 comments

I’m reading through John Hughes’ ‘Why functional programming matters‘ paper and one thing I’ve come across which is a bit counter intuitive to me is the Haskell function composition operator.

I’ve written previously about F#’s function composition operator which is defined as follows:

let inline (>>) f g x = g(f x)

To write a function which doubled all the values in a list and then returned the odd values we’d do this:

let doubleThenOdd = (fun x -> x*2) >> List.filter (fun x -> x % 2 <> 0)

Of course it’s not possible for there to be any values!

doubleThenOdd [1..5];;
val it : int list = []

Based on that understanding I would expect the Haskell function composition operator (‘.’) to work in the same way:

let doubleThenOdd = map (\ x -> x*2) . filter (\ x -> (mod x 2) /= 0)

But it doesn’t!

Prelude> doubleThenOdd [1..5]

In Haskell the functions are applied from right to left rather than left to right as I had expected.

The definition of ‘.’ is therefore:

(f . g) x = f (g x)

So to get what I wanted we’d need to switch around ‘map’ and ‘filter’:

let doubleThenOdd = filter (\ x -> (mod x 2) /= 0) . map (\ x -> x*2)
Prelude> doubleThenOdd [1..5]

It’s not too difficult to follow once I worked out that it was different to what I was used to but I was very confused for a while!

Is there a reason why they implement this operator differently?

Written by Mark Needham

December 9th, 2009 at 10:10 pm

Posted in F#

Tagged with ,

Clojure: when-let macro

with 6 comments

In my continued playing around with Clojure I came across the ‘when-let‘ macro.

‘when-let’ is used when we want to bind an expression to a symbol and only execute the body provided as the second argument to the macro if that symbol evaluates to true.

As I wrote previously, a value of ‘false’ or ‘nil’ would result in the second argument not being evaluated.

A simple example of using ‘when-let’ would be:

(when-let [a 2] (println "The value of a is:" a))

This is the definition:

(defmacro when-let
  "bindings => binding-form test
  When test is true, evaluates body with binding-form bound to the value of test"
  [bindings & body]
  (assert-args when-let
     (vector? bindings) "a vector for its binding"
     (= 2 (count bindings)) "exactly 2 forms in binding vector")
   (let [form (bindings 0) tst (bindings 1)]
    `(let [temp# ~tst]
       (when temp#
         (let [~form temp#]

The ‘assert-args’ call at the beginning of the macro is quite interesting.

Two assertions are stated:

  • The first argument should be a vector
  • That vector should contain exactly two forms

I’ve not used dynamic languages very much before but it seems like this is one way for a dynamic language to fail fast by checking that the arguments are as expected. In a static language that would be a compile time check.

Line 9 is quite interesting as we know that ‘bindings’ will be a vector so we can take the ‘0th’ and ‘1st’ elements from it and bind them to ‘form’ and ‘tst’ respectively. I didn’t quite pick up on the first few times I read it.

On line 10 it makes use of ‘auto-gensym’ to create a unique name which begins with ‘temp’ and is bound to the value of ‘tst’ which in the simple example provided would be the value ‘2’. As I understand it the name would be something like ‘temp__304’ or something similarly random!

‘when 2’ evaluates to true which means that we execute the body provided as the second argument.

user=> (when-let [a 2] (println "The value of a is:" a))
The value of a is: 2

This is a bit of a contrived example of using the construct and it seems to be properly used when we’re getting a value out of a list and want to check whether or not we’ve reached the end of that list or not. If we have then eventually we’ll have a value of ‘nil’ bound by the ‘let’ and then we’ll know we’re finished.

An example of where the body wouldn’t be evaluated is:

user=> (when-let [a nil] (println "This won't get printed"))

I don’t really understand why we need to bind ‘form’ to ‘temp’ on the second last line as it doesn’t seem like the value is used? I’m sure there’s probably something I’m missing there so if anyone could point it out that’d be cool!

As I understand it, the ‘~@body’ on the last line is called the ‘splicing unquote’ and it allows the individual values in ‘body’ to be put into the template started at ‘`(let [temp# ~tst]’ individually rather than just being put in there as a list.

Written by Mark Needham

December 9th, 2009 at 2:41 am

Posted in Clojure

Tagged with