Mark Needham

Thoughts on Software Development

Archive for October, 2011

Working with external identifiers

with one comment

As part of the ingestion process for our application we import XML documents and corresponding PDFs into a database and onto the file system respectively.


Since the user needs to be able to search for documents by the userFacingId we reference it by that identifier in the database and the web application.

Each document also has an external identifier and we use this to identify the PDFs on the file system.

We can’t use the raw userFacingId to do this because there are some documents which have the same ID when we import them.

Most of the time we only need to care about the userFacingId in the web application but when the user wants to download a PDF we need to map from the userFacingId to the externalId so we can locate the file on the file system.

The first implementation of this code involved some mapping code in the web application from which we constructed an externalId from a given userFacingId.

Unfortunately this logic drifted into a few different places and it started to become really difficult to tell whether we were dealing with a userFacingId or an externalId.

We wanted to try and isolate the translation logic into one place on the edge of the system but Pat pointed out that it would actually be simpler if we never had to care about the externalId in our code.

We changed the ingestion process to add the externalId to each document so that we’d be able to get hold of it when we needed to.


We had to change the design of the code so that whenever the user wants to download a PDF (for example) we make a call to the database by the userFacingId to look up the externalId.

The disadvantage of the approach is that we’re making an extra (internal) network call to look up the ID but it’s the type of code that should be easily cacheable if it becomes a performance problem so it should be fine.

I think this approach is much better than having potentially flawed translation logic in the application.

Written by Mark Needham

October 31st, 2011 at 10:58 pm

Canonical Identifiers

without comments

Duncan and I had an interesting problem recently where we had to make it possible to search within an ‘item’ to find possible sub items that exist inside it.

The URI for the item was something like this:


Let’s say Item 234 contains the following sub items:

  • Mark
  • duncan

We have a search box on the page which allows us to type in the name of a sub item and go the sub item’s page if it exists or see an error message if it doesn’t.

If the user types in the sub item name exactly right then there’s no problem:


redirects to:


It becomes more interesting if the user gets the case of the sub item wrong e.g. they type ‘mark’ instead of ‘Mark’.

It’s not very friendly in terms of user experience to give the user an error message if they do that so I suggested that we just make the look up of the sub item case insensitive


would therefore find us the ‘Mark’ sub item.

Duncan pointed out that we’d now have more than 1 URI for the same document which isn’t particularly great since theoretically there should be a one to one mapping between a URI and a given document.

He pointed out that we could do a look up to find the ‘canonical identifier’ before we did the redirect such that if you typed in ‘mark’:


would redirect to:


The logic for checking the existence of a sub item would be the bit that’s case insensitive and makes it more user friendly.

Written by Mark Needham

October 30th, 2011 at 10:32 pm

Gaming the system: Some project examples

with 2 comments

Earlier this year Liz Keogh gave a talk at QCon London titled ‘Learning and Perverse Incentives: The Evil Hat‘ where she eventually encouraged people to try and game the systems that they take part in.

Over the last month or so we’ve had two different metrics visibly on show and are therefore prime targets for being gamed.

The first metric is one we included on our build radiator which shows how many commits to the git repository each person has for that day.

We originally created the metric to try and see which people were embracing git and committing locally and which were still treating it like Subversion and only committing when they had something to push to the central repository.

The other advantage we wanted to try and encourage is that by creating lots of small commits it’s easier for someone browsing ‘git log’ to see what’s happened over time just from glancing at the commit messages.

Bigger commits tend to mean that changes have been made in multiple places and perhaps not all those changes are related to each other.

Since we made that metric visible the number of commits have visibly increased and it’s mostly been positive because people tend to push to the central repository quite frequently.

There have, however, been a couple of occasions where people have made 10/15 commits locally over the day and then pushed them all at the end of the day and gone straight to the top of the leader board.

IMG 20111026 175248 1

The disadvantage of this approach is that it means other people on the team aren’t integrating with your changes until right at the end of the day which can lead to merge hell for them.

There have also been some times when people’s count has artificially increased because they’ve checked in, broke the build and then checked in again to fix it.

We’re going to try and find a way to combine local commits with remote pushes in a combined metric as our next trick.

Another metric which we’ve recently made visible is the number of points that we’ve completed so far in the iteration.

Previously we’ve had this data available in our Project Manager’s head and in Mingle but since a big part of how the team is judged is based on the number of points ‘achieved’ the team asked for the score to be made visible.

Since that happened from my observation we’ve ‘achieved’ or got very close to the planned velocity every week whereas before that it was a bit hit and miss.

I think sub consciously the estimates made on stories have started to veer towards the cautious side whereas previously they were probably more optimistic.

Another change in behaviour I’ve noticed is that people tend to postpone any technical tasks they have to do when we’re near the end of an iteration and instead keep focus on the story to ensure it gets completed in time.

We’ve also seen a couple of occasions where people stayed 2/3 hours longer on the last day of the iteration to ensure that stories got signed off so the points could be counted.

It’s been quite interesting to observe how behaviour can change based on increasing the visibility of metrics even when in the first case it’s actually irrelevant to the perception of the team.

Written by Mark Needham

October 26th, 2011 at 11:55 pm

Posted in Systems Thinking

Tagged with

Scala: Adding logging around a repository

with one comment

We wanted to add some logging around one of our repositories to track how many times users were trying to do various things on the application and came across a cool blog post explaining how we might be able to do this.

We ended up with the following code:

class BarRepository {
  def all: Seq[Bar] = Seq()
  def find(barId:String) : Bar = Bar("myBar")
class TrackService(barRepository:BarRepository) {
  def all : Seq[Bar] = { 
    var bars = barRepository.all; 
    println("tracking all bars"); 
implicit def trackServiceToBarRepository(t:TrackService) : BarRepository = t.barRepository

We can then use it like this:

scala> val service = new TrackService(new BarRepository())
service: TrackService = TrackService@4e5394c
scala> service.all
tracking all bars
res6: Seq[Bar] = List()

If a method doesn’t exist on TrackService then the implicit conversion ensures that the appropriate method will be called on BarRepository directly:

scala> service.find("mark")
res7: Bar = Bar(myBar)

I came across another way to achieve the same results by making use of traits although we’d need to change our design a little bit to achieve this pattern:

trait IProvideBars {
  def all : Seq[Bar]
  def find(barId:String) : Bar
class BarRepository extends IProvideBars {
  def all: Seq[Bar] = Seq()
  def find(barId:String) : Bar = Bar("myBar")
trait Tracking extends IProvideBars {
  abstract override def all : Seq[Bar] = { 
    val bars = super.all;
    println("tracking all bars"); 
scala> val b = new BarRepository() with Tracking
b: BarRepository with Tracking = $anon$1@ddc652f
scala> b.all
tracking all bars
res8: Seq[Bar] = List()

Written by Mark Needham

October 25th, 2011 at 9:19 pm

Posted in Scala

Tagged with

Scala: Creating an Xml element with an optional attribute

with one comment

We have a lot of Xml in our application and one of the things that we need to do reasonably frequently in our test code is create elements which have optional attributes on them.

Our simple first approach looked like this:

def createElement(attribute: Option[String]) = if(attribute.isDefined) <p bar={attribute.get} /> else <p />

That works but it always seemed like we should be able to do it in a simpler way.

Our first attempt was this:

def createElement(attribute: Option[String]) = <p bar={attribute} />

But that ends up in a compilation error:

error: overloaded method constructor UnprefixedAttribute with alternatives:
  (key: String,value: Option[Seq[scala.xml.Node]],next: scala.xml.MetaData)scala.xml.UnprefixedAttribute <and>
  (key: String,value: String,next: scala.xml.MetaData)scala.xml.UnprefixedAttribute <and>
  (key: String,value: Seq[scala.xml.Node],next1: scala.xml.MetaData)scala.xml.UnprefixedAttribute
 cannot be applied to (java.lang.String, Option[String], scala.xml.MetaData)
       def createElement1(attribute: Option[String]) = <p bar={attribute} />

We really need to extract the string value from the option if there is one and not do anything if there isn’t one but with the above approach we try to shove an option in as the attribute value. Unfortunately there isn’t an overload of the constructor which lets us do that.

Eventually one of my colleagues suggested we try passing null in as the attribute value if we had a None option:

def createElement(attribute: Option[String]) = <p bar={attribute.getOrElse(null)} />

Which works pretty well:

scala> createElement(Some("mark"))
res0: scala.xml.Elem = <p bar="mark"></p>
scala> createElement(None)
res1: scala.xml.Elem = <p ></p>

Written by Mark Needham

October 25th, 2011 at 8:38 pm

Posted in Scala

Tagged with

Retrospective: The 5 whys

without comments

Last week my colleague Pat Fornasier ran our team’s fortnightly retrospective and one of the exercises we did was ‘the 5 whys’.

I’ve always wanted to see how the 5 why’s would pan out but could never see how you could fit it into a normal retrospective.

Pat was able to do this by using the data gathered by an earlier timeline exercise where the team had to plot the main events that had happened over the last 6 months.

We ended up with 5 key areas and split into groups to explore those topics further.

Wikipedia describes the 5 whys like so:

The 5 Whys is a questions-asking method used to explore the cause/effect relationships underlying a particular problem. Ultimately, the goal of applying the 5 Whys method is to determine a root cause of a defect or problem.

My group had to investigate the topic ‘Why are we so obsessed with points?’.

These were some of my observations from the exercise:

  • It’s very easy to lose focus on the exercise and start talking about solutions or ideas when only a couple of whys have been followed.

    Pat suggested that this problem could be solved by having a facilitator who helps keep the discussion on track.

  • We went down a dead-end a few times where our 5th why ended up being something quite broad which we couldn’t do anything about.

    We ended up going back up the chain of whys to see whether we could branch off a different way on any of the and it was actually reasonably easy to think of other whys the further up you went.

  • By going beyond surface reasons for things you actually end up with much more interesting conversations although I think it does also become a little bit more uncomfortable for people.

    For example we ended up discussing what ‘minimum viable product’ actually means for us and a couple of the group had a much different opinion to the product owner. It would have been interesting if we’d been able to continue the discussion for longer.

  • For our particular topic we ended up discussing why the deadline we have was set when it was and couldn’t really come up with any reason for why it couldn’t be changed other than we’d been told it couldn’t.

    It would have been more interesting to have the people external to the team who set the deadline so that we could understand if there was more to it.

I tried looking for a video to see a real life example of a 5 whys discussion being facilitated but I wasn’t able to find one.

Perryn pointed me to a chat log on the cucumber wiki where Aslak asks the 5 whys to someone trying to articulate why they want to have a login feature in their application but I’d be interested in seeing more examples if anyone knows any.

Written by Mark Needham

October 24th, 2011 at 10:53 pm

Posted in Agile

Tagged with

Learning Unix find: Searching in/Excluding certain folders

with 5 comments

I love playing around with commands on the Unix shell but one of the ones that I’ve found the most difficult to learn beyond the very basics is find.

I think this is partially because I find the find man page quite difficult to read and partially because it’s usually quicker to work out how to solve my problem with a command I already know than to learn another one.

However, I recently came across Greg’s wiki which seems to do a pretty good job of explaining it.

Reasonably frequently I want to get a list of files to scan but want to exclude files in the .git directory since the results tend to become overwhelming with those included:

$ find . ! -path  "*.git*" -type f -print

Here we’re saying find items which don’t have git in their path and which are of type ‘f’ (file) and then print them out.

If we don’t include the -type flag then the results will also include directories which isn’t what we want in this case. The -print is optional in this case since by default what we select will be printed.

Sometimes we want to exclude more than one directory which can be done with the following command:

$ find . \( ! -path "*target*" -a ! -path "*tools*" -a ! -path "*.git*" -print \)

Here we’re excluding the ‘target’, ‘tools’ and ‘git’ directories from the listing of files that we return.

The -a flag stands for ‘and’ so the above command reads ‘find all files/directories which do not have target in their path and do not have tools in their path and do not have .git in their path’.

We can always make that command a bit more specific if any of those words legitimately appear in a path.

As well as the -print flag there is also a -prune flag which we can use to stop find from descending into a folder.

The first command could therefore be written like this:

$ find . -path "*.git*" -prune -o -type f -print

This reads ‘don’t go any further into a folder which has git in the path but print any other files which don’t have git in their path’.

I’m still finding -prune a bit confusing to understand and as the wiki points out:

The most confusing property of -prune is that it is an ACTION, and thus no further filters are processed after it.

To use it, you have to combine it with -o to actually process the non-skipped files, like so:

A couple of months ago I was playing around with our git repository trying to get a list of all the scala files in the ‘src/main’ directory and I went with this command:

$ find . -type f -regex ".*src/main.*\.scala$"

Using the above flags it could instead be written like this:

$ find . -path "*src/main*" -type f -iname "*\.scala*"


$ find . -type f -path "*src/main/*\.scala"

Interestingly those latter two versions seem to be a bit slower than the one that uses the -regex flag.

I’m not entirely sure why that is – presumably by supplying two flags on the latter two solutions find has to do more operations per line than it does with the -regex option or something like that?

Written by Mark Needham

October 21st, 2011 at 9:25 pm

Posted in Shell Scripting

Tagged with ,

Getting stuck and agile software teams

without comments

I came across an interesting set of posts by Jeff Wofford where he talks about programmers getting stuck and it made me think that, despite its faults, agile software development does have some useful practices for stopping us getting stuck for too long.

Many of the examples that Jeff describes sound like yak shaving to me which is part of what makes programming fun but doesn’t always correlate to adding value to the product that you’re building.

Although I wrote about some of the disadvantages of pair programming a while ago it is actually a very useful practice for ensuring that we don’t get stuck.

We’re much less likely to go off down a rabbit hole trying to solve some interesting but unrelated problem if we have to try and convince someone else to come along on that journey.

On most teams that I’ve worked on at least a reasonable percentage of the team is co-located so there’s almost certainly going to be someone sitting nearby who will be able to help.

If that isn’t enough, we tend to have a very visible story wall of what everyone’s working on right next to the work space and it become pretty obvious when something has been stuck in one of the columns for a long time.

Another team member is bound to point that out and if they don’t then the standup at the beginning of the day provides a good opportunity to see if anyone else on the team has a way around the problem you’re working on.

It also provides an opportunity to find out whether the problem you’re trying to solve is actually worth solving or not by talking to the product owner/one of the business analysts.

For the types of problems that I work on more often than not it isn’t vital to solve a lot of problems that we think we need to and the product owner would much rather we just parked it and work on something else that is valuable to them.

Jeff goes on to describe some other more general ways of getting unstuck but the above are some which might not be available to us with a less collaborative approach.

Written by Mark Needham

October 20th, 2011 at 10:09 pm

Posted in Coding

Tagged with

git: Only pushing some changes from local repository

with 4 comments

Something that we want to do reasonable frequently on my current project is to push some changes which have been committed to our local repository to master but not all of them.

For example we might end up with 3 changes we haven’t pushed:

>> ~/github/local$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 3 commits.
nothing to commit (working directory clean)
>> ~/github/local$ git hist
* bb7b139 Thu, 20 Oct 2011 07:37:11 +0100 | mark: one last time (HEAD, master) [Mark Needham]
* 1cef99a Thu, 20 Oct 2011 07:36:35 +0100 | mark:another new line [Mark Needham]
* 850e105 Thu, 20 Oct 2011 07:36:01 +0100 | mark: new line [Mark Needham]
* 2b25622 Thu, 20 Oct 2011 07:32:43 +0100 | mark: adding file for first time (origin/master) [Mark Needham]

And we only want to push the commit with hash 850e105 for example.

The approach which my colleague Uday showed us is to first take a temporary branch of the current state.

>> ~/github/local$ git checkout -b temp-branch
Switched to a new branch 'temp-branch'

Then immediately switch back to master and ‘get rid’ of the last two changes from there:

>> ~/github/local$ git checkout master
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 3 commits.
>> ~/github/local$ git reset HEAD~2 --hard
HEAD is now at 850e105 mark: new line

We can then push just that change:

>> ~/github/local$ git push
Counting objects: 5, done.
Writing objects: 100% (3/3), 257 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
To /Users/mneedham/github/remote
   2b25622..850e105  master -> master

And merge the temporary branch back in again so we’re back where we were before:

>> ~/github/local$ git merge temp-branch
Updating 850e105..bb7b139
 foo.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
>> ~/github/local$ git hist
* bb7b139 Thu, 20 Oct 2011 07:37:11 +0100 | mark: one last time (HEAD, temp-branch, master) [Mark Needham]
* 1cef99a Thu, 20 Oct 2011 07:36:35 +0100 | mark:another new line [Mark Needham]
* 850e105 Thu, 20 Oct 2011 07:36:01 +0100 | mark: new line (origin/master) [Mark Needham]
* 2b25622 Thu, 20 Oct 2011 07:32:43 +0100 | mark: adding file for first time [Mark Needham]
>> ~/github/local$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 2 commits.
nothing to commit (working directory clean)

And finally we delete the temporary branch:

>> ~/github/local$ git branch -d temp-branch
Deleted branch temp-branch (was bb7b139).

We can achieve the same thing without creating the branch and just cherry picking the commits back again after we’ve pushed our changes but this seems approach seems quicker.

Written by Mark Needham

October 20th, 2011 at 6:50 am

Posted in Version Control

Tagged with

Unix: Some useful tools

with one comment

On my current project we regularly use a few Unix tools which aren’t on the standard installation so I thought I’d collate them here so I don’t forget about them in the future.


We suspected we’d ended up with some rogue characters in a file that we weren’t able to detect in our normal text editor recently and wanted to view the byte by byte representation of the file to check it out.

We came across ghex which seems to be a pretty decent tool for allowing us to do this.

sudo port install ghex
ghex2 ourFile.jade


axel is a download accelerator and lets us send multiple partial/range requests to download parts of a file before putting it back together at the end.

We found this quite useful when I was working in India to download files from the US over VPN. scp was painfully slow so we used to set up a simple HTTP server on the US server and then use axel to grab the file.

Some servers don’t support range requests but a reasonable number of them seem to.

sudo port install axel
axel -a


The man page claims the following:

Ack is designed as a replacement for 99% of the uses of grep.

It worked reasonably well for replacing the following grep command:

grep -iR "searchTerm" .

One of the cool things is that by default it doesn’t search in binary files whereas grep does. I have noticed that it sometimes doesn’t pick up search terms in files which grep would match and I’m not entirely sure why.

sudo port install p5-app-ack
ack "something"

I’m sure there are plenty of other cool tools about so if you know of any let me know!

Written by Mark Needham

October 17th, 2011 at 10:58 pm

Posted in Software Development

Tagged with