Mark Needham

Thoughts on Software Development

Archive for August, 2011

Coding: The value in finding the generic abstraction

without comments

I recently worked on adding the meta data section for each of the different document types that it serves which involved showing 15-20 pieces of data for each document type.

There are around 4-5 document types and although the meta data for each document type is similar it’s not exactly the same!

When we got to the second document type it wasn’t obvious where the abstraction was so we went for the copy/paste approach to see if it would be any easier to see the commonality if we put the two templates side by side.

We saw some duplication in the way that we were building up each individual piece of meta data but couldn’t see any higher abstraction.

We eventually got through all the document types and hadn’t really found a clean solution to the problem.

I wanted to spend some time playing around the code to see if I could find one but Duncan pointed out that it was important to consider that refactoring in the bigger context of the application.

Even if we did find a really nice design it’s probably not going to give us any benefit since we’ve covered most of the document types and there will maybe be just one that we have to add the meta data section for.

The return on investment for finding a clean generic abstraction won’t be very high in this case.

In another part of our application we need to make it possible for the use to do faceted search but it hasn’t been decided what the final list of facets to search on will be.

It therefore needs to be very easy to make it possible to search on a new facet and include details about that facet in all search results.

We spent a couple of days about 5/6 weeks ago working out how to model that bit of code so that it would be really easy to add a new facet since we knew that there would be more coming in future.

When that time eventually came last week it took just 2 or 3 lines of code to get the new facet up and running.

In this case spending the time to find the generic abstraction had a good return on investment.

I sometimes find it difficult to know exactly which bits of code we should invest a lot of time in because there are always loads of places where improvements can be made.

Analysing whether there’s going to be a future return on investment from cleaning it up/finding the abstraction seems to be a useful thing to do.

Of course the return on investment I’m talking about here relates to the speed at which we can add future functionality.

I guess another return on investment could be reducing the time it takes to understand a piece of code if it’s likely to be read frequently.

Written by Mark Needham

August 31st, 2011 at 6:49 am

Posted in Coding

Tagged with

The read-only database

without comments

The last couple of applications I’ve worked on have had almost completely read only databases where we had to populate the database in an offline process and then provide various ways for users to access the data.

This creates an interesting situation with respect to how we should setup our development environment.

Our normal setup would probably have an individual version of that database on every development machine and we would populate and then truncate the database during various test scenarios.

Test data

This actually means that our tests are interacting with the database in a different way than we would see during the running of the application.

It also means that we have more infrastructure to take care of and more software updates to do although using tools like Chef or Puppet can reduce the pain this causes once the initial setup of those scripts has been done.

On the project I worked on last year we started off with the individual database approach but eventually moved to having a shared database used by all the developers.

We only made the move once we had the real production data and the script which would populate that data into our database ready.

Test data shared

The disadvantage of having this shared database is that our tests become more indirect.

We wrote our tests against data which we knew would be in our production data set which meant if anything failed you had a bit more investigation to do since the data setup was done elsewhere.

On the other hand noone had to worry about getting it setup on their machines which had proved to be tricky to totally automate.

We have a similar situation on the application I’m currently working on and have noticed that we run into problems that don’t usually exist as a result of adding data to the database on each test.

For example in one test the database takes a bit of time to sort out its indexes which means that some tests intermittently fail.

We found a bit of a hacky way around this by forcing the database to reindex in the test and waiting until it has done so but we’ve now solved a problem which doesn’t actually exist in production.

This approach wouldn’t work as well if we had a read/write database since we’d end up with tests failing since another developer machine had mutated the data it relied on.

With a read only database it seems to be ok though.

Written by Mark Needham

August 29th, 2011 at 11:32 pm

Pain Driven Development

with 6 comments

My colleague Pat Fornasier has been using an interesting spin on the idea of making decisions at the last responsible moment by encouraging our team to ‘feel the pain’ before introducing any constraint in our application.

These are some of the decisions which we’ve been delaying/are still delaying:

Dependency Injection

Everyone in our team comes from a Java/C# background and one of the first technical decisions that gets made on applications in those languages is which dependency injection container to use.

We decided to just create a trait where we wired up the dependencies ourself and then inject that trait into the entry point of our application. Effectively it acts as the ApplicationContext that a framework like Spring would provide.

I was fairly sure that we’d need to introduce a container fairly quickly but it’s been 10 weeks and we still haven’t felt the need to do that and our application is simpler as a result.

Data Ingestion

As I mentioned in an earlier post we have to import around 5 million documents into our database by the time the application goes live.

Our initial attempt at writing this code was single threaded and it was clear that there were many places where performance optimisations could be made.

Since we were only ingesting a few thousand documents at that stage it still ran pretty quickly so Pat encouraged us to wait until we felt the pain before making any changes.

That duly happened once the number of documents increased and it started taking 3/4 hours to run the job in our QA environment.

We then spent a couple of days working out how to make it possible to process the documents more quickly.

Complex markup in documents

As I mentioned a couple of months ago the application we’re working on is mainly about taking data from a database and applying some transformations on it before showing it to the user.

We decided to incrementally add different types of documents into the database.

This meant that initially all our transformations involved just getting a text representation of XML nodes even though we knew that eventually we’d need to do more processing on the data depending on which tags appeared.

These data transformations actually turned out to be more complicated than we’d imagined so we might have delayed the pain here a little bit too long.

On the other hand we were able to show early progress to our business stakeholders which probably wouldn’t have been the case if we’d tried to take on the complex markup all at once.


One thing to note with this approach is that we need to make sure there is a feedback mechanism to recognise when we are feeling pain otherwise we’ll end up going beyond the last responsible moment more frequently.

There will probably also be more complaints about things not being done ‘properly’ since we’re waiting for longer until we actually do that.

We have a code review that the whole team attends for an hour each week which acts as the feedback mechanism and we recently starting using Fabio’s effort/pain wall to work out which things were causing us most pain.

Written by Mark Needham

August 21st, 2011 at 5:33 pm

node.js: Building a graph of build times using the Go API

with 3 comments

I’ve been playing around with node.js again and one thing that I wanted to do was take a CSV file generated by the Go API and extract the build times so that we could display it on a graph.

Since I don’t have a Go instance on my machine I created a URL in my node application which would mimic the API and return a CSV file.

I’m using the express web framework to take care of some of the plumbing:


var express = require('express')
var app = express.createServer();
app.get('/fake-go', function(req, res) {
  fs.readFile('go.txt', function(err, data) {
    res.end(data, 'UTF-8');		

go.txt is just in my home directory and looks like this:


I wanted to create an end point which I could call and get back a JSON representation of all the different builds.

app.get('/go/show', function(req, res) {
  var site = http.createClient(3000, "localhost"); 
  var request = site.request("GET", "/fake-go", {'host' : "localhost"})
  request.on('response', function(response) {
    var data = "";
    response.on('data', function(chunk) {
      data += chunk;
    response.on('end', function() {
      var lines = data.split("\n"), buildTimes = [];
      lines.forEach(function(line, index) {
        var columns = line.split(",");
        if(index != 0 && nonEmpty(columns[9]) && nonEmpty(columns[11]) && columns[3] == "Passed") {
          buildTimes.push({ start :  columns[9], end : columns[11]});
function isEmpty(column) {
  return column !== "" && column !== undefined

I should probably use underscore.js for some of that code but I didn’t want to shave that yak just yet!

I have a default route setup so that I can just go to localhost:3000 and see the graphs:

app.get('/', function(req, res){
  res.render('index.jade', { title: 'Dashboard' });

On the client side we can then create a graph using the RGraph API:


h2(align="center") Project Dashboard
  function drawGoGraph(buildTimes) {		
    var go = new RGraph.Line('go', _(buildTimes).map(function(buildTime) { return (new Date(buildTime.end) - new Date(buildTime.start)) / 1000 }).filter(function(diff) { return diff > 0; }));
    go.Set('chart.title', 'Build Times');		
    go.Set('', 45);
    go.Set('chart.gutter.bottom', 125);
    go.Set('chart.gutter.left', 50);
    go.Set('chart.text.angle', 90);
    go.Set('chart.shadow', true);
    go.Set('chart.linewidth', 1);
  $(document).ready(function() {
    $.getJSON('/go/show', function(data) {
  canvas(id="go", width="500", height="400")
    [Please wait...]

We just do some simple subtraction between the start and end build times and then filter out any results which have an end time before the start time. I’m not entirely sure why we end up with entries like that but having those in the graph totally ruins it!

We include all the .js files in the layout.jade file.


!!! 5
    title Project Dashboard
    script(src="jquery-1.6.2.min.js ")

Et voila:

Build graph

Written by Mark Needham

August 13th, 2011 at 2:52 pm

Posted in Javascript

Tagged with

Scala: Do modifiers on functions really matter?

with 8 comments

A couple of colleagues and I were having an interesting discussion this afternoon about the visibility of functions which are mixed into an object from a trait.

The trait in question looks like this:

trait Formatting {
  def formatBytes(bytes: Long): Long = {
    math.round(bytes.toDouble / 1024)

And is mixed into various objects which need to display the size of a file in kB like this:

class SomeObject extends Formatting {

By mixing that function into SomeObject any of the clients of SomeObject would now to be able to call that function and transform a bytes value of their own!

The public API of SomeObject is now cluttered with this extra method although it can’t actually do any damage to the state of SomeObject because it’s a pure function whose output depends only on the input given to it.

There are a couple of ways I can think of to solve the modifier ‘problem’:

  • Make formatBytes a private method on SomeObject
  • Put formatBytes on a singleton object and call it from SomeObject

The problem with the first approach is that it means we have to test the formatBytes function within the context of SomeObject which makes our test much more difficult than if we can test it on its own.

It also makes the discoverability of that function more difficult for someone else who has the same problem to solve elsewhere.

With the second approach we’ll have a dependency on that singleton object in our object which we wouldn’t be able to replace in a test context even if we wanted to.

While thinking about this afterwards I realised that it was quite similar to something that I used to notice when i was learning F# – the modifiers on functions don’t seem to matter if the data they operate on is immutable.

I often used to go back over bits of code I’d written and make all the helper functions private before realising that it made more sense to keep them public but group them with similar functions in a module.

I’m moving towards the opinion that if the data is immutable then it doesn’t actually matter that much who it’s accessible to because they can’t change the original version of that data.

private only seems to make sense if it’s a function mutating a specific bit of data in an object but I’d be interesting in hearing where else my opinion doesn’t make sense.

Written by Mark Needham

August 13th, 2011 at 2:10 am

Posted in Scala

Tagged with

Scala, WebDriver and the Page Object Pattern

with 6 comments

We’re using WebDriver on my project to automate our functional tests and as a result are using the Page Object pattern to encapsulate each page of the application in our tests.

We’ve been trying to work out how to effectively reuse code since some of the pages have parts of them which work exactly the same as another page.

For example we had a test similar to this…

class FooPageTests extends Spec with ShouldMatchers with FooPageSteps {
  it("is my dummy test") {

…where FooPageSteps extends CommonSteps which contains the common assertions:

trait FooPageSteps extends CommonSteps {
  override val page = new FooPage(driver)
trait CommonSteps {
  val page : FooPage
  val driver: HtmlUnitDriver
  def iShouldNotSeeAnyCommonLinks() {
    page.allCommonLinks.isEmpty should equal(true)

FooPage looks like this:

class FooPage(override val driver:WebDriver) extends Page(driver) with CommonSection {
abstract class Page(val driver: WebDriver) {
  def title(): String = driver.getTitle;
trait CommonSection {
  val driver:WebDriver
  def allCommonLinks:Seq[String] = driver.findElements(By.cssSelector(".common-links li")).map(_.getText)

We wanted to reuse CommonSteps for another page like so:

trait BarPageSteps extends CommonSteps {
  override val page = new BarPage(driver)
class BarPage(override val driver:WebDriver) extends Page(driver) with CommonSection {

But that means that we need to change the type of page in CommonSteps to make it a bit more generic so it will work for BarPageSteps too.

Making it of type Page is not enough since we still need to be able to call the allCommonLinks which is mixed into FooPage by CommonSection.

We therefore end up with the following:

trait CommonSteps {
  val page : Page with CommonSection
  val driver: HtmlUnitDriver
  def iShouldNotSeeAnyCommonLinks() {
    page.allCommonLinks.isEmpty should equal(true)

We’re able to mix in CommonSection just for this instance of Page which works pretty well for allowing us to achieve code reuse in this case!

Written by Mark Needham

August 9th, 2011 at 12:54 am

Posted in Scala

Tagged with ,