Mark Needham

Thoughts on Software Development

Archive for the ‘Software Development’ tag

hdiutil: could not access / create failed – Operation canceled

without comments

Earlier in the year I wrote a blog post showing how to build a Mac OS X DMG file for a Java application and I recently revisited this script to update it to a new version and ran into a frustrating error message.

I tried to run the following command to create a new DMG file from a source folder…

$ hdiutil create -volname "DemoBench" -size 100m -srcfolder dmg/ -ov -format UDZO pack.temp.dmg

…but was met with the following error message:

...could not access /Volumes/DemoBench/ - Operation canceled
hdiutil: create failed - Operation canceled

I was initially a bit stumped and thought maybe the flags to hdiutil had changed but a quick look at the man page suggested that wasn’t the issue.

I decided to go back to my pre command line approach for creating a DMG – DiskUtility – and see if I could create it that way. This helped reveal the actual problem:

2014 10 31 09 42 20

I increased the volume size to 150 MB…

$ hdiutil create -volname "DemoBench" -size 100m -srcfolder dmg/ -ov -format UDZO pack.temp.dmg

and all was well:

created: /Users/markneedham/projects/neo-technology/quality-tasks/park-bench/database-agent-desktop/target/pack.temp.dmg

And this post will serve as documentation to stop it catching me out next time!

Written by Mark Needham

October 31st, 2014 at 9:45 am

Data Modelling: The Thin Model

with 2 comments

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

2014 10 27 06 41 19

The nice thing about the graph model is that it encourages an iterative approach so I was quickly able to bring the model to life and the domain experts back onside.

We can see a simple example of adding flesh to a model with a subset of the movies graph.

We might start out with the model on the right hand side which just describes the structure of the graph but doesn’t give us very much information about the entities.

I tend to sketch out the structure of all the data before adding any attributes but I think some people find it easier to follow if you add at least some flesh before moving on to the next part of the model.

In our next iteration of the movie graph we can add attributes to the actor and movie:

2014 10 27 06 57 32

We can then go on to evolve the model further but the lesson for me is value the attributes more, it’s not all about the structure.

Written by Mark Needham

October 27th, 2014 at 6:55 am

Neo4j: LOAD CSV – The sneaky null character

without comments

I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.

The CSV file looked like this:

$ cat foo.csv

I wrote the following LOAD CSV query to extract some of the fields and compare others:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN,, = "2"
==> +--------------------------------------+
==> | | | = "2" |
==> +--------------------------------------+
==> | <null>   | "2"     | false          |
==> +--------------------------------------+
==> 1 row

I had expect to see a “1” in the first column and a ‘true’ in the third column, neither of which happened.

I initially didn’t have a text editor with hexcode mode available so I tried checking the length of the entry in the ‘bar’ field:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN,, = "2", length(
==> +---------------------------------------------------------+
==> | | | = "2" | length( |
==> +---------------------------------------------------------+
==> | <null>   | "2"     | false          | 2                |
==> +---------------------------------------------------------+
==> 1 row

The length of that value is 2 when we’d expect it to be 1 given it’s a single character.

I tried trimming the field to see if that made any difference…

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN, trim(, trim( = "2", length(
==> +---------------------------------------------------------------------+
==> | | trim( | trim( = "2" | length( |
==> +---------------------------------------------------------------------+
==> | <null>   | "2"            | true                 | 2                |
==> +---------------------------------------------------------------------+
==> 1 row

…and it did! I thought there was probably a trailing whitespace character after the “2” which trim had removed and that ‘foo’ column in the header row had the same issue.

I was able to see that this was the case by extracting the JSON dump of the query via the Neo4j browser:


It turns out there were null characters scattered around the file so I needed to pre process the file to get rid of them:

$ tr < foo.csv -d '\000' > bar.csv

Now if we process bar.csv it’s a much smoother process:

load csv with headers from "file:/Users/markneedham/Downloads/bar.csv" AS line
RETURN,, = "2", length(
==> +---------------------------------------------------------+
==> | | | = "2" | length( |
==> +---------------------------------------------------------+
==> | "1"      | "2"      | true           | 1                |
==> +---------------------------------------------------------+
==> 1 row

Note to self: don’t expect data to be clean, inspect it first!

Written by Mark Needham

October 18th, 2014 at 10:49 am

PostgreSQL: ERROR: column does not exist

without comments

I’ve been playing around with PostgreSQL recently and in particular the Northwind dataset typically used as an introductory data set for relational databases.

Having imported the data I wanted to take a quick look at the employees table:

postgres=# SELECT * FROM employees LIMIT 1;
 EmployeeID | LastName | FirstName |        Title         | TitleOfCourtesy | BirthDate  |  HireDate  |           Address           |  City   | Region | PostalCode | Country |   HomePhone    | Extension | Photo |                                                                                      Notes                                                                                      | ReportsTo |              PhotoPath               
          1 | Davolio  | Nancy     | Sales Representative | Ms.             | 1948-12-08 | 1992-05-01 | 507 - 20th Ave. E.\nApt. 2A | Seattle | WA     | 98122      | USA     | (206) 555-9857 | 5467      | \x    | Education includes a BA IN psychology FROM Colorado State University IN 1970.  She also completed "The Art of the Cold Call."  Nancy IS a member OF Toastmasters International. |         2 | http://accweb/emmployees/davolio.bmp
(1 ROW)

That works fine but what if I only want to return the ‘EmployeeID’ field?

postgres=# SELECT EmployeeID FROM employees LIMIT 1;
ERROR:  COLUMN "employeeid" does NOT exist
LINE 1: SELECT EmployeeID FROM employees LIMIT 1;

I hadn’t realised (or had forgotten) that field names get lower cased so we need to quote the name if it’s been stored in mixed case:

postgres=# SELECT "EmployeeID" FROM employees LIMIT 1;
(1 ROW)

From my reading the suggestion seems to be to have your field names lower cased to avoid this problem but since it’s just a dummy data set I guess I’ll just put up with the quoting overhead for now.

Written by Mark Needham

September 29th, 2014 at 10:40 pm

4 types of user

with 2 comments

I’ve been working with Neo4j full time for slightly more than a year now and from interacting with the community I’ve noticed that while using different features of the product people fall into 4 categories.

These are as follows:


On one axis we have ‘loudness’ i.e. how vocal somebody is either on twitter, StackOverflow or by email and on the other we have ‘success’ which is how well a product feature is working for them.

The people in the top half of the diagram will get the most attention because they’re the most visible.

Of those people we’ll tend to spend more time on the people who are unhappy and vocal to try and help them solve the problems their having.

When working with the people in the top left it’s difficult to understand how representative they are for the whole user base.

It could be the case that they aren’t representative at all and actually there is a quiet majority who the product is working for and are just getting on with it with no fuss.

However, it could equally be the case that they are absolutely representative and there are a lot of users quietly suffering / giving up using the product.

I haven’t come up with a good way to come across the less vocal users but in my experience they’ll often be passive users of the user group or Stack Overflow i.e. they’ll read existing issues but not post anything themselves.

Given this uncertainty I think it makes sense to assume that the silent majority suffer the same problems as the more vocal minority.

Another interesting thing I’ve noticed about this quadrant is that the people in the top right are often the best people in the community to help those who are struggling.

It’d be interesting to know whether anyone has noticed a similar thing with the products they worked on, and if so what approach do you take to unveiling the silent majority?

Written by Mark Needham

July 29th, 2014 at 7:07 pm

install4j and AppleScript: Creating a Mac OS X Application Bundle for a Java application

without comments

We have a few internal applications at Neo which can be launched using ‘java -jar ‘ and I always forget where the jars are so I thought I’d wrap a Mac OS X application bundle around it to make life easier.

My favourite installation pattern is the one where when you double click the dmg it shows you a window where you can drag the application into the ‘Applications’ folder, like this:

2014 04 07 00 38 41

I’m not a fan of the installation wizards and the installation process here is so simple that a wizard seems overkill.

I started out learning about the structure of an application bundle which is well described in the Apple Bundle Programming guide. I then worked my way through a video which walks you through bundling a JAR file in a Mac application.

I figured that bundling a JAR was probably a solved problem and had a look at App Bundler, JAR Bundler and Iceberg before settling on Install4j which we used for Neo4j desktop.

I started out by creating an installer using Install4j and then manually copying the launcher it created into an Application bundle template but it was incredibly fiddly and I ended up with a variety of indecipherable messages in the system error log.

Eventually I realised that I didn’t need to create an installer and that what I actually wanted was a Mac OS X single bundle archive media file.

After I’d got install4j creating that for me I just needed to figure out how to create the background image telling the user to drag the application into their ‘Applications’ folder.

Luckily I came across this StackOverflow post which provided some AppleScript to do just that and with a bit of tweaking I ended up with the following shell script which seems to do the job:

rm target/DBench_macos_1_0_0.tgz
/Applications/install4j\ 5/bin/install4jc TestBench.install4j
rm -rf target/dmg && mkdir -p target/dmg
tar -C target/dmg -xvf target/DBench_macos_1_0_0.tgz
cp -r src/packaging/.background target/dmg
ln -s /Applications target/dmg
cd target
rm "${finalDMGName}"
umount -f /Volumes/"${title}"
hdiutil create -volname ${title} -size 100m -srcfolder dmg/ -ov -format UDRW pack.temp.dmg
device=$(hdiutil attach -readwrite -noverify -noautoopen "pack.temp.dmg" | egrep '^/dev/' | sed 1q | awk '{print $1}')
sleep 5
echo '
   tell application "Finder"
     tell disk "'${title}'"
           set current view of container window to icon view
           set toolbar visible of container window to false
           set statusbar visible of container window to false
           set the bounds of container window to {400, 100, 885, 430}
           set theViewOptions to the icon view options of container window
           set arrangement of theViewOptions to not arranged
           set icon size of theViewOptions to 72
           set background picture of theViewOptions to file ".background:'${backgroundPictureName}'"
           set position of item "'${applicationName}'" of container window to {100, 100}
           set position of item "Applications" of container window to {375, 100}
           update without registering applications
           delay 5
     end tell
   end tell
' | osascript
hdiutil detach ${device}
hdiutil convert "pack.temp.dmg" -format UDZO -imagekey zlib-level=9 -o "${finalDMGName}"
rm -f pack.temp.dmg
cd ..

To summarise, this script creates a symlink to ‘Applications’, puts a background image in a directory titled ‘.background’, sets that as the background of the window and positions the symlink and application appropriately.

Et voila:

2014 04 07 00 59 56

The Firefox guys wrote a couple of blog posts detailing their experiences writing an installer which were quite an interesting read as well.

Written by Mark Needham

April 7th, 2014 at 12:04 am

Automating Skype’s ‘This message has been removed’

with one comment

One of the stranger features of Skype is that that it allows you to delete the contents of a message that you’ve already sent to someone – something I haven’t seen on any other messaging system I’ve used.

For example if I wrote a message in Skype and wanted to edit it I would press the ‘up’ arrow:

2014 02 20 23 02 28

Once I’ve deleted the message I’d see this in the space where the message used to be:

2014 02 20 23 00 41

I almost certainly am too obsessed with this but I find it quite amusing when I see people posting and retracting messages so I wanted to see if it could be automated.

Alistair showed me Automator, a built in tool on the Mac for automating work flows.

Automator allows you to execute Applescript so we wrote the following code which selects the current chat in Skype, writes a message and then deletes it one character at a time:

on run {input, parameters}
	tell application "Skype"
	end tell
	tell application "System Events"
		set message to "now you see me, now you don't"
		keystroke message
		keystroke return
		keystroke (ASCII character 30) --up arrow
		repeat length of message times
			keystroke (ASCII character 8) --backspace
		end repeat
		keystroke return
	end tell
	return input
end run

We wired up the Applescript via the Utilities > Run Applescript menu option in Automator:

2014 02 20 23 12 38

We can then go further and wire that up to a keyboard shortcut if we want by saving the workflow as a service in Automator but for my messing around purposes clicking the ‘Run’ button from Automator didn’t seem too much of a hardship!

Written by Mark Needham

February 20th, 2014 at 11:16 pm

Learning about bitmaps

with 3 comments

A few weeks ago Alistair and I were working on the code used to model the labels that a node has attached to it in a Neo4j database.

The way this works is that chunks of 32 nodes ids are represented as a 32 bit bitmap for each label where a 1 for a bit means that a node has the label and a 0 means that it doesn’t.

For example, let’s say we have node ids 0-31 where 0 is the highest bit and 31 is the lowest bit. If only node 0 has the label then that’d be represented as the following value:

java> int bitmap = 1 << 31;
int bitmap = -2147483648

If we imagine the 32 bits positioned next to each other it would look like this:

2014 01 12 15 45 16
java> 0X80000000;
Integer res16 = -2147483648

The next thing we want to do is work out whether a node has a label applied or not. We can do this by using a bitwise AND.

For example to check whether the highest bit is set we would write the following code:

java> bitmap & (1 << 31);
Integer res10 = -2147483648

That is set as we would imagine. Now let’s check a a few bits that we know aren’t set:

java> bitmap & (1 << 0);
Integer res11 = 0
java> bitmap & (1 << 1);
Integer res12 = 0
java> bitmap & (1 << 30);
Integer res13 = 0

Another operation we might want to do is set another bit on our existing bitmap for which we can use a bitwise inclusive OR.

A bitwise inclusive OR means that a bit will be set if either value has the bit set or if both have it set.

Let’s set the second highest bit. and visualise that calculation:

2014 01 12 15 45 16

If we evaluate that we’d expect the two highest bits to be set:

java> bitmap |= (1 << 30);
Integer res14 = -1073741824

Now if we visualise the bitmap we’ll see that is indeed the case:

2014 01 12 17 16 21
java> 0XC0000000;
Integer res15 = -1073741824

The next operation we want to do is to unset a bit that we’re already set for which we can use a bitwise exclusive OR.

An exclusive OR means that a bit will only remain set if there’s a combination of (0 and 1) or (1 and 0) in the calculation. If there are two 1’s or 2 0’s then it’ll be unset.

Let’s unset the 2nd highest bit so that we’re left with just the top bit being set.

If we visualise that we have the following calculation:

2014 01 12 17 33 21

And if we evaluate that we’re back to our original bitmap:

java> bitmap ^= (1 << 30);
Integer res2 = -2147483648

I used the Java REPL to evaluate the code samples in this post and this article explains bitshift operators very clearly.

The Neo4j version of the bitmap described in this post is in the BitmapFormat class on github.

Written by Mark Needham

January 12th, 2014 at 5:44 pm

Supporting production code: Start with the simple things

without comments

A few months ago I wrote about my experiences supporting production code while working at uSwitch.

Since then I’ve been working on support for Neo4j customers and I’ve realised that there are a couple of other things to keep in mind while debugging production problems that I missed from the initial list.

Keep a clear head / Hold back your assumptions

The first is that it’s very helpful to completely clear your head of any assumptions when looking at a problem.

I’ve got into the habit of pattern matching different error messages that I come across with root causes and while that’s sometimes useful, often there are subtle differences which mean the root cause is different.

Although I still sometimes fall into the assumptions trap I’ve found that it helps to ask exactly what someone is trying to do rather than immediately trying to solve the problem.

Look for the simple things

Along with the assumptions another mistake I make is to imagine the most complicated version of events that could lead to a problem manifesting.

Sometimes this is the case but more frequently a configuration setting may have been misunderstood or a query poorly designed and the problem can be resolved more easily.

To stop myself making this mistake I have a rough flow chart in my head working down from simpler causes to more complicated ones for different problem areas.

As I said, I still do make assumptions and look for complicated reasons for problems but by keeping these two things in mind I think/hope I’m doing it less often than I used to!

Written by Mark Needham

December 20th, 2013 at 6:07 pm

Glicko Rating System: A simple example using Clojure

with one comment

A couple of weeks ago I wrote about the Elo Rating system and when reading more about it I learnt that one of its weaknesses is that it doesn’t take into account the reliability of a players’ rating.

For example, a player may not have played for a long time. When they next play a match we shouldn’t assume that the accuracy of that rating is the same as for another player with the same rating but who plays regularly.

Mark Glickman wrote the Glicko Rating System to take the uncertainty into account by introducing a ‘ratings deviation’ (RD). A low RD indicates that a player competes frequently and a higher RD indicates that they don’t.

One other difference between Glicko and Elo is the following:

It is interesting to note that, in the Glicko system, rating changes are not balanced as they usually are in the Elo system.

If one player’s rating increases by x, the opponent’s rating does not usually decrease by x as in the Elo system.

In fact, in the Glicko system, the amount by which the opponent’s rating decreases is governed by both players’ RD’s.

The RD value effectively tells us the range in which the player’s actual rating probably exists. i.e. a 95% confidence interval.

e.g. if a player has a rating of 1850 and a RD of 50 then the interval is 1750 – 1950 or (Rating – 2*RD)(Rating + 2*RD)

The algorithm has 2 steps:

  1. Determine a rating and RD for each player at the onset of the rating period. If the player is unrated use a value of 1500 and RD of 350. If they do have a rating we’ll calculate the new RD from the old RD using this formula:
    Glicko rd


    • t is the number of rating periods since last competition (e.g., if the player
      competed in the most recent rating period, t = 1)
    • c is a constant that governs the increase in uncertainty over time.
  2. Update each players rating and RD separately using the following formula:


    • r is the player’s pre-period rating
    • RD is the player’s pre-period ratings deviation
    • r1, r2,…,rm are the pre-period ratings of their opponents
    • RD1, RD2,…,RDm are the pre-period ratings deviations of their opponents
    • s1, s2,…,2m are the scores against the opponents. 1 is a win, 1/2 is a draw, 0 is a defeat.
    • r’ is the player’s post-period rating
    • RD’ is the player’s post-period ratings deviation

The paper provides an example to follow and includes the intermediate workings which made it easier to build the algorithm one function at a time.

The q function was the simplest to implement so I created that and the g function at the same time:

(ns ranking-algorithms.glicko
  (:require [clojure.math.numeric-tower :as math]))
(def q
  (/ (java.lang.Math/log 10) 400))
(defn g [rd]
  (/ 1
     (java.lang.Math/sqrt (+ 1
                             (/ (* 3 (math/expt q 2) (math/expt rd 2))
                                (math/expt ( . Math PI) 2))))))

We can use the following table to check we get the right results when we call it.:

Glicko table

> (g 30)
> (g 100)
> (g 300)

The next easiest function to write was the E function:

(defn e [rating opponent-rating opponent-rd]
  (/ 1
     (+ 1
        (math/expt 10 (/ (* (- (g opponent-rd))
                            (- rating opponent-rating))

And if we test that assuming that we have a rating of 1500 with a RD of 200:

> (e 1500 1400 30)
> (e 1500 1550 100)
> (e 1500 1700 300)

Finally we need to write the d2 supporting function:

(defn d2 [opponents]
  (/ 1  (* (math/expt q 2)
           (reduce process-opponent 0 opponents))))
(defn process-opponent [total opponent]
  (let [{:keys [g e]} opponent]
    (+ total (* (math/expt g 2) e (- 1 e)))))

In this function we need to sum a combination of the g and e values we calculated earlier for each opponent so we can use a reduce over a collection of those values for each opponent to do that:

> (d2 [{:g (g 30) :e (e 1500 1400 30)} 
       {:g (g 100) :e (e 1500 1550 100)} 
       {:g (g 300) :e (e 1500 1700 300)}])

I get a slightly different value for this function which I think is because I didn’t round the intermediate values to 2 decimal places as the example does.

Now we can introduce the r’ function which returns our ranking after taking the matches against these opponents into account:

(defn update-ranking [ranking-delta opponent]
  (let [{:keys [ranking opponent-ranking opponent-ranking-rd score]} opponent]
    (+ ranking-delta
       (* (g opponent-ranking-rd)
          (- score (e ranking opponent-ranking opponent-ranking-rd))))))
(defn g-and-e
  [ranking {o-rd :opponent-ranking-rd o-ranking :opponent-ranking}]
  {:g (g o-rd) :e (e ranking o-ranking o-rd)})
(defn ranking-after-round
  [{ ranking :ranking rd :ranking-rd opponents :opponents}]  
  (+ ranking
     (* (/ q
           (+ (/ 1 (math/expt rd 2))
              (/ 1 (d2 (map (partial g-and-e ranking) opponents)))))
        (reduce update-ranking 0 (map #(assoc-in % [:ranking] ranking) opponents)))))

One thing I wasn’t sure about here was the use of partial which is a bit of a Haskell idiom. I’m not sure what the favoured approach is in Clojure land yet.

If we execute that function we get the expected result:

> (ranking-after-round { :ranking 1500 
                         :ranking-rd 200 
                         :opponents[{:opponent-ranking 1400 :opponent-ranking-rd 30 :score 1} 
                                    {:opponent-ranking 1550 :opponent-ranking-rd 100 :score 0} 
                                    {:opponent-ranking 1700 :opponent-ranking-rd 300 :score 0}]})

The only function missing now is RD’ which returns our RD after taking these matches into account:

(defn rd-after-round
  [{ ranking :ranking rd :ranking-rd opponents :opponents}]
  (java.lang.Math/sqrt (/ 1 (+ (/ 1 (math/expt rd 2)
                               (/ 1 (d2 (map (partial g-and-e ranking) opponents)))))))

If we execute that function we get the expected result and we’re done!

> (rd-after-round { :ranking 1500 
                    :ranking-rd 200 
                    :opponents[{:opponent-ranking 1400 :opponent-ranking-rd 30 :score 1} 
                               {:opponent-ranking 1550 :opponent-ranking-rd 100 :score 0} 
                               {:opponent-ranking 1700 :opponent-ranking-rd 300 :score 0}]})

The next step is to run this algorithm against the football data and see if its results differ to the ones I got with the Elo algorithm.

I’m still not quite sure what I should set the rating period to. My initial thinking was that the rating period could be a season but that would mean that a team’s rating only really makes sense after a few seasons of matches.

The code is on github if you want to play with it and if you have any suggestions on how to make the code more idiomatic I’d love to hear them.

Written by Mark Needham

September 14th, 2013 at 9:02 pm