Mark Needham

Thoughts on Software Development

Haskell: Writing a custom equality operator

with 3 comments

In the comments on my post about generating random numbers to test a function David Turner suggested that this was exactly the use case for which QuickCheck was intended for so I’ve been learning a bit more about that this week.

I started with a simple property to check that the brute force (bf) and divide and conquer (dc) versions of the algorithm returned the same result, assuming that there were enough values in the list to have a closest pair:

prop_closest_pairs xs = length xs >= 2 ==> dcClosest xs == (fromJust $ bfClosest xs)

I could then run that as follows:

> import QuickCheck.Test
> quickCheck(prop_closest_pairs :: [(Double, Double)] -> Property)

It failed pretty quickly because the bf and dc versions of the algorithm sometimes return the pairs in a different order.

e.g. bf may say the closest pair is (2.0, 0.0), (2.1, 1.1) while dc says it’s (2.1, 1.1), (2.0, 0.0) which will lead to the quick check property failing because those values aren’t equal:

> ((2.0, 0.0), (2.1, 1.1))  == ((2.1, 1.1), (2.0, 0.0))
False

The best way I could think of to get around this problem was to create a type to represent a pair of points and then write a custom equality operator.

I initially ended up with the following:

type Point a = (a, a)
data Pair a = P (Point a) (Point a)
instance Eq (Pair a) where
	P a b == P c d = a == c && b == d || a == d && b == c

Which didn’t actually compile:

qc_test.hs:41:58:
    No instance for (Eq a)
      arising from a use of `=='
    In the second argument of `(&&)', namely `b == c'
    In the second argument of `(||)', namely `a == d && b == c'
    In the expression: a == c && b == d || a == d && b == c

The problem is that while we’ve made Pair an instance of the Equality type class there’s no guarantee that the value contained inside it is an instance of the Equality type class which means we might not be able to compare its values.

We need to add a class constraint to make sure that the value inside P is a part of Eq:

instance (Eq a) => Eq (Pair a) where
	P a b == P c d = a == c && b == d || a == d && b == c

Now we’re saying that we want to make Pair an instance of the Equality type class but only when the value that Pair contains is already an instance of the Equality type class.

In this case we’re just storing pairs of doubles inside the Pair so it will work fine.

Now if we compare the two points from above we’ll see that they’re equal:

> P (2.0, 0.0) (2.1, 1.1)  == P (2.1, 1.1) (2.0, 0.0)
True

I had to go and change the existing code to make use of this new but it didn’t take more than 5-10 minutes to do that.

Written by Mark Needham

May 16th, 2012 at 1:16 pm

Posted in Haskell

Tagged with

Haskell: Removing if statements

with 2 comments

When I was looking over my solution to the closest pairs algorithm which I wrote last week I realised there there were quite a few if statements, something I haven’t seen in other Haskell code I’ve read.

This is the initial version that I wrote:

dcClosest :: (Ord a, Floating a) => [Point a] -> (Point a, Point a)
dcClosest pairs
  if length pairs <= 3 then = fromJust $ bfClosest pairs    
  else 
    foldl (\closest (p1:p2:_) -> if distance (p1, p2) < distance closest then (p1, p2) else closest) 
          closestPair 
          (windowed 2 pairsWithinMinimumDelta)
  where sortedByX = sortBy compare pairs	      
        (leftByX:rightByX:_) = chunk (length sortedByX `div` 2) sortedByX
        closestPair = if distance closestLeftPair < distance closestRightPair then closestLeftPair else closestRightPair  
          where closestLeftPair =  dcClosest leftByX
                closestRightPair = dcClosest rightByX
        pairsWithinMinimumDelta = sortBy (compare `on` snd) $ filter withinMinimumDelta sortedByX
          where withinMinimumDelta (x, _) = abs (xMidPoint - x) <= distance closestPair   
                  where (xMidPoint, _) = last leftByX

We can remove the first if statement which checks the length of the list and replace it with pattern matching code like so:

dcClosest :: (Ord a, Floating a) => [Point a] -> (Point a, Point a)
dcClosest pairs
  | length pairs <= 3 = fromJust $ bfClosest pairs    
  | otherwise = foldl (\closest (p1:p2:_) -> if distance (p1, p2) < distance closest then (p1, p2) else closest)
                      closestPair 
                      (windowed 2 pairsWithinMinimumDelta)
    ...

We can also get rid of the if statement inside the first argument passed to ‘foldl’ and replace it with a call to ‘minimumBy’:

dcClosest :: (Ord a, Floating a) => [Point a] -> (Point a, Point a)
dcClosest pairs
  | length pairs <= 3 = fromJust $ bfClosest pairs    
  | otherwise = foldl (\closest (p1:p2:_) -> minimumBy (compare `on` distance) [closest, (p1, p2)])
                      closestPair 
                      (windowed 2 pairsWithinMinimumDelta)
    ...

We can do the same to replace the if statement where we work out the closestPair which results in this final version of the code:

dcClosest :: (Ord a, Floating a) => [Point a] -> (Point a, Point a)
dcClosest pairs
  | length pairs <= 3 = fromJust $ bfClosest pairs    
  | otherwise = foldl (\closest (p1:p2:_) -> minimumBy (compare `on` distance) [closest, (p1, p2)])
                      closestPair 
                      (windowed 2 pairsWithinMinimumDelta)
  where sortedByX = sortBy compare pairs	      
        (leftByX:rightByX:_) = chunk (length sortedByX `div` 2) sortedByX    
        closestPair = minimumBy (compare `on` distance) [closestLeftPair, closestRightPair]
          where closestLeftPair =  dcClosest leftByX
                closestRightPair = dcClosest rightByX     
        pairsWithinMinimumDelta = sortBy (compare `on` snd) $ filter withinMinimumDelta sortedByX
          where withinMinimumDelta (x, _) = abs (xMidPoint - x) <= distance closestPair
                  where (xMidPoint, _) = last leftByX

It takes up marginally less space and I think the change to use pattern matching on the length of ‘pairs’ makes the biggest difference as the code is now lined up at the same level of indentation.

The other changes would have more of an impact if there were more than 2 things being compared – right now I think either of the versions of the code are equally readable.

Written by Mark Needham

May 12th, 2012 at 3:46 pm

Posted in Haskell

Tagged with

neo4j/Cypher: Finding the shortest path between two nodes while applying predicates

without comments

As I mentioned in a blog post about a week ago I decided to restructure the ThoughtWorks graph I’ve modelled in neo4j so that I could explicitly model projects and clients.

As a result I had to update a traversal I’d written for finding the shortest path between two people in the graph.

The original traversal query I had was really simple because I had a direct connection between the people nodes:

neo = Neography::Rest.new
paths = neo.get_paths(start_node,
                      destination_node,
                      { "type" => "colleagues" },
                      depth = 3,
                      algorithm = "shortestPath")
           .map { |x| x["nodes"] }
           .uniq
paths.map { |p| p.map { |node| neo.get_node_properties(node, "name")["name"] } }

I changed the way the graph was modelled so that you needed to follow a ‘worked_on’ relationship to a project in order to go between people:

V2

In the first version I’d written some pre processing code in Ruby to check whether or not people worked on the project at the same time before creating the relationship between the nodes.

It wasn’t possible to do that with the new structure since I was working out if there was a colleagues relationship dynamically.

I therefore added a ‘start_date’ and ‘end_date’ property to the ‘worked_on’ relationship between a person node and project node so that I’d be able to take it into account when traversing the graph.]

I initially thought it would be possible to do this using cypher and wrote the following query:

start_node_id = neo.send(:get_id, start_node)
destination_node_id = neo.send(:get_id, destination_node)
 
query =  " START a=node(#{start_node_id}), x=node(#{destination_node_id})" 
query << " MATCH p = allShortestPaths( a-[:worked_on*]-x )" 
query << " RETURN p, extract(person in nodes(p) : person.name)"
paths = neo.execute_query(query)

I wasn’t sure how to do the filtering on ‘start_date’ and ‘end_date’ and Andres pointed out that it’s not actually currently possible to take relationship properties into account when traversing a graph with cypher so we need to do the filtering on ‘start_date’ and ‘end_date’ in code.

My first attempt to do that looked like this:

paths = neo.get_paths(node1, node2, { "type" => "worked_on", "direction" => "all" }, 
                      depth = 5, algorithm = "shortestPath").uniq
matching = paths.select do |row|
  relationshipPairs = row["relationships"].each_slice(2).to_a
  relationshipPairs.all? do |pair|
    r1 = neo.get_relationship(pair[0])["data"]
    r2 = neo.get_relationship(pair[1])["data"]
    r1["start_date"] <= r2["end_date"] && r1["end_date"] >= r2["start_date"]
  end
end

The problem with this approach is that it’s really slow due to the fact that I’m pulling back every relationship back in order to check the start and end dates.

Michael Hunger suggested an alternative approach where I still used a cypher query but returned the relationships instead of just the nodes:

start_node_id = neo.send(:get_id, start_node)
destination_node_id = neo.send(:get_id, destination_node)
 
query =  " START a=node(#{start_node_id}), x=node(#{destination_node_id})" 
query << " MATCH p = allShortestPaths( a-[:worked_on*]-x )" 
query << " RETURN p, rels(p), extract(person in nodes(p) : person.name)"
paths = neo.execute_query(query)
matching = paths["data"].select do |row|
  relationship_pairs = row[1].each_slice(2).to_a
  relationship_pairs.all? do |pair|
    r1 = pair[0]["data"]
    r2 = pair[1]["data"]
    r1["start_date"] <= r2["end_date"] && r1["end_date"] >= r2["start_date"]
  end
end
 
matching.map { |x| x[2].each_with_index.select { |x,idx| idx.even? }.map(&:first) }.uniq

This approach mostly works although it runs into problems in the scenario where two people have worked on the same project but not at the same time.

In that scenario the above code will return the relationship between them but it will then be filtered out by the start date/end date logic which means we won’t see a shortest path between those nodes.

I’m not sure how to solve that problem so for the moment I’m going to take the Josh Adell’s suggestion to keep the ‘colleagues’ relationship between nodes and use that relationship for the shortest path traversal.

The ‘worked_on’ relationship is still useful for other things that I want to do but not this particular one.

Written by Mark Needham

May 12th, 2012 at 2:55 pm

Posted in neo4j

Tagged with

Haskell: Explicit type declarations in GHCI

without comments

On a few occasions I’ve wanted to be able to explicitly define the type of something when trying things out in the Haskell REPL (GHCI) but I didn’t actually realise this was possible until a couple of days ago.

For example say we want to use the read function to parse an input string into an integer.

We could do this:

> read "1" :: Int
1

But if we just evaluate the function alone and try and assign the result without casting to a type we get an exception:

> let x  = read "1"
 
<interactive>:1:10:
    Ambiguous type variable `a0' in the constraint:
      (Read a0) arising from a use of `read'
    Probable fix: add a type signature that fixes these type variable(s)
    In the expression: read "1"
    In an equation for `x': x = read "1"

sepp2k shows how we can provide a type declaration in GHCI in his Stack Over Flow answer:

> let x::Int; x = read "1"
> x
1

We can also use it when creating a list of integers to ensure they are of type ‘Int’ rather than ‘Integer’ for example:

> let y = [1,2,3]
> :t y
y :: [Integer]
 
> let y::[Int]; y = [1,2,3]
> :t y
y :: [Int]

It’s a pretty simple thing but I didn’t know it was even possible!

Written by Mark Needham

May 10th, 2012 at 7:11 am

Posted in Haskell

Tagged with

Haskell: Closest Pairs Algorithm

without comments

As I mentioned in a post a couple of days ago I’ve been writing the closest pairs algorithm in Haskell and while the brute force version works for small numbers of pairs it starts to fall apart as the number of pairs increases:

time ./closest_pairs 100 bf 
./closest_pairs 100 bf  0.01s user 0.00s system 87% cpu 0.016 total
 
time ./closest_pairs 1000 bf
./closest_pairs 1000 bf  3.59s user 0.01s system 99% cpu 3.597 total
 
time ./closest_pairs 5000 bf
./closest_pairs 5000  554.09s user 0.36s system 99% cpu 9:14.46 total

Luckily there’s a divide and conquer algorithm we can use which brings down the running time from O(n2

That algorithm is defined like so:

  1. Sort points along the x-coordinate
  2. Split the set of points into two equal-sized subsets by a vertical line x = xmid
  3. Solve the problem recursively in the left and right subsets. This will give the left-side and right-side minimal distances dLmin and dRmin respectively.
  4. Find the minimal distance dLRmin among the pair of points in which one point lies on the left of the dividing vertical and the second point lies to the right (a split pair).
  5. The final answer is the minimum among dLmin, dRmin, and dLRmin.

By step 4 in which we’ll look for a closest pair where the x and y values are in opposite subsets we don’t need to consider the whole list of points again since we already know that the closest pair of points is no further apart than dist = min(dLmin, dRmin).

We can therefore filter out a lot of the values by only keeping values which are within dist of the middle x value.

If a point is further away than that then we already know that the distance between it and any point on the other side will be greater than dist so there’s no point in considering it.

I used the C# version from Rosetta Code as a template and this is what I ended up with:

import Data.Maybe
import Data.List
import Data.List.Split
import Data.Function
 
dcClosest :: (Ord a, Floating a) => [Point a] -> (Point a, Point a)
dcClosest pairs
  if length pairs <= 3 then = fromJust $ bfClosest pairs    
  else 
    foldl (\closest (p1:p2:_) -> if distance (p1, p2) < distance closest then (p1, p2) else closest) 
          closestPair 
          (windowed 2 pairsWithinMinimumDelta)
  where sortedByX = sortBy compare pairs	      
        (leftByX:rightByX:_) = chunk (length sortedByX `div` 2) sortedByX
        closestPair = if distance closestLeftPair < distance closestRightPair then closestLeftPair else closestRightPair  
          where closestLeftPair =  dcClosest leftByX
                closestRightPair = dcClosest rightByX
        pairsWithinMinimumDelta = sortBy (compare `on` snd) $ filter withinMinimumDelta sortedByX
          where withinMinimumDelta (x, _) = abs (xMidPoint - x) <= distance closestPair   
                  where (xMidPoint, _) = last leftByX

If the number of pairs is 3 or less then we can just use the brute force algorithm since the whole splitting the list in half thing doesn’t work so well with a list of less than 4 items.

The main function is a fold which goes over all the pairs of points where we’ve determined that there might be a closest pair with one point on the left hand side of our ‘sorted by x list’ and the other on the right hand side.

We use the ‘windowed‘ function to pair up the points which in this case does something like this:

> windowed 2 [(0,0), (1,1), (2,2), (1.1, 1.2)]
[[(0.0,0.0),(1.0,1.0)],[(1.0,1.0),(2.0,2.0)],[(2.0,2.0),(1.1,1.2)]]

We pass along the closest pair that we’ve found from pairs of points on the left and right hand sides of the list as our seed value and check whether any of the split pairs are closer together than that one. By the time the fold is finished we will have the closest pair in the list.

The code in the where section is used to split the list into two halves, sort it by x and then work out which side of the list has the closest pair. That’s all done recursively and then the last bit of code works out which values we need to consider for the split pairs part of the algorithm.

I learnt about the ‘on‘ function while writing this algorithm which makes it really easy to pick a function to sort a collection with e.g. ‘compare `on` snd’ lets us sort a list in ascending order based on the second value in a tuple.

One problem with the way I’ve written this algorithm is that it places a lot of focus on the split pair bit of the code when actually a big part of why it works is that we’re sorting it into an order that makes it easier to work with and then dividing the problem by 2 each time.

My current thinking is that perhaps I should have had that code in the main body of the function rather than hiding it away in the where section.

It isn’t actually necessary to execute the foldl all the way through since we’ll often know there’s not going to be a split pair before we go through all the pairs. I thought it reads pretty nicely as it is thought and it runs pretty quickly as it is anyway!

Using the same data set as before (and getting the same answers!):

> time ./closest_pairs 1000 dc
./closest_pairs 1000 dc  0.02s user 0.00s system 91% cpu 0.024 total
 
> time ./closest_pairs 5000 dc
./closest_pairs 5000 dc  0.11s user 0.01s system 97% cpu 0.118 total

I described the code for generating random numbers in an earlier blog post but I’ll include it again to show how the algorithm is wired up:

import Control.Monad.State (State, evalState, get, put)
import System.Random (StdGen, mkStdGen, random)
import System
 
type R a = State StdGen a
rand :: R Double
rand = do
  gen <- get
  let (r, gen') = random gen
  put gen'
  return r
 
randPair :: R (Double, Double)
randPair = do
  x <- rand
  y <- rand
  return (x,y)
 
runRandom :: R a -> Int -> a
runRandom action seed = evalState action $ mkStdGen seed
 
normals :: R [(Double, Double)]
normals = mapM (\_ -> randPair) $ repeat ()
 
main = do 
	args <- getArgs
	let numberOfPairs = read (head args) :: Int
	if length args > 1 && args !! 1 == "bf" then 
		putStrLn $ show ( (bfClosest $ take numberOfPairs $ runRandom normals 42))
	else 
		putStrLn $ show ( (dcClosest $ take numberOfPairs $ runRandom normals 42))

The full code is on hpaste if anyone has any suggestions for how to improve it.

Written by Mark Needham

May 9th, 2012 at 12:05 am

Posted in Haskell

Tagged with

Haskell: Generating random numbers

with one comment

As I mentioned in my last post I’ve been coding the closest pairs algorithm in Haskell and needed to create some pairs of coordinates to test it against.

I’ve tried to work out how to create lists of random numbers in Haskell before and always ended up giving up because it seemed way more difficult than it should be but this time I came across a really good explanation of how to do it by jrockway on Stack Overflow.

I thought it’d be interesting to work up to his answer by first looking at the functions available to help us generate random numbers.

mkStdGen is a function which creates a random number generator for a given seed value and seems like a good place to start.

mkStdGen :: Int -> StdGen

The function mkStdGen provides an alternative way of producing an initial generator, by mapping an Int into a generator. Again, distinct arguments should be likely to produce distinct generators.

We can combine that with ‘randomR’ to give us a random value in a range that we choose or with ‘random’ to get a random value in a default range.

randomR :: RandomGen g => (a, a) -> g -> (a, g)

Takes a range (lo,hi) and a random number generator g, and returns a random value uniformly distributed in the closed interval [lo,hi], together with a new generator. It is unspecified what happens if lo>hi. For continuous types there is no requirement that the values lo and hi are ever produced, but they may be, depending on the implementation and the interval.

> fst $ randomR (1,10) (mkStdGen 66)
8
> let randomNumber :: Int; x = fst $ random (mkStdGen 66)
> randomNumber
-4755267682593970340

For now I’m ignoring the new random generator that is provided as the second part of the tuple that these two functions return.

If we want to create an infinite list of random numbers then something like this will do the trick:

let values :: [Int]; values = map fst $ scanl (\(r, gen) _ -> random gen) (random (mkStdGen 1)) $ repeat ()

We use the ‘repeat’ function to create an infinite sequence of nothing then run a scanl across that infinite sequence, called ‘random’ with a new instance of the random number generator each time:

> take 10 values
[7917908265643496962,-1017158127812413512,-1196564839808993555,128524678767915218,3573078269730797745,-2026762924287790163,-5402471397730582264,-8620480566299071809,5987841344909700899,1198540087579679591]
it :: [Int]

In this version we’re passing the next generator along as part of the accumulator of the ‘scanl’ function but another way to do that would be to encapsulate it inside a State monad:

Using the State monad our random number function would read like this:

1
2
3
4
5
6
myRand :: State StdGen Int
myRand = do
	gen <- get
	let (r, nextGen) = random gen
	put nextGen
	return r

The state monad contains some state (the random number generator) and creates values (the random numbers).

Line 3 gets the current generator.
Line 4 creates a new number and generator which it assigns to ‘r’ and ‘nextGen’
Line 5 updates the State monad with the new generator
Line 6 returns the value that was created.

We change ‘values’ to read like this:

let values = mapM (\_ -> myRand) $ repeat ()

‘mapM (\_ -> myRand)’ is the same as doing ‘sequence $ map (\_ -> myRand)’ and we need to do that because we want to return a State Monad with an array of Ints rather than an array of State Monads each with an Int.

In order to evaluate those values we call ‘evalState’ which returns the final result from the State monad i.e. the random value created:

> evalState values $ mkStdGen 1
[7917908265643496962,-1017158127812413512,-1196564839808993555,128524678767915218,3573078269730797745,-2026762924287790163,-5402471397730582264,-8620480566299071809,5987841344909700899,1198540087579679591,5081506216124746781,-2413363174299075397,-6804412472718217891,4559850124463334118,-362777938400029309,-23100237439495333,-2426472460098089322,...]

In our case we want to get an infinite sequence of pairs to represent the coordinates on a plane so we need to have a random function that creates those:

myRandPair :: State StdGen (Int, Int)
myRandPair = do
	x <- myRand
	y <- myRand
	return (x, y)

We then change ‘values’ accordingly:

> let values = mapM (\_ -> myRandPair) $ repeat ()
> evalState values $ mkStdGen 1
[(7917908265643496962,-1017158127812413512),(-1196564839808993555,128524678767915218),(3573078269730797745,-2026762924287790163),(-5402471397730582264,-8620480566299071809),(5987841344909700899,1198540087579679591),(5081506216124746781,-2413363174299075397),(-6804412472718217891,4559850124463334118),(-362777938400029309,-23100237439495333)...]

If we want to get back some different pairs then we just need to change the initial seed value that we pass to ‘mkStdGen’.

jrockway explains this in more detail in his post in which he also wraps the State monad inside an alias to make the code easier to understand.

Written by Mark Needham

May 8th, 2012 at 10:09 pm

Posted in Haskell

Tagged with

Haskell: Maximum Int value

without comments

One of the algorithms covered in Algo Class was the closest pairs algorithm – an algorithm used to determine which pair of points on a plane are closest to each other based on their Euclidean distance.

My real interest lies in writing the divide and conquer version of the algorithm but I started with the brute force version so that I’d be able to compare my answers.

This is the algorithm:

minDist = infinity
for each p in P:
  for each q in P:
    if p ≠ q and dist(p, q) < minDist:
      minDist = dist(p, q)
      closestPair = (p, q)
return closestPair

‘infinity’ in this case could be the maximum value that an Int could hold which on a 64 bit architecture would be 263 so I hardcoded that into my implementation:
o

bfClosest :: (Ord a, Floating a) => [(a, a)] -> Maybe ((a, a), (a, a))
bfClosest pairs = 
  snd $ foldl (\ acc@(min, soFar) (p1, p2) -> 
                if distance p1 p2 < min then (distance p1 p2, Just(p1, p2)) else acc) 
              (2^63, Nothing) 
              [(pairs !! i, pairs !! j) | i <- [0..length pairs - 1], j <- [0..length pairs-1 ], i /= j]
  where distance (x1, y1) (x2, y2) =  sqrt $ ((x1 - x2) ^ 2) + ((y1 - y2) ^ 2)

We’re comparing each point with all the others in the list by folding over a collection of all the combinations and then passing the pair with the smallest distance between points as part of our accumulator.

More by chance than anything else I was reading the Learn You a Haskell chapter on types and type classes and came across the maxBound function which does exactly what I want:

> 2 ^ 63
9223372036854775808
 
> maxBound :: Int
9223372036854775807

We can’t plug that straight into the function as is because the fold inside ‘bfClosest’ expects a float and had been automatically coercing 263 into the appropriate type.

We therefore use ‘fromIntegral’ to help us out:

bfClosest :: (Ord a, Floating a) => [(a, a)] -> Maybe ((a, a), (a, a))
bfClosest pairs = 
  snd $ foldl (\ acc@(min, soFar) (p1, p2) -> 
                if distance p1 p2 < min then (distance p1 p2, Just(p1, p2)) else acc) 
              (fromIntegral (maxBound :: Int), Nothing) 
              [(pairs !! i, pairs !! j) | i <- [0..length pairs - 1], j <- [0..length pairs-1 ], i /= j]
  where distance (x1, y1) (x2, y2) =  sqrt $ ((x1 - x2) ^ 2) + ((y1 - y2) ^ 2)

Written by Mark Needham

May 7th, 2012 at 9:18 am

Posted in Haskell

Tagged with

neo4j: What question do you want to answer?

with 11 comments

Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.

When I first started doing this the main question I wanted to answer was ‘how connected are people to each other’ which led to me modelling the data like this:

Initial

The ‘colleagues with’ relationship stored information about the project the two people had worked on together and how long they’d worked together.

This design was fine while that was the only question I wanted to answer but after I showed it to a few people it became clear that there were other questions we could ask which would be difficult to answer with it designed this way.

e.g.

  • Which people on project X have I never worked with?
  • Which person has worked for client X for the longest?
  • Which people worked together on the same client if not the same project?

I therefore need to make ‘client’ and ‘project’ first class entities in the graph rather than just being there implicitly which favours a design more along these lines:

Initial

It makes it a little more difficult to answer the initial question about connections between people but opens up the answers to other questions such as the ones detailed above.

I’m still getting used to this way of modelling data but it feels like you’re driven towards designing your data in a way that’s useful to you as opposed to the relational approach where you tend to model relations and then work out what you want to do with the data.

Written by Mark Needham

May 5th, 2012 at 1:20 pm

Posted in neo4j

Tagged with

gephi: Centring a graph around an individual node

without comments

I spent some time recently playing around with gephi – an open source platform for creating visualisations of graphs – to get a bit more insight into the ThoughtWorks graph which I’ve created in neo4j.

I followed Max De Marxi’s blog post to create a GEFX (Graph Exchange XML Format) file to use in gephi although I later learned that you can import directly from neo4j into gephi which I haven’t tried yet.

I loaded it into gephi and this is what it looks like:

Full graph
Filter menu top

There are massive numbers of connections between almost every node so it’s pretty hard to see what’s going on.

I wanted to be able to centre the graph around an individual person and see just the links related to them.

To do that we need to make use of an ‘ego filter‘ so the first step is to make the filters menu visible by clicking ‘Window > Filters’.

We can then choose the ego filter which sits under ‘Library > Topology’ and fill in the ID of the node that we want to centre the graph around as well as choose the depth of the traversal.

Filter menu bottom

In this case any traversal above 1 will end up traversing the majority of the graph since the average distance between nodes in this graph is just above 2.5.

I run the ‘Force Atlas’ layout algorithm over it and this is what we end up with:

My graph

My graph shows some interesting clustering of nodes where the ones on the right are people I worked with in India, top left are people I worked with in Australia, bottom left people in the UK and the ones towards the lower middle are people I went to ThoughtWorks University with.

There are a load of other filters to choose from but the ego filter is pretty cool!

Written by Mark Needham

April 30th, 2012 at 10:20 pm

Posted in neo4j,Software Development

Tagged with , ,

Performance: Caching per request

without comments

A couple of years ago I wrote a post describing an approach my then colleague Christian Blunden used to help improve the performance of an application where you try to do expensive things less or find another way to do them.

On the application I’m currently working on we load reference data from an Oracle database into memory based on configurations provided by the user.

There are multiple configurations and then multiple ways that those configurations can be priced so we have two nested for loops in which we load data and then perform calculations on it.

When profiling the application we realised that some of the database calls being made with the same parameters and were therefore loading back the same reference data that we’d already loaded.

The most obvious way to solve this problem would be to move the code out of the loop and make less calls to the database that way but logically the domain is expressed more clearly when it’s inside the loop.

Alex therefore came up with an alternative approach where we wrap the database calling code in a caching decorator.

The caching decorator is a request scoped object so we’re only caching those results for a short amount of time which means that we don’t have to worry about cache invalidation because it’s thrown away when the request has been processed.

I’ve previously seen this problem solving by using a Hibernate second level cache which would cache results across requests.

In our application there are more likely to be calls to the database with the same parameters within the same request rather than across requests.

The load on the system is likely to come from complex requests where many prices needed to be calculated rather than from a huge frequency of requests

If that changes then we always have the option of caching at both levels but at the moment our current approach seems to work pretty well.

Written by Mark Needham

April 30th, 2012 at 9:45 pm