Mark Needham

Thoughts on Software Development

Archive for June, 2012

Powerpoint saving movies as images

without comments

I’ve been working on a presentation for the ThoughtWorks Europe away day over the last few days and I created some screen casts using Camtasia which I wanted to include.

It’s reasonably easy to insert movies into Powerpoint but I was finding that when I saved the file and then reloaded it the movies had been converted into images which wasn’t what I wanted at all!

Eventually I came across a blog post which explained that I’d been saving the file as the wrong format.

I’d been saving it as a ‘ppt’ but what I actually needed was to save it as a ‘ppsx':

Graph

I spent about an hour trying to work out what I was doing wrong, an hour I hopefully won’t have to spend next time.

Written by Mark Needham

June 30th, 2012 at 10:05 am

neo4j: Handling optional relationships

without comments

On my ThoughtWorks neo4j there are now two different types of relationships between people nodes – they can either be colleagues or one can be the sponsor of the other.

The graph looks like this:

Sponsors colleagues

I wanted to get a list of all the sponsor pairs but also have some indicator of whether the two people have worked together.

I started off by getting all of the sponsor pairs:

START n = node(*) 
MATCH n-[r:sponsor_of]->n2
RETURN n.name, n2.name

I managed to narrow that down to the people who sponsored someone that they’d worked with like so:

START n = node(*) 
MATCH n-[r:sponsor_of]->n2, n-[r2:colleagues]->c
WHERE c = n2
RETURN n.name, n2.name

But it wasn’t quite what I wanted since I’d now lost all the sponsor pairs who didn’t work together.

My next attempt was to remove the WHERE clause and try the following which isn’t even a valid cypher query:

START n = node(*) 
MATCH n-[r:sponsor_of]->n2, n-[r2:colleagues]->c
RETURN n.name, n2.name, n2 IN [c]

I was struggling so I decided to draw out the above diagram and then work backwards from the type of output which I expected if I had the correct query.

The output I wanted was like this:

PersonA | PersonB | Sponsor Relationship | Colleague Relationship
PersonA | PersonC | Sponsor Relationship | -

Once I had written it out on paper it became clear that what I needed to do was find all the sponsor pairs and then optionally look for a colleagues relationship between the pair:

START n = node(*)  
MATCH n-[r:sponsor_of]->n2-[r2?:colleagues]->n 
RETURN n.name, n2.name, r, r2

The ‘?’ before the ‘:’ in the colleagues relationship indicates that it’s optional and will still return the traversal even if that relationship doesn’t exist.

If we run that query in the console it does exactly what we want:

==> +--------------------------------------------------------------------------------------+
==> | n.name            | n2.name            | r                      | r2                 |
==> +--------------------------------------------------------------------------------------+
==> | "PersonA"         | "PersonB"          | :sponsor_of[261255] {} | :colleagues[217292]|
==> | "PersonA"         | "PersonC"          | :sponsor_of[261252] {} | <null>             |
==> +--------------------------------------------------------------------------------------+

Written by Mark Needham

June 24th, 2012 at 11:32 pm

Posted in neo4j

Tagged with ,

Why you shouldn’t use name as a key a.k.a. I am an idiot

with 5 comments

I think one of the first things that I learnt about dealing with users in a data store is that you should never use name as a primary key because their might be two people with the same name.

Despite knowing that I foolishly chose to ignore this knowledge when building my neo4j graph and used name as the key for the Lucene index.

I thought I’d got away with it but NO!

Earlier today I was trying to work out who the most connected person at ThoughtWorks is and the graph was suggesting that ‘Rahul Singh’ was the most connected, having worked with 540 people.

I mentioned this to Jen who felt something was probably wrong since he’d only started working at ThoughtWorks a couple of years ago.

Amusingly Jen found an email from 18 months ago sent by Rahul #1 explaining that there were in fact two people with exactly the same name and he was getting emails intended for the other one and vice versa.

I now have first hand knowledge of what can happen if you ignore one of the most basic rules of software development!

My gamble that there probably wouldn’t be two people with the same name in such a small dataset has totally failed and from now on I’ll be sure to use a unique key!

Written by Mark Needham

June 24th, 2012 at 10:55 pm

Brightbox Repository: GPG error: The following signatures couldn’t be verified because the public key is not available

with one comment

We’re using the Brightbox Ruby repository to get the versions of Ruby which we install on our machines and although we eventually put the configuration for this repository into Puppet we initially tested it out on a local VM.

To start with you need to add the repository to /etc/apt/sources.list:

deb http://ppa.launchpad.net/brightbox/ruby-ng/ubuntu lucid main

To get that picked up we run the following:

apt-get update

Which initially threw this error because it’s a gpg signed repository and we hadn’t added the key:

W: GPG error: http://ppa.launchpad.net lucid Release: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY F5DA5F09C3173AA6

I recently realised that the instructions explaining how to sign the repository are hidden away in an overlay on the page but the opsview wiki also explains what to do.

To add the key we need to run the following:

sudo apt-key add -
 
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: SKS 1.0.10
 
mI0ETKTCMQEEAMX3ttL4YFO5AQ7Z6L5gaGw57CJBQl6jCv6lka0p8DaGNkeX0Rs9DhINa8qR
hxJCPK6ijeoNss69G/ni+sMSRViJBFWXzitEE1ew5YM2sw7wLE3guToDu60kaDwIn5mR3GTx
cgqDrQeCuGZJgz3e2lgmGYw2rAhMe78rRgkR5GFvABEBAAG0G0xhdW5jaHBhZCBQUEEgZm9y
IEJyaWdodGJveIi4BBMBAgAiBQJMpMIxAhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAK
CRD12l8Jwxc6pl2BA/4p5DFEpGVvkgLj7/YLYCtYmZDw8i/drGbkWfIQiOgPWIf8QgpJXVME
1tkH8N1ssjbJlUKl/HubNBKZ6HDyQsQASFug+eI6KhSFMScDBf/oMX3zVCTTvUkgJtOWYc5d
77zJacEUGoSEx63QUJVvp/LAnqkZbt17JJL6HOou/CNicw==
=G8vE
-----END PGP PUBLIC KEY BLOCK-----

Then press Ctrl-D to exit the command.

The public key comes from here and is referenced under the ‘Signing key’ section.

Written by Mark Needham

June 24th, 2012 at 12:58 am

Creating a Samba share between Ubuntu and Mac OS X

without comments

On the project I’m currently working on we have our development environment setup on a bare bones Ubuntu instance which we run via VmWare.

We wanted to be able to edit files on the VM from the host O/S so my colleague Phil suggested that we set up a Samba server on the VM and then connect to it from the Mac.

We first needed to install a couple of packages on the VM:

  • apt-get install samba
  • apt-get install libpam-smbpass

The first package is self explanatory and the second allows us to keep the samba username/password in sync with the unix user on the VM.

Installing the samba package will automatically start up the Samba daemon ‘smbd’.

$ ps aux | grep smbd
mneedham 10915  0.0  0.0   7624   928 pts/14   S+   17:37   0:00 grep --color=auto smbd
root     32610  0.0  0.1  95372  5408 ?        S    Jun22   0:50 smbd -F

We then need to edit /etc/samba/smb.conf:

First we uncomment this line:

security = user

Then add a share, probably at the bottom of the file but anywhere is fine:

[mneedham]
comment = Mark's vm
read only = no
path = /home/mneedham
guest ok = no
browseable = yes
create mask = 0644

From the Mac we need to mount the share:

  • Go to finder
  • Connect to server (Cmd – K)
  • Type in ‘smb://ip.of-vm
  • Select the name of the share

The share should now be accessible from the host O/S at /Volumes/name.of.share

Looking back I’m sure there’s a way to configure VmWare to share files from the guest O/S but at least I now know another way to do it as well!

Written by Mark Needham

June 24th, 2012 at 12:40 am

Visualising a neo4j graph using gephi

without comments

At ThoughtWorks we don’t have line managers but people can choose to have a sponsor – typically someone who has worked in the company for longer/has more experience in the industry than them – who can help them navigate the organisation better.

From hearing people talk about sponsors over the last 6 years it seemed like quite a few people sponsored the majority and there were probably a few people who didn’t have a sponsor.

It seemed like a pretty good problem to visualise in a graph so I got access to the data, spent a few hours tidying it up so all the names matched the names we have in our staffing application and then loaded it into neo4j.

I initially tried to visualise the data in sigma.js but that didn’t work that well here – I think it’s much better when we actually want to browse around a graph whereas here I’m just interested in an overall snapshot.

I therefore decided to load the data into gephi and find a way of visualising it using that.

The relationships on the graph are like this:

Sponsors graphviz

I created this using the following graphviz definition:

graph effectgraph {
	size="8,8"; 
	rankdir=LR;
 
	person1[label="Person 1"];
	person2[label="Person 2"];	
	person3[label="Person 3"];	
	officeA[label="Office A"];
 
	officeA -- person1 [label="member_of"];
	officeA -- person2 [label="member_of"];
	officeA -- person3 [label="member_of"];
	person1 -- person2 [label="sponsor_of"];
	person2 -- person3 [label="sponsor_of"];	
}
dot -Tpng v3.dot >> sponsors.png

I wrote a script based on Max de Marzi’s blog post to get the data into gexf format so that I could load it into gephi:

First I get a collection of all the people who are sponsors and how many sponsees they have:

def load_sponsors
 query =  " START n = node(*)" 
 query << " MATCH n-[r:sponsor_of]->n2" 
 query << " RETURN ID(n), count(r) AS sponsees ORDER BY sponsees DESC"
 
 sponsors = {}
 @neo.execute_query(query)["data"].each do |id, sponsees|
 	sponsors[id] = sponsees
 end
 sponsors
end

That creates a hash of sponsors with a count of how many sponsees they which I used in the following function to creates a collection of nodes:

def nodes
  query =  " START n = node(*)"
  query << " MATCH n-[r:member_of]->o" 
  query << " WHERE o.name IN ['London', 'Manchester', 'Hamburg'] AND not(has(r.end_date))"
  query << " RETURN DISTINCT(n.name), ID(n)"
 
  sponsors_sponsee_count = load_sponsors
 
  nodes = Set.new
  @neo.execute_query(query)["data"].each do |n| 
  	nodes << { "id" => n[1], "name" => n[0], "size" => 5 + ((sponsors_sponsee_count[n[1]] || 0) * 5) }
  end
 
  nodes
end

I have nodes representing people in the whole organisation so I need to filter to only find people who work for ThoughtWorks Europe since that’s where I have the sponsor data for. I add a size property here so that people who have more sponsees will be more prominent on the graph.

We then have the following function to describe the ‘sponsor_of’ relationships:

def edges
  query =  " START n = node(*)"
  query << " MATCH n-[r:sponsor_of]->n2"
  query << " RETURN ID(r), ID(n), ID(n2)"
 
  @neo.execute_query(query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} }
end

I use the following code to generate the XML format I need:

xml = Builder::XmlMarkup.new(:target=>STDOUT, :indent=>2)
xml.instruct! :xml
xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz"  do
  xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do
    xml.nodes :count => nodes.size do
      nodes.each do |n|
        xml.node :id => n["id"],   :label => n["name"] do
          xml.tag!("viz:size",     :value => n["size"])
          xml.tag!("viz:color",    :b => 255, :g => 255, :r => 255)
          xml.tag!("viz:position", :x => rand(100), :y => rand(100))
       end
      end
    end
    xml.edges :count => edges.size do
      edges.each do |e|
        xml.edge:id => e["id"], :source => e["source"], :target => e["target"]
      end
    end
  end
end

We end up with something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gephi.org/gexf" xmlns:viz="http://www.gephi.org/gexf/viz">
  <graph defaultedgetype="directed" idtype="string" type="static">
    <nodes count="274">
      <node id="1331" label="Person 1">
        <viz:size value="5"/>
        <viz:color b="255" g="255" r="255"/>
        <viz:position x="69" y="31"/>
      </node>
    ....
    </nodes>
    <edges count="187">
      <edge id="7975" source="56" target="1374"/>
    </edges>
  </graph>
</gexf>

I set the positions of the nodes to be randomised because the gephi algorithms seem to work much better that way.

I can then create the gexf file like so:

ruby gephi_me.rb >> sponsors.gexf

I loaded it into gephi and ran the Force Atlas & ‘Noverlap’ algorithms over the graph to make it a bit easier to visualise the data:

Sponsors

The top 4 sponsors on the graph are sponsors to 28 people between them and the next 7 cover a further 35 people.

Interestingly there’s a big group of orphans in the middle who don’t have a sponsor – initially I thought it was a bit strange that there are so many but people who have moved to the UK from another country and have a sponsor from that country would also come in this category.

I wrote the following query to help me find out who the orphans were after noticing that on the visualisation:

  query =  " START n = node(*)"
  query << " MATCH n-[r:member_of]->o, n<-[r2?:sponsor_of]-n2" 
  query << " WHERE r2 is null and o.name IN ['London', 'Manchester', 'Hamburg'] AND not(has(r.end_date))"
  query << " RETURN DISTINCT(n.name), ID(n)"

I wanted to annotate the image to point out who specific people were for internal use and a few people on twitter pointed me towards skitch which made my life amazingly easy so I’d highly recommend that.

Written by Mark Needham

June 21st, 2012 at 5:02 am

Posted in neo4j

Tagged with , , ,

Haskell: Mixed type lists

with 3 comments

I’ve been continuing to work through the exercises in The Little Schemer and came across a problem which needed me to write a function to take a mixed list of Integers and Strings and filter out the Integers.

As I mentioned in my previous post I’ve been doing the exercises in Haskell but I thought I might struggle with that approach here because Haskell collections are homogeneous i.e. all the elements need to be of the same type.

I read about existentially quantified types but they seemed a bit complicated and instead I decided to use the Dynamic interface.

Using Dynamic we can define a function to strip out the numbers like this:

import Data.Dynamic
import Data.Maybe
 
noNums :: [Dynamic] -> [Dynamic]
noNums lat = cond [(null lat, []), 
                   (isNumber (head lat), noNums (tail lat)),
                   (otherwise, head lat : noNums (tail lat))]
 
justInt :: Dynamic -> Maybe Int
justInt dyn = fromDynamic dyn :: Maybe Int
 
isNumber :: Dynamic -> Bool
isNumber x = isJust $ justInt x

We can then call the function like this:

> map toString $ noNums [toDyn (5 :: Int), toDyn "pears", toDyn (6 :: Int), toDyn "prunes", toDyn (9 :: Int), toDyn "dates"]
[Just "pears",Just "prunes",Just "dates"]
toString :: Dynamic -> Maybe String
toString dyn = fromDynamic dyn

fromDynamic eventually makes a call to unSafeCoerce#:

The function unsafeCoerce# allows you to side-step the typechecker entirely. That is, it allows you to coerce any type into any other type. If you use this function, you had better get it right, otherwise segmentation faults await. It is generally used when you want to write a program that you know is well-typed, but where Haskell’s type system is not expressive enough to prove that it is well typed.

I wanted to try and make the ‘isNumber’ function handle any numeric type rather than just Ints but I haven’t quite worked out how to do that.

Obviously I’m only using Dynamic here because the exercise requires it but I’m not sure what real life situation would require its use.

If anyone has used it before or knows a use case I’d be interested to know what it is!

Written by Mark Needham

June 19th, 2012 at 11:09 pm

Posted in Haskell

Tagged with

The Little Schemer: Attempt #2

without comments

A few weeks ago I asked the twittersphere for some advice on how I could get better at writing recursive functions and one of the pieces of advice was to work through The Little Schemer.

I first heard about The Little Schemer a couple of years ago and after going through the first few pages I got bored and gave up.

I still found the first few pages a bit trivial this time around as well but my colleague Jen Smith encouraged me to keep going and once I’d got about 20 pages in it became clearer to me why the first few pages had been written the way they had.

I’m just under half way through at the moment and so far the whole thing that the authors are trying to get you to do is think through problems in a way that makes them easily solvable using recursion.

In the first few pages that approach seems ridiculously over the top but it gets you thinking in the right way and means that when you reach the exercises where you need to write your own functions it’s not too much of a step up.

Since I’ve been learning Haskell for the last few months I thought it’d be interesting to try and use that language even though Clojure would be a more natural fit.

I initially wrote the solutions to exercises using idiomatic Haskell before realising I was probably missing the point by doing that:

multiInsertL new old [] = []
multiInsertL new old (x:xs) | x == old = new:old:multiInsertL new old xs
                            | otherwise = x:multiInsertL new old xs

I came across the cond package which lets you simulate a Scheme/Clojure ‘cond’ and so far all the functions in the exercises have made use of that.

multiInsertL2 new old lat = cond [(null lat, []),
                                  (head lat == old, new:old:multiInsertL2 new old (tail lat)),
                                  (otherwise, (head lat):multiInsertL2 new old (tail lat))]

I wouldn’t write Haskell like this normally but I think it’s helpful for me while I’m getting used to the different patterns you can have in recursive functions.

As we go through the exercises the authors describe ‘commandments’ that you should follow when writing recursive functions which are really useful for me as a novice.

My favourite one so far is the 4th commandment:

Always change at least one argument while recursing. It must be changed to be closer to termination. The changing argument must be tested in the termination condition:

  • when using cdr, test termination with null? and
  • when using sub1, test termination with zero?

So many times when I’ve tried to write recursive functions I’ve ended up putting them into an infinite loop but following this rule should help avoid that!

The cool thing about this book is that you can work through a few problems, put it down for a few days before picking it back up and you’ll still be able to pick up from where you left off quite easily.

Highly recommended so far!

Written by Mark Needham

June 19th, 2012 at 12:21 am

neo4j/Cypher: Finding the most connected node on the graph

with 6 comments

As I mentioned in another post about a month ago I’ve been playing around with a neo4j graph in which I have the following relationship between nodes:

One thing I wanted to do was work out which node is the most connected on the graph, which would tell me who’s worked with the most people.

I started off with the following cypher query:

query =  " START n = node(*)"
query << " MATCH n-[r:colleagues]->c"
query << " WHERE n.type? = 'person' and has(n.name)"
query << " RETURN n.name, count(r) AS connections"
query << " ORDER BY connections DESC"

I can then execute that via the neo4j console or through irb using the neography gem like so:

> require 'rubygems'
> require 'neography'
> neo = Neography::Rest.new(:port => 7476)
> neo.execute_query query
 
# cut for brevity
{"data"=>[["Carlos Villela", 283], ["Mark Needham", 221]], "columns"=>["n.name", "connections"]}

That shows me each person and the number of people they’ve worked with but I wanted to be able to see the most connected person in each office .

Each person is assigned to an office while they’re working out of that office but people tend to move around so they’ll have links to multiple offices:

V3

I put ‘start_date’ and ‘end_date’ properties on the ‘member_of’ relationship and we can work out a person’s current office by finding the ‘member_of’ relationship which doesn’t have an end date defined:

query =  " START n = node(*)"
query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office"
query << " WHERE n.type? = 'person' and has(n.name) and not(has(r2.end_date))"
query << " RETURN n.name, count(r) AS connections, office.name"
query << " ORDER BY connections DESC"

And now our results look more like this:

{"data"=>[["Carlos Villela", 283, "Porto Alegre - Brazil"], ["Mark Needham", 221, "London - UK South"]], 
"columns"=>["n.name", "connections"]}

If we want to restrict that just to return the people for a specific person we can do that as well:

query =  " START n = node(*)"
query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office"
query << " WHERE n.type? = 'person' and has(n.name) and (not(has(r2.end_date))) and office.name = 'London - UK South'"
query << " RETURN n.name, count(r) AS connections, office.name"
query << " ORDER BY connections DESC"
{"data"=>[["Mark Needham", 221, "London - UK South"]], "columns"=>["n.name", "connections"]}

In the current version of cypher we need to put brackets around the not expression otherwise it will apply the not to the rest of the where clause. Another way to get around that would be to put the not part of the where clause at the end of the line.

While I am able to work out the most connected person by using these queries I’m not sure that it actually tells you who the most connected person is because it’s heavily biased towards people who have worked on big teams.

Some ways to try and account for this are to bias the connectivity in favour of those you have worked longer with and also to give less weight to big teams since you’re less likely to have a strong connection with everyone as you might in a smaller team.

I haven’t got onto that yet though!

Written by Mark Needham

June 16th, 2012 at 10:41 am

Posted in neo4j

Tagged with ,

Functional Thinking: Separating concerns

without comments

Over the weekend I was trying to port some of the neo4j import code for the ThoughtWorks graph I’ve been working on to make use of the REST Batch API and I came across an interesting example of imperative vs functional thinking.

I’m using the neography gem to populate the graph and to start with I was just creating a person node and then creating an index entry for it:

people_to_load = Set.new
people_to_load << { :name => "Mark Needham", :id => 1 }
people_to_load << { :name => "Jenn Smith", :id => 2 }
people_to_load << { :name => "Chris Ford", :id => 3 } 
 
command_index = 0
people_commands = people_to_load.inject([]) do |acc, person| 
  acc << [:create_node, {:id => person[:id], :name => person[:name]}]
  acc << [:add_node_to_index, "people", "name", person[:name], "{#{command_index}}"]
  command_index += 2
  acc
end
 
Neography::Rest.new.batch * people_commands

people_commands ends up containing the following arrays in the above example:

 [
  [:create_node, {:id=>"1", :name=>"Mark Needham"}], 
  [:add_node_to_index, "people", "name", "Mark Needham", "{0}"], 
  [:create_node, {:id=>"2", :name=>"Jenn Smith"}], 
  [:add_node_to_index, "people", "name", "Jenn Smith", "{2}"], 
  [:create_node, {:id=>"3", :name=>"Chris Ford"}, 
  [:add_node_to_index, "people", "name", "Chris Ford", "{4}"]
 ]

We can refer to previously executed batch commands by referencing their ‘job id’ which in this case is their 0 indexed position in the list of commands. e.g. the second command which indexes me refers to the node created in ‘job id’ ‘0’ i.e the first command in this batch

In the neo4j REST API we’d be able to define an arbitrary id for a command and then reference that later on but it’s not implemented that way in neography.

I thought having the ‘command_index += 2′ was a bit rubbish because it’s nothing to do with the problem I’m trying to solve so I posted to twitter to see if there was a more functional way to do this.

My colleague Chris Ford came up with a neat approach which involved using ‘each_with_index’ to work out the index positions rather than having a counter. His final version looked like this:

insert_commands = people_to_load.map do |person|
  [:create_node, {:id => person[:id], :name => person[:name]}]
end
 
index_commands = people_to_load.each_with_index.map do |person, index|
  [:add_node_to_index, "people", "name", person[:name], "{#{index}}"]
end
 
people_commands = insert_commands + index_commands

The neat thing about this solution is that Chris has separated the two concerns – creating the node and indexing it.

When I was thinking about this problem imperatively they seemed to belong together but there’s actually no reason for that to be the case and we can write simpler code by separating them.

We do iterate through the set twice but since it’s not really that big it doesn’t make too much difference. to the performance.

Written by Mark Needham

June 12th, 2012 at 11:50 pm