Mark Needham

Thoughts on Software Development

Visualising a neo4j graph using gephi

without comments

At ThoughtWorks we don’t have line managers but people can choose to have a sponsor – typically someone who has worked in the company for longer/has more experience in the industry than them – who can help them navigate the organisation better.

From hearing people talk about sponsors over the last 6 years it seemed like quite a few people sponsored the majority and there were probably a few people who didn’t have a sponsor.

It seemed like a pretty good problem to visualise in a graph so I got access to the data, spent a few hours tidying it up so all the names matched the names we have in our staffing application and then loaded it into neo4j.

I initially tried to visualise the data in sigma.js but that didn’t work that well here – I think it’s much better when we actually want to browse around a graph whereas here I’m just interested in an overall snapshot.

I therefore decided to load the data into gephi and find a way of visualising it using that.

The relationships on the graph are like this:

Sponsors graphviz

I created this using the following graphviz definition:

graph effectgraph {
	size="8,8"; 
	rankdir=LR;
 
	person1[label="Person 1"];
	person2[label="Person 2"];	
	person3[label="Person 3"];	
	officeA[label="Office A"];
 
	officeA -- person1 [label="member_of"];
	officeA -- person2 [label="member_of"];
	officeA -- person3 [label="member_of"];
	person1 -- person2 [label="sponsor_of"];
	person2 -- person3 [label="sponsor_of"];	
}
dot -Tpng v3.dot >> sponsors.png

I wrote a script based on Max de Marzi’s blog post to get the data into gexf format so that I could load it into gephi:

First I get a collection of all the people who are sponsors and how many sponsees they have:

def load_sponsors
 query =  " START n = node(*)" 
 query << " MATCH n-[r:sponsor_of]->n2" 
 query << " RETURN ID(n), count(r) AS sponsees ORDER BY sponsees DESC"
 
 sponsors = {}
 @neo.execute_query(query)["data"].each do |id, sponsees|
 	sponsors[id] = sponsees
 end
 sponsors
end

That creates a hash of sponsors with a count of how many sponsees they which I used in the following function to creates a collection of nodes:

def nodes
  query =  " START n = node(*)"
  query << " MATCH n-[r:member_of]->o" 
  query << " WHERE o.name IN ['London', 'Manchester', 'Hamburg'] AND not(has(r.end_date))"
  query << " RETURN DISTINCT(n.name), ID(n)"
 
  sponsors_sponsee_count = load_sponsors
 
  nodes = Set.new
  @neo.execute_query(query)["data"].each do |n| 
  	nodes << { "id" => n[1], "name" => n[0], "size" => 5 + ((sponsors_sponsee_count[n[1]] || 0) * 5) }
  end
 
  nodes
end

I have nodes representing people in the whole organisation so I need to filter to only find people who work for ThoughtWorks Europe since that’s where I have the sponsor data for. I add a size property here so that people who have more sponsees will be more prominent on the graph.

We then have the following function to describe the ‘sponsor_of’ relationships:

def edges
  query =  " START n = node(*)"
  query << " MATCH n-[r:sponsor_of]->n2"
  query << " RETURN ID(r), ID(n), ID(n2)"
 
  @neo.execute_query(query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} }
end

I use the following code to generate the XML format I need:

xml = Builder::XmlMarkup.new(:target=>STDOUT, :indent=>2)
xml.instruct! :xml
xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz"  do
  xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do
    xml.nodes :count => nodes.size do
      nodes.each do |n|
        xml.node :id => n["id"],   :label => n["name"] do
          xml.tag!("viz:size",     :value => n["size"])
          xml.tag!("viz:color",    :b => 255, :g => 255, :r => 255)
          xml.tag!("viz:position", :x => rand(100), :y => rand(100))
       end
      end
    end
    xml.edges :count => edges.size do
      edges.each do |e|
        xml.edge:id => e["id"], :source => e["source"], :target => e["target"]
      end
    end
  end
end

We end up with something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gephi.org/gexf" xmlns:viz="http://www.gephi.org/gexf/viz">
  <graph defaultedgetype="directed" idtype="string" type="static">
    <nodes count="274">
      <node id="1331" label="Person 1">
        <viz:size value="5"/>
        <viz:color b="255" g="255" r="255"/>
        <viz:position x="69" y="31"/>
      </node>
    ....
    </nodes>
    <edges count="187">
      <edge id="7975" source="56" target="1374"/>
    </edges>
  </graph>
</gexf>

I set the positions of the nodes to be randomised because the gephi algorithms seem to work much better that way.

I can then create the gexf file like so:

ruby gephi_me.rb >> sponsors.gexf

I loaded it into gephi and ran the Force Atlas & ‘Noverlap’ algorithms over the graph to make it a bit easier to visualise the data:

Sponsors

The top 4 sponsors on the graph are sponsors to 28 people between them and the next 7 cover a further 35 people.

Interestingly there’s a big group of orphans in the middle who don’t have a sponsor – initially I thought it was a bit strange that there are so many but people who have moved to the UK from another country and have a sponsor from that country would also come in this category.

I wrote the following query to help me find out who the orphans were after noticing that on the visualisation:

  query =  " START n = node(*)"
  query << " MATCH n-[r:member_of]->o, n<-[r2?:sponsor_of]-n2" 
  query << " WHERE r2 is null and o.name IN ['London', 'Manchester', 'Hamburg'] AND not(has(r.end_date))"
  query << " RETURN DISTINCT(n.name), ID(n)"

I wanted to annotate the image to point out who specific people were for internal use and a few people on twitter pointed me towards skitch which made my life amazingly easy so I’d highly recommend that.

Be Sociable, Share!

Written by Mark Needham

June 21st, 2012 at 5:02 am

Posted in neo4j

Tagged with , , ,