# Mark Needham

Thoughts on Software Development

## Graph Processing: Betweeness Centrality – neo4j’s cypher vs graphstream

Last week I wrote about the betweenness centrality algorithm and my attempts to understand it using graphstream and while reading the source I realised that I might be able to put something together using neo4j’s all shortest paths algorithm.

To recap, the betweenness centrality algorithm is used to determine the load and importance of a node in a graph.

While talking about this with Jen she pointed out that calculating the betweenness centrality of nodes across the whole graph often doesn’t make sense. However, it can be useful to know which node is the most important in a smaller sub graph that you’re interested in.

In this case I’m interested in working out the betweenness centrality of nodes in a very small directed graph:

Let’s briefly recap the algorithm:

[Betweenness centrality] is equal to the number of shortest paths from all vertices to all others that pass through that node.

This means that we exclude any paths which go directly between two nodes without passing through any others, something which I didn’t initially grasp.

If we work out the applicable paths by hand we end up with the following:

```A -> B: Direct Path Exists A -> C: B A -> D: E A -> E: Direct Path Exists B -> A: No Path Exists B -> C: Direct Path Exists B -> D: E or C B -> E: Direct Path Exists C -> A: No Path Exists C -> B: No Path Exists C -> D: Direct Path Exists C -> E: No Path Exists D -> A: No Path Exists D -> B: No Path Exists D -> C: No Path Exists D -> E: No Path Exists E -> A: No Path Exists E -> B: No Path Exists E -> C: No Path Exists E -> D: Direct Path Exists```

Which gives the following betweenness centrality values:

```A: 0 B: 1 C: 0.5 D: 0 E: 1.5```

We can write a test against the latest version of graphstream (which takes direction into account) to confirm our manual algorithm:

``` @Test public void calculateBetweennessCentralityOfMySimpleGraph() { Graph graph = new SingleGraph("Tutorial 1");   Node A = graph.addNode("A"); Node B = graph.addNode("B"); Node E = graph.addNode("E"); Node C = graph.addNode("C"); Node D = graph.addNode("D");   graph.addEdge("AB", A, B, true); graph.addEdge("BE", B, E, true); graph.addEdge("BC", B, C, true); graph.addEdge("ED", E, D, true); graph.addEdge("CD", C, D, true); graph.addEdge("AE", A, E, true);   BetweennessCentrality bcb = new BetweennessCentrality(); bcb.computeEdgeCentrality(false); bcb.betweennessCentrality(graph);   System.out.println("A="+ A.getAttribute("Cb")); System.out.println("B="+ B.getAttribute("Cb")); System.out.println("C="+ C.getAttribute("Cb")); System.out.println("D="+ D.getAttribute("Cb")); System.out.println("E="+ E.getAttribute("Cb")); }```

The output is as expected:

```A=0.0 B=1.0 C=0.5 D=0.0 E=1.5```

I wanted to see if I could do the same thing using neo4j so I created the graph in a blank database using the following cypher statements:

```CREATE (A {name: "A"}) CREATE (B {name: "B"}) CREATE (C {name: "C"}) CREATE (D {name: "D"}) CREATE (E {name: "E"})   CREATE A-[:TO]->E CREATE A-[:TO]->B CREATE B-[:TO]->C CREATE B-[:TO]->E CREATE C-[:TO]->D CREATE E-[:TO]->D```

I then wrote a query which found the shortest path between all sets of nodes in the graph:

```MATCH p = allShortestPaths(source-[r:TO*]->destination) WHERE source <> destination RETURN NODES(p)```

If we run that it returns the following:

```==> +---------------------------------------------------------+ ==> | NODES(p) | ==> +---------------------------------------------------------+ ==> | [Node[1]{name:"A"},Node[2]{name:"B"}] | ==> | [Node[1]{name:"A"},Node[2]{name:"B"},Node[3]{name:"C"}] | ==> | [Node[1]{name:"A"},Node[5]{name:"E"},Node[4]{name:"D"}] | ==> | [Node[1]{name:"A"},Node[5]{name:"E"}] | ==> | [Node[2]{name:"B"},Node[3]{name:"C"}] | ==> | [Node[2]{name:"B"},Node[3]{name:"C"},Node[4]{name:"D"}] | ==> | [Node[2]{name:"B"},Node[5]{name:"E"},Node[4]{name:"D"}] | ==> | [Node[2]{name:"B"},Node[5]{name:"E"}] | ==> | [Node[3]{name:"C"},Node[4]{name:"D"}] | ==> | [Node[5]{name:"E"},Node[4]{name:"D"}] | ==> +---------------------------------------------------------+ ==> 10 rows```

We’re still returning the direct links between nodes but that’s reasonably easy to correct by filtering the results based on the number of nodes in the path:

```MATCH p = allShortestPaths(source-[r:TO*]->destination) WHERE source <> destination AND LENGTH(NODES(p)) > 2 RETURN EXTRACT(n IN NODES(p): n.name)```
```==> +--------------------------------+ ==> | EXTRACT(n IN NODES(p): n.name) | ==> +--------------------------------+ ==> | ["A","B","C"] | ==> | ["A","E","D"] | ==> | ["B","C","D"] | ==> | ["B","E","D"] | ==> +--------------------------------+ ==> 4 rows```

If we tweak the cypher query a bit we can get a collection of the shortest paths for each source/destination:

```MATCH p = allShortestPaths(source-[r:TO*]->destination) WHERE source <> destination AND LENGTH(NODES(p)) > 2 WITH EXTRACT(n IN NODES(p): n.name) AS nodes RETURN HEAD(nodes) AS source, HEAD(TAIL(TAIL(nodes))) AS destination, COLLECT(nodes) AS paths```
```==> +------------------------------------------------------+ ==> | source | destination | paths | ==> +------------------------------------------------------+ ==> | "A" | "D" | [["A","E","D"]] | ==> | "A" | "C" | [["A","B","C"]] | ==> | "B" | "D" | [["B","C","D"],["B","E","D"]] | ==> +------------------------------------------------------+ ==> 3 rows```

When we have a way to slice collections using cypher it wouldn’t be too difficult to get from here to a betweenness centrality score for the nodes but for now it’s much easier to use a general programming language.

In this case I used Ruby and came up with the following code:

```require 'neography' neo = Neography::Rest.new   query = " MATCH p = allShortestPaths(source-[r:TO*]->destination)" query << " WHERE source <> destination AND LENGTH(NODES(p)) > 2" query << " WITH EXTRACT(n IN NODES(p): n.name) AS nodes" query << " RETURN HEAD(nodes) AS source, HEAD(TAIL(TAIL(nodes))) AS destination, COLLECT(nodes) AS paths"   betweenness_centrality = { "A" => 0, "B" => 0, "C" => 0, "D" => 0, "E" => 0 }   neo.execute_query(query)["data"].map { |row| row[2].map { |path| path[1..-2] } }.each do |potential_central_nodes| number_of_central_nodes = potential_central_nodes.size potential_central_nodes.each do |nodes| nodes.each { |node| betweenness_centrality[node] += (1.0 / number_of_central_nodes) } end end   p betweenness_centrality```

which outputs the following:

```\$ bundle exec ruby centrality.rb {"A"=>0, "B"=>1.0, "C"=>0.5, "D"=>0, "E"=>1.5}```

It seems to do the job but I’m sure there are some corner cases it doesn’t handle which a mature library would take care of. As an experiment to see what’s possible I think it’s not too bad though!

The graph is on the neo4j console in case anyone is interested in playing around with it.

Be Sociable, Share!

Written by Mark Needham

July 27th, 2013 at 11:21 am

Posted in Graph Processing,neo4j

Tagged with

• Michael Hunger

This should calculate it all in Cypher:

``` START n=node(*) WHERE HAS (n.name) WITH collect(n.name) AS all_nodes // now we have the names of all nodes in a collection START source=node(*), destination=node(*) MATCH p = allShortestPaths(source-[r:TO*]->destination) WHERE source destination AND LENGTH(p)> 1 WITH EXTRACT(n IN NODES(p): n.name) AS nodes,all_nodes WITH COLLECT(nodes) AS paths,all_nodes // aggregated all suitable paths into a single large collection to calculate on RETURN reduce(res=[], x IN all_nodes : // for each node in the graph, find all the paths that contain it neither as first nor last and // compute the number length(paths) == number of ocurrences // and create a collection of pairs of name and count res +[x, length(filter(p IN paths : x IN tail(p) AND x last(p))) ]) ```

Results:
``` [A, 0, B, 1, C, 1, D, 0, E, 2] ```

```MATCH (n), pthroughn = allShortestPaths((a)-[*]->(b)) WHERE n in nodes(pthroughn) AND n a AND n b AND a b WITH n,a,b,count(pthroughn) as sumn MATCH p = allShortestPaths((a)-[*]->(b)) WITH n, a, b, tofloat(sumn) / tofloat(count(p)) AS fraction RETURN n, sum(fraction)```

?

• Your definition is not quite correct. The bc of x is the sum of the fractions of all shortest paths from any node a to any node b through node x and all shortest paths between a and b, whereby x != a and x != b and a != b

Still you seem to do the correct thing.

Tweaked my query from yesterday a bit, so that it only needs to call allShortestPaths once:

`MATCH (n), p=allShortestPaths(a-[*]->b)`

``` WITH n, a, b, tofloat(length(filter(p in collect(p) WHERE n IN nodes(p) AND n a AND n b))) / tofloat(length(collect(p))) AS fraction ```

`RETURN n, sum(fraction) AS betweenness ORDER BY betweenness DESC`

• The first cypher does not return the correct result of B:1 , C: 0.5 and E: 1.5, because for the example the A–>B–>E path is included, although A and E are neighbours in the original example.

WHERE a b AND length(p)> 1

should be

WHERE a b AND length(p)> 1 AND NOT (a)–>(b)

• Emmanuel Leroy

Any details on performance? Do you have any benchmark for this measure?