Archive for the ‘graphs’ tag
Graph Processing: Calculating betweenness centrality for an undirected graph using graphstream
Since I now spend most of my time surrounded by graphs I thought it’d be interesting to learn a bit more about graph processing, a topic my colleague Jim wrote about a couple of years ago.
I like to think of the types of queries you’d do with a graph processing engine as being similar in style graph global queries where you take most of the nodes in a graph into account and do some sort of calculation.
One of the interesting graph global algorithms that I’ve come across recently is ‘betweenness centrality‘ which allows us to work out the centrality of each node with respect to the rest of the network/graph.
[Betweenness centrality] is equal to the number of shortest paths from all vertices to all others that pass through that node. Betweenness centrality is a more useful measure (than just connectivity) of both the load and importance of a node. The former is more global to the network, whereas the latter is only a local effect.
I wanted to find a library that I could play around with to get a better understanding of how this algorithm works and I came across graphstream which seems to do the job.
I was able to get up and running by creating a Maven pom file with the following dependencies:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <packaging>jar</packaging> <artifactId>mygsproject</artifactId> <groupId>org.graphstream</groupId> <version>1.0SNAPSHOT</version> <name>mygsproject</name> <description/> <dependencies> <dependency> <groupId>org.graphstream</groupId> <artifactId>gscore</artifactId> <version>1.0</version> </dependency> <dependency> <groupId>org.graphstream</groupId> <artifactId>gsalgo</artifactId> <version>1.0</version> </dependency> </dependencies> </project> 
I wanted to find a worked example which I could use to understand how betweenness centrality is calculated and I found one on slide 14 of these lecture notes from The University of Nevada.
The example graph looks like this:
And to work out the betweenness centrality we need to take every pair of nodes and see which (if any) nodes a path between the two must pass through.
A > B: None A > C: B A > D: B, C A > E: B, C, D B > C: None B > D: C B > E: C, D C > D: None C > E: D D > E: None 
So for this example we end up with the following centrality values for each node:
A: 0 B: 3 C: 4 D: 3 E: 0 
If we run this through graph stream we'll see values double to this because it ignores the direction of a relationship (e.g. it's possible to go from B>A even though that relationship is actually only from A>B.):
public class Spike { public static void main(String[] args) { Graph graph = new SingleGraph("Tutorial 1"); Node A = graph.addNode("A"); Node B = graph.addNode("B"); Node E = graph.addNode("E"); Node C = graph.addNode("C"); Node D = graph.addNode("D"); graph.addEdge("AB", "A", "B"); graph.addEdge("BC", "B", "C"); graph.addEdge("CD", "C", "D"); graph.addEdge("DE", "D", "E"); BetweennessCentrality bcb = new BetweennessCentrality(); bcb.init(graph); bcb.compute(); System.out.println("A="+ A.getAttribute("Cb")); System.out.println("B="+ B.getAttribute("Cb")); System.out.println("C="+ C.getAttribute("Cb")); System.out.println("D="+ D.getAttribute("Cb")); System.out.println("E="+ E.getAttribute("Cb")); } } 
A=0.0 B=6.0 C=8.0 D=6.0 E=0.0 
What we can learn from this calculation is that in this graph node 'C' is the most influential one because most paths between other nodes have to pass through it. There's not much in it though as nodes 'B' and 'D' are close behind.
Now that I had a better understanding of how to manually execute the algorithm I thought I should try and work out what the example from the documentation would return.
The example graph looks like this:
And the paths between the nodes would be as follows:
Since I know graphstream treats the graph as being undirected I don't respect the direction of relationships in this calculation)
A > B: Direct Path Exists A > C: B A > D: E A > E: Direct Path Exists B > A: Direct Path Exists B > C: Direct Path Exists B > D: E or C B > E: Direct Path Exists C > A: B C > B: Direct Path Exists C > D: Direct Path Exists C > E: D or B D > A: E D > B: C or E D > C: Direct Path Exists D > E: Direct Path Exists E > A: Direct Path Exists E > B: Direct Path Exists E > C: D or B E > D: Direct Path Exists 
For some of these there were two potential paths so we give 1/2 a point to each node which leads to these totals
A: 0 B: 3 C: 1 D: 1 E: 3 
If we run that through graphstream we'd expect to see the same values:
public class Spike { public static void main(String[] args) { Graph graph = new SingleGraph("Tutorial 1"); Node A = graph.addNode("A"); Node B = graph.addNode("B"); Node E = graph.addNode("E"); Node C = graph.addNode("C"); Node D = graph.addNode("D"); graph.addEdge("AB", "A", "B"); graph.addEdge("BE", "B", "E"); graph.addEdge("BC", "B", "C"); graph.addEdge("ED", "E", "D"); graph.addEdge("CD", "C", "D"); graph.addEdge("AE", "A", "E"); BetweennessCentrality bcb = new BetweennessCentrality(); bcb.init(graph); bcb.compute(); System.out.println("A="+ A.getAttribute("Cb")); System.out.println("B="+ B.getAttribute("Cb")); System.out.println("C="+ C.getAttribute("Cb")); System.out.println("D="+ D.getAttribute("Cb")); System.out.println("E="+ E.getAttribute("Cb")); } } 
A=0.0 B=3.0 C=1.0 D=1.0 E=3.0 
This library does the job for some definition of betweenness centrality but ideally I'd like to have the direction of relationships taken into account so I'm going to give it a try with one of the other libraries that I've come across.
So far the other graph processing libraries I know of are graphchi, JUNG, GreenMarl and giraph but if you know of any others that I should try out please let me know.
** Update ** (27th July 2013)
While writing another post about betweenness centrality I realised I'd made a mistake in the calculations on this post so I've corrected those now.
BellmanFord algorithm in Python
The latest problem of the Algorithms 2 class required us to write an algorithm to calculate the shortest path between two nodes on a graph and one algorithm which allows us to do this is BellmanFord.
BellmanFord computes the single source shortest path which means that if we have a 5 vertex graph we’d need to run it 5 times to find the shortest path for each vertex and then find the shortest paths of those shortest paths.
One nice thing about BellmanFord compared to Djikstra is that it’s able to handle negative edge weights.
The pseudocode of the algorithm is as follows:
 Let A = 2D array of size n * n where n is the number of vertices
 Initialise A[0,s] = 0; A[0,v] = infinity for all v ≠ s where s is the node we’re finding the shortest path for

for i=1,2,3,…n1
 for each v ∈ V
 A[i,v] = min { A[i1, v],
min (w,v) ∈ E { A[k1, w] + Cost_{wv} }
 }  where (w,v) are the incoming edges of vertex v
 A[i,v] = min { A[i1, v],
 for each v ∈ V
 check for negative cycles by running one more time and checking A[n] = A[n1]
 If they are equal then return A[n1] otherwise report a negative cycle.
My first version looked like this:
import os file = open(os.path.dirname(os.path.realpath(__file__)) + "/g_small.txt") vertices, edges = map(lambda x: int(x), file.readline().replace("\n", "").split(" ")) adjacency_list = [[] for k in xrange(vertices)] for line in file.readlines(): tail, head, weight = line.split(" ") adjacency_list[int(head)1].append({"from" : int(tail), "weight" : int(weight)}) s=0 cache = [[0 for k in xrange(vertices+1)] for j in xrange(vertices+1)] cache[0][s] = 0 for v in range(0, vertices): if v != s: cache[0][v] = float("inf") for i in range(1, vertices): for v in range(0, vertices): least_adjacent_cost = calculate_least_adjacent_cost(adjacency_list, i, v, cache[i1]) cache[i][v] = min(cache[i1][v], least_adjacent_cost) # detecting negative cycles for v in range(0, vertices): least_adjacent_cost = calculate_least_adjacent_cost(adjacency_list, i, v, cache[vertices1]) cache[vertices][v] = min(cache[vertices1][v], least_adjacent_cost) if(not cache[vertices] == cache[vertices1]): raise Exception("negative cycle detected") shortest_path = min(cache[vertices1]) print("Shortest Path: " + str(shortest_path)) 
where calculate_least_adjacent_cost is defined like so:
def calculate_least_adjacent_cost(adjacency_list, i, v, cache): adjacent_nodes = adjacency_list[v] least_adjacent_cost = float("inf") for node in adjacent_nodes: adjacent_cost = cache[node["from"]1] + node["weight"] if adjacent_cost < least_adjacent_cost: least_adjacent_cost = adjacent_cost return least_adjacent_cost 
As you can see we’re using an adjacency list as our graph representation, mapping from each node to its corresponding edges. We then initialise the cache as per the algorithm and then have two nested for loops which we use to work out the shortest path from s to each vertex.
The calculate_least_adjacent_cost function is used to work out which of the incoming edges to a vertex has the lowest cost, taking into account previous shortest path calculations that we’ve done up to the source vertices of those incoming edges.
We then have an extra call at the end to check for negative cycles. If there is no change in the values calculated from s to each vertex then we know we don’t have any negative cycles because otherwise one of them would have come into effect and given us a different result.
This algorithm works but it’s inefficient in its use of space – our cache has size n*n when we only ever care about 2 of the rows – the previous and current ones – so we can just use a list instead.
If we do this the code now looks like this:
s=0 cache = [[] for j in xrange(vertices+1)] cache[s] = 0 for v in range(0, vertices): if v != s: cache[v] = float("inf") for i in range(1, vertices): for v in range(0, vertices): previous_cache = cache least_adjacent_cost = calculate_least_adjacent_cost(adjacency_list, i, v, previous_cache) cache[v] = min(previous_cache[v], least_adjacent_cost) # detecting negative cycles for v in range(0, vertices): previous_cache = copy.deepcopy(cache) least_adjacent_cost = calculate_least_adjacent_cost(adjacency_list, i, v, previous_cache) cache[v] = min(previous_cache[v], least_adjacent_cost) if(not cache == previous_cache): raise Exception("negative cycle detected") 
By doing this we lose the history of the algorithm over the run which means in its current state we wouldn’t be able to work our what the shortest path was, we just know its cost.
For a 1000 node, 47978 edge graph it takes 27 seconds to go over the whole graph and detect a negative cycle if there is one.
The code for this algorithm is on github as always.
gephi: Centring a graph around an individual node
I spent some time recently playing around with gephi – an open source platform for creating visualisations of graphs – to get a bit more insight into the ThoughtWorks graph which I’ve created in neo4j.
I followed Max De Marxi’s blog post to create a GEFX (Graph Exchange XML Format) file to use in gephi although I later learned that you can import directly from neo4j into gephi which I haven’t tried yet.
I loaded it into gephi and this is what it looks like:
There are massive numbers of connections between almost every node so it’s pretty hard to see what’s going on.
I wanted to be able to centre the graph around an individual person and see just the links related to them.
To do that we need to make use of an ‘ego filter‘ so the first step is to make the filters menu visible by clicking ‘Window > Filters’.
We can then choose the ego filter which sits under ‘Library > Topology’ and fill in the ID of the node that we want to centre the graph around as well as choose the depth of the traversal.
In this case any traversal above 1 will end up traversing the majority of the graph since the average distance between nodes in this graph is just above 2.5.
I run the ‘Force Atlas’ layout algorithm over it and this is what we end up with:
My graph shows some interesting clustering of nodes where the ones on the right are people I worked with in India, top left are people I worked with in Australia, bottom left people in the UK and the ones towards the lower middle are people I went to ThoughtWorks University with.
There are a load of other filters to choose from but the ego filter is pretty cool!