Mark Needham

Thoughts on Software Development

Archive for the ‘graph-processing’ tag

Java/JBLAS: Calculating eigenvector centrality of an adjacency matrix

with 5 comments

I recently came across a very interesting post by Kieran Healy where he runs through a bunch of graph algorithms to see whether he can detect the most influential people behind the American Revolution based on their membership of various organisations.

The first algorithm he looked at was betweenness centrality which I’ve looked at previously and is used to determine the load and importance of a node in a graph.

This algorithm would assign a high score to nodes which have a lot of nodes connected to them even if those nodes aren’t necessarily influential nodes in the graph.

If we want to take the influence of the other nodes into account then we can use an algorithm called eigenvector centrality.

Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.

Google’s PageRank is a variant of the Eigenvector centrality measure.

Both PageRank and Eigenvector centrality give us a probability which describes how often we’d end up visiting each node on a random walk around the graph.

As far as I can tell there are a couple of differences between PageRank and Eigenvector centrality (but I’m happy to be corrected as I’m still learning this stuff):

  1. PageRank introduces a ‘dampening factor’ to simulate the idea that some percentage of the time we might decide not to follow any of a node’s relationships but instead pick a random node in the graph.
  2. PageRank makes sure that the elements in each column of the adjacency matrix add up to one. Therefore, if our node had a relationship to every other one in the graph then each would only contribute a value of 1/n rather than 1.

In this instance since Healy wanted to analyse the influence of people rather than web pages eigenvector centrality makes more sense.

Over the past few days I’ve been trying to understand this topic area a bit better and found the following resources useful:

I calculated a few made up matrices by hand but found it became too difficult after a 3×3 matrix so I wanted to find a Java based library which I could use instead.

Adjacencymatrix

These were the ones that I came across:

  • JBLAS – Linear Algebra for Java
  • JAMA – A Java Matrix Package
  • Colt – Advanced Computing for Science
  • Commons Math – The Apache Commons Mathematics Library
  • la4j – Linear Algebra for Java
  • MTJ – Matrix Toolkits Java

I’d heard of JBLAS before so I thought I’d give that a try on one of the adjacency matrices described in Murphy Waggoner’s post about the Gould Index and see if I got the same eigenvector centrality values.

The first step was to define the matrix which can be represented as an array of arrays:

DoubleMatrix matrix = new DoubleMatrix(new double[][] {
        {1,1,0,0,1,0,0},
        {1,1,0,0,1,0,0},
        {0,0,1,1,1,0,0},
        {0,0,1,1,1,0,0},
        {1,1,1,1,1,1,1},
        {0,0,0,0,1,1,1},
        {0,0,0,0,1,1,1},
});

Our next stop is to work out the eigenvalues which we can do using the following function:

ComplexDoubleMatrix eigenvalues = Eigen.eigenvalues(matrix);
for (ComplexDouble eigenvalue : eigenvalues.toArray()) {
    System.out.print(String.format("%.2f ", eigenvalue.abs()));
}
4.00 2.00 0.00 1.00 2.00 0.00 0.00

We want to get the corresponding eigenvector for the eigenvalue of 4 and as far as I can tell the Eigen#eigenvectors function returns its values in the same order as the Eigen#eigenvalues function so I wrote the following code to work out the principal eigenvector :

List<Double> principalEigenvector = getPrincipalEigenvector(matrix);
System.out.println("principalEigenvector = " + principalEigenvector);
 
private static List<Double> getPrincipalEigenvector(DoubleMatrix matrix) {
    int maxIndex = getMaxIndex(matrix);
    ComplexDoubleMatrix eigenVectors = Eigen.eigenvectors(matrix)[0];
    return getEigenVector(eigenVectors, maxIndex);
}
 
private static int getMaxIndex(DoubleMatrix matrix) {
    ComplexDouble[] doubleMatrix = Eigen.eigenvalues(matrix).toArray();
    int maxIndex = 0;
    for (int i = 0; i < doubleMatrix.length; i++){
        double newnumber = doubleMatrix[i].abs();
        if ((newnumber > doubleMatrix[maxIndex].abs())){
            maxIndex = i;
        }
    }
    return maxIndex;
}
 
private static List<Double> getEigenVector(ComplexDoubleMatrix eigenvector, int columnId) {
    ComplexDoubleMatrix column = eigenvector.getColumn(columnId);
 
    List<Double> values = new ArrayList<Double>();
    for (ComplexDouble value : column.toArray()) {
        values.add(value.abs()  );
    }
    return values;
}

In getMaxIndex we work out which index in the array the largest eigenvalue belongs to so that we can look it up in the array we get from Eigen#eigenvectors. According to the documentation the eigenvectors are stored in the first matrix we get back which is why we choose that on the second line of getPrincipalEigenvector.

This is the output we get from running that:

principalEigenvector = [0.3162277660168381, 0.3162277660168376, 0.316227766016838, 0.316227766016838, 0.6324555320336759, 0.316227766016838, 0.316227766016838]

Finally we normalise the values so that they all add together to equal 1 which means our result will tell the % of time that a random walk would take you to this node:

System.out.println("normalisedPrincipalEigenvector = " + normalised(principalEigenvector));
 
private static List<Double> normalised(List<Double> principalEigenvector) {
    double total = sum(principalEigenvector);
    List<Double> normalisedValues = new ArrayList<Double>();
    for (Double aDouble : principalEigenvector) {
        normalisedValues.add(aDouble / total);
    }
    return normalisedValues;
}
 
private static double sum(List<Double> principalEigenvector) {
    double total = 0;
    for (Double aDouble : principalEigenvector) {
        total += aDouble;
    }
    return total;
}
normalisedPrincipalEigenvector = [0.12500000000000006, 0.12499999999999988, 0.12500000000000003, 0.12500000000000003, 0.25, 0.12500000000000003, 0.12500000000000003]

We get the same answers as Murphy does so I guess the library is working correctly!

Next I think I should do some experimentation with PageRank on this graph to see how its measure of centrality differs.

Written by Mark Needham

August 5th, 2013 at 10:12 pm

Posted in Graph Processing

Tagged with ,

Graph Processing: Betweeness Centrality – neo4j’s cypher vs graphstream

with 7 comments

Last week I wrote about the betweenness centrality algorithm and my attempts to understand it using graphstream and while reading the source I realised that I might be able to put something together using neo4j’s all shortest paths algorithm.

To recap, the betweenness centrality algorithm is used to determine the load and importance of a node in a graph.

While talking about this with Jen she pointed out that calculating the betweenness centrality of nodes across the whole graph often doesn’t make sense. However, it can be useful to know which node is the most important in a smaller sub graph that you’re interested in.

In this case I’m interested in working out the betweenness centrality of nodes in a very small directed graph:

Let’s briefly recap the algorithm:

[Betweenness centrality] is equal to the number of shortest paths from all vertices to all others that pass through that node.

This means that we exclude any paths which go directly between two nodes without passing through any others, something which I didn’t initially grasp.

If we work out the applicable paths by hand we end up with the following:

A -> B: Direct Path Exists
A -> C: B
A -> D: E
A -> E: Direct Path Exists
B -> A: No Path Exists
B -> C: Direct Path Exists
B -> D: E or C
B -> E: Direct Path Exists
C -> A: No Path Exists
C -> B: No Path Exists
C -> D: Direct Path Exists
C -> E: No Path Exists
D -> A: No Path Exists
D -> B: No Path Exists
D -> C: No Path Exists
D -> E: No Path Exists
E -> A: No Path Exists 
E -> B: No Path Exists
E -> C: No Path Exists
E -> D: Direct Path Exists

Which gives the following betweenness centrality values:

A: 0
B: 1
C: 0.5
D: 0
E: 1.5

We can write a test against the latest version of graphstream (which takes direction into account) to confirm our manual algorithm:

    @Test
    public void calculateBetweennessCentralityOfMySimpleGraph() {
        Graph graph = new SingleGraph("Tutorial 1");
 
        Node A = graph.addNode("A");
        Node B = graph.addNode("B");
        Node E = graph.addNode("E");
        Node C = graph.addNode("C");
        Node D = graph.addNode("D");
 
        graph.addEdge("AB", A, B, true);
        graph.addEdge("BE", B, E, true);
        graph.addEdge("BC", B, C, true);
        graph.addEdge("ED", E, D, true);
        graph.addEdge("CD", C, D, true);
        graph.addEdge("AE", A, E, true);
 
        BetweennessCentrality bcb = new BetweennessCentrality();
        bcb.computeEdgeCentrality(false);
        bcb.betweennessCentrality(graph);
 
        System.out.println("A="+ A.getAttribute("Cb"));
        System.out.println("B="+ B.getAttribute("Cb"));
        System.out.println("C="+ C.getAttribute("Cb"));
        System.out.println("D="+ D.getAttribute("Cb"));
        System.out.println("E="+ E.getAttribute("Cb"));
    }

The output is as expected:

A=0.0
B=1.0
C=0.5
D=0.0
E=1.5

I wanted to see if I could do the same thing using neo4j so I created the graph in a blank database using the following cypher statements:

CREATE (A {name: "A"})
CREATE (B {name: "B"})
CREATE (C {name: "C"})
CREATE (D {name: "D"})
CREATE (E {name: "E"})
 
CREATE A-[:TO]->E
CREATE A-[:TO]->B
CREATE B-[:TO]->C
CREATE B-[:TO]->E
CREATE C-[:TO]->D
CREATE E-[:TO]->D

I then wrote a query which found the shortest path between all sets of nodes in the graph:

MATCH p = allShortestPaths(source-[r:TO*]->destination) 
WHERE source <> destination
RETURN NODES(p)

If we run that it returns the following:

==> +---------------------------------------------------------+
==> | NODES(p)                                                |
==> +---------------------------------------------------------+
==> | [Node[1]{name:"A"},Node[2]{name:"B"}]                   |
==> | [Node[1]{name:"A"},Node[2]{name:"B"},Node[3]{name:"C"}] |
==> | [Node[1]{name:"A"},Node[5]{name:"E"},Node[4]{name:"D"}] |
==> | [Node[1]{name:"A"},Node[5]{name:"E"}]                   |
==> | [Node[2]{name:"B"},Node[3]{name:"C"}]                   |
==> | [Node[2]{name:"B"},Node[3]{name:"C"},Node[4]{name:"D"}] |
==> | [Node[2]{name:"B"},Node[5]{name:"E"},Node[4]{name:"D"}] |
==> | [Node[2]{name:"B"},Node[5]{name:"E"}]                   |
==> | [Node[3]{name:"C"},Node[4]{name:"D"}]                   |
==> | [Node[5]{name:"E"},Node[4]{name:"D"}]                   |
==> +---------------------------------------------------------+
==> 10 rows

We’re still returning the direct links between nodes but that’s reasonably easy to correct by filtering the results based on the number of nodes in the path:

MATCH p = allShortestPaths(source-[r:TO*]->destination) 
WHERE source <> destination  AND LENGTH(NODES(p)) > 2
RETURN EXTRACT(n IN NODES(p): n.name)
==> +--------------------------------+
==> | EXTRACT(n IN NODES(p): n.name) |
==> +--------------------------------+
==> | ["A","B","C"]                  |
==> | ["A","E","D"]                  |
==> | ["B","C","D"]                  |
==> | ["B","E","D"]                  |
==> +--------------------------------+
==> 4 rows

If we tweak the cypher query a bit we can get a collection of the shortest paths for each source/destination:

MATCH p = allShortestPaths(source-[r:TO*]->destination) 
WHERE source <> destination  AND LENGTH(NODES(p)) > 2
WITH EXTRACT(n IN NODES(p): n.name) AS nodes
RETURN HEAD(nodes) AS source, 
       HEAD(TAIL(TAIL(nodes))) AS destination, 
       COLLECT(nodes) AS paths
==> +------------------------------------------------------+
==> | source | destination | paths                         |
==> +------------------------------------------------------+
==> | "A"    | "D"         | [["A","E","D"]]               |
==> | "A"    | "C"         | [["A","B","C"]]               |
==> | "B"    | "D"         | [["B","C","D"],["B","E","D"]] |
==> +------------------------------------------------------+
==> 3 rows

When we have a way to slice collections using cypher it wouldn’t be too difficult to get from here to a betweenness centrality score for the nodes but for now it’s much easier to use a general programming language.

In this case I used Ruby and came up with the following code:

require 'neography'
neo = Neography::Rest.new
 
query =  " MATCH p = allShortestPaths(source-[r:TO*]->destination)"
query << " WHERE source <> destination  AND LENGTH(NODES(p)) > 2"
query << " WITH EXTRACT(n IN NODES(p): n.name) AS nodes" 
query << " RETURN HEAD(nodes) AS source, HEAD(TAIL(TAIL(nodes))) AS destination, COLLECT(nodes) AS paths"
 
betweenness_centrality = { "A" => 0, "B" => 0, "C" => 0, "D" => 0, "E" => 0 }
 
neo.execute_query(query)["data"].map { |row| row[2].map { |path| path[1..-2] } }.each do |potential_central_nodes|		
  number_of_central_nodes = potential_central_nodes.size
  potential_central_nodes.each do |nodes|
    nodes.each { |node| betweenness_centrality[node] += (1.0 / number_of_central_nodes) }
  end
end
 
p betweenness_centrality

which outputs the following:

$ bundle exec ruby centrality.rb 
{"A"=>0, "B"=>1.0, "C"=>0.5, "D"=>0, "E"=>1.5}

It seems to do the job but I’m sure there are some corner cases it doesn’t handle which a mature library would take care of. As an experiment to see what’s possible I think it’s not too bad though!

The graph is on the neo4j console in case anyone is interested in playing around with it.

Written by Mark Needham

July 27th, 2013 at 11:21 am

Posted in Graph Processing,neo4j

Tagged with