Mark Needham

Thoughts on Software Development

neo4j: Extracting a subgraph as an adjacency matrix and calculating eigenvector centrality with JBLAS

with 4 comments

Earlier in the week I wrote a blog post showing how to calculate the eigenvector centrality of an adjacency matrix using JBLAS and the next step was to work out the eigenvector centrality of a neo4j sub graph.

There were 3 steps involved in doing this:

  1. Export the neo4j sub graph as an adjacency matrix
  2. Run JBLAS over it to get eigenvector centrality scores for each node
  3. Write those scores back into neo4j

I decided to make use of the Paul Revere data set from Kieran Healy’s blog post which consists of people and groups that they had membership of.

The script to import the data is on my fork of the revere repository.

Having imported the data the next step was to write a cypher query which would give me the people in an adjacency matrix with the number in each column/row intersection showing how many common groups that pair of people had.

I thought it’d be easier to build this query incrementally so I started out writing a query which would return one row of the adjacency matrix:

MATCH p1:Person, p2:Person
WHERE p1.name = "Paul Revere"
WITH p1, p2
MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2
 
WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links
ORDER BY p2
RETURN p1, COLLECT(links) AS row

Here we start with Paul Revere and then find the relationships between him and every other person by way of a common group membership.

We use an optional relationship since we need to include a value in each column/row of our adjacency matrix we need to return a 0 value for anyone he doesn’t intersect with.

If we run that query we get back the following:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p1            | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Paul Revere" | [2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,1,1,1,1,1,1,3,3,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,2,1,1,2,1,2,1,1,1,1,1,0,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,2,1,3,1,3,2,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,0,1,0,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,3,1,1,1,1,3,1,1,1,1,0,1,2,1,1,1,1,1,1,1] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

As it turns outs we’ve only got to remove the WHERE clause and order everybody and we’ve get the adjacency matrix for everyone:

MATCH p1:Person, p2:Person
WITH p1, p2
MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2
 
WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links
ORDER BY p2
RETURN p1, COLLECT(links) AS row
ORDER BY p1
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p1                      | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Abiel Ruddock"         | [0,1,1,1,0,1,0,1,0,0,1,1,1,0,1,2,0,1,0,1,1,1,2,2,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,1,1,0,0,2,2,0,0,1,1,2,1,1,1,0,1,0,1,1,0,0,2,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,0,2,1,2,1,0,0,0,0,1,1,0,1,0,0,1,0,2,0,0,1,0,0,0,1,0,0,2,0,1,0,1,1,1,0,0,1,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,2,0,0,1,1,0,0,2,0,1,2,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,2,1,0,1,1,1,1,1,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,2,1,0,0,1,1,1,1,0,1,0,0,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,0,1,0,2,1,1,0,0,2,0,1,0,0,0,0,1,0,1,0,1,0,1,0] |
| "Abraham Hunt"          | [1,0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,1,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0] |
...
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
254 rows
9897 ms

The next step was to wire up the query results with the JBLAS code that I wrote in the previous post. I ended up with the following:

public class Neo4jAdjacencyMatrixSpike {
    public static void main(String[] args) throws SQLException {
        ClientResponse response = client()
                .resource("http://localhost:7474/db/data/cypher")
                .entity(queryAsJson(), MediaType.APPLICATION_JSON)
                .accept(MediaType.APPLICATION_JSON)
                .post(ClientResponse.class);
 
        JsonNode result = response.getEntity(JsonNode.class);
        ArrayNode rows = (ArrayNode) result.get("data");
 
        List<Double> principalEigenvector = JBLASSpike.getPrincipalEigenvector(new DoubleMatrix(asMatrix(rows)));
 
        List<Person> people = asPeople(rows);
        updatePeopleWithEigenvector(people, principalEigenvector);
 
        System.out.println(sort(people).take(10));
    }
 
    private static double[][] asMatrix(ArrayNode rows) {
        double[][] matrix = new double[rows.size()][254];
        int rowCount = 0;
 
        for (JsonNode row : rows) {
            ArrayNode matrixRow = (ArrayNode) row.get(2);
 
            double[] rowInMatrix = new double[254];
            matrix[rowCount] = rowInMatrix;
            int columnCount = 0;
            for (JsonNode jsonNode : matrixRow) {
                matrix[rowCount][columnCount] = jsonNode.asInt();
                columnCount++;
            }
 
            rowCount++;
        }
        return matrix;
    }
 
    // rest cut for brevity
}

Here we are taking the query and then converting it into an array of arrays before passing it to our JBLAS code to calculate the principal eigenvector. We then return the top 10 people:

Person{name='William Cooper', eigenvector=0.172604992239612, nodeId=68},
Person{name='Nathaniel Barber', eigenvector=0.17260499223961198, nodeId=18},
Person{name='John Hoffins', eigenvector=0.17260499223961195, nodeId=118},
Person{name='Paul Revere', eigenvector=0.17171142003936804, nodeId=207},
Person{name='Caleb Davis', eigenvector=0.16383970722169897, nodeId=71},
Person{name='Caleb Hopkins', eigenvector=0.16383970722169897, nodeId=121},
Person{name='Henry Bass', eigenvector=0.16383970722169897, nodeId=21},
Person{name='Thomas Chase', eigenvector=0.16383970722169897, nodeId=54},
Person{name='William Greenleaf', eigenvector=0.16383970722169897, nodeId=104},
Person{name='Edward Proctor', eigenvector=0.15600043886738055, nodeId=201}

I get back the same 10 people as Kieran Healy although they have different eigenvector values. As far as I understand the absolute value doesn’t matter, what’s more important is the relative score to other people so I think we’re ok.

The final step was to write these eigenvector values back into neo4j which we can do with the following code:

    private static void updateNeo4jWithEigenvectors(List<Person> people) {
        for (Person person : people) {
            ObjectNode request = JsonNodeFactory.instance.objectNode();
            request.put("query", "START p = node({nodeId}) SET p.eigenvectorCentrality={value}");
 
            ObjectNode params = JsonNodeFactory.instance.objectNode();
            params.put("nodeId", person.nodeId);
            params.put("value", person.eigenvector);
 
            request.put("params", params);
 
            client()
                    .resource("http://localhost:7474/db/data/cypher")
                    .entity(request, MediaType.APPLICATION_JSON)
                    .accept(MediaType.APPLICATION_JSON)
                    .post(ClientResponse.class);
        }
    }

Now we might use that eigenvector centrality value in other queries, such as one to show who the most central/potentially influential people are in each group:

MATCH g:Group<-[:MEMBER_OF]-p
 
WITH g.name AS group, p.name AS personName, p.eigenvectorCentrality as eigen
ORDER BY eigen DESC
 
WITH group, COLLECT(personName) AS people
RETURN group, HEAD(people) + [HEAD(TAIL(people))] + [HEAD(TAIL(TAIL(people)))] AS mostCentral
+--------------------------------------------------------------------------+
| group             | mostCentral                                          |
+--------------------------------------------------------------------------+
| "StAndrewsLodge"  | ["Paul Revere","Joseph Warren","Thomas Urann"]       |
| "BostonCommittee" | ["William Cooper","Nathaniel Barber","John Hoffins"] |
| "LoyalNine"       | ["Caleb Hopkins","William Greenleaf","Caleb Davis"]  |
| "LondonEnemies"   | ["William Cooper","Nathaniel Barber","John Hoffins"] |
| "LongRoomClub"    | ["Paul Revere","John Hancock","Benjamin Clarke"]     |
| "NorthCaucus"     | ["William Cooper","Nathaniel Barber","John Hoffins"] |
| "TeaParty"        | ["William Cooper","Nathaniel Barber","John Hoffins"] |
+--------------------------------------------------------------------------+
7 rows
280 ms

Our top ten feature frequently although it’s interesting that only one of them is in the ‘LongRoomClub’ group which perhaps indicates that people in that group are less likely to be members of the other ones.

I’d be interested if anyone can think of other potential uses for eigenvector centrality once we’ve got it back in the graph.

All the code described in this post is on github if you want to take it for a spin.

Be Sociable, Share!

Written by Mark Needham

August 11th, 2013 at 7:23 am

Posted in neo4j

Tagged with , , ,

  • Pingback: Ranking Systems: What I’ve learnt so far at Mark Needham()

  • Pingback: Neo4j 2.0: Creating adjacency matrices at Mark Needham()

  • Vojta

    Thanks! Found this pretty helpful 😉

  • Moreno Bonaventura

    Hi Mark, interesting but:

    does it scale up? not really.
    I’m sure you know you don’t need to store all those zeros of the adj matrix.
    A simple and light edge-list is much more efficient.
    You can also (you really should!!) take advantages of BLAST algo optimized for sparse matrix.

    I’m working on very large-scale neo4j db and I have to put back on nodes’ properties some meaningful network metrics such as eigenvector centrality and many more.
    The only way we found this working (without burning the server) is to get the edgelist with:
    MATCH n-[r]-m RETURN n.id,m.id
    and everything else is performed by optimized C codes ( using LAPCAK or BLAST of course).

    I would very happy to understand if the are any other scalable solution to the computation of network metrics.