Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Neo4j: Cypher – Creating relationships between a collection of nodes / Invalid input ‘[‘:

without comments

When working with graphs we’ll frequently find ourselves wanting to create relationships between collections of nodes.

A common example of this would be creating a linked list of days so that we can quickly traverse across a time tree. Let’s say we start with just 3 days:

MERGE (day1:Day {day:1 })
MERGE (day2:Day {day:2 })
MERGE (day3:Day {day:3 })
RETURN day1, day2, day3

And we want to create a ‘NEXT’ relationship between adjacent days:

(day1)-[:NEXT]->(day2)-[:NEXT]->(day3)

The most obvious way to do this would be to collect the days into an ordered collection and iterate over them using FOREACH, creating a relationship between adjacent nodes:

MATCH (day:Day)
WITH day
ORDER BY day.day
WITH COLLECT(day) AS days
FOREACH(i in RANGE(0, length(days)-2) | 
  CREATE UNIQUE (days[i])-[:NEXT]->(days[i+1]))

Unfortunately this isn’t valid syntax:

Invalid input '[': expected an identifier character, node labels, a property map, whitespace, ')' or a relationship pattern (line 6, column 32)
"            CREATE UNIQUE (days[i])-[:NEXT]->(days[i+1]))"
                                ^

It doesn’t seem to like us using array indices where we specify the node identifier.

However, we can work around that by putting days[i] and days[i+1] into single item arrays and using nested FOREACH loops on those, something Michael Hunger showed me last year and I forgot all about!

MATCH (day:Day)
WITH day
ORDER BY day.day
WITH COLLECT(day) AS days
FOREACH(i in RANGE(0, length(days)-2) | 
  FOREACH(day1 in [days[i]] | 
    FOREACH(day2 in [days[i+1]] | 
      CREATE UNIQUE (day1)-[:NEXT]->(day2))))

Now if we do a query to get back all the days we’ll see they’re connected:

2014 04 19 07 32 37

Written by Mark Needham

April 19th, 2014 at 6:33 am

Posted in neo4j

Tagged with

Neo4j 2.0.0: Query not prepared correctly / Type mismatch: expected Map

without comments

I was playing around with Neo4j’s Cypher last weekend and found myself accidentally running some queries against an earlier version of the Neo4j 2.0 series (2.0.0).

My first query started with a map and I wanted to create a person from an identifier inside the map:

WITH {person: {id: 1}} AS params
MERGE (p:Person {id: params.person.id})
RETURN p

When I ran the query I got this error:

==> SyntaxException: Type mismatch: expected Map but was Boolean, Number, String or Collection<Any> (line 1, column 62)
==> "WITH {person: {id: 1}} AS params MERGE (p:Person {id: params.person.id}) RETURN p"

If we try the same query in 2.0.1 it works as we’d expect:

==> +---------------+
==> | p             |
==> +---------------+
==> | Node[1]{id:} |
==> +---------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 1
==> Labels added: 1
==> 47 ms

My next query was the following which links topics of interest to a person:

WITH {topics: [{name: "Java"}, {name: "Neo4j"}]} AS params
MERGE (p:Person {id: 2})
FOREACH(t IN params.topics | 
  MERGE (topic:Topic {name: t.name})
  MERGE (p)-[:INTERESTED_IN]->(topic)
)
RETURN p

In 2.0.0 that query fails like so:

==> InternalException: Query not prepared correctly!

but if we try it in 2.0.1 we’ll see that it works as well:

==> +---------------+
==> | p             |
==> +---------------+
==> | Node[4]{id:2} |
==> +---------------+
==> 1 row
==> Nodes created: 1
==> Relationships created: 2
==> Properties set: 1
==> Labels added: 1
==> 53 ms

So if you’re seeing either of those errors then get yourself upgraded to 2.0.1 as well!

Written by Mark Needham

April 13th, 2014 at 5:40 pm

Posted in neo4j

Tagged with

Remote profiling Neo4j using yourkit

without comments

yourkit is my favourite JVM profiling tool and whilst it’s really easy to profile a local JVM process, sometimes I need to profile a process on a remote machine.

In that case we need to first have the remote JVM started up with a yourkit agent parameter passed as one of the args to the Java program.

Since I’m mostly working with Neo4j this means we need to add the following to conf/neo4j-wrapper.conf:

wrapper.java.additional=-agentpath:/Users/markhneedham/Downloads/YourKit_Java_Profiler_2013_build_13074.app/bin/mac/libyjpagent.jnilib=port=8888

If we run lsof with the Neo4j process ID we’ll see that there’s now a socket listening on port 8888:

java    4388 markhneedham   20u    IPv6 0x901df453b4e9a125       0t0      TCP *:8888 (LISTEN)
...

We can connect to that via the ‘Monitor Remote Applications’ section of yourkit:

2014 03 24 23 39 59

In this case I’m demonstrating how to connect to it on my laptop and am using localhost but usually we’d specify the remote machine’s host name instead.

We also need to ensure that port 8888 is open on any firewalls we have in front of the machine.

The file we refer to in the ‘agentpath’ flag is a bit different depending on the operating system we’re using. All the details are on the yourkit website.

Written by Mark Needham

March 24th, 2014 at 11:44 pm

Posted in neo4j

Tagged with ,

Neo4j 2.1.0-M01: LOAD CSV with Rik Van Bruggen’s Tube Graph

without comments

Last week we released the first milestone of Neo4j 2.1.0 and one its features is a new function in cypher – LOAD CSV – which aims to make it easier to get data into Neo4j.

I thought I’d give it a try to import the London tube graph – something that my colleague Rik wrote about a few months ago.

I’m using the same data set as Rik but I had to tweak it a bit as there were naming differences when describing the connection from Kennington to Waterloo and Kennington to Oval. My updated version of the dataset is on github.

With the help of Alistair we now have a variation on the original which takes into account the various platforms at stations and the waiting time of a train on the platform. This will also enable us to add in things like getting from the ticket hall to the various platforms more easily.

The model looks like this:

2014 03 03 16 15 58

Now we need to create a graph and the first step is to put an index on station name as we’ll be looking that up quite frequently in the queries that follow:

CREATE INDEX on :Station(stationName)

Now that’s in place we can make use of LOAD CSV. The data is very de-normalised which works out quite nicely for us and we end up with the following script:

LOAD CSV FROM "file:/Users/markhneedham/code/tube/runtimes.csv" AS csvLine
WITH csvLine[0] AS lineName, 
     csvLine[1] AS direction, 
     csvLine[2] AS startStationName,
     csvLine[3] AS destinationStationName, 
     toFloat(csvLine[4]) AS distance, 
     toFloat(csvLine[5]) AS runningTime
 
MERGE (start:Station { stationName: startStationName}) 
MERGE (destination:Station { stationName: destinationStationName}) 
MERGE (line:Line { lineName: lineName}) 
MERGE (line) - [:DIRECTION] -> (dir:Direction { direction: direction}) 
CREATE (inPlatform:InPlatform {name: "In: " + destinationStationName + " " + lineName + " " + direction})
CREATE (outPlatform:OutPlatform {name: "Out: " + startStationName + " " + lineName + " " + direction}) 
CREATE (inPlatform) - [:AT] -> (destination) 
CREATE (outPlatform) - [:AT] -> (start) 
CREATE (inPlatform) - [:ON] -> (dir) 
CREATE (outPlatform) - [:ON] -> (dir) 
CREATE (outPlatform) - [r:TRAIN {distance: distance, runningTime: runningTime}] -> (inPlatform)

This file doesn’t contain any headers so we’ll simulate them by using a WITH clause so that we don’t have index lookups all over the place. In this case we’re pointing to a file on the local file system but we could choose to point to a CSV file on the web if we wanted to.

Since stations, lines and directions appear frequently we’ll use MERGE to ensure they don’t get duplicated.

After that we have a post processing step to connect the ‘in’ and ‘out’ platforms shown in the diagram.

MATCH (station:Station) <-[:AT]- (platformIn:InPlatform), 
      (station:Station) <-[:AT]- (platformOut:OutPlatform), 
      (direction:Direction) <-[:ON]- (platformIn:InPlatform), 
      (direction:Direction) <-[:ON]- (platformOut:OutPlatform) 
CREATE (platformIn) -[:WAIT {runningTime: 0.5}]-> (platformOut)

After running a few queries on the graph I realised that it wasn’t possible to combine some journies through Kennington and Euston so I had to add some relationships in there as well:

// link the Euston stations
MATCH (euston:Station {stationName: "EUSTON"})<-[:AT]-(eustonIn:InPlatform)
MATCH (eustonCx:Station {stationName: "EUSTON (CX)"})<-[:AT]-(eustonCxIn:InPlatform)
MATCH (eustonCity:Station {stationName: "EUSTON (CITY)"})<-[:AT]-(eustonCityIn:InPlatform)
 
CREATE UNIQUE (eustonIn)-[:WAIT {runningTime: 0.0}]->(eustonCxIn)
CREATE UNIQUE (eustonIn)-[:WAIT {runningTime: 0.0}]->(eustonCityIn)
CREATE UNIQUE (eustonCxIn)-[:WAIT {runningTime: 0.0}]->(eustonCityIn)
 
// link the Kennington stations
MATCH (kenningtonCx:Station {stationName: "KENNINGTON (CX)"})<-[:AT]-(kenningtonCxIn:InPlatform)
MATCH (kenningtonCity:Station {stationName: "KENNINGTON (CITY)"})<-[:AT]-(kenningtonCityIn:InPlatform)
 
CREATE UNIQUE (kenningtonCxIn)-[:WAIT {runningTime: 0.0}]->(kenningtonCityIn)

I’ve been playing around with the A* algorithm to find the quickest route between stations based on the distances between stations.

The next step is to put a timetable graph alongside this so we can do quickest routes at certain parts of the day and the next step after that will be to take delays into account.

If you’ve got some data you want to get into the graph give LOAD CSV a try and let us know how you get on, the cypher team are keen to get feedback on this.

Written by Mark Needham

March 3rd, 2014 at 4:34 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Finding directors who acted in their own movie

without comments

I’ve been doing quite a few Intro to Neo4j sessions recently and since it contains a lot of problems for the attendees to work on I get to see how first time users of Cypher actually use it.

A couple of hours in we want to write a query to find directors who acted in their own film based on the following model.

2014 02 28 22 40 02

A common answer is the following:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
WHERE a.name = d.name
RETURN a

We’re matching an actor ‘a’, finding the movie they acted in and then finding the director of that movie. We now have pairs of actors and directors which we filter down by comparing their ‘name’ property.

I haven’t written SQL for a while but if my memory serves me correctly comparing properties or attributes in this way is quite a common way to test for equality.

In a graph we don’t need to compare properties – what we actually want to check is if ‘a’ and ‘d’ are the same node:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
WHERE a = d
RETURN a

We’ve simplifed the query a bit but we can actually go one better by binding the director to the same identifier as the actor like so:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(a)
RETURN a

So now we’re matching an actor ‘a’, finding the movie they acted in and then finding the director if they happen to be the same person as ‘a’.

The code is now much simpler and more revealing of its intent too.

Written by Mark Needham

February 28th, 2014 at 10:57 pm

Posted in neo4j

Tagged with ,

Neo4j: Cypher – Set Based Operations

without comments

I was recently reminded of a Neo4j cypher query that I wrote a couple of years ago to find the colleagues that I hadn’t worked with in the ThoughtWorks London office.

The model looked like this:

2014 02 18 17 04 01

And I created the following fake data set of the aforementioned model:

public class SetBasedOperations
{
    private static final Label PERSON = DynamicLabel.label( "Person" );
    private static final Label OFFICE = DynamicLabel.label( "Office" );
 
    private static final DynamicRelationshipType COLLEAGUES = DynamicRelationshipType.withName( "COLLEAGUES" );
    private static final DynamicRelationshipType MEMBER_OF = DynamicRelationshipType.withName( "MEMBER_OF" );
 
    public static void main( String[] args ) throws IOException
    {
        Random random = new Random();
        String path = "/tmp/set-based-operations";
        FileUtils.deleteRecursively( new File( path ) );
 
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
 
        Transaction tx = db.beginTx();
        try
        {
            Node me = db.createNode( PERSON );
            me.setProperty( "name", "me" );
 
            Node londonOffice = db.createNode( OFFICE );
            londonOffice.setProperty( "name", "London Office" );
 
            me.createRelationshipTo( londonOffice, MEMBER_OF );
 
            for ( int i = 0; i < 1000; i++ )
            {
                Node colleague = db.createNode( PERSON );
                colleague.setProperty( "name", "person" + i );
 
                colleague.createRelationshipTo( londonOffice, MEMBER_OF );
 
                if(random.nextInt( 10 ) >= 8) {
                    me.createRelationshipTo( colleague, COLLEAGUES );
                }
 
                tx.success();
            }
        }
        finally
        {
            tx.finish();
        }
 
        db.shutdown();
 
        CommunityNeoServer server = CommunityServerBuilder
                .server()
                .usingDatabaseDir( path )
                .onPort( 9001 )
                .persistent()
                .build();
 
        server.start();
 
    }
}

I’ve created a node representing me and 1,000 people who work in the London office. Out of those 1,000 people I made it so that ~150 of them have worked with me.

If I want to write a cypher query to find the exact number of people who haven’t worked with me I might start with the following:

MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

We start by finding me, then find the London office which I was a member of, and then find the other people who are members of that office. On the second line we remove people that I’ve previously worked with and then return a count of the people who are left.

When I ran this through my Cypher query tuning tool the average time to evaluate this query was 7.46 seconds.

That is obviously a bit too slow if we want to run the query on a web page and as far as I can tell the reason for that is that for each potential colleague we are searching through my ‘COLLEAGUES’ relationships and checking whether they exist. We’re doing that 1,000 times which is a bit inefficient.

I chatted to David about this, and he suggested that a more efficient query would be to work out all my colleagues up front once and then do the filtering from that set of people instead.

The re-worked query looks like this:

MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)

When we run that through the query tuner the average time reduces to 150 milliseconds which is much better.

This type of query seems to be more about set operations than graph ones because we’re looking for what isn’t there rather than what is. When that’s the case getting the set of things that we want to compare against up front is more profitable.

Written by Mark Needham

February 20th, 2014 at 6:22 pm

Posted in neo4j

Tagged with ,

Neo4j: Creating nodes and relationships from a list of maps

without comments

Last week Alistair and I were porting some Neo4j cypher queries from 1.8 to 2.0 and one of the queries we had to change was an interesting one that created a bunch of relationships from a list/array of maps.

In the query we had a user ‘Mark’ and wanted to create ‘FRIENDS_WITH’ relationships to Peter and Michael.

2014 02 17 13 39 08

The application passed in a list of maps representing Peter and Michael as a parameter but if we remove the parameters the query looked like this:

MERGE (me:User {userId: 1} )
SET me.name = "Mark"
FOREACH (f IN [{userId: 2, name: "Michael"}, {userId: 3, name: "Peter"}] | 
    MERGE (u:User {userId: f.userId})
    SET u = f
    MERGE (me)-[:FRIENDS_WITH]->(u))

We first ensure that a user with ‘id’ of 1 exists in the database and then make sure their name is set to ‘Mark’. After we’ve done that we iterate over a list of maps containing Mark’s friends and ensure there is a ‘FRIENDS_WITH’ relationship from Mark to them.

The parameterised version of that query looks like this:

MERGE (me:User { userId: {userId} }) 
SET me.name = {name} 
FOREACH(f IN {friends} | 
    MERGE (u:User {userId: f.userId }) 
    SET u = f 
    MERGE (me)-[:FRIENDS_WITH]->(u))

We can then execute that query using Jersey:

public class ListsOfMapsCypher
{
    public static void main( String[] args )
    {
        ObjectNode request = JsonNodeFactory.instance.objectNode();
        request.put("query",
                "MERGE (me:User { userId: {userId} }) " +
                "SET me.name = {name} " +
                "FOREACH(f IN {friends} | " +
                    "MERGE (u:User {userId: f.userId }) " +
                    "SET u = f " +
                    "MERGE (me)-[:FRIENDS_WITH]->(u)) ");
 
        ObjectNode params = JsonNodeFactory.instance.objectNode();
        params.put("userId", 1);
        params.put("name", "Mark");
 
        ArrayNode friends = JsonNodeFactory.instance.arrayNode();
 
        ObjectNode friend1 = JsonNodeFactory.instance.objectNode();
        friend1.put( "userId", 2 );
        friend1.put( "name", "Michael" );
        friends.add( friend1 );
 
        ObjectNode friend2 = JsonNodeFactory.instance.objectNode();
        friend2.put( "userId", 3 );
        friend2.put( "name", "Peter" );
        friends.add( friend2 );
 
        params.put("friends", friends );
 
        request.put("params", params );
 
        ClientResponse clientResponse = client()
                .resource( "http://localhost:7474/db/data/cypher" )
                .accept( MediaType.APPLICATION_JSON )
                .entity( request, MediaType.APPLICATION_JSON_TYPE )
                .post( ClientResponse.class );
 
 
        System.out.println(clientResponse.getEntity( String.class ));
    }
 
    private static Client client()
    {
        DefaultClientConfig defaultClientConfig = new DefaultClientConfig();
        defaultClientConfig.getClasses().add(JacksonJsonProvider.class);
        return Client.create(defaultClientConfig);
    }
}

We can then write a query to check Mark and his friends were persisted:

2014 02 17 14 10 12

And that’s it!

Written by Mark Needham

February 17th, 2014 at 2:11 pm

Posted in neo4j

Tagged with ,

Neo4j: Value in relationships, but value in nodes too!

without comments

I’ve recently spent a bit of time working with people on their graph commons and a common pattern I’ve come across is that although the models have lots of relationships there are often missing nodes.

Emails

We’ll start with a model which represents the emails that people send between each other. A first cut might look like this:

2014 02 12 08 30 59

The problem with this approach is that we haven’t modelled the concept of an email – that’s been implicitly modelled via a relationship.

This means that if we want to indicate who was cc’d or bcc’d on the email we can’t do it. We might also want to track the replies on a thread but again we can’t do it.

A richer model that treated an email as a first class citizen would allow us to do both these things and would look like this:

2014 02 12 23 16 02

We could then write queries to get the chain of emails in a thread or find all the emails that a person was cc’d in – two queries that would be much more difficult to write if we don’t have the concept of an email.

Footballers and football matches

Our second example come from my football dataset and involves modelling the matches that players participated in.

My first attempt looked like this:

2014 02 12 23 30 35

This works reasonably well but I wanted to be able to model which team a player had represented in a match which was quite difficult with this model.

One approach would be to add a ‘team’ property to the ‘PLAYED_IN’ relationship but then we’d need to do some work at query time to work out which team node that property value referred to.

Instead I realised that I was missing the concept of a player’s performance in a match which would make some queries much easier to write. The new model looks like this:

2014 02 12 23 37 28

The tube

The final example is modelling the London tube although this could apply to any transport system. Our initial model of part of the Northern Line might look like this:

2014 02 12 23 59 46

This model works really well and my colleague Rik has written a blog post showing the queries you could write against it.

However, it’s missing the concept of a platform which means if we want to create a routing application which takes into account the amount of time it takes to move between different

If we introduce a node to represent the different platforms in a station we can introduce that type of information:

2014 02 13 00 04 06

In each of these examples we’ve effectively normalised our model by introducing an extra concept which means it looks more complicated.

The benefit of this approach across all three examples is that it allows us to answer more complicated questions of our data which in my experience are the really interesting questions.

As always, let me know what you think in the comments.

Written by Mark Needham

February 13th, 2014 at 12:10 am

Posted in neo4j

Tagged with

Jython/Neo4j: java.lang.ExceptionInInitializerError: java.lang.ExceptionInInitializerError

without comments

I’ve been playing around with calling Neo4j’s Java API from Python via Jython and immediately ran into the following exception when trying to create an embedded instance:

$ jython -Dpython.path /path/to/neo4j.jar
Jython 2.5.3 (2.5:c56500f08d34+, Aug 13 2012, 14:48:36)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_45
Type "help", "copyright", "credits" or "license" for more information.
 
>>> import org.neo4j.graphdb.factory
>>> org.neo4j.graphdb.factory.GraphDatabaseFactory().newEmbeddedDatabase("/tmp/foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
	at org.neo4j.graphdb.factory.GraphDatabaseFactory$1.newDatabase(GraphDatabaseFactory.java:83)
	at org.neo4j.graphdb.factory.GraphDatabaseBuilder.newGraphDatabase(GraphDatabaseBuilder.java:198)
	at org.neo4j.graphdb.factory.GraphDatabaseFactory.newEmbeddedDatabase(GraphDatabaseFactory.java:69)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
 
java.lang.ExceptionInInitializerError: java.lang.ExceptionInInitializerError

I eventually ended up at GraphDatabaseSettings so I tried to import that:

>>> import org.neo4j.graphdb.factory.GraphDatabaseSettings
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
java.lang.ExceptionInInitializerError
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.python.core.Py.loadAndInitClass(Py.java:909)
	at org.python.core.Py.findClassInternal(Py.java:844)
	at org.python.core.Py.findClassEx(Py.java:895)
	at org.python.core.packagecache.SysPackageManager.findClass(SysPackageManager.java:133)
	at org.python.core.packagecache.PackageManager.findClass(PackageManager.java:28)
	at org.python.core.packagecache.SysPackageManager.findClass(SysPackageManager.java:122)
	at org.python.core.PyJavaPackage.__findattr_ex__(PyJavaPackage.java:137)
	at org.python.core.PyObject.__findattr__(PyObject.java:863)
	at org.python.core.PyObject.impAttr(PyObject.java:1001)
	at org.python.core.imp.import_next(imp.java:720)
	at org.python.core.imp.import_logic(imp.java:782)
	at org.python.core.imp.import_module_level(imp.java:842)
	at org.python.core.imp.importName(imp.java:917)
	at org.python.core.ImportFunction.__call__(__builtin__.java:1220)
	at org.python.core.PyObject.__call__(PyObject.java:357)
	at org.python.core.__builtin__.__import__(__builtin__.java:1173)
	at org.python.core.imp.importOne(imp.java:936)
	at org.python.pycode._pyx5.f$0(<stdin>:1)
	at org.python.pycode._pyx5.call_function(<stdin>)
	at org.python.core.PyTableCode.call(PyTableCode.java:165)
	at org.python.core.PyCode.call(PyCode.java:18)
	at org.python.core.Py.runCode(Py.java:1275)
	at org.python.core.Py.exec(Py.java:1319)
	at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:215)
	at org.python.util.InteractiveInterpreter.runcode(InteractiveInterpreter.java:89)
	at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:70)
	at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:46)
	at org.python.util.InteractiveConsole.push(InteractiveConsole.java:110)
	at org.python.util.InteractiveConsole.interact(InteractiveConsole.java:90)
	at org.python.util.jython.run(jython.java:317)
	at org.python.util.jython.main(jython.java:129)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
	at org.neo4j.graphdb.factory.GraphDatabaseSettings.<clinit>(GraphDatabaseSettings.java:69)
	... 33 more
 
java.lang.ExceptionInInitializerError: java.lang.ExceptionInInitializerError

This bit of code is used to work out the default cache type. It does by first reading the available cache types from a file on the class path and then picking the first one on the list.

I could see that the file it was trying to read was in the JAR and I thought passing the JAR to ‘python.path’ would put it on the classpath. Alistair suggested checking that assumption by checking what was actually on the classpath:

>>> import java.lang
>>> java.lang.System.getProperty("java.class.path")
u'/usr/local/Cellar/jython/2.5.3/libexec/jython.jar:'

As you can see our JAR is missing from the list which explains the problem with reading the file. If we launch the jython REPL with our JAR passed explicitly to the Java classpath then it works:

$ jython -J-cp /path/to/neo4j.jar
Jython 2.5.3 (2.5:c56500f08d34+, Aug 13 2012, 14:48:36)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_45
Type "help", "copyright", "credits" or "license" for more information.
 
>>> import java.lang
>>> java.lang.System.getProperty("java.class.path")
u'/usr/local/Cellar/jython/2.5.3/libexec/jython.jar:/path/to/neo4j.jar:'

We can now create an embedded database with no problems:

>>> import org.neo4j.graphdb.factory
>>> org.neo4j.graphdb.factory.GraphDatabaseFactory().newEmbeddedDatabase("/tmp/foo")
EmbeddedGraphDatabase [/tmp/foo]

Written by Mark Needham

February 5th, 2014 at 12:21 pm

Posted in neo4j,Python

Tagged with

Neo4j 2.0.0: Optimising a football query

without comments

A couple of months ago I wrote a blog post explaining how I’d applied Wes Freeman’s Cypher optimisation patterns to a query – since then Neo4j 2.0.0 has been released and I’ve extended the model so I thought I’d try again.

The updated model looks like this:

2014 01 31 22 25 20

The query is similar to before – I want to calculate the top away goal scorers in the 2012-2013 season. I started off with this:

MATCH (game)<-[:contains_match]-(season:Season),
      (team)<-[:away_team]-(game),
      (stats)-[:in]->(game),
      (team)<-[:for]-(stats)<-[:played]-(player)
WHERE season.name = "2012-2013"
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

When I executed this query using my Python query tuning tool and the average time was 3.31 seconds.

I separated the MATCH statements into smaller individual statements just to see what would happen:

MATCH (game)<-[:contains_match]-(season:Season)
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
WHERE season.name = "2012-2013"
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

That reduced the time to 178 milliseconds which is quite a nice improvement for so little effort. As I understand it, this is down to the traversal matcher handling smaller patterns more effectively than it handles longer patterns.

The next step was to move the WHERE clause up so that it filtered out rows right at the beginning of the query rather than letting them hang around for another 3 MATCH statements:

MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

That took the time down to 131 milliseconds. At this stage I also tried putting the ‘MATCH (team)<-[:away_team]-(game)' line first to see what would happen.

MATCH (team)<-[:away_team]-(game)
MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

I expected it to be a bit slower since we were now keeping around more games than necessary again and as expected the time rose to 172 milliseconds – slightly quicker than our first attempt at tweaking.

I reverted that change and realised that there would be a lot of ‘stats’ nodes which didn’t have any goals associated with them and would therefore have a ‘goals’ property value of 0. I tried filtering those nodes out before the ‘SUM’ part of the query:

MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
WHERE stats.goals > 0
RETURN player.name,
COLLECT(DISTINCT team.name),
SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

This proved to be a very good optimisation – the time reduced to 47 milliseconds, an improvement of almost 3x on the previous optimisation and 63x quicker than the original query.

The main optimisation pattern used here was reducing the number of rows being passed through the query.

Ideally you don’t want to be passing unnecessary rows through each stage of the query – rows can be filtered out either by using more specific MATCH clauses or in this case by a WHERE clause.

Wes and I presented a webinar on Cypher query optimisation last week which so if you want to learn more about tuning queries that might be worth a watch.

Written by Mark Needham

January 31st, 2014 at 10:41 pm

Posted in neo4j

Tagged with