Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ tag

Neo4j: Cypher – Creating a time tree down to the day

with 4 comments

Michael recently wrote a blog post showing how to create a time tree representing time down to the second using Neo4j’s Cypher query language, something I built on top of for a side project I’m working on.

The domain I want to model is RSVPs to meetup invites – I want to understand how much in advance people respond and how likely they are to drop out at a later stage.

For this problem I only need to measure time down to the day so my task is a bit easier than Michael’s.

After a bit of fiddling around with leap years I believe the following query will create a time tree representing all the days from 2011 – 2014, which covers the time the London Neo4j meetup has been running:

WITH range(2011, 2014) AS years, range(1,12) as months
FOREACH(year IN years | 
  MERGE (y:Year {year: year})
  FOREACH(month IN months | 
    CREATE (m:Month {month: month})
    MERGE (y)-[:HAS_MONTH]->(m)
    FOREACH(day IN (CASE 
                      WHEN month IN [1,3,5,7,8,10,12] THEN range(1,31) 
                      WHEN month = 2 THEN 
                        CASE
                          WHEN year % 4 <> 0 THEN range(1,28)
                          WHEN year % 100 <> 0 THEN range(1,29)
                          WHEN year % 400 <> 0 THEN range(1,29)
                          ELSE range(1,28)
                        END
                      ELSE range(1,30)
                    END) |      
      CREATE (d:Day {day: day})
      MERGE (m)-[:HAS_DAY]->(d))))

The next step is to link adjacent days together so that we can easily traverse between adjacent days without needing to go back up and down the tree. For example we should have something like this:

(jan31)-[:NEXT]->(feb1)-[:NEXT]->(feb2)

We can build this by first collecting all the ‘day’ nodes in date order like so:

MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.year, month.month, day.day
WITH collect(day) as days
RETURN days

And then iterating over adjacent nodes to create the ‘NEXT’ relationship:

MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.year, month.month, day.day
WITH collect(day) as days
FOREACH(i in RANGE(0, length(days)-2) | 
    FOREACH(day1 in [days[i]] | 
        FOREACH(day2 in [days[i+1]] | 
            CREATE UNIQUE (day1)-[:NEXT]->(day2))))

Now if we want to find the previous 5 days from the 1st February 2014 we could write the following query:

MATCH (y:Year {year: 2014})-[:HAS_MONTH]->(m:Month {month: 2})-[:HAS_DAY]->(:Day {day: 1})<-[:NEXT*0..5]-(day)
RETURN y,m,day
2014 04 19 22 14 04

If we want to we can create the time tree and then connect the day nodes all in one query by using ‘WITH *’ like so:

WITH range(2011, 2014) AS years, range(1,12) as months
FOREACH(year IN years | 
  MERGE (y:Year {year: year})
  FOREACH(month IN months | 
    CREATE (m:Month {month: month})
    MERGE (y)-[:HAS_MONTH]->(m)
    FOREACH(day IN (CASE 
                      WHEN month IN [1,3,5,7,8,10,12] THEN range(1,31) 
                      WHEN month = 2 THEN 
                        CASE
                          WHEN year % 4 <> 0 THEN range(1,28)
                          WHEN year % 100 <> 0 THEN range(1,29)
                          WHEN year % 400 <> 0 THEN range(1,29)
                          ELSE range(1,28)
                        END
                      ELSE range(1,30)
                    END) |      
      CREATE (d:Day {day: day})
      MERGE (m)-[:HAS_DAY]->(d))))
 
WITH *
 
MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.year, month.month, day.day
WITH collect(day) as days
FOREACH(i in RANGE(0, length(days)-2) | 
    FOREACH(day1 in [days[i]] | 
        FOREACH(day2 in [days[i+1]] | 
            CREATE UNIQUE (day1)-[:NEXT]->(day2))))

Now I need to connect the RSVP events to the tree!

Written by Mark Needham

April 19th, 2014 at 9:15 pm

Posted in neo4j

Tagged with

Neo4j 2.0.1: Cypher – Concatenating an empty collection / Type mismatch: expected Integer, Collection<Integer> or Collection<Collection<Integer>> but was Collection<Any>

without comments

Last weekend I was playing around with some collections using Neo4j’s Cypher query language and I wanted to concatenate two collections.

This was easy enough when both collections contained values…

$ RETURN [1,2,3,4] + [5,6,7];
==> +---------------------+
==> | [1,2,3,4] + [5,6,7] |
==> +---------------------+
==> | [1,2,3,4,5,6,7]     |
==> +---------------------+
==> 1 row

…but I ended up with the following exception when I tried to concatenate with an empty collection:

$ RETURN [1,2,3,4] + [];
==> SyntaxException: Type mismatch: expected Integer, Collection<Integer> or Collection<Collection<Integer>> but was Collection<Any> (line 1, column 20)
==> "RETURN [1,2,3,4] + []"
==>                     ^

I figured there was probably some strange type coercion going on for the empty collection and came up with the following work around using the RANGE function:

$ RETURN [1,2,3,4] + RANGE(0,-1);
==> +-------------------------+
==> | [1,2,3,4] + RANGE(0,-1) |
==> +-------------------------+
==> | [1,2,3,4]               |
==> +-------------------------+
==> 1 row

While writing this up I decided to check if it behaved the same way in the recently released 2.0.2 and was pleasantly surprised to see that the work around is no longer necessary:

$ RETURN [1,2,3,4] + [];
==> +----------------+
==> | [1,2,3,4] + [] |
==> +----------------+
==> | [1,2,3,4]      |
==> +----------------+
==> 1 row

So if you’re seeing the same issue get yourself upgraded!

Written by Mark Needham

April 19th, 2014 at 7:51 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Creating relationships between a collection of nodes / Invalid input ‘[‘:

without comments

When working with graphs we’ll frequently find ourselves wanting to create relationships between collections of nodes.

A common example of this would be creating a linked list of days so that we can quickly traverse across a time tree. Let’s say we start with just 3 days:

MERGE (day1:Day {day:1 })
MERGE (day2:Day {day:2 })
MERGE (day3:Day {day:3 })
RETURN day1, day2, day3

And we want to create a ‘NEXT’ relationship between adjacent days:

(day1)-[:NEXT]->(day2)-[:NEXT]->(day3)

The most obvious way to do this would be to collect the days into an ordered collection and iterate over them using FOREACH, creating a relationship between adjacent nodes:

MATCH (day:Day)
WITH day
ORDER BY day.day
WITH COLLECT(day) AS days
FOREACH(i in RANGE(0, length(days)-2) | 
  CREATE UNIQUE (days[i])-[:NEXT]->(days[i+1]))

Unfortunately this isn’t valid syntax:

Invalid input '[': expected an identifier character, node labels, a property map, whitespace, ')' or a relationship pattern (line 6, column 32)
"            CREATE UNIQUE (days[i])-[:NEXT]->(days[i+1]))"
                                ^

It doesn’t seem to like us using array indices where we specify the node identifier.

However, we can work around that by putting days[i] and days[i+1] into single item arrays and using nested FOREACH loops on those, something Michael Hunger showed me last year and I forgot all about!

MATCH (day:Day)
WITH day
ORDER BY day.day
WITH COLLECT(day) AS days
FOREACH(i in RANGE(0, length(days)-2) | 
  FOREACH(day1 in [days[i]] | 
    FOREACH(day2 in [days[i+1]] | 
      CREATE UNIQUE (day1)-[:NEXT]->(day2))))

Now if we do a query to get back all the days we’ll see they’re connected:

2014 04 19 07 32 37

Written by Mark Needham

April 19th, 2014 at 6:33 am

Posted in neo4j

Tagged with

Neo4j 2.0.0: Query not prepared correctly / Type mismatch: expected Map

without comments

I was playing around with Neo4j’s Cypher last weekend and found myself accidentally running some queries against an earlier version of the Neo4j 2.0 series (2.0.0).

My first query started with a map and I wanted to create a person from an identifier inside the map:

WITH {person: {id: 1}} AS params
MERGE (p:Person {id: params.person.id})
RETURN p

When I ran the query I got this error:

==> SyntaxException: Type mismatch: expected Map but was Boolean, Number, String or Collection<Any> (line 1, column 62)
==> "WITH {person: {id: 1}} AS params MERGE (p:Person {id: params.person.id}) RETURN p"

If we try the same query in 2.0.1 it works as we’d expect:

==> +---------------+
==> | p             |
==> +---------------+
==> | Node[1]{id:} |
==> +---------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 1
==> Labels added: 1
==> 47 ms

My next query was the following which links topics of interest to a person:

WITH {topics: [{name: "Java"}, {name: "Neo4j"}]} AS params
MERGE (p:Person {id: 2})
FOREACH(t IN params.topics | 
  MERGE (topic:Topic {name: t.name})
  MERGE (p)-[:INTERESTED_IN]->(topic)
)
RETURN p

In 2.0.0 that query fails like so:

==> InternalException: Query not prepared correctly!

but if we try it in 2.0.1 we’ll see that it works as well:

==> +---------------+
==> | p             |
==> +---------------+
==> | Node[4]{id:2} |
==> +---------------+
==> 1 row
==> Nodes created: 1
==> Relationships created: 2
==> Properties set: 1
==> Labels added: 1
==> 53 ms

So if you’re seeing either of those errors then get yourself upgraded to 2.0.1 as well!

Written by Mark Needham

April 13th, 2014 at 5:40 pm

Posted in neo4j

Tagged with

Remote profiling Neo4j using yourkit

without comments

yourkit is my favourite JVM profiling tool and whilst it’s really easy to profile a local JVM process, sometimes I need to profile a process on a remote machine.

In that case we need to first have the remote JVM started up with a yourkit agent parameter passed as one of the args to the Java program.

Since I’m mostly working with Neo4j this means we need to add the following to conf/neo4j-wrapper.conf:

wrapper.java.additional=-agentpath:/Users/markhneedham/Downloads/YourKit_Java_Profiler_2013_build_13074.app/bin/mac/libyjpagent.jnilib=port=8888

If we run lsof with the Neo4j process ID we’ll see that there’s now a socket listening on port 8888:

java    4388 markhneedham   20u    IPv6 0x901df453b4e9a125       0t0      TCP *:8888 (LISTEN)
...

We can connect to that via the ‘Monitor Remote Applications’ section of yourkit:

2014 03 24 23 39 59

In this case I’m demonstrating how to connect to it on my laptop and am using localhost but usually we’d specify the remote machine’s host name instead.

We also need to ensure that port 8888 is open on any firewalls we have in front of the machine.

The file we refer to in the ‘agentpath’ flag is a bit different depending on the operating system we’re using. All the details are on the yourkit website.

Written by Mark Needham

March 24th, 2014 at 11:44 pm

Posted in neo4j

Tagged with ,

Neo4j 2.1.0-M01: LOAD CSV with Rik Van Bruggen’s Tube Graph

without comments

Last week we released the first milestone of Neo4j 2.1.0 and one its features is a new function in cypher – LOAD CSV – which aims to make it easier to get data into Neo4j.

I thought I’d give it a try to import the London tube graph – something that my colleague Rik wrote about a few months ago.

I’m using the same data set as Rik but I had to tweak it a bit as there were naming differences when describing the connection from Kennington to Waterloo and Kennington to Oval. My updated version of the dataset is on github.

With the help of Alistair we now have a variation on the original which takes into account the various platforms at stations and the waiting time of a train on the platform. This will also enable us to add in things like getting from the ticket hall to the various platforms more easily.

The model looks like this:

2014 03 03 16 15 58

Now we need to create a graph and the first step is to put an index on station name as we’ll be looking that up quite frequently in the queries that follow:

CREATE INDEX on :Station(stationName)

Now that’s in place we can make use of LOAD CSV. The data is very de-normalised which works out quite nicely for us and we end up with the following script:

LOAD CSV FROM "file:/Users/markhneedham/code/tube/runtimes.csv" AS csvLine
WITH csvLine[0] AS lineName, 
     csvLine[1] AS direction, 
     csvLine[2] AS startStationName,
     csvLine[3] AS destinationStationName, 
     toFloat(csvLine[4]) AS distance, 
     toFloat(csvLine[5]) AS runningTime
 
MERGE (start:Station { stationName: startStationName}) 
MERGE (destination:Station { stationName: destinationStationName}) 
MERGE (line:Line { lineName: lineName}) 
MERGE (line) - [:DIRECTION] -> (dir:Direction { direction: direction}) 
CREATE (inPlatform:InPlatform {name: "In: " + destinationStationName + " " + lineName + " " + direction})
CREATE (outPlatform:OutPlatform {name: "Out: " + startStationName + " " + lineName + " " + direction}) 
CREATE (inPlatform) - [:AT] -> (destination) 
CREATE (outPlatform) - [:AT] -> (start) 
CREATE (inPlatform) - [:ON] -> (dir) 
CREATE (outPlatform) - [:ON] -> (dir) 
CREATE (outPlatform) - [r:TRAIN {distance: distance, runningTime: runningTime}] -> (inPlatform)

This file doesn’t contain any headers so we’ll simulate them by using a WITH clause so that we don’t have index lookups all over the place. In this case we’re pointing to a file on the local file system but we could choose to point to a CSV file on the web if we wanted to.

Since stations, lines and directions appear frequently we’ll use MERGE to ensure they don’t get duplicated.

After that we have a post processing step to connect the ‘in’ and ‘out’ platforms shown in the diagram.

MATCH (station:Station) <-[:AT]- (platformIn:InPlatform), 
      (station:Station) <-[:AT]- (platformOut:OutPlatform), 
      (direction:Direction) <-[:ON]- (platformIn:InPlatform), 
      (direction:Direction) <-[:ON]- (platformOut:OutPlatform) 
CREATE (platformIn) -[:WAIT {runningTime: 0.5}]-> (platformOut)

After running a few queries on the graph I realised that it wasn’t possible to combine some journies through Kennington and Euston so I had to add some relationships in there as well:

// link the Euston stations
MATCH (euston:Station {stationName: "EUSTON"})<-[:AT]-(eustonIn:InPlatform)
MATCH (eustonCx:Station {stationName: "EUSTON (CX)"})<-[:AT]-(eustonCxIn:InPlatform)
MATCH (eustonCity:Station {stationName: "EUSTON (CITY)"})<-[:AT]-(eustonCityIn:InPlatform)
 
CREATE UNIQUE (eustonIn)-[:WAIT {runningTime: 0.0}]->(eustonCxIn)
CREATE UNIQUE (eustonIn)-[:WAIT {runningTime: 0.0}]->(eustonCityIn)
CREATE UNIQUE (eustonCxIn)-[:WAIT {runningTime: 0.0}]->(eustonCityIn)
 
// link the Kennington stations
MATCH (kenningtonCx:Station {stationName: "KENNINGTON (CX)"})<-[:AT]-(kenningtonCxIn:InPlatform)
MATCH (kenningtonCity:Station {stationName: "KENNINGTON (CITY)"})<-[:AT]-(kenningtonCityIn:InPlatform)
 
CREATE UNIQUE (kenningtonCxIn)-[:WAIT {runningTime: 0.0}]->(kenningtonCityIn)

I’ve been playing around with the A* algorithm to find the quickest route between stations based on the distances between stations.

The next step is to put a timetable graph alongside this so we can do quickest routes at certain parts of the day and the next step after that will be to take delays into account.

If you’ve got some data you want to get into the graph give LOAD CSV a try and let us know how you get on, the cypher team are keen to get feedback on this.

Written by Mark Needham

March 3rd, 2014 at 4:34 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Finding directors who acted in their own movie

without comments

I’ve been doing quite a few Intro to Neo4j sessions recently and since it contains a lot of problems for the attendees to work on I get to see how first time users of Cypher actually use it.

A couple of hours in we want to write a query to find directors who acted in their own film based on the following model.

2014 02 28 22 40 02

A common answer is the following:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
WHERE a.name = d.name
RETURN a

We’re matching an actor ‘a’, finding the movie they acted in and then finding the director of that movie. We now have pairs of actors and directors which we filter down by comparing their ‘name’ property.

I haven’t written SQL for a while but if my memory serves me correctly comparing properties or attributes in this way is quite a common way to test for equality.

In a graph we don’t need to compare properties – what we actually want to check is if ‘a’ and ‘d’ are the same node:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
WHERE a = d
RETURN a

We’ve simplifed the query a bit but we can actually go one better by binding the director to the same identifier as the actor like so:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(a)
RETURN a

So now we’re matching an actor ‘a’, finding the movie they acted in and then finding the director if they happen to be the same person as ‘a’.

The code is now much simpler and more revealing of its intent too.

Written by Mark Needham

February 28th, 2014 at 10:57 pm

Posted in neo4j

Tagged with ,

Neo4j: Cypher – Set Based Operations

without comments

I was recently reminded of a Neo4j cypher query that I wrote a couple of years ago to find the colleagues that I hadn’t worked with in the ThoughtWorks London office.

The model looked like this:

2014 02 18 17 04 01

And I created the following fake data set of the aforementioned model:

public class SetBasedOperations
{
    private static final Label PERSON = DynamicLabel.label( "Person" );
    private static final Label OFFICE = DynamicLabel.label( "Office" );
 
    private static final DynamicRelationshipType COLLEAGUES = DynamicRelationshipType.withName( "COLLEAGUES" );
    private static final DynamicRelationshipType MEMBER_OF = DynamicRelationshipType.withName( "MEMBER_OF" );
 
    public static void main( String[] args ) throws IOException
    {
        Random random = new Random();
        String path = "/tmp/set-based-operations";
        FileUtils.deleteRecursively( new File( path ) );
 
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
 
        Transaction tx = db.beginTx();
        try
        {
            Node me = db.createNode( PERSON );
            me.setProperty( "name", "me" );
 
            Node londonOffice = db.createNode( OFFICE );
            londonOffice.setProperty( "name", "London Office" );
 
            me.createRelationshipTo( londonOffice, MEMBER_OF );
 
            for ( int i = 0; i < 1000; i++ )
            {
                Node colleague = db.createNode( PERSON );
                colleague.setProperty( "name", "person" + i );
 
                colleague.createRelationshipTo( londonOffice, MEMBER_OF );
 
                if(random.nextInt( 10 ) >= 8) {
                    me.createRelationshipTo( colleague, COLLEAGUES );
                }
 
                tx.success();
            }
        }
        finally
        {
            tx.finish();
        }
 
        db.shutdown();
 
        CommunityNeoServer server = CommunityServerBuilder
                .server()
                .usingDatabaseDir( path )
                .onPort( 9001 )
                .persistent()
                .build();
 
        server.start();
 
    }
}

I’ve created a node representing me and 1,000 people who work in the London office. Out of those 1,000 people I made it so that ~150 of them have worked with me.

If I want to write a cypher query to find the exact number of people who haven’t worked with me I might start with the following:

MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

We start by finding me, then find the London office which I was a member of, and then find the other people who are members of that office. On the second line we remove people that I’ve previously worked with and then return a count of the people who are left.

When I ran this through my Cypher query tuning tool the average time to evaluate this query was 7.46 seconds.

That is obviously a bit too slow if we want to run the query on a web page and as far as I can tell the reason for that is that for each potential colleague we are searching through my ‘COLLEAGUES’ relationships and checking whether they exist. We’re doing that 1,000 times which is a bit inefficient.

I chatted to David about this, and he suggested that a more efficient query would be to work out all my colleagues up front once and then do the filtering from that set of people instead.

The re-worked query looks like this:

MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)

When we run that through the query tuner the average time reduces to 150 milliseconds which is much better.

This type of query seems to be more about set operations than graph ones because we’re looking for what isn’t there rather than what is. When that’s the case getting the set of things that we want to compare against up front is more profitable.

Written by Mark Needham

February 20th, 2014 at 6:22 pm

Posted in neo4j

Tagged with ,

Neo4j: Creating nodes and relationships from a list of maps

without comments

Last week Alistair and I were porting some Neo4j cypher queries from 1.8 to 2.0 and one of the queries we had to change was an interesting one that created a bunch of relationships from a list/array of maps.

In the query we had a user ‘Mark’ and wanted to create ‘FRIENDS_WITH’ relationships to Peter and Michael.

2014 02 17 13 39 08

The application passed in a list of maps representing Peter and Michael as a parameter but if we remove the parameters the query looked like this:

MERGE (me:User {userId: 1} )
SET me.name = "Mark"
FOREACH (f IN [{userId: 2, name: "Michael"}, {userId: 3, name: "Peter"}] | 
    MERGE (u:User {userId: f.userId})
    SET u = f
    MERGE (me)-[:FRIENDS_WITH]->(u))

We first ensure that a user with ‘id’ of 1 exists in the database and then make sure their name is set to ‘Mark’. After we’ve done that we iterate over a list of maps containing Mark’s friends and ensure there is a ‘FRIENDS_WITH’ relationship from Mark to them.

The parameterised version of that query looks like this:

MERGE (me:User { userId: {userId} }) 
SET me.name = {name} 
FOREACH(f IN {friends} | 
    MERGE (u:User {userId: f.userId }) 
    SET u = f 
    MERGE (me)-[:FRIENDS_WITH]->(u))

We can then execute that query using Jersey:

public class ListsOfMapsCypher
{
    public static void main( String[] args )
    {
        ObjectNode request = JsonNodeFactory.instance.objectNode();
        request.put("query",
                "MERGE (me:User { userId: {userId} }) " +
                "SET me.name = {name} " +
                "FOREACH(f IN {friends} | " +
                    "MERGE (u:User {userId: f.userId }) " +
                    "SET u = f " +
                    "MERGE (me)-[:FRIENDS_WITH]->(u)) ");
 
        ObjectNode params = JsonNodeFactory.instance.objectNode();
        params.put("userId", 1);
        params.put("name", "Mark");
 
        ArrayNode friends = JsonNodeFactory.instance.arrayNode();
 
        ObjectNode friend1 = JsonNodeFactory.instance.objectNode();
        friend1.put( "userId", 2 );
        friend1.put( "name", "Michael" );
        friends.add( friend1 );
 
        ObjectNode friend2 = JsonNodeFactory.instance.objectNode();
        friend2.put( "userId", 3 );
        friend2.put( "name", "Peter" );
        friends.add( friend2 );
 
        params.put("friends", friends );
 
        request.put("params", params );
 
        ClientResponse clientResponse = client()
                .resource( "http://localhost:7474/db/data/cypher" )
                .accept( MediaType.APPLICATION_JSON )
                .entity( request, MediaType.APPLICATION_JSON_TYPE )
                .post( ClientResponse.class );
 
 
        System.out.println(clientResponse.getEntity( String.class ));
    }
 
    private static Client client()
    {
        DefaultClientConfig defaultClientConfig = new DefaultClientConfig();
        defaultClientConfig.getClasses().add(JacksonJsonProvider.class);
        return Client.create(defaultClientConfig);
    }
}

We can then write a query to check Mark and his friends were persisted:

2014 02 17 14 10 12

And that’s it!

Written by Mark Needham

February 17th, 2014 at 2:11 pm

Posted in neo4j

Tagged with ,

Neo4j: Value in relationships, but value in nodes too!

without comments

I’ve recently spent a bit of time working with people on their graph commons and a common pattern I’ve come across is that although the models have lots of relationships there are often missing nodes.

Emails

We’ll start with a model which represents the emails that people send between each other. A first cut might look like this:

2014 02 12 08 30 59

The problem with this approach is that we haven’t modelled the concept of an email – that’s been implicitly modelled via a relationship.

This means that if we want to indicate who was cc’d or bcc’d on the email we can’t do it. We might also want to track the replies on a thread but again we can’t do it.

A richer model that treated an email as a first class citizen would allow us to do both these things and would look like this:

2014 02 12 23 16 02

We could then write queries to get the chain of emails in a thread or find all the emails that a person was cc’d in – two queries that would be much more difficult to write if we don’t have the concept of an email.

Footballers and football matches

Our second example come from my football dataset and involves modelling the matches that players participated in.

My first attempt looked like this:

2014 02 12 23 30 35

This works reasonably well but I wanted to be able to model which team a player had represented in a match which was quite difficult with this model.

One approach would be to add a ‘team’ property to the ‘PLAYED_IN’ relationship but then we’d need to do some work at query time to work out which team node that property value referred to.

Instead I realised that I was missing the concept of a player’s performance in a match which would make some queries much easier to write. The new model looks like this:

2014 02 12 23 37 28

The tube

The final example is modelling the London tube although this could apply to any transport system. Our initial model of part of the Northern Line might look like this:

2014 02 12 23 59 46

This model works really well and my colleague Rik has written a blog post showing the queries you could write against it.

However, it’s missing the concept of a platform which means if we want to create a routing application which takes into account the amount of time it takes to move between different

If we introduce a node to represent the different platforms in a station we can introduce that type of information:

2014 02 13 00 04 06

In each of these examples we’ve effectively normalised our model by introducing an extra concept which means it looks more complicated.

The benefit of this approach across all three examples is that it allows us to answer more complicated questions of our data which in my experience are the really interesting questions.

As always, let me know what you think in the comments.

Written by Mark Needham

February 13th, 2014 at 12:10 am

Posted in neo4j

Tagged with