Mark Needham

Thoughts on Software Development

neo4j/cypher: Redundant relationships

with 4 comments

Last week I was writing a query to find the top scorers in the Premier League so far this season alongside the number of games they’ve played in which initially read like this:

START player = node:players('name:*')
MATCH player-[:started|as_sub]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5
+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
1 ms

I modelled whether a player started a game or came on as a substitute with separate relationship types ‘started’ and ‘as_sub’ but in this query we’re not interested in that, we just want to know whether they played.

In the world of relational database design we tend to try and avoid redundancy but with graphs this isn’t such a big deal so I thought I may as well add a ‘played’ relationship whenever a ‘as_sub’ or ‘started’ one was being created.

We can then simplify the above query to read like this:

START player = node:players('name:*')
MATCH player-[:played]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5
+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
0 ms

When I’m querying I often forget that I modelled starting/substitute separately and think the data has screwed up and it’s always because I’ve forgotten to include the ‘as_sub’ relationship.

Having the ‘played’ relationship means that no longer happens which is cool.

I have a reasonably small data set so I haven’t seen any performance problems from creating this redundancy.

However, since the maximum number of relationships going out from a player would be unlikely to be more than 1000s I don’t think it will become one either.

As always I’d be interested in thoughts from others who have come across similar problems or can see something that I’ve missed.

Be Sociable, Share!

Written by Mark Needham

April 16th, 2013 at 9:41 pm

Posted in neo4j

Tagged with ,

  • http://andypalmer.com Andy Palmer

    Why don’t you add from/to minutes as a property of the played relationship?
    Players who played from a minute != 0 were subbed (and you could infer who they were subbed with by the corresponding played to)

  • Mark Needham

    @andypalmer:disqus yeh I was thinking about your idea of doing that when I created the relationship but then I was thinking that maybe the substitution in itself is an interesting concept in which case you’d want to have that as a node?

    So you’d have:

    player_1-[:on]-substitution-[:off]-player_2
    match-[:had_substitution]-substitution

    So then you could store the time of the substitution on the node and work out how long a player played from that?

  • http://andypalmer.com Andy Palmer

    I’m not sure that I agree that interesting concepts are necessarily nodes. If the relationships were unable to take properties, however, I’d probably default to that too.

    Anyhow, transforming a graph from well-defined relationships to well-defined nodes is relatively straight forward.

    I like to use the visual representation to see if I’m going too far astray.

    The Neo4J cypher example has Keanu Reeves -[:acts_in]- The Matrix, where the :acts_in relationship contains the character name. I think this is a nice example of a relationship that is doing the right duties. (although, if it became important to track characters that weren’t the same actor, such as Anakin Skywalker in Star Wars I, II, and VI, then we might need to promote them to nodes)

    Are you at the Neo4J User group next week? Maybe we could pair for a bit :-)

  • Mark Needham

    @andypalmer:disqus fair enough and yep I’ll be there. Will bring my machine with the graph on it and we can hack on it, good idea!