· neo4j cypher

neo4j/cypher: Redundant relationships

Last week I was writing a query to find the top scorers in the Premier League so far this season alongside the number of games they’ve played in which initially read like this:

START player = node:players('name:*')
MATCH player-[:started|as_sub]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5
+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
1 ms

I modelled whether a player started a game or came on as a substitute with separate relationship types 'started' and 'as_sub' but in this query we’re not interested in that, we just want to know whether they played.

In the world of relational database design we tend to try and avoid redundancy but with graphs this isn’t such a big deal so I thought I may as well add a 'played' relationship whenever a 'as_sub' or 'started' one was being created.

We can then simplify the above query to read like this:

START player = node:players('name:*')
MATCH player-[:played]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5
+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
0 ms

When I’m querying I often forget that I modelled starting/substitute separately and think the data has screwed up and it’s always because I’ve forgotten to include the 'as_sub' relationship.

Having the 'played' relationship means that no longer happens which is cool.

I have a reasonably small data set so I haven’t seen any performance problems from creating this redundancy.

However, since the maximum number of relationships going out from a player would be unlikely to be more than 1000s I don’t think it will become one either.

As always I’d be interested in thoughts from others who have come across similar problems or can see something that I’ve missed.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket