Mark Needham

Thoughts on Software Development

Neo4j 2.0.0: Optimising a football query

without comments

A couple of months ago I wrote a blog post explaining how I’d applied Wes Freeman’s Cypher optimisation patterns to a query – since then Neo4j 2.0.0 has been released and I’ve extended the model so I thought I’d try again.

The updated model looks like this:

2014 01 31 22 25 20

The query is similar to before – I want to calculate the top away goal scorers in the 2012-2013 season. I started off with this:

MATCH (game)<-[:contains_match]-(season:Season),
      (team)<-[:away_team]-(game),
      (stats)-[:in]->(game),
      (team)<-[:for]-(stats)<-[:played]-(player)
WHERE season.name = "2012-2013"
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

When I executed this query using my Python query tuning tool and the average time was 3.31 seconds.

I separated the MATCH statements into smaller individual statements just to see what would happen:

MATCH (game)<-[:contains_match]-(season:Season)
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
WHERE season.name = "2012-2013"
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

That reduced the time to 178 milliseconds which is quite a nice improvement for so little effort. As I understand it, this is down to the traversal matcher handling smaller patterns more effectively than it handles longer patterns.

The next step was to move the WHERE clause up so that it filtered out rows right at the beginning of the query rather than letting them hang around for another 3 MATCH statements:

MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

That took the time down to 131 milliseconds. At this stage I also tried putting the ‘MATCH (team)<-[:away_team]-(game)' line first to see what would happen.

MATCH (team)<-[:away_team]-(game)
MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
RETURN player.name, 
       COLLECT(DISTINCT team.name), 
       SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

I expected it to be a bit slower since we were now keeping around more games than necessary again and as expected the time rose to 172 milliseconds – slightly quicker than our first attempt at tweaking.

I reverted that change and realised that there would be a lot of ‘stats’ nodes which didn’t have any goals associated with them and would therefore have a ‘goals’ property value of 0. I tried filtering those nodes out before the ‘SUM’ part of the query:

MATCH (game)<-[:contains_match]-(season:Season)
WHERE season.name = "2012-2013"
MATCH (team)<-[:away_team]-(game)
MATCH (stats)-[:in]->(game)
MATCH (team)<-[:for]-(stats)<-[:played]-(player)
WHERE stats.goals > 0
RETURN player.name,
COLLECT(DISTINCT team.name),
SUM(stats.goals) as goals
ORDER BY goals DESC
LIMIT 10

This proved to be a very good optimisation – the time reduced to 47 milliseconds, an improvement of almost 3x on the previous optimisation and 63x quicker than the original query.

The main optimisation pattern used here was reducing the number of rows being passed through the query.

Ideally you don’t want to be passing unnecessary rows through each stage of the query – rows can be filtered out either by using more specific MATCH clauses or in this case by a WHERE clause.

Wes and I presented a webinar on Cypher query optimisation last week which so if you want to learn more about tuning queries that might be worth a watch.

Be Sociable, Share!

Written by Mark Needham

January 31st, 2014 at 10:41 pm

Posted in neo4j

Tagged with