Mark Needham

Thoughts on Software Development

neo4j/cypher: Combining COUNT and COLLECT in one query

with 6 comments

In my continued playing around with football data I wanted to write a cypher query against neo4j which would show me which teams had missed the most penalties this season and who missed them.

I started off with a query that returned all the penalties that have been missed this season and the games those misses happened in:

START player = node:players('name:*')
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team,
      game-[:home_team]-home,
      game-[:away_team]-away
RETURN player.name, team.name, home.name, away.name
+-------------------------------------------------------------------------------------------------+
| player.name          | team.name              | home.name              | away.name              |
+-------------------------------------------------------------------------------------------------+
| "Papiss Demba Cisse" | "Newcastle United"     | "Newcastle United"     | "Norwich City"         |
| "Wayne Rooney"       | "Manchester United"    | "Manchester United"    | "Arsenal"              |
| "Mikel Arteta"       | "Arsenal"              | "Arsenal"              | "Fulham"               |
| "David Silva"        | "Manchester City"      | "Manchester City"      | "Southampton"          |
| "Frank Lampard"      | "Chelsea"              | "Manchester City"      | "Chelsea"              |
| "Adel Taarabt"       | "Queens Park Rangers"  | "Queens Park Rangers"  | "Norwich City"         |
| "Javier Hernández"   | "Manchester United"    | "Manchester United"    | "Wigan Athletic"       |
| "Robin Van Persie"   | "Manchester United"    | "Southampton"          | "Manchester United"    |
| "Jonathan Walters"   | "Stoke City"           | "Fulham"               | "Stoke City"           |
| "Shane Long"         | "West Bromwich Albion" | "West Bromwich Albion" | "Liverpool"            |
| "Steven Gerrard"     | "Liverpool"            | "Liverpool"            | "West Bromwich Albion" |
| "Lucas Piazon"       | "Chelsea"              | "Chelsea"              | "Aston Villa"          |
+-------------------------------------------------------------------------------------------------+
12 rows

(there should actually be another penalty miss for Jonathan Walters against Chelsea but for some reason the data source has missed it off!

I then grouped the penalty misses by team so that I’d have one row for each team and a collection showing the people who’d missed.

We can use the COLLECT function to do the latter:

START player = node:players('name:*')
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team
RETURN DISTINCT team.name, COLLECT(player.name) AS players
+---------------------------------------------------------------------------------+
| team.name              | players                                                |
+---------------------------------------------------------------------------------+
| "Newcastle United"     | ["Papiss Demba Cisse"]                                 |
| "Manchester United"    | ["Wayne Rooney","Javier Hernández","Robin Van Persie"] |
| "Stoke City"           | ["Jonathan Walters"]                                   |
| "West Bromwich Albion" | ["Shane Long"]                                         |
| "Chelsea"              | ["Frank Lampard","Lucas Piazon"]                       |
| "Arsenal"              | ["Mikel Arteta"]                                       |
| "Manchester City"      | ["David Silva"]                                        |
| "Liverpool"            | ["Steven Gerrard"]                                     |
| "Queens Park Rangers"  | ["Adel Taarabt"]                                       |
+---------------------------------------------------------------------------------+
9 rows

I wanted to order the teams by the number of penalties they’d missed so Manchester United would be first in the table in this case and initially tried to order the results by a count of players:

START player = node:players('name:*')
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team
RETURN DISTINCT team.name, COLLECT(player.name) AS players
ORDER BY COUNT(player.name)

which doesn’t actually compile:

SyntaxException: Aggregation expressions must be listed in the RETURN clause to be used in ORDER BY

I tried a few other variations such as the following:

START player = node:players('name:*')
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team
RETURN DISTINCT team.name, COUNT(player.name) AS numberOfPlayers, 
       COLLECT(player.name) AS players
ORDER BY numberOfPlayers DESC

which again doesn’t compile:

SyntaxException: Aggregation expressions must be listed in the RETURN clause to be used in ORDER BY

I eventually found a post by Andres where he explains that you need to split the query into two and make use of WITH if you want to make use of two aggregation expressions.

I ended up with the following query which does the job:

START player = node:players('name:*')
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team
WITH DISTINCT team, COLLECT(player.name) AS players
 
MATCH player-[:missed_penalty_in]-game, 
      player-[:played|subbed_on]-stats-[:in]-game,
      stats-[:for]-team
WITH DISTINCT team, COUNT(player) AS numberOfPlayers, players
 
RETURN team.name, players
ORDER BY numberOfPlayers DESC
+---------------------------------------------------------------------------------+
| team.name              | players                                                |
+---------------------------------------------------------------------------------+
| "Manchester United"    | ["Wayne Rooney","Javier Hernández","Robin Van Persie"] |
| "Chelsea"              | ["Frank Lampard","Lucas Piazon"]                       |
| "Liverpool"            | ["Steven Gerrard"]                                     |
| "Manchester City"      | ["David Silva"]                                        |
| "Newcastle United"     | ["Papiss Demba Cisse"]                                 |
| "Queens Park Rangers"  | ["Adel Taarabt"]                                       |
| "Stoke City"           | ["Jonathan Walters"]                                   |
| "Arsenal"              | ["Mikel Arteta"]                                       |
| "West Bromwich Albion" | ["Shane Long"]                                         |
+---------------------------------------------------------------------------------+
9 rows
Be Sociable, Share!

Written by Mark Needham

February 24th, 2013 at 7:19 pm

Posted in neo4j

Tagged with ,

  • Pingback: Pomodoros and the To-Do list at Mark Needham

  • Craig Brett

    A good article and even better, showing Neo4J in an original context. I have a question though, why are you using the match clause twice? As opposed to just the one match clause.
    Craig

  • Mark Needham

    @google-9bb8d264b48965789b5cad6a6a103ded:disqus I couldn’t see a way that I could use two aggregate functions (COUNT & COLLECT) within the same RETURN statement and this was the only way I came up with for doing so.

    Perhaps there’s a cleaner way where you don’t have to use two MATCH clauses but I’m not sure how to do it!

  • Pingback: neo4j/cypher: Combining COUNT and COLLECT in one query « Another Word For It

  • http://www.markhneedham.com/blog Mark Needham

    Craig Brett I emailed Andres (dude who wrote cypher) to see if what I said was correct and it would seem that I am wrong and you can in fact combine these two aggregate functions in one RETURN like so:

    START player = node:players(‘name:*’)
    MATCH player-[:missed_penalty_in]-game,
    player-[:played|subbed_on]-stats-[:in]-game,
    stats-[:for]-team
    RETURN DISTINCT team.name, COUNT(player) AS numberOfPlayers,
    COLLECT(player.name) AS players
    ORDER BY numberOfPlayers DESC

    So you need to COUNT on the node rather than a property of it for it to work properly.

  • Pingback: neo4j/cypher: Getting the hang of the WITH statement at Mark Needham