Archive for the ‘neo4j’ tag
neo4j/cypher: Keep longest path when finding taxonomy
I’ve been playing around with modelling a product taxonomy and one thing that I wanted to do was find out the full path where a product sits under the tree.
I created a simple data set to show the problem:
CREATE (cat { name: "Cat" })
CREATE (subcat1 { name: "SubCat1" })
CREATE (subcat2 { name: "SubCat2" })
CREATE (subsubcat1 { name: "SubSubCat1" })
CREATE (product1 { name: "Product1" })
CREATE (cat)-[:CHILD]-subcat1-[:CHILD]-subsubcat1
CREATE (product1)-[:HAS_CATEGORY]-(subsubcat1)I wanted to write a query which would return ‘product1′ and the tree ‘Cat -> SubCat1 -> SubSubCat1′ and initially wrote the following query:
START product=node:node_auto_index(name="Product1") MATCH product-[:HAS_CATEGORY]-category, taxonomy=category<-[:CHILD*1..]-parent RETURN product, EXTRACT(n IN NODES(taxonomy): n.name)
which returns:
==> +--------------------------------------------------------------------+
==> | product | EXTRACT(n IN NODES(taxonomy): n.name) |
==> +--------------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1"] |
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1","Cat"] |
==> +--------------------------------------------------------------------+
==> 2 rowsI didn’t want to return the first row since that isn’t the full tree and Andres suggested that looking for nodes which didn’t have any incoming children would help me do that:
START product=node:node_auto_index(name="Product1")
MATCH product-[:HAS_CATEGORY]-category,
taxonomy=category<-[:CHILD*1..]-parent
WHERE NOT parent<-[:CHILD]-()
RETURN product, EXTRACT(n IN NODES(taxonomy): n.name)==> +--------------------------------------------------------------------+
==> | product | EXTRACT(n IN NODES(taxonomy): n.name) |
==> +--------------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1","Cat"] |
==> +--------------------------------------------------------------------+
==> 1 rowIf we want to reverse the taxonomy so it’s in the right order we can follow Wes Freeman’s advice from the following Stack Overflow thread:
START product=node:node_auto_index(name="Product1")
MATCH product-[:HAS_CATEGORY]-category, taxonomy=category<-[:CHILD*1..]-parent
WHERE NOT parent<-[:CHILD]-()
RETURN product,
REDUCE(acc=[], cat IN EXTRACT(n IN NODES(taxonomy): n.name): cat + acc) AS taxonomy==> +-------------------------------------------------------------+
==> | product | taxonomy |
==> +-------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["Cat","SubCat1","SubSubCat1"] |
==> +-------------------------------------------------------------+
==> 1 rowneo4j: When the web console returns nothing…use the data browser!
In my time playing around with neo4j I’ve run into a problem a few times where I executed a query using the web console (usually accessible @ http://localhost:7474/webadmin/#/console/) and have got absolutely no response.
I noticed a similar thing today when Rickard and I were having a look at why a Lucene index query wasn’t behaving as we expected.
I setup some data in a neo4j database using neography with the following code:
require 'neography' @neo = Neography::Rest.new @neo.create_node_index("Id_Index", "exact", "lucene") node1 = @neo.create_node("Hour" => 1, "name" => "Max") node2 = @neo.create_node("Hour" => 2, "name" => "Mark") node3 = @neo.create_node("Hour" => 3, "name" => "Rickard") @neo.add_node_to_index("Id_Index", "Hour", 1, node1) @neo.add_node_to_index("Id_Index", "Hour", 2, node2) @neo.add_node_to_index("Id_Index", "Hour", 3, node3)
I then ran the following query which I was expecting to return all the nodes:
start hour=node:Id_Index("Hour:[00 TO 02] or Hour:[03 TO 05]") RETURN hourInstead it returned nothing and I couldn’t see anything being logged either.
Rickard pointed out was because the exception is only returned to the API caller and that it would be better to run the query from the Data Browser which is typically accessible from http://localhost:7474/webadmin/#/data/search/
If we run the query from there then we can see what’s going wrong:
BadInputException StackTrace: org.neo4j.server.rest.repr.RepresentationExceptionHandlingIterable.exceptionOnHasNext(RepresentationExceptionHandlingIterable.java:50) org.neo4j.helpers.collection.ExceptionHandlingIterable$1.hasNext(ExceptionHandlingIterable.java:60) org.neo4j.helpers.collection.IteratorWrapper.hasNext(IteratorWrapper.java:42) org.neo4j.server.rest.repr.ListRepresentation.serialize(ListRepresentation.java:58) org.neo4j.server.rest.repr.Serializer.serialize(Serializer.java:75) org.neo4j.server.rest.repr.MappingSerializer.putList(MappingSerializer.java:61) org.neo4j.server.rest.repr.CypherResultRepresentation.serialize(CypherResultRepresentation.java:57) org.neo4j.server.rest.repr.MappingRepresentation.serialize(MappingRepresentation.java:42) org.neo4j.server.rest.repr.OutputFormat.assemble(OutputFormat.java:179) org.neo4j.server.rest.repr.OutputFormat.formatRepresentation(OutputFormat.java:131) org.neo4j.server.rest.repr.OutputFormat.response(OutputFormat.java:117) org.neo4j.server.rest.repr.OutputFormat.ok(OutputFormat.java:55) org.neo4j.server.rest.web.CypherService.cypher(CypherService.java:94) java.lang.reflect.Method.invoke(Method.java:597)
There seemed to be some strangeness going on with how Lucene handles the query when a default search field isn’t provided but we noticed that it behaved as expected if we didn’t use an OR since Lucene has an implicit OR between statements anyway.
start hour=node:Id_Index("Hour:[00 TO 02] Hour:[03 TO 05]") RETURN hourEither way, the lesson for me was if the console isn’t giving a result run the query in the data browser to work out what’s going wrong!
neo4j/cypher: Returning a row with zero count when no relationship exists
I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received.
1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels.
To refresh this is the sub graph that we’ll need to look at to work it out:
I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for:
START game = node:matches('match_id:*')
MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards
LIMIT 5When we run this we get the following results:
+------------------------------+ | team.name | redCards | +------------------------------+ | "Sunderland" | 1 | | "West Ham United" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | | "Liverpool" | 2 | +------------------------------+ 5 rows
The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship.
I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on.
The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional.
However, that on its own that isn’t enough because it now returns a count of all the player performances for each team:
START game = node:matches('match_id:*')
MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards ASC
LIMIT 5+-----------------------------+ | team.name | redCards | +-----------------------------+ | "Chelsea" | 448 | | "Wigan Athletic" | 459 | | "Fulham" | 460 | | "Liverpool" | 466 | | "Everton" | 467 | +-----------------------------+ 5 rows
Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up.
We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need:
START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COLLECT(r) AS redCards
LIMIT 5+-----------------------------------------------------------------------------------------------------+
| team.name | redCards |
+-----------------------------------------------------------------------------------------------------+
| "Wigan Athletic" | [:sent_off_in[26443] {},:sent_off_in[37785] {}] |
| "Everton" | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] |
| "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}] |
| "Southampton" | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}] |
| "West Ham United" | [:sent_off_in[21734] {minute:67}] |
+-----------------------------------------------------------------------------------------------------+
5 rowsWe then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done:
START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, LENGTH(COLLECT(r)) AS redCards
ORDER BY redCards
LIMIT 5+--------------------------------+ | team.name | redCards | +--------------------------------+ | "Manchester United" | 0 | | "West Ham United" | 1 | | "Sunderland" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | +--------------------------------+ 5 rows
neo4j/cypher: Redundant relationships
Last week I was writing a query to find the top scorers in the Premier League so far this season alongside the number of games they’ve played in which initially read like this:
START player = node:players('name:*')
MATCH player-[:started|as_sub]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5+------------------------------------+ | player.name | games | goals | +------------------------------------+ | "Luis Suárez" | 30 | 22 | | "Robin Van Persie" | 30 | 19 | | "Gareth Bale" | 27 | 17 | | "Michu" | 29 | 16 | | "Demba Ba" | 28 | 15 | +------------------------------------+ 5 rows 1 ms
I modelled whether a player started a game or came on as a substitute with separate relationship types ‘started’ and ‘as_sub’ but in this query we’re not interested in that, we just want to know whether they played.
In the world of relational database design we tend to try and avoid redundancy but with graphs this isn’t such a big deal so I thought I may as well add a ‘played’ relationship whenever a ‘as_sub’ or ‘started’ one was being created.
We can then simplify the above query to read like this:
START player = node:players('name:*')
MATCH player-[:played]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5+------------------------------------+ | player.name | games | goals | +------------------------------------+ | "Luis Suárez" | 30 | 22 | | "Robin Van Persie" | 30 | 19 | | "Gareth Bale" | 27 | 17 | | "Michu" | 29 | 16 | | "Demba Ba" | 28 | 15 | +------------------------------------+ 5 rows 0 ms
When I’m querying I often forget that I modelled starting/substitute separately and think the data has screwed up and it’s always because I’ve forgotten to include the ‘as_sub’ relationship.
Having the ‘played’ relationship means that no longer happens which is cool.
I have a reasonably small data set so I haven’t seen any performance problems from creating this redundancy.
However, since the maximum number of relationships going out from a player would be unlikely to be more than 1000s I don’t think it will become one either.
As always I’d be interested in thoughts from others who have come across similar problems or can see something that I’ve missed.
neo4j/cypher: Playing around with time
I’ve done a bit of modelling with years and months in neo4j graphs that I’ve worked on previously but I haven’t ever done anything with time so I thought it’d be interesting to have a go with my football graph.
I came across this StackOverflow post on my travels which suggested that indexing nodes by time would be helpful and since I have a bunch of football matches with associated times I thought I’d try it out.
I created the key of the index by running code similar to the following:
> DateTime.now.strftime("%H%M") => "2200"
We can then write a query to show all the games at a certain time of day:
START game=node:times('time:*')
RETURN game.time, COUNT(game)
ORDER BY game.time+-------------------------+ | game.time | COUNT(game) | +-------------------------+ | 1245 | 21 | | 1330 | 21 | | 1500 | 163 | | 1600 | 29 | | 1730 | 22 | | 1945 | 21 | | 2000 | 19 | +-------------------------+ 7 rows
To be fair any index that referenced all the matches would allow us to do this. e.g.
START game=node:matches('match_id:*')
RETURN game.time, COUNT(game)
ORDER BY game.timeThe time based indexing becomes more interesting when we use Lucene’s numeric range query syntax to only select matches which happened between certain times of day:
START game=node:times('time:[1600 TO 2000]')
RETURN game.time, COUNT(game)
ORDER BY game.time+-------------------------+ | game.time | COUNT(game) | +-------------------------+ | 1600 | 29 | | 1730 | 22 | | 1945 | 21 | | 2000 | 19 | +-------------------------+ 4 rows
I couldn’t see a way to set an open ended value either side of the ‘TO’ so if we want to do that we just need to set a really high maximum value or really low minimum value.
For example if we want to find all the evening matches we could use this query:
START game=node:times('time:[1730 TO 2359]')
RETURN game.time, COUNT(game)
ORDER BY game.time+-------------------------+ | game.time | COUNT(game) | +-------------------------+ | 1730 | 22 | | 1945 | 21 | | 2000 | 19 | +-------------------------+ 3 rows
I also indexed each match by its full timestamp so we could find all the evening games this year if we wanted as well:
> Time.new(2013,1,1).to_i => 1356998400
START game=node:times('time:[1730 TO 2359] AND date: [1356998400 TO 9999999999]')
RETURN game.time, game.name, game.friendly_date+------------------------------------------------------------------------------------+ | game.time | game.name | game.friendly_date | +------------------------------------------------------------------------------------+ | 1730 | "Wigan Athletic vs Liverpool" | "2013-03-02 17:30:00 +0000" | | 2000 | "Aston Villa vs Manchester City" | "2013-03-04 20:00:00 +0000" | | 2000 | "West Ham United vs Tottenham Hotspur" | "2013-02-25 20:00:00 +0000" | | 2000 | "Liverpool vs West Bromwich Albion" | "2013-02-11 20:00:00 +0000" | | 1730 | "Fulham vs Manchester United" | "2013-02-02 17:30:00 +0000" | ... | 1945 | "Chelsea vs Southampton" | "2013-01-16 19:45:00 +0000" | +------------------------------------------------------------------------------------+ 25 rows
neo4j/cypher: CypherTypeException: Failed merging Number with Relationship
The latest thing that I added to my football graph was the matches that are shown on TV as I have the belief that players who score on televised games get more attention than players who score in other games.
I thought it’d be interesting to work out who the top scorers are on each of these game types.
I added the following relationship type to allow me to do this:
game-[:on_tv]-channel
I then wrote a query to get a list of all the players along with a collection of the games they played in and whether or not this game was televised:
START player=node:players('name:*')
MATCH player-[:played|subbed_on]-stats-[:in]-game-[t?:on_tv]-channel
RETURN player.name, COLLECT([stats.goals, t]) AS games
LIMIT 10Unfortunately when I ran this query I ended up with the following exception:
CypherTypeException: Failed merging Number with Relationship
From some previous conversations with Wes I’d noticed that this exception didn’t seem to happen with the 1.9.M05 release but I was using 1.9.M04.
Just to see what happened I tried returning the type of the relationship in the collection literal rather than the relationship and that worked:
START player=node:players('name:*')
MATCH player-[:played|subbed_on]-stats-[:in]-game-[t?:on_tv]-channel
RETURN player.name, COLLECT([stats.goals, TYPE(t)]) AS games
LIMIT 10+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | player.name | games | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Djibril Cissé" | [[0,<null>],[1,"on_tv"],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[1,<null>],[0,"on_tv"],[0,"on_tv"],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[1,<null>]] | | "Markus Rosenberg" | [[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,"on_tv"],[0,<null>],[0,"on_tv"],[0,<null>]] | | "Gabriel Agbonlahor" | [[0,"on_tv"],[1,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,"on_tv"],[1,<null>],[0,"on_tv"],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[1,"on_tv"],[1,<null>],[0,<null>],[0,"on_tv"],[1,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>]] | | "Shaun Derry" | [[0,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,"on_tv"]] | | "Marouane Fellaini" | [[0,<null>],[0,"on_tv"],[0,<null>],[1,<null>],[1,"on_tv"],[1,<null>],[0,<null>],[0,"on_tv"],[1,"on_tv"],[0,<null>],[0,"on_tv"],[2,<null>],[1,<null>],[1,<null>],[1,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[2,<null>],[0,"on_tv"],[0,<null>],[0,"on_tv"]] | | "Jermaine Jenas" | [[0,<null>],[1,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[1,<null>]] | | "Sean Morrison" | [[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[1,<null>],[0,"on_tv"],[1,"on_tv"],[0,<null>],[0,"on_tv"]] | | "Claudio Yacob" | [[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,"on_tv"],[0,<null>],[0,<null>],[0,"on_tv"],[0,"on_tv"],[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>],[0,<null>],[0,<null>]] | | "Michael Owen" | [[0,<null>],[0,<null>],[0,<null>],[0,<null>],[1,<null>],[0,<null>]] | | "Tony Hibbert" | [[0,"on_tv"],[0,"on_tv"],[0,<null>],[0,<null>],[0,<null>]] | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows
The final query ends up looking like this out of interest:
START player=node:players('name:*')
MATCH player-[:played|subbed_on]-stats-[:in]-game-[t?:on_tv]-channel
WITH COLLECT([stats.goals, TYPE(t)]) AS games, player
RETURN player.name,
REDUCE(goals = 0, h IN FILTER(g IN games : HEAD(TAIL(g)) IS NULL): goals + HEAD(h)) AS nonTvGoals,
REDUCE(goals = 0, h IN FILTER(g IN games : HEAD(TAIL(g)) <> NULL): goals + HEAD(h)) AS tvGoals,
REDUCE(goals = 0, h in games : goals + HEAD(h)) AS totalGoals
ORDER BY tvGoals DESC
LIMIT 10+--------------------------------------------------------+ | player.name | nonTvGoals | tvGoals | totalGoals | +--------------------------------------------------------+ | "Gareth Bale" | 4 | 12 | 16 | | "Robin Van Persie" | 8 | 11 | 19 | | "Luis Suárez" | 12 | 10 | 22 | | "Theo Walcott" | 3 | 8 | 11 | | "Demba Ba" | 7 | 8 | 15 | | "Santi Cazorla" | 4 | 7 | 11 | | "Carlos Tevez" | 3 | 6 | 9 | | "Edin Dzeko" | 6 | 6 | 12 | | "Wayne Rooney" | 6 | 6 | 12 | | "Juan Mata" | 4 | 6 | 10 | +--------------------------------------------------------+ 10 rows
So as we can see Gareth Bale pretty much only scores when TV cameras are about!
I was intrigued what had changed between 1.9.M04 and 1.9.M05 so I spent a few hours this morning browsing the cypher part of the code base and not really getting anywhere for the most part.
I thought that there had probably been a change around the way that collection literals were handled but a quick scan of git log suggested there hadn’t been any changes:
$ git log -- community/cypher/src/main/scala/org/neo4j/cypher/internal/parser/v1_9
commit 7311bbe33bc06b346e60e12a4eee2a7173cbd317
Author: Andres Taylor <andres@neotechnology.com>
Date: Tue Mar 19 06:42:49 2013 +0100
Handles single node patterns in MATCH
commit c9f580456572d3d267a15dc88b35e07fd450cf93
Author: Stefan Plantikow <stefan.plantikow@googlemail.com>
Date: Fri Jan 11 21:27:55 2013 +0100
scala 2.10 support: Kills type erasure warnings with a few casts and cleans up a bit
...1.9.M04 was released on January 22nd and 1.9.M05 on March 6th and the only commit in this part of the code base in that time didn’t touch the bit of code I’d been looking at.
Interestingly I couldn’t find anywhere in the code base which had the string ‘Failed merging’ so I thought I’d do a quick scan of the diffs to see if this had been deleted:
$ git log -S"Failed merging"
commit b6501aac03cf70419e94b4cfc160695e4950914a
Author: Andres Taylor <andres@neotechnology.com>
Date: Sat Feb 16 19:30:00 2013 +0100
Changed how Cypher merges typesSo in fact there was a commit which changed the way that the collection type was determined so that rather than throwing an exception for clashing types a parent type would be used instead.
In any case this merging problem doesn’t exist in 1.9.M05 and I’ve switched my graph to use that version of neo4j so I won’t be see in this exception anymore!
neo4j/cypher: WITH, COLLECT & EXTRACT
As I mentioned in my last post I’m trying to get the hang of the WITH statement in neo4j’s cypher query language and I found another application when trying to work out which opponents teams played on certain days.
I started out with a query which grouped the data set by day and showed the opponents that were played on that day:
START team = node:teams('name:"Manchester United"')
MATCH team-[h:home_team|away_team]-game-[:on_day]-day
RETURN DISTINCT day.name, COLLECT(TRIM(REPLACE(REPLACE(game.name, "Manchester United", ""), "vs", "")))+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | day.name | opponents | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Sunday" | ["Liverpool","Everton","Southampton","Liverpool","Newcastle United","Chelsea","Manchester City","Swansea City","Tottenham Hotspur"] | | "Wednesday" | ["Southampton","West Ham United","Newcastle United"] | | "Monday" | ["Everton"] | | "Saturday" | ["Reading","Fulham","Wigan Athletic","Tottenham Hotspur","Stoke City","Arsenal","Queens Park Rangers","Sunderland","West Bromwich Albion","Norwich City","Reading","Aston Villa","Norwich City","Fulham","Queens Park Rangers"] | | "Tuesday" | ["Wigan Athletic"] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 5 rows
The way we’ve got the opponents is a bit of a hack – the name of the two teams is in the ‘name’ property of a game node and we’ve removed ‘Manchester United’ and the word ‘vs’ to get the opponent’s name.
I thought it’d be cool if we could separate the games on each day based on whether Manchester United were playing at home or away.
With a lot of help from Wes Freeman we ended up with the following query which does the job:
START team = node:teams('name:"Manchester United"')
MATCH team-[h:home_team|away_team]-game-[:on_day]-day
WITH day.name as d, game, team, h
MATCH team-[:home_team|away_team]-game-[:home_team|away_team]-opp
WITH d, COLLECT([type(h),opp.name]) AS games
RETURN d,
EXTRACT(c in FILTER(x in games: HEAD(x) = "home_team") : HEAD(TAIL(c))) AS home,
EXTRACT(c in FILTER(x in games: HEAD(x) = "away_team") : HEAD(TAIL(c))) AS awayWe use a similar approach with COLLECT as in the previous post whereby we have a collection of tuples describing whether Manchester United were at home or not and who they were playing.
A neat thing that Wes pointed out is that since there are only 2 teams per game we’re able to get the opponent node easily because it’s the only other node that can match the ‘home_team|away_team” relationship since we’ve already matched our team.
If we run the query just up to the last WITH we get the following result:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | d | games | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Sunday" | [["home_team","Liverpool"],["home_team","Everton"],["away_team","Southampton"],["away_team","Liverpool"],["away_team","Newcastle United"],["away_team","Chelsea"],["away_team","Manchester City"],["away_team","Swansea City"],["away_team","Tottenham Hotspur"]] | | "Wednesday" | [["home_team","Southampton"],["home_team","West Ham United"],["home_team","Newcastle United"]] | | "Monday" | [["away_team","Everton"]] | | "Saturday" | [["home_team","Reading"],["home_team","Fulham"],["home_team","Wigan Athletic"],["home_team","Tottenham Hotspur"],["home_team","Stoke City"],["home_team","Arsenal"],["home_team","Queens Park Rangers"],["home_team","Sunderland"],["home_team","West Bromwich Albion"],["home_team","Norwich City"],["away_team","Reading"],["away_team","Aston Villa"],["away_team","Norwich City"],["away_team","Fulham"],["away_team","Queens Park Rangers"]] | | "Tuesday" | [["away_team","Wigan Athletic"]] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 5 rows
We then use the FILTER function to choose either the opponents Manchester United played at home or away and then we use the EXTRACT function to get the opponent from the tuple:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | d | home | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Sunday" | ["Liverpool","Everton"] | | "Wednesday" | ["Southampton","West Ham United","Newcastle United"] | | "Monday" | [] | | "Saturday" | ["Reading","Fulham","Wigan Athletic","Tottenham Hotspur","Stoke City","Arsenal","Queens Park Rangers","Sunderland","West Bromwich Albion","Norwich City"] | | "Tuesday" | [] | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 5 rows +-----------------------------------------------------------------------------------------------------------------------------+ | d | away | +-----------------------------------------------------------------------------------------------------------------------------+ | "Sunday" | ["Southampton","Liverpool","Newcastle United","Chelsea","Manchester City","Swansea City","Tottenham Hotspur"] | | "Wednesday" | [] | | "Monday" | ["Everton"] | | "Saturday" | ["Reading","Aston Villa","Norwich City","Fulham","Queens Park Rangers"] | | "Tuesday" | ["Wigan Athletic"] | +-----------------------------------------------------------------------------------------------------------------------------+ 5 rows
(I ran the query twice alternating between the last two lines so that it’s readable here. In actual fact the away teams would be in a column next to the home teams)
I thought it was quite interesting how many games Manchester United play away on a Sunday – I think all of those games were probably televised so I thought they’d be more evenly split between home and away matches. Adding televised matches is perhaps another layer to add to the graph.
It’s probably more useful to summarise how many games were played on each day at home and away rather than who they’re against and we can use the REDUCE function to do this:
START team = node:teams('name:"Manchester United"')
MATCH team-[h:home_team|away_team]-game-[:on_day]-day
WITH day.name as dayName, game, team, h
MATCH team-[:home_team|away_team]-game-[:home_team|away_team]-opp
WITH dayName, COLLECT([type(h),opp.name]) AS games
RETURN dayName,
REDUCE(homeGames=0, game in EXTRACT(c in FILTER(x in games: head(x) = "home_team") : HEAD(TAIL(c))) : homeGames + 1) as home,
REDUCE(awayGames=0, game in EXTRACT(c in FILTER(x in games: head(x) = "away_team") : HEAD(TAIL(c))) : awayGames + 1) as away,
REDUCE(totalGames=0, game in games : totalGames + 1) as total+-----------------------------------+ | dayName | home | away | total | +-----------------------------------+ | "Sunday" | 2 | 7 | 9 | | "Wednesday" | 3 | 0 | 3 | | "Monday" | 0 | 1 | 1 | | "Saturday" | 10 | 5 | 15 | | "Tuesday" | 0 | 1 | 1 | +-----------------------------------+ 5 rows
An alternative way of writing the initial query would be the following which Michael Hunger suggested on the thread:
START team = node:teams('name:"Manchester United"')
MATCH p=team-[:home_team|away_team]-game-[:home_team|away_team]-(), game-[:on_day]-day
WITH day.name as dayName, COLLECT([LAST(p), HEAD(RELS(p))]) AS opponents
WITH dayName,
EXTRACT(y in FILTER(x in opponents: TYPE(HEAD(TAIL(x))) = "home_team") : HEAD(y)) AS home,
EXTRACT(y in FILTER(x in opponents : TYPE(HEAD(TAIL(x))) = "away_team") : HEAD(y)) AS away
RETURN dayName,
EXTRACT(team in home: team.name) AS homeOpponents,
EXTRACT(team in away: team.name) AS awayOpponents
ORDER BY dayNameHere we take a slightly different approach where we make use of functions that we can apply to a matching path. We create a collection of tuples where LAST(p) matches the opponent node and HEAD(RELS(p)) matches the ‘home_team’ or ‘away_team’ relationship accordingly.
We then filter the collection to find the times that we played at home and away. This is done by taking the second value from the tuple and then calling TYPE on it which either returns ‘home_team’ or ‘away_team’. We then extract the first value from the tuple which is the opponent node.
In the last part of the query we extract the name from the opponent nodes.
neo4j/cypher: Getting the hang of the WITH statement
I wrote a post a few weeks ago showing an example of a cypher query which made use of the WITH statement but I still don’t completely understand how it works so I thought I’d write some more queries that use it.
I wanted to find out whether Luis Suárez has a better scoring record depending on which day a match is played on.
We start out by finding all the matches that he’s played in and which days those matches were on:
START player = node:players('name:"Luis Suárez"')
MATCH game-[:in]-stats-[:played]-player, game-[:on_day]-day
RETURN day.name, game.name+---------------------------------------------------+ | day.name | game.name | +---------------------------------------------------+ | "Saturday" | "Liverpool vs Southampton" | | "Saturday" | "Southampton vs Liverpool" | | "Saturday" | "Liverpool vs Reading" | | "Saturday" | "West Bromwich Albion vs Liverpool" | ... +---------------------------------------------------+ 29 rows
We can then group those matches by day to find out how many games he played in on a particular day:
START player = node:players('name:"Luis Suárez"')
MATCH game-[:in]-stats-[:played]-player, game-[:on_day]-day
RETURN day.name, COUNT(game.name)+--------------------------------+ | day.name | COUNT(game.name) | +--------------------------------+ | "Sunday" | 13 | | "Wednesday" | 4 | | "Monday" | 1 | | "Saturday" | 11 | +--------------------------------+ 4 rows
Now we want to find out which days the games that Suarez scored in were on so we start out by returning each day that Suarez played in a match and then return a collection containing information about the games he played on that day and whether he scored:
START player = node:players('name:"Luis Suárez"')
MATCH game-[:in]-stats-[:played]-player-[r?:scored_in]-game-[:on_day]-day
RETURN day, COLLECT(DISTINCT([type(r), game.name])) AS games+----------------------------------------------------------------------------------------+ | day.name | games | +----------------------------------------------------------------------------------------+ | "Sunday" | [["scored_in","Liverpool vs Manchester City"]…] | | "Wednesday" | [[<null>,"Tottenham Hotspur vs Liverpool"],[<null>,"Stoke City vs Liverpool"]...] | | "Monday" | [[<null>,"Liverpool vs West Bromwich Albion"]] | | "Saturday" | [[<null>,"Liverpool vs Southampton"]…] | +----------------------------------------------------------------------------------------+ 4 rows
This query has got a little bit more complicated than our previous ones because we wanted to return all the days that Suarez played matches on even if he didn’t score on that day.
The only interesting thing in the first couple of lines is that we match the ‘scored_in’ relationship optionally so that we can handle the situation where Suarez didn’t score while still returning a row.
On the third line we return the day and then we get a collection of tuples of the ‘scored_in’ relationship and the corresponding game.
We use DISTINCT on this line to take care of the situation where Suarez scored multiple times in the same match. We’re working out the number of games that Suarez scored in so counting multiple goals in the same match would ruin that count.
We then need to tweak that query slightly to get a count of the matches that Suarez scored in rather than just returning them. We end up with the following:
START player = node:players('name:"Luis Suárez"')
MATCH game-[:in]-stats-[:played]-player-[r?:scored_in]-game-[:on_day]-day
WITH day, COLLECT(DISTINCT([type(r), game.name])) AS games
RETURN day.name, REDUCE(totalGames = 0, game in FILTER(x in games : head(x) = "scored_in"): totalGames + 1) AS gamesScoredIn+-----------------------------+ | day.name | gamesScoredIn | +-----------------------------+ | "Wednesday" | 2 | | "Saturday" | 6 | | "Sunday" | 7 | | "Monday" | 0 | +-----------------------------+ 4 rows
We start off by filtering the games so that we only keep the ones that Suarez scored in. We then run a REDUCE over the resulting collection which just adds 1 to an accumulator for each record in the collection.
Now that we’ve got that the next step is to combine our games played and games scored in queries together so that we can see what % of games Suarez scores in on each day.
We end up with the following:
START player = node:players('name:"Luis Suárez"')
MATCH game-[:in]-stats-[:played]-player, game-[:on_day]-day
WITH player, day, COUNT(game) AS playedGames
MATCH game-[:in]-stats-[:played]-player-[r?:scored_in]-game-[:on_day]-day
WITH day, COLLECT(DISTINCT([type(r), game.name])) AS games, playedGames
WITH day, REDUCE(totalGames = 0, game in FILTER(x in games : head(x) = "scored_in"): totalGames + 1) AS scoredGames, playedGames
RETURN day.name, playedGames, scoredGames, (scoredGames*1.0/playedGames*1.0) * 100 AS percentage+-------------------------------------------------------------+ | day.name | playedGames | scoredGames | percentage | +-------------------------------------------------------------+ | "Saturday" | 11 | 6 | 54.54545454545454 | | "Monday" | 1 | 0 | 0.0 | | "Wednesday" | 4 | 2 | 50.0 | | "Sunday" | 13 | 7 | 53.84615384615385 | +-------------------------------------------------------------+ 4 rows
One thing I kept getting confused about here is that we need to pass everything that we want to eventually return in each WITH statement otherwise it won’t be available to us at the end.
If we’re going to do multiple MATCH statements we need to pass the starting node in the preceding WITH statement which in this case means that we need to pass the player variable along.
Other than that this query is the amalgamation of the previous two except we’ve added some arithmetic on the final line to work out the % of matches that Suarez scores in. I had to multiply each number by 1.0 to force float based arithmetic rather than integer based.
In answer to our initial question it doesn’t seem to matter which day a match is played on, Suarez scores in approximately every other game.
neo4j/cypher: SQL style GROUP BY WITH LIMIT query
A few weeks ago I wrote a blog post where I described how we could construct a SQL GROUP BY style query in cypher and last week I wanted to write a similar query but with what I think would be a LIMIT clause in SQL.
I wanted to find the maximum number of goals that players had scored in a match for a specific team and started off with the following query to find all the matches that players had scored in:
START team = node:teams('name:"Manchester United"')
MATCH team-[:home_team|away_team]-game-[:scored_in]-player-[:played]-stats-[:for]-team,
stats-[:in]-game
RETURN DISTINCT player.name, stats.goals, game.nameWe find all the matches where Manchester United were playing and then get a list of the players who scored for them in those games:
+--------------------------------------------------------------------------------+ | player.name | stats.goals | game.name | +--------------------------------------------------------------------------------+ | "Javier Hernández" | 1 | "Manchester United vs Wigan Athletic" | | "Robin Van Persie" | 1 | "Manchester United vs Sunderland" | | "Danny Welbeck" | 1 | "Manchester United vs Stoke City" | | "Rafael" | 1 | "Queens Park Rangers vs Manchester United" | | "Wayne Rooney" | 1 | "Manchester United vs Norwich City" | | "Shinji Kagawa" | 1 | "Manchester United vs Fulham" | | "Shinji Kagawa" | 3 | "Manchester United vs Norwich City" | ... +--------------------------------------------------------------------------------+ 50 rows
Our next step would be to only return the unique combinations of players and goals:
START team = node:teams('name:"Manchester United"')
MATCH team-[:home_team|away_team]-game-[:scored_in]-player-[:played]-stats-[:for]-team,
stats-[:in]-game
RETURN DISTINCT player.name, stats.goals+-----------------------------------+ | player.name | stats.goals | +-----------------------------------+ | "Nemanja Vidic" | 1 | | "Shinji Kagawa" | 1 | | "Danny Welbeck" | 1 | | "Darren Fletcher" | 1 | | "Wayne Rooney" | 2 | | "Javier Hernández" | 1 | | "Nani" | 1 | | "Tom Cleverley" | 1 | | "Robin Van Persie" | 2 | | "Shinji Kagawa" | 3 | ... | "Robin Van Persie" | 3 | +-----------------------------------+ 21 rows
Since we only want to return each player once along with their maximum value from the ‘stats.goals’ column all we have to do now is make use of the MAX function:
START team = node:teams('name:"Manchester United"')
MATCH team-[:home_team|away_team]-game-[:scored_in]-player-[:played]-stats-[:for]-team,
stats-[:in]-game
RETURN DISTINCT player.name, MAX(stats.goals) AS goals
ORDER BY goals DESC+-----------------------------+ | player.name | goals | +-----------------------------+ | "Robin Van Persie" | 3 | | "Shinji Kagawa" | 3 | | "Javier Hernández" | 2 | | "Wayne Rooney" | 2 | | "Paul Scholes" | 1 | | "Ryan Giggs" | 1 | | "Tom Cleverley" | 1 | ... | "Rafael" | 1 | | "Darren Fletcher" | 1 | | "Nani" | 1 | +-----------------------------+ 16 rows
clojure/Java Interop – Importing neo4j spatial data
I wrote a post about a week ago where I described how I’d added football stadiums to my football graph using neo4j spatial and after I’d done that I wanted to put it into my import script along with the rest of the data.
I thought leiningen would probably work quite well for this as you can point it at a Java class and have it be executed.
To start with I had to change the import code slightly to link stadiums to teams which have already been added to the graph:
package main.java; // imports excluded public class StadiumsImport { public static void main(String[] args) throws IOException { List<String> lines = readFile("data/stadiums.csv"); EmbeddedGraphDatabase db = new EmbeddedGraphDatabase("neo4j-community-1.9.M04/data/graph.db"); Index<Node> stadiumsIndex = createSpatialIndex(db, "stadiumsLocation"); Transaction tx = db.beginTx(); for (String stadium : lines) { String[] columns = stadium.split(","); Index<Node> teamsIndex = db.index().forNodes("teams"); String team = columns[1].replaceAll("\"",""); Node teamNode = teamsIndex.get("name", team).getSingle(); if(teamNode != null) { Node stadiumNode = db.createNode(); stadiumNode.setProperty("wkt", String.format("POINT(%s %s)", columns[4], columns[3])); stadiumNode.setProperty("name", columns[0].replaceAll("\"","")); stadiumsIndex.add(stadiumNode, "dummy", "value"); teamNode.createRelationshipTo(stadiumNode, DynamicRelationshipType.withName("play_at")); } } tx.success(); tx.finish(); } private static Index<Node> createSpatialIndex(EmbeddedGraphDatabase db, String indexName) { return db.index().forNodes(indexName, SpatialIndexProvider.SIMPLE_WKT_CONFIG); } // readFile excluded }
I’ve excluded some bits of the code for brevity but it’s on this gist if you’re interested.
The only change from last week’s version is that we’re now looking up the team that a stadium belongs to and creating a ‘play_at’ relationship from the team to the stadium.
I was then able to execute that code by calling ‘lein run’ based on the following project.clj file:
(defproject neo4jfootball "1.0.0-SNAPSHOT"
:description "neo4j football project"
:main "main.java.StadiumsImport"
:dependencies [[org.clojure/clojure "1.4.0"]
[org.neo4j/neo4j-spatial "0.11-SNAPSHOT"]
[clojure-csv/clojure-csv "2.0.0-alpha1"]]
:jvm-opts ["-Xmx2g"]
:plugins [[lein-idea "1.0.1"]]
:repositories {"local" ~(str (.toURI (java.io.File. "maven_repository")))}
:java-source-paths ["src/main/java"] )I’m using a local Maven repository to store the neo4j spatial JAR. The Maven entry was created by executed the following command from where I had the neo4j spatial project checked out:
mvn install:install-file -Dfile=target/neo4j-spatial-0.11-SNAPSHOT.jar -DartifactId=neo4j-spatial -Dversion=0.11-SNAPSHOT -DgroupId=org.neo4j -Dpackaging=jar -DlocalRepositoryPath=/path/to/neo4j-football/maven_repository -DpomFile=pom.xml
That worked reasonably well but I thought it’d be interesting to see what the above code would look like if it was written in clojure instead.
This is what I ended up with:
(ns neo4jfootball.core (:require [clojure-csv.core :as csv]) (:use clojure.java.io) (:import (org.neo4j.kernel EmbeddedGraphDatabase) (org.neo4j.gis.spatial.indexprovider SpatialIndexProvider) (org.neo4j.graphdb DynamicRelationshipType))) (defn take-csv [fname] (with-open [file (reader fname)] (csv/parse-csv (slurp file)))) (defn transform [line] {:stadium (get line 0) :team (get line 1) :lat (get line 3) :long (get line 4)}) (def not-nil? (comp not nil?)) (defn create-stadium-node [db line] (let [stadium-node (.. db createNode)] (.. stadium-node (setProperty "wkt" (format "POINT(%s %s)" (:long line) (:lat line)))) (.. stadium-node (setProperty "name" (:stadium line))) stadium-node)) (defn -main [] (do (let [db (new EmbeddedGraphDatabase "neo4j-community-1.9.M04/data/graph.db") tx (.beginTx db) stadiums-index (.. db index (forNodes "stadiumsLocation" (SpatialIndexProvider/SIMPLE_WKT_CONFIG))) teams-index (.. db index (forNodes "teams"))] (doseq [line (drop 1 (map transform (take-csv "data/stadiums.csv")))] (let [team-node (.. teams-index (get "name" (:team line)) getSingle)] (if (not-nil? team-node) (let [stadium-node (create-stadium-node db line)] (.. stadiums-index (add stadium-node "dummy" "value")) (.. team-node (createRelationshipTo stadium-node (DynamicRelationshipType/withName "play_at"))))))) (.. tx success) (.. tx finish))))
The code is simplified quite a bit by using the clojure CSV library so I could probably have achieved similar in the Java version by using an equivalent library.
It’s a bit easier to see what properties of a row in the CSV file are being used where as a result of the transform function where we convert the array into a map.
It would have taken quite a bit more code to achieve a similar thing in Java so I didn’t bother.
The Java Interop page on the clojure website was quite useful for working out how to call the various methods on the Java API.
I’m mainly using the .. macro which allows us to chain Java method calls together. In a couple of cases we could just as easily have used the . macro instead.
We can then call this code from lein like so:
lein run -m neo4jfootball.core