Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Neo4j: The foul revenge graph

without comments

Last week I was showing the foul graph to my colleague Alistair who came up with the idea of running a ‘foul revenge’ query to find out which players gained revenge for a foul with one of their own later in them match.

Queries like this are very path centric and therefore work well in a graph. To recap, this is what the foul graph looks like:

2015 05 26 07 35 33

The first thing that we need to do is connect the fouls in a linked list based on time so that we can query their order more easily.

We can do this with the following query:

MATCH (foul:Foul)-[:COMMITTED_IN_MATCH]->(match)
WITH foul,match
ORDER BY match.id, foul.sortableTime
WITH match, COLLECT(foul) AS fouls
FOREACH(i in range(0, length(fouls) -2) |
  FOREACH(foul1 in [fouls[i]] | FOREACH (foul2 in [fouls[i+1]] |
    MERGE (foul1)-[:NEXT]->(foul2)
)));

This query collects fouls grouped by match and then adds a ‘NEXT’ relationship between adjacent fouls. The graph now looks like this:

2015 05 26 07 43 28

Now let’s find the revenge foulers in the Bayern Munich vs Barcelona match. We’re looking for the following pattern:

2015 05 26 07 55 45

This translates to the following cypher query:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2)
WHERE (foul1)-[:NEXT*]->(foul2)
RETURN player2.name AS firstFouler, player1.name AS revengeFouler, foul1.time, foul1.location, foul2.time, foul2.location

I’ve added in a few extra parts to the pattern to pull out the players involved and to find the revenge foulers in a specific match – the Bayern Munich vs Barcelona Semi Final 2nd leg.


We end up with the following revenge fouls:

2015 05 26 00 05 48

We can see here that Dani Alves actually gains revenge on Bastian Schweinsteiger twice for a foul he made in the 10th minute.

If we tweak the query to the following we can get a visual representation of the revenge fouls as well:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2),
      (foul1)-[:NEXT*]->(foul2)
RETURN *

2015 05 23 15 23 22

At the moment I’ve restricted the revenge concept to single matches but I wonder whether it’d be more interesting to create a linked list of fouls which crosses matches between teams in the same season.

The code for all of this is on github – the README is a bit sketchy at the moment but I’ll be fixing that up soon.

Written by Mark Needham

May 26th, 2015 at 7:03 am

Posted in neo4j

Tagged with

Neo4j: Finding all shortest paths

without comments

One of the Cypher language features we show in Neo4j training courses is the shortest path function which allows you to find the shortest path in terms of number of relationships between two nodes.

Using the movie graph, which you can import via the ‘:play movies’ command in the browser, we’ll first create a ‘KNOWS’ relationship between any people that have appeared in the same movie:

MATCH (p1:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
MERGE (p1)-[:KNOWS]-(p2)

Now that we’ve got that relationship we can easily find the shortest path between two people, say Tom Cruise and Tom Hanks:

MATCH (p1:Person {name: "Tom Hanks"}), (p2:Person {name: "Tom Cruise"}),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
Graph  18

That works pretty well but what if we want to find the longest shortest path between any two people in the graph? We can calculate it like this:

MATCH (p1:Person), (p2:Person),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  19

So that’s 6 hops which is actually the Bacon number – I expect we’d probably see a smaller maximum value if we imported all the movies.

And to round off the post what if we want to find the longest shortest path between the 10 people who acted in the most movies? We might start out with the following query which seems like it should do the job:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH p1 AS p1, p1 AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Unfortunately if we run that query we get no rows returned because ‘p1′ and ‘p2′ always refer to the same node.

Instead we can calculate the shortest path between our hardest working people by creating a cross product using COLLECT and UNWIND:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH COLLECT(p1) AS ps
UNWIND ps AS p1 UNWIND ps AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Graph  20

That’s all for now!

Written by Mark Needham

May 19th, 2015 at 10:45 pm

Posted in neo4j

Tagged with ,

Neo4j: Refactoring the BBC football live text fouls graph

without comments

Yesterday I wrote about a Neo4j graph I’ve started building which contains all the fouls committed in the Champions League game between Barcelona & Bayern Munich and surrounding meta data.

While adding other events into the graph I realised that I’d added some duplication in the model and the model could do with some refactoring to make it easier to use.

To recap, this is the model that we designed in the previous blog post:

The duplication is on the left hand side of the model – we model a foul as being committed by one player against another and then hook the foul back into the match. By doing that we’re not using the ‘appearance’ concept which links a player and a match together.

We can make the ‘COMMITTED_IN_MATCH’ relationship redundant by connecting the foul to appearance rather than to player. The match the foul was committed in can then be found by navigating through the appearance node.

This is what we want the graph to look like:

2015 05 17 10 40 44

We’ll move towards this new model in 3 steps:

  • Introduce the new structure alongside the existing one
  • Rewrite our queries to use the new structure
  • Remove the old structure

Introducing the new structure

First up let’s write a query to introduce the new structure.

match (foul:Foul)-[:COMMITTED_AGAINST]->(fouledPlayer),
      (foul)<-[:COMMITTED_FOUL]-(foulingPlayer),
      (foul)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"}),
      (foulingPlayer)-[:MADE_APPEARANCE]-(foulingPlayerApp)-[:IN_MATCH]->(match),
      (fouledPlayer)-[:MADE_APPEARANCE]-(fouledPlayerApp)-[:IN_MATCH]->(match)
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayerApp)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayerApp)

Remember we’re not going to delete the old structure yet so that’s why there aren’t any delete statements in here.

Rewriting our queries

Now we need to update our queries to work against the new graph structure:

Where do the fouls happen?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-()<-[]-(foul:Foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

Who fouls the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who was fouled the most?

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[r:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)<-[:COMMITTED_AGAINST]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who fouled who the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)-[:COMMITTED_AGAINST]->(fouled:Player),
      (foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"}),
      (match)<-[:IN_MATCH]-(fouledApp)<-[:COMMITTED_AGAINST]->(foul:Foul)<-[:COMMITTED_FOUL]-(foulerApp)-[:IN_MATCH]->(match),
      (fouledApp)<-[:MADE_APPEARANCE]-(fouled),
      (foulerApp)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

Which team fouled most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-()<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

Worst fouler for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouler)-[:MADE_APPEARANCE]->(app)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

Most fouled against for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)<-[:COMMITTED_AGAINST]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouled)-[:MADE_APPEARANCE]->(app)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls

The early queries are made more complicated by the refactoring but the latter ones are slightly simpler. I think we need to hook some more events onto the appearance node to see whether this refactoring is worthwhile or not.

Removing the old structure

Holding judgement for now, let’s look at how we’d remove the old structure – the final step in this refactoring:

match (match:Match {id: "32683310"})<-[oldRel:COMMITTED_IN_MATCH]-(foul:Foul)
DELETE oldRel
match (player:Player)<-[oldRel:COMMITTED_AGAINST]-(foul:Foul)
DELETE oldRel
match (player:Player)-[oldRel:COMMITTED_FOUL]->(foul:Foul)
DELETE oldRel

Hopefully you can see how you’d go about refactoring your own graph if you realise the model isn’t quite what you want.

Any questions/thoughts/suggestions let me know!

Written by Mark Needham

May 17th, 2015 at 11:04 am

Posted in neo4j

Tagged with

Neo4j: BBC football live text fouls graph

without comments

I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.

It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.

We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.

2015 05 16 23 10 43

Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.

I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:

$ wget http://www.bbc.co.uk/sport/0/football/32683310

Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:

2015 05 16 22 19 00

HTML-wise it looks like this:

2015 05 16 22 20 28

4709393221 bddd85c64e z

Image courtesy of William Brawley

I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.

It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange – sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.

Luckily the two parts of the foul can be joined together by matching the time which made life easier.

The full code for the scrapper is on github if you want to play with it.

This is what the resulting CSV file looks like:

$ head -n 10 data/events.csv 
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern München,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern München,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern München
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern München,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern München
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern München
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern München
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern München,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern München,Ivan Rakitic,Barcelona

Now it’s time to create a graph. We’ll aim to massage the data into this model:

2015 05 16 22 50 32

Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:

// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
 
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
 
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
 
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
 
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
 
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
 
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
 
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
 
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);

We’ll use neo4j-shell to execute the script:

$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql

Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!

  • Where do the fouls happen on the pitch?
  • Who made the most fouls?
  • Who was fouled the most?
  • Who fouled who the most?
  • Which team fouled the most?
  • Who’s the worst fouler in each team?
  • Who’s the most fouled in each team?

Where do the fouls happen?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
 
+----------------------------------+
| location                 | fouls |
+----------------------------------+
| "in the defensive half." | 12    |
| "in the attacking half." | 12    |
| "on the right wing."     | 3     |
| "on the left wing."      | 3     |
+----------------------------------+
4 rows

Who fouls the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+------------------------------+
| fouler               | fouls |
+------------------------------+
| "Rafinha"            | 4     |
| "Pedro"              | 3     |
| "Medhi Benatia"      | 3     |
| "Dani Alves"         | 3     |
| "Xabi Alonso"        | 3     |
| "Javier Mascherano"  | 2     |
| "Thiago Alcántara"   | 2     |
| "Robert Lewandowski" | 2     |
| "Sebastian Rode"     | 1     |
| "Sergio Busquets"    | 1     |
+------------------------------+
10 rows

Who was fouled the most?

// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+----------------------------------+
| fouled                   | fouls |
+----------------------------------+
| "Robert Lewandowski"     | 4     |
| "Lionel Messi"           | 4     |
| "Neymar"                 | 3     |
| "Pedro"                  | 2     |
| "Xabi Alonso"            | 2     |
| "Andrés Iniesta"         | 2     |
| "Rafinha"                | 2     |
| "Bastian Schweinsteiger" | 2     |
| "Sebastian Rode"         | 1     |
| "Sergio Busquets"        | 1     |
+----------------------------------+
10 rows

Who fouled who the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
      (foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+--------------------------------------------------------+
| fouler              | fouled                   | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski"     | 2     |
| "Dani Alves"        | "Bastian Schweinsteiger" | 2     |
| "Xabi Alonso"       | "Lionel Messi"           | 2     |
| "Rafinha"           | "Neymar"                 | 2     |
| "Rafinha"           | "Andrés Iniesta"         | 2     |
| "Dani Alves"        | "Xabi Alonso"            | 1     |
| "Thiago Alcántara"  | "Javier Mascherano"      | 1     |
| "Pedro"             | "Juan Bernat"            | 1     |
| "Medhi Benatia"     | "Pedro"                  | 1     |
| "Neymar"            | "Sebastian Rode"         | 1     |
+--------------------------------------------------------+
10 rows

Which team fouled the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+-----------------------------+
| team.name           | fouls |
+-----------------------------+
| "FC Bayern München" | 18    |
| "Barcelona"         | 12    |
+-----------------------------+
2 rows

Worst fouler for each team?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Rafinha"             | 4               |
| "Barcelona"         | "Pedro"               | 3               |
+---------------------------------------------------------------+
2 rows

Most fouled against for each team

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
      (fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Robert Lewandowski"  | 4               |
| "Barcelona"         | "Lionel Messi"        | 4               |
+---------------------------------------------------------------+
2 rows

So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.

I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on github if you want to play around with it and perhaps get the other events into the graph.

Written by Mark Needham

May 16th, 2015 at 9:13 pm

Posted in neo4j

Tagged with

R: ggplot – Displaying multiple charts with a for loop

without comments

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
eventId = "220750415"
query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
df = cypher(graph, query, id= eventId)
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id
18 1.430819e+12      1.392726e+12 130976662
95 1.430069e+12      1.430069e+12  10286388
79 1.429035e+12      1.429035e+12  38344282
64 1.428108e+12      1.412935e+12 153473172
73 1.429513e+12      1.398236e+12 143322942
19 1.430389e+12      1.430389e+12 129261842
37 1.429643e+12      1.327603e+12   9750821
49 1.429618e+12      1.429618e+12 184325696
69 1.430781e+12      1.404554e+12  67485912
1  1.430929e+12      1.430146e+12 185405773

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
df$earliestTime = timestampToDate(df$earliestRSVP.time)
df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id                time    date        earliestTime earliestDate
47 1.430697e+12      1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11      2015-05
44 1.430924e+12      1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44      2015-05
85 1.429611e+12      1.422378e+12  53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02      2015-01
14 1.430125e+12      1.412690e+12   7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09      2014-10
29 1.430035e+12      1.430035e+12  37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03      2015-04
12 1.430855e+12      1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10      2015-05
41 1.428917e+12      1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40      2015-01
87 1.430927e+12      1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59      2015-05
62 1.430849e+12      1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23      2015-05
8  1.430237e+12      1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40      2015-03

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) 
 
ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))

2015 05 13 00 30 59

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))
}

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415")

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")
 
for(event in events$n) {
  plotEvent(as.character(event))
}

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra)
p = list()
for(i in 1:count(events)$n) {
  event = events[i, 1]
  p[[i]] = plotEvent(as.character(event))
}
 
do.call(grid.arrange,p)

2015 05 14 00 57 10

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
            WITH e,rsvp, person
            MATCH (person)-[:RSVPD]->(otherRSVP)
 
            WITH e,person, rsvp, otherRSVP
            ORDER BY person.id, otherRSVP.time
 
            WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
            return rsvp.time, earliestRSVP.time,  person.id, e.time"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$eventTime = timestampToDate(df$e.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.ticks = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) + 
    labs(title = df$eventTime[1])
}
2015 05 14 01 08 54

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Written by Mark Needham

May 14th, 2015 at 12:17 am

Posted in neo4j,R

Tagged with , ,

Neo4j: LOAD CSV – java.io.InputStreamReader there’s a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn’t supported.

without comments

I recently came across the last.fm dataset via Ben Frederickson’s blog and thought it’d be an interesting one to load into Neo4j and explore.

I started with a simple query to parse the CSV file and count the number of rows:

LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv" 
AS row FIELDTERMINATOR  "\t"
return COUNT(*)
 
At java.io.InputStreamReader@4d307fda:6484 there's a field starting with a quote and whereas it ends that quote there seems  to be character in that field after that ending quote. That isn't supported. This is what I read: 'weird al"'

This blows up because (as the message says) we’ve got a field which uses double quotes but then has other characters either side of the quotes.

A quick search through the file reveals one of the troublesome lines:

$ grep "\"weird" lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv  | head -n 1
0015371426d2cbef354b2f680340de38d0ebd2f0	7746d775-9550-4360-b8d5-c37bd448ce01	"weird al" yankovic	4099

I ran a file containing only that line through CSV Lint to see what it thought and indeed it is invalid:

2015 05 04 10 50 43

Let’s clean up our file to use single quotes instead of double quotes and try the query again:

$ tr "\"" "'" < lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv > lastfm-dataset-360K/clean.tsv
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/clean.tsv" as row FIELDTERMINATOR  "\t"
return COUNT(*)
 
17559530

And we’re back in business! Interestingly Python’s CSV reader chooses to strip out the double quotes rather than throw an exception:

import csv
with open("smallWeird.tsv", "r") as file:
    reader = csv.reader(file, delimiter="\t")
 
    for row in reader:
        print row
$ python explore.py
['0015371426d2cbef354b2f680340de38d0ebd2f0', '7746d775-9550-4360-b8d5-c37bd448ce01', 'weird al yankovic', '4099']

I prefer LOAD CSV’s approach but it’s an interesting trade off I hadn’t considred before.

Written by Mark Needham

May 4th, 2015 at 9:56 am

Posted in neo4j

Tagged with ,

Neo4j: The learning to cycle dependency graph

without comments

Over the past couple of weeks I’ve been reading about skill building and the break down of skills into more manageable chunks, and recently had a chance to break down the skills required to learn to cycle.

I initially sketched out the skill progression but quickly realised I had drawn a dependency graph and thought that putting it into Neo4j would simplify things.

I started out with the overall goal for cycling which was to ‘Be able to cycle through a public park':

MERGE (:Goal:Task {name: "Be able to cycle through a public park"})

This goal is easy for someone who’s already learnt to cycle but if we’re starting from scratch it’s a bit daunting so we need to break it down into a simpler skill that we can practice.

The mini algorithm that we’re going to employ for task breakdown is this:

  1. Can we do the given task now?
  2. Break the task down into something simpler and return to 1.

One of the things to keep in mind is that we won’t get the break down perfect the first time so we may need to change it. For a diagram drawn on a piece of paper this would be annoying but in Neo4j it’s just a simpler refactoring.

Going back to cycling. Since the goal isn’t yet achievable we need to break that down into something a bit easier. Let’s start with something really simple:

MERGE (task:Task {name: "Take a few steps forward while standing over the bike"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[:DEPENDS_ON]->(task)

In the first line we create our new task and then we connect it to our goal which we created earlier.

Graph  9

After we’ve got the hang of walking with the bike we want to get comfortable with cycling forward a few rotations while sitting on the bike but to do that we need to be able to get the bike moving from a standing start. We might also have another step where we cycle forward while standing on the bike as that might be slightly easier.

Let’s update our graph:

// First let's get rid of the relationship between our initial task and the goal
MATCH (initialTask:Task {name: "Take a few steps forward while standing over the bike"})
MATCH (goal:Goal {name: "Be able to cycle through a public park"})
MATCH (goal)-[rel:DEPENDS_ON]->(initialTask)
DELETE rel
 
WITH initialTask, goal, ["Get bike moving from standing start", "Cycle forward while standing", "Cycle forward while sitting"] AS newTasks
 
// Create some nodes for our new tasks
UNWIND newTasks AS newTask
MERGE (t:Task {name: newTask})
WITH initialTask, goal, COLLECT(t) AS newTasks
WITH initialTask, goal, newTasks, newTasks[0] AS firstTask, newTasks[-1] AS lastTask
 
// Connect the last task to the goal
MERGE (goal)-[:DEPENDS_ON]->(lastTask)
 
// And the first task to our initial task
MERGE (firstTask)-[:DEPENDS_ON]->(initialTask)
 
// And all the tasks to each other
FOREACH(i in RANGE(0, length(newTasks) - 2) |
  FOREACH(t1 in [newTasks[i]] | FOREACH(t2 in [newTasks[i+1]] |
    MERGE (t2)-[:DEPENDS_ON]->(t1) 
)))
Graph  10

We don’t strictly need to learn how to cycle while standing up – we could just go straight from getting the bike moving to cycling forward while sitting. Let’s update the graph to reflect that:

MATCH (sitting:Task {name: "Cycle forward while sitting"})
MATCH (moving:Task {name: "Get bike moving from standing start"})
MERGE (sitting)-[:DEPENDS_ON]->(moving)

Graph  11

Once we’ve got the hang of those tasks let’s add in a few more to get us closer to our goal:

WITH [
  {skill: "Controlled stop using brakes/feet", dependsOn: "Cycle forward while sitting"},
  {skill: "Steer around stationary objects", dependsOn: "Controlled stop using brakes/feet"},
  {skill: "Steer around people", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a small circular circuit", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Navigate a small circular circuit"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Steer around people"},
  {skill: "Be able to cycle through a public park", dependsOn: "Navigate a loop of a section of the park"}
 
] AS newTasks
 
FOREACH(newTask in newTasks |
  MERGE (t1:Task {name: newTask.skill})   
  MERGE (t2:Task {name: newTask.dependsOn})
  MERGE (t1)-[:DEPENDS_ON]->(t2)
)

Finally let’s get rid of the relationship from our goal to ‘Cycle forward while sitting’ since we’ve replaced that with some intermediate steps:

MATCH (task:Task {name: "Cycle forward while sitting"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[rel:DEPENDS_ON]->(task)
DELETE rel

And here’s what the final dependency graph looks like:

Graph  13

Although I put this into Neo4j in order to visualise the dependencies we can now query the data as well. For example, let’s say I know how to cycle forward while sitting on the bike. What steps are there between me and being able to cycle around a park?

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((g)-[:DEPENDS_ON*]->(t))
RETURN path

Graph  14

Or if we want a list of the tasks we need to do next we could restructure the query slightly:

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((t)<-[:DEPENDS_ON*]->(g))
WITH [n in nodes(path) | n.name] AS tasks
UNWIND tasks AS task
RETURN task
 
==> +--------------------------------------------+
==> | task                                       |
==> +--------------------------------------------+
==> | "Cycle forward while sitting"              |
==> | "Controlled stop using brakes/feet"        |
==> | "Steer around stationary objects"          |
==> | "Steer around people"                      |
==> | "Navigate a loop of a section of the park" |
==> | "Be able to cycle through a public park"   |
==> +--------------------------------------------+
==> 6 rows

That’s all for now but I think this is an interesting way of tracking how you’d learn a skill. I’m trying a similar approach for some statistics topics I’m learning about but I’ve found the order of tasks isn’t so linear there – interestingly much more a graph than a tree.

Written by Mark Needham

April 7th, 2015 at 8:59 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Building the query for a movie’s profile page

with one comment

2015 04 01 12 11 51

Yesterday I spent the day in Berlin delivering a workshop as part of the Data Science Retreat and one of the exercises we did was write a query that would pull back all the information you’d need to create the IMDB page for a movie.

Scanning the page we can see that need to get some basic meta data including the title. Next we’ll need to pull in the actors, directors, producers and finally a recommendation for some other movies the viewer might like to see.

I’d struggle to be able to write this all in one go – it’s non trivial. However, if we break it down there are actually 5 simpler queries that we probably can write. Our final step is then to work out how to glue them all together.

Let’s get started.

If you want to follow along open up your Neo4j browser and type :play movies and import the built in data set.

We’re going to create the query for The Matrix home page so the first step is to find the node representing that movie in the database:

match (movie:Movie {title: "The Matrix"})
return movie.title
 
==> +--------------+
==> | movie.title  |
==> +--------------+
==> | "The Matrix" |
==> +--------------+
==> 1 row

Easy enough. Now let’s get back the producers:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
RETURN movie.title, COLLECT(producer.name) AS producers
 
==> +--------------------------------+
==> | movie.title  | producers       |
==> +--------------------------------+
==> | "The Matrix" | ["Joel Silver"] |
==> +--------------------------------+
==> 1 row

We’ve introduced the COLLECT function here as we want to ensure that our final result only has one row regardless of how many producers there are. COLLECT applies an implicit group by ‘movie.title’ and collects the producers for each movie (in this case just The Matrix) into an array.

We’ve used OPTIONAL MATCH *LINK* because we still want to return a row for the query even if it has no producers. In the case that there are no producers we’d hope to see an empty array.

Now let’s write the same query to pull back the directors of the movie:

match (movie:Movie {title: "The Matrix"})
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------+
==> | movie.title  | directors                           |
==> +----------------------------------------------------+
==> | "The Matrix" | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------+
==> 1 row

We really want to do both of these in one query so we get back a single result with 3 columns. To do that we’re going to introduce the WITH clause which allows us combine the results of traversals together.

In this case we’ll first do a traversal to get the producers, collect those into an array and then traverse out again to get the directors and collect those. This is what the query looks like:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, producers, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------------------------+
==> | movie.title  | producers       | directors                           |
==> +----------------------------------------------------------------------+
==> | "The Matrix" | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------------------------+
==> 1 row

We can follow the same pattern to return the actors:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
RETURN movie.title, COLLECT(actor.name) AS actors, producers, directors
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie.title  | actors                                                                                | producers       | directors                           |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "The Matrix" | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

So far, so good. We’ve got everything except the other movies recommendation which is a bit trickier so we’ll write it on its own first:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
RETURN otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | otherMovie                                                                                                        | score |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | Node[348]{title:"The Matrix Revolutions",released:2003,tagline:"Everything that has a beginning has an end"}      | 4     |
==> | Node[347]{title:"The Matrix Reloaded",released:2003,tagline:"Free your mind"}                                     | 4     |
==> | Node[490]{title:"Something's Gotta Give",released:2003}                                                           | 1     |
==> | Node[349]{title:"The Devil's Advocate",released:1997,tagline:"Evil has its winning ways"}                         | 1     |
==> | Node[438]{title:"Johnny Mnemonic",released:1995,tagline:"The hottest data on earth. In the coolest head in town"} | 1     |
==> | Node[443]{title:"Cloud Atlas",released:2012,tagline:"Everything is connected"}                                    | 1     |
==> | Node[452]{title:"V for Vendetta",released:2006,tagline:"Freedom! Forever!"}                                       | 1     |
==> | Node[425]{title:"The Replacements",released:2000,tagline:"Pain heals, Chicks dig scars... Glory lasts forever"}   | 1     |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> 8 rows

Our recommendation query finds all the actors in The Matrix and then traverses out to find other movies they’ve acted in and orders those movies based on how many of our actors appeared in them. Not surprisingly the other Matrix movies come out top.

In order to plug this into the rest of the query we need a single row to be returned i.e. our other movie suggestions need to be returned as an array rather than individual rows. Let’s do that:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN COLLECT({movie: otherMovie.title, score: score}) AS otherMovies
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "The Replacements", score -> 1}] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We’ve introduced a WITH clause for two reasons:

  1. To ensure the order of the movies based on highest score
  2. Because we can’t do an aggregation within an aggregation i.e. COLLECT(COUNT(…)) would be an illegal operation in Cypher.

Now we’re ready to plug this recommendation query into our main one:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
WITH  movie, COLLECT(actor.name) AS actors, producers, directors
optional match (movie)<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH movie, actors, producers, directors, otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN movie, actors, producers, directors,  
       COLLECT({movie: otherMovie.title, score: score}) AS recommended
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie                                                                           | actors                                                                                | producers       | directors                           | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[338]{title:"The Matrix",released:1999,tagline:"Welcome to the Real World"} | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "The Replacements", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1}] |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

Voila! 4 different types of data gathered and just one query to do it all.

For the eagle eyed cypher specialists (Hi Michael!), you’ll have noticed a bit of duplication in how we traverse out to the actors twice, once to retrieve them and once to make the movie recommendation.

We could optimise this by collecting the actors once and then using the UNWIND clause but that’s an optimisation which I think slightly obscures the intent of the query so I’ve left it like this for now.

Written by Mark Needham

April 1st, 2015 at 11:54 am

Posted in neo4j

Tagged with

Neo4j: Generating real time recommendations with Cypher

with one comment

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

CREATE
    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
 
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),
 
    (m)-[:FRIEND_OF]->(d),
    (m)-[:FRIEND_OF]->(l),
    (m)-[:FRIEND_OF]->(a),
    (m)-[:FRIEND_OF]->(v),
    (d)-[:FRIEND_OF]->(v),
    (c)-[:FRIEND_OF]->(v),
    (d)-[:LIVES_IN]->(lon),
    (v)-[:LIVES_IN]->(lon),
    (m)-[:LIVES_IN]->(lon),
    (l)-[:LIVES_IN]->(mum);

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
 
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
 
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
 
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        }
 
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();
    }

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
 
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       ageDifference:ageDifference,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
ORDER BY score DESC
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Written by Mark Needham

March 27th, 2015 at 6:59 am

Posted in neo4j

Tagged with

Neo4j: Detecting potential typos using EXPLAIN

without comments

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Written by Mark Needham

March 17th, 2015 at 10:46 pm

Posted in neo4j

Tagged with