Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ tag

Neo4j: Finding all shortest paths

without comments

One of the Cypher language features we show in Neo4j training courses is the shortest path function which allows you to find the shortest path in terms of number of relationships between two nodes.

Using the movie graph, which you can import via the ‘:play movies’ command in the browser, we’ll first create a ‘KNOWS’ relationship between any people that have appeared in the same movie:

MATCH (p1:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
MERGE (p1)-[:KNOWS]-(p2)

Now that we’ve got that relationship we can easily find the shortest path between two people, say Tom Cruise and Tom Hanks:

MATCH (p1:Person {name: "Tom Hanks"}), (p2:Person {name: "Tom Cruise"}),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
Graph  18

That works pretty well but what if we want to find the longest shortest path between any two people in the graph? We can calculate it like this:

MATCH (p1:Person), (p2:Person),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  19

So that’s 6 hops which is actually the Bacon number – I expect we’d probably see a smaller maximum value if we imported all the movies.

And to round off the post what if we want to find the longest shortest path between the 10 people who acted in the most movies? We might start out with the following query which seems like it should do the job:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH p1 AS p1, p1 AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Unfortunately if we run that query we get no rows returned because ‘p1′ and ‘p2′ always refer to the same node.

Instead we can calculate the shortest path between our hardest working people by creating a cross product using COLLECT and UNWIND:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH COLLECT(p1) AS ps
UNWIND ps AS p1 UNWIND ps AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Graph  20

That’s all for now!

Written by Mark Needham

May 19th, 2015 at 10:45 pm

Posted in neo4j

Tagged with ,

Neo4j: Refactoring the BBC football live text fouls graph

without comments

Yesterday I wrote about a Neo4j graph I’ve started building which contains all the fouls committed in the Champions League game between Barcelona & Bayern Munich and surrounding meta data.

While adding other events into the graph I realised that I’d added some duplication in the model and the model could do with some refactoring to make it easier to use.

To recap, this is the model that we designed in the previous blog post:

The duplication is on the left hand side of the model – we model a foul as being committed by one player against another and then hook the foul back into the match. By doing that we’re not using the ‘appearance’ concept which links a player and a match together.

We can make the ‘COMMITTED_IN_MATCH’ relationship redundant by connecting the foul to appearance rather than to player. The match the foul was committed in can then be found by navigating through the appearance node.

This is what we want the graph to look like:

2015 05 17 10 40 44

We’ll move towards this new model in 3 steps:

  • Introduce the new structure alongside the existing one
  • Rewrite our queries to use the new structure
  • Remove the old structure

Introducing the new structure

First up let’s write a query to introduce the new structure.

match (foul:Foul)-[:COMMITTED_AGAINST]->(fouledPlayer),
      (foul)<-[:COMMITTED_FOUL]-(foulingPlayer),
      (foul)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"}),
      (foulingPlayer)-[:MADE_APPEARANCE]-(foulingPlayerApp)-[:IN_MATCH]->(match),
      (fouledPlayer)-[:MADE_APPEARANCE]-(fouledPlayerApp)-[:IN_MATCH]->(match)
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayerApp)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayerApp)

Remember we’re not going to delete the old structure yet so that’s why there aren’t any delete statements in here.

Rewriting our queries

Now we need to update our queries to work against the new graph structure:

Where do the fouls happen?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-()<-[]-(foul:Foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

Who fouls the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who was fouled the most?

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[r:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)<-[:COMMITTED_AGAINST]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who fouled who the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)-[:COMMITTED_AGAINST]->(fouled:Player),
      (foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"}),
      (match)<-[:IN_MATCH]-(fouledApp)<-[:COMMITTED_AGAINST]->(foul:Foul)<-[:COMMITTED_FOUL]-(foulerApp)-[:IN_MATCH]->(match),
      (fouledApp)<-[:MADE_APPEARANCE]-(fouled),
      (foulerApp)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

Which team fouled most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-()<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

Worst fouler for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouler)-[:MADE_APPEARANCE]->(app)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

Most fouled against for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)<-[:COMMITTED_AGAINST]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouled)-[:MADE_APPEARANCE]->(app)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls

The early queries are made more complicated by the refactoring but the latter ones are slightly simpler. I think we need to hook some more events onto the appearance node to see whether this refactoring is worthwhile or not.

Removing the old structure

Holding judgement for now, let’s look at how we’d remove the old structure – the final step in this refactoring:

match (match:Match {id: "32683310"})<-[oldRel:COMMITTED_IN_MATCH]-(foul:Foul)
DELETE oldRel
match (player:Player)<-[oldRel:COMMITTED_AGAINST]-(foul:Foul)
DELETE oldRel
match (player:Player)-[oldRel:COMMITTED_FOUL]->(foul:Foul)
DELETE oldRel

Hopefully you can see how you’d go about refactoring your own graph if you realise the model isn’t quite what you want.

Any questions/thoughts/suggestions let me know!

Written by Mark Needham

May 17th, 2015 at 11:04 am

Posted in neo4j

Tagged with

Neo4j: BBC football live text fouls graph

without comments

I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.

It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.

We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.

2015 05 16 23 10 43

Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.

I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:

$ wget http://www.bbc.co.uk/sport/0/football/32683310

Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:

2015 05 16 22 19 00

HTML-wise it looks like this:

2015 05 16 22 20 28

4709393221 bddd85c64e z

Image courtesy of William Brawley

I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.

It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange – sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.

Luckily the two parts of the foul can be joined together by matching the time which made life easier.

The full code for the scrapper is on github if you want to play with it.

This is what the resulting CSV file looks like:

$ head -n 10 data/events.csv 
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern München,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern München,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern München
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern München,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern München
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern München
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern München
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern München,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern München,Ivan Rakitic,Barcelona

Now it’s time to create a graph. We’ll aim to massage the data into this model:

2015 05 16 22 50 32

Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:

// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
 
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
 
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
 
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
 
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
 
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
 
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
 
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
 
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);

We’ll use neo4j-shell to execute the script:

$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql

Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!

  • Where do the fouls happen on the pitch?
  • Who made the most fouls?
  • Who was fouled the most?
  • Who fouled who the most?
  • Which team fouled the most?
  • Who’s the worst fouler in each team?
  • Who’s the most fouled in each team?

Where do the fouls happen?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
 
+----------------------------------+
| location                 | fouls |
+----------------------------------+
| "in the defensive half." | 12    |
| "in the attacking half." | 12    |
| "on the right wing."     | 3     |
| "on the left wing."      | 3     |
+----------------------------------+
4 rows

Who fouls the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+------------------------------+
| fouler               | fouls |
+------------------------------+
| "Rafinha"            | 4     |
| "Pedro"              | 3     |
| "Medhi Benatia"      | 3     |
| "Dani Alves"         | 3     |
| "Xabi Alonso"        | 3     |
| "Javier Mascherano"  | 2     |
| "Thiago Alcántara"   | 2     |
| "Robert Lewandowski" | 2     |
| "Sebastian Rode"     | 1     |
| "Sergio Busquets"    | 1     |
+------------------------------+
10 rows

Who was fouled the most?

// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+----------------------------------+
| fouled                   | fouls |
+----------------------------------+
| "Robert Lewandowski"     | 4     |
| "Lionel Messi"           | 4     |
| "Neymar"                 | 3     |
| "Pedro"                  | 2     |
| "Xabi Alonso"            | 2     |
| "Andrés Iniesta"         | 2     |
| "Rafinha"                | 2     |
| "Bastian Schweinsteiger" | 2     |
| "Sebastian Rode"         | 1     |
| "Sergio Busquets"        | 1     |
+----------------------------------+
10 rows

Who fouled who the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
      (foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+--------------------------------------------------------+
| fouler              | fouled                   | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski"     | 2     |
| "Dani Alves"        | "Bastian Schweinsteiger" | 2     |
| "Xabi Alonso"       | "Lionel Messi"           | 2     |
| "Rafinha"           | "Neymar"                 | 2     |
| "Rafinha"           | "Andrés Iniesta"         | 2     |
| "Dani Alves"        | "Xabi Alonso"            | 1     |
| "Thiago Alcántara"  | "Javier Mascherano"      | 1     |
| "Pedro"             | "Juan Bernat"            | 1     |
| "Medhi Benatia"     | "Pedro"                  | 1     |
| "Neymar"            | "Sebastian Rode"         | 1     |
+--------------------------------------------------------+
10 rows

Which team fouled the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+-----------------------------+
| team.name           | fouls |
+-----------------------------+
| "FC Bayern München" | 18    |
| "Barcelona"         | 12    |
+-----------------------------+
2 rows

Worst fouler for each team?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Rafinha"             | 4               |
| "Barcelona"         | "Pedro"               | 3               |
+---------------------------------------------------------------+
2 rows

Most fouled against for each team

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
      (fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Robert Lewandowski"  | 4               |
| "Barcelona"         | "Lionel Messi"        | 4               |
+---------------------------------------------------------------+
2 rows

So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.

I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on github if you want to play around with it and perhaps get the other events into the graph.

Written by Mark Needham

May 16th, 2015 at 9:13 pm

Posted in neo4j

Tagged with

Neo4j: LOAD CSV – java.io.InputStreamReader there’s a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn’t supported.

without comments

I recently came across the last.fm dataset via Ben Frederickson’s blog and thought it’d be an interesting one to load into Neo4j and explore.

I started with a simple query to parse the CSV file and count the number of rows:

LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv" 
AS row FIELDTERMINATOR  "\t"
return COUNT(*)
 
At java.io.InputStreamReader@4d307fda:6484 there's a field starting with a quote and whereas it ends that quote there seems  to be character in that field after that ending quote. That isn't supported. This is what I read: 'weird al"'

This blows up because (as the message says) we’ve got a field which uses double quotes but then has other characters either side of the quotes.

A quick search through the file reveals one of the troublesome lines:

$ grep "\"weird" lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv  | head -n 1
0015371426d2cbef354b2f680340de38d0ebd2f0	7746d775-9550-4360-b8d5-c37bd448ce01	"weird al" yankovic	4099

I ran a file containing only that line through CSV Lint to see what it thought and indeed it is invalid:

2015 05 04 10 50 43

Let’s clean up our file to use single quotes instead of double quotes and try the query again:

$ tr "\"" "'" < lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv > lastfm-dataset-360K/clean.tsv
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/clean.tsv" as row FIELDTERMINATOR  "\t"
return COUNT(*)
 
17559530

And we’re back in business! Interestingly Python’s CSV reader chooses to strip out the double quotes rather than throw an exception:

import csv
with open("smallWeird.tsv", "r") as file:
    reader = csv.reader(file, delimiter="\t")
 
    for row in reader:
        print row
$ python explore.py
['0015371426d2cbef354b2f680340de38d0ebd2f0', '7746d775-9550-4360-b8d5-c37bd448ce01', 'weird al yankovic', '4099']

I prefer LOAD CSV’s approach but it’s an interesting trade off I hadn’t considred before.

Written by Mark Needham

May 4th, 2015 at 9:56 am

Posted in neo4j

Tagged with ,

Spark: Generating CSV files to import into Neo4j

without comments

About a year ago Ian pointed me at a Chicago Crime data set which seemed like a good fit for Neo4j and after much procrastination I’ve finally got around to importing it.

The data set covers crimes committed from 2001 until now. It contains around 4 million crimes and meta data around those crimes such as the location, type of crime and year to name a few.

The contents of the file follow this structure:

$ head -n 10 ~/Downloads/Crimes_-_2001_to_present.csv
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,"(41.75017626412204, -87.55494559131228)"
9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,-87.57568059471686,"(41.729576153145636, -87.57568059471686)"
9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,-87.72803579358926,"(41.884543798701515, -87.72803579358926)"
9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,-87.74148516669783,"(41.785633535413176, -87.74148516669783)"
9460355,HX113738,01/14/2014 04:21:00 AM,070XX S PEORIA ST,0820,THEFT,$500 AND UNDER,STREET,true,false,0733,007,17,68,06,1171480,1858195,2014,01/16/2014 12:40:00 AM,41.766348042591375,-87.64702037047671,"(41.766348042591375, -87.64702037047671)"
9461140,HX113909,01/14/2014 03:17:00 AM,016XX W HUBBARD ST,0610,BURGLARY,FORCIBLE ENTRY,COMMERCIAL / BUSINESS OFFICE,false,false,1215,012,27,24,05,1165029,1903111,2014,01/16/2014 12:40:00 AM,41.889741146006095,-87.66939334853973,"(41.889741146006095, -87.66939334853973)"
9460361,HX113731,01/14/2014 03:12:00 AM,022XX S WENTWORTH AVE,0820,THEFT,$500 AND UNDER,CTA TRAIN,false,false,0914,009,25,34,06,1175363,1889525,2014,01/20/2014 12:40:05 AM,41.85223460427207,-87.63185047834335,"(41.85223460427207, -87.63185047834335)"
9461691,HX114506,01/14/2014 03:00:00 AM,087XX S COLFAX AVE,0650,BURGLARY,HOME INVASION,RESIDENCE,false,false,0423,004,7,46,05,1195052,1847362,2014,01/17/2014 12:40:17 AM,41.73607283858007,-87.56097809501115,"(41.73607283858007, -87.56097809501115)"
9461792,HX114824,01/14/2014 03:00:00 AM,012XX S CALIFORNIA BLVD,0810,THEFT,OVER $500,STREET,false,false,1023,010,28,29,06,1157929,1894034,2014,01/17/2014 12:40:17 AM,41.86498077118534,-87.69571529596696,"(41.86498077118534, -87.69571529596696)"

Since I wanted to import this into Neo4j I needed to do some massaging of the data since the neo4j-import tool expects to receive CSV files containing the nodes and relationships we want to create.

Spark logo 192x100px

I’d been looking at Spark towards the end of last year and the pre-processing of the big initial file into smaller CSV files containing nodes and relationships seemed like a good fit.

I therefore needed to create a Spark job to do this. We’ll then pass this job to a Spark executor running locally and it will spit out CSV files.

2015 04 15 00 51 42

We start by creating a Scala object with a main method that will contain our processing code. Inside that main method we’ll instantiate a Spark context:

import org.apache.spark.{SparkConf, SparkContext}
 
object GenerateCSVFiles {  
    def main(args: Array[String]) {    
        val conf = new SparkConf().setAppName("Chicago Crime Dataset")    
        val sc = new SparkContext(conf)  
    }
}

Easy enough. Next we’ll read in the CSV file. I found the easiest way to reference this was with an environment variable but perhaps there’s a more idiomatic way:

import java.io.File
import org.apache.spark.{SparkConf, SparkContext}
 
object GenerateCSVFiles {
  def main(args: Array[String]) {
    var crimeFile = System.getenv("CSV_FILE")
 
    if(crimeFile == null || !new File(crimeFile).exists()) {
      throw new RuntimeException("Cannot find CSV file [" + crimeFile + "]")
    }
 
    println("Using %s".format(crimeFile))
 
    val conf = new SparkConf().setAppName("Chicago Crime Dataset")
 
    val sc = new SparkContext(conf)
    val crimeData = sc.textFile(crimeFile).cache()
}

The type of crimeData is RDD[String] – Spark’s way of representing the (lazily evaluated) lines of the CSV file. This also includes the header of the file so let’s write a function to get rid of that since we’ll be generating our own headers for the different files:

import org.apache.spark.rdd.RDD
 
// http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAEYYnxYuEaie518ODdn-fR7VvD39d71=CgB_Dxw_4COVXgmYYQ@mail.gmail.com%3E
def dropHeader(data: RDD[String]): RDD[String] = {
  data.mapPartitionsWithIndex((idx, lines) => {
    if (idx == 0) {
      lines.drop(1)
    }
    lines
  })
}

Now we’re ready to start generating our new CSV files so we’ll write a function which parses each line and extracts the appropriate columns. I’m using Open CSV for this:

import au.com.bytecode.opencsv.CSVParser
 
def generateFile(file: String, withoutHeader: RDD[String], fn: Array[String] => Array[String], header: String , distinct:Boolean = true, separator: String = ",") = {
  FileUtil.fullyDelete(new File(file))
 
  val tmpFile = "/tmp/" + System.currentTimeMillis() + "-" + file
  val rows: RDD[String] = withoutHeader.mapPartitions(lines => {
    val parser = new CSVParser(',')
    lines.map(line => {
      val columns = parser.parseLine(line)
      fn(columns).mkString(separator)
    })
  })
 
  if (distinct) rows.distinct() saveAsTextFile tmpFile else rows.saveAsTextFile(tmpFile)
}

We then call this function like this:

generateFile("/tmp/crimes.csv", withoutHeader, columns => Array(columns(0),"Crime", columns(2), columns(6)), "id:ID(Crime),:LABEL,date,description", false)

The output into ‘tmpFile’ is actually 32 ‘part files’ but I wanted to be able to merge those together into individual CSV files that were easier to work with.

I won’t paste the the full job here but if you want to take a look it’s on github.

Now we need to submit the job to Spark. I’ve wrapped this in a script if you want to follow along but these are the contents:

./spark-1.1.0-bin-hadoop1/bin/spark-submit \
--driver-memory 5g \
--class GenerateCSVFiles \
--master local[8] \ 
target/scala-2.10/playground_2.10-1.0.jar \
$@

If we execute that we’ll see the following output…”

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Crimes_-_2001_to_present.csv
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/04/15 00:31:44 INFO SparkContext: Running Spark version 1.3.0
...
15/04/15 00:47:26 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
15/04/15 00:47:26 INFO DAGScheduler: Stage 8 (saveAsTextFile at GenerateCSVFiles.scala:51) finished in 2.702 s
15/04/15 00:47:26 INFO DAGScheduler: Job 4 finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.715588 s
 
real	0m44.935s
user	4m2.259s
sys	0m14.159s

and these CSV files will be generated:

$ ls -alh /tmp/*.csv
-rwxrwxrwx  1 markneedham  wheel   3.0K 14 Apr 07:37 /tmp/beats.csv
-rwxrwxrwx  1 markneedham  wheel   217M 14 Apr 07:37 /tmp/crimes.csv
-rwxrwxrwx  1 markneedham  wheel    84M 14 Apr 07:37 /tmp/crimesBeats.csv
-rwxrwxrwx  1 markneedham  wheel   120M 14 Apr 07:37 /tmp/crimesPrimaryTypes.csv
-rwxrwxrwx  1 markneedham  wheel   912B 14 Apr 07:37 /tmp/primaryTypes.csv

Let’s have a quick check what they contain:

$ head -n 10 /tmp/beats.csv
id:ID(Beat),:LABEL
1135,Beat
1421,Beat
2312,Beat
1113,Beat
1014,Beat
2411,Beat
1333,Beat
2521,Beat
1652,Beat
$ head -n 10 /tmp/crimes.csv
id:ID(Crime),:LABEL,date,description
9464711,Crime,01/14/2014 05:00:00 AM,SIMPLE
9460704,Crime,01/14/2014 04:55:00 AM,ARMED: HANDGUN
9460339,Crime,01/14/2014 04:44:00 AM,TO PROPERTY
9461467,Crime,01/14/2014 04:43:00 AM,$500 AND UNDER
9460355,Crime,01/14/2014 04:21:00 AM,$500 AND UNDER
9461140,Crime,01/14/2014 03:17:00 AM,FORCIBLE ENTRY
9460361,Crime,01/14/2014 03:12:00 AM,$500 AND UNDER
9461691,Crime,01/14/2014 03:00:00 AM,HOME INVASION
9461792,Crime,01/14/2014 03:00:00 AM,OVER $500
$ head -n 10 /tmp/crimesBeats.csv
:START_ID(Crime),:END_ID(Beat),:TYPE
5896915,0733,ON_BEAT
9208776,2232,ON_BEAT
8237555,0111,ON_BEAT
6464775,0322,ON_BEAT
6468868,0411,ON_BEAT
4189649,0524,ON_BEAT
7620897,0421,ON_BEAT
7720402,0321,ON_BEAT
5053025,1115,ON_BEAT

Looking good. Let’s get them imported into Neo4j:

$ ./neo4j-community-2.2.0/bin/neo4j-import --into /tmp/my-neo --nodes /tmp/crimes.csv --nodes /tmp/beats.csv --nodes /tmp/primaryTypes.csv --relationships /tmp/crimesBeats.csv --relationships /tmp/crimesPrimaryTypes.csv
Nodes
[*>:45.76 MB/s----------------------------------|PROPERTIES(2)=============|NODE:3|v:118.05 MB/]  4M
Done in 5s 605ms
Prepare node index
[*RESOLVE:64.85 MB-----------------------------------------------------------------------------]  4M
Done in 4s 930ms
Calculate dense nodes
[>:42.33 MB/s-------------------|*PREPARE(7)===================================|CALCULATOR-----]  8M
Done in 5s 417ms
Relationships
[>:42.33 MB/s-------------|*PREPARE(7)==========================|RELATIONSHIP------------|v:44.]  8M
Done in 6s 62ms
Node --> Relationship
[*>:??-----------------------------------------------------------------------------------------]  4M
Done in 324ms
Relationship --> Relationship
[*LINK-----------------------------------------------------------------------------------------]  8M
Done in 1s 984ms
Node counts
[*>:??-----------------------------------------------------------------------------------------]  4M
Done in 360ms
Relationship counts
[*>:??-----------------------------------------------------------------------------------------]  8M
Done in 653ms
 
IMPORT DONE in 26s 517ms

Next I updated conf/neo4j-server.properties to point to my new database:

#***************************************************************
# Server configuration
#***************************************************************
 
# location of the database directory
#org.neo4j.server.database.location=data/graph.db
org.neo4j.server.database.location=/tmp/my-neo

Now I can start up Neo and start exploring the data:

$ ./neo4j-community-2.2.0/bin/neo4j start
MATCH (:Crime)-[r:CRIME_TYPE]->() 
RETURN r 
LIMIT 10
Graph  15

There’s lots more relationships and entities that we could pull out of this data set – what I’ve done is just a start. So if you’re up for some more Chicago crime exploration the code and instructions explaining how to run it are on github.

Written by Mark Needham

April 14th, 2015 at 10:56 pm

Posted in Spark

Tagged with ,

Neo4j: The learning to cycle dependency graph

without comments

Over the past couple of weeks I’ve been reading about skill building and the break down of skills into more manageable chunks, and recently had a chance to break down the skills required to learn to cycle.

I initially sketched out the skill progression but quickly realised I had drawn a dependency graph and thought that putting it into Neo4j would simplify things.

I started out with the overall goal for cycling which was to ‘Be able to cycle through a public park':

MERGE (:Goal:Task {name: "Be able to cycle through a public park"})

This goal is easy for someone who’s already learnt to cycle but if we’re starting from scratch it’s a bit daunting so we need to break it down into a simpler skill that we can practice.

The mini algorithm that we’re going to employ for task breakdown is this:

  1. Can we do the given task now?
  2. Break the task down into something simpler and return to 1.

One of the things to keep in mind is that we won’t get the break down perfect the first time so we may need to change it. For a diagram drawn on a piece of paper this would be annoying but in Neo4j it’s just a simpler refactoring.

Going back to cycling. Since the goal isn’t yet achievable we need to break that down into something a bit easier. Let’s start with something really simple:

MERGE (task:Task {name: "Take a few steps forward while standing over the bike"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[:DEPENDS_ON]->(task)

In the first line we create our new task and then we connect it to our goal which we created earlier.

Graph  9

After we’ve got the hang of walking with the bike we want to get comfortable with cycling forward a few rotations while sitting on the bike but to do that we need to be able to get the bike moving from a standing start. We might also have another step where we cycle forward while standing on the bike as that might be slightly easier.

Let’s update our graph:

// First let's get rid of the relationship between our initial task and the goal
MATCH (initialTask:Task {name: "Take a few steps forward while standing over the bike"})
MATCH (goal:Goal {name: "Be able to cycle through a public park"})
MATCH (goal)-[rel:DEPENDS_ON]->(initialTask)
DELETE rel
 
WITH initialTask, goal, ["Get bike moving from standing start", "Cycle forward while standing", "Cycle forward while sitting"] AS newTasks
 
// Create some nodes for our new tasks
UNWIND newTasks AS newTask
MERGE (t:Task {name: newTask})
WITH initialTask, goal, COLLECT(t) AS newTasks
WITH initialTask, goal, newTasks, newTasks[0] AS firstTask, newTasks[-1] AS lastTask
 
// Connect the last task to the goal
MERGE (goal)-[:DEPENDS_ON]->(lastTask)
 
// And the first task to our initial task
MERGE (firstTask)-[:DEPENDS_ON]->(initialTask)
 
// And all the tasks to each other
FOREACH(i in RANGE(0, length(newTasks) - 2) |
  FOREACH(t1 in [newTasks[i]] | FOREACH(t2 in [newTasks[i+1]] |
    MERGE (t2)-[:DEPENDS_ON]->(t1) 
)))
Graph  10

We don’t strictly need to learn how to cycle while standing up – we could just go straight from getting the bike moving to cycling forward while sitting. Let’s update the graph to reflect that:

MATCH (sitting:Task {name: "Cycle forward while sitting"})
MATCH (moving:Task {name: "Get bike moving from standing start"})
MERGE (sitting)-[:DEPENDS_ON]->(moving)

Graph  11

Once we’ve got the hang of those tasks let’s add in a few more to get us closer to our goal:

WITH [
  {skill: "Controlled stop using brakes/feet", dependsOn: "Cycle forward while sitting"},
  {skill: "Steer around stationary objects", dependsOn: "Controlled stop using brakes/feet"},
  {skill: "Steer around people", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a small circular circuit", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Navigate a small circular circuit"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Steer around people"},
  {skill: "Be able to cycle through a public park", dependsOn: "Navigate a loop of a section of the park"}
 
] AS newTasks
 
FOREACH(newTask in newTasks |
  MERGE (t1:Task {name: newTask.skill})   
  MERGE (t2:Task {name: newTask.dependsOn})
  MERGE (t1)-[:DEPENDS_ON]->(t2)
)

Finally let’s get rid of the relationship from our goal to ‘Cycle forward while sitting’ since we’ve replaced that with some intermediate steps:

MATCH (task:Task {name: "Cycle forward while sitting"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[rel:DEPENDS_ON]->(task)
DELETE rel

And here’s what the final dependency graph looks like:

Graph  13

Although I put this into Neo4j in order to visualise the dependencies we can now query the data as well. For example, let’s say I know how to cycle forward while sitting on the bike. What steps are there between me and being able to cycle around a park?

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((g)-[:DEPENDS_ON*]->(t))
RETURN path

Graph  14

Or if we want a list of the tasks we need to do next we could restructure the query slightly:

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((t)<-[:DEPENDS_ON*]->(g))
WITH [n in nodes(path) | n.name] AS tasks
UNWIND tasks AS task
RETURN task
 
==> +--------------------------------------------+
==> | task                                       |
==> +--------------------------------------------+
==> | "Cycle forward while sitting"              |
==> | "Controlled stop using brakes/feet"        |
==> | "Steer around stationary objects"          |
==> | "Steer around people"                      |
==> | "Navigate a loop of a section of the park" |
==> | "Be able to cycle through a public park"   |
==> +--------------------------------------------+
==> 6 rows

That’s all for now but I think this is an interesting way of tracking how you’d learn a skill. I’m trying a similar approach for some statistics topics I’m learning about but I’ve found the order of tasks isn’t so linear there – interestingly much more a graph than a tree.

Written by Mark Needham

April 7th, 2015 at 8:59 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Building the query for a movie’s profile page

with one comment

2015 04 01 12 11 51

Yesterday I spent the day in Berlin delivering a workshop as part of the Data Science Retreat and one of the exercises we did was write a query that would pull back all the information you’d need to create the IMDB page for a movie.

Scanning the page we can see that need to get some basic meta data including the title. Next we’ll need to pull in the actors, directors, producers and finally a recommendation for some other movies the viewer might like to see.

I’d struggle to be able to write this all in one go – it’s non trivial. However, if we break it down there are actually 5 simpler queries that we probably can write. Our final step is then to work out how to glue them all together.

Let’s get started.

If you want to follow along open up your Neo4j browser and type :play movies and import the built in data set.

We’re going to create the query for The Matrix home page so the first step is to find the node representing that movie in the database:

match (movie:Movie {title: "The Matrix"})
return movie.title
 
==> +--------------+
==> | movie.title  |
==> +--------------+
==> | "The Matrix" |
==> +--------------+
==> 1 row

Easy enough. Now let’s get back the producers:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
RETURN movie.title, COLLECT(producer.name) AS producers
 
==> +--------------------------------+
==> | movie.title  | producers       |
==> +--------------------------------+
==> | "The Matrix" | ["Joel Silver"] |
==> +--------------------------------+
==> 1 row

We’ve introduced the COLLECT function here as we want to ensure that our final result only has one row regardless of how many producers there are. COLLECT applies an implicit group by ‘movie.title’ and collects the producers for each movie (in this case just The Matrix) into an array.

We’ve used OPTIONAL MATCH *LINK* because we still want to return a row for the query even if it has no producers. In the case that there are no producers we’d hope to see an empty array.

Now let’s write the same query to pull back the directors of the movie:

match (movie:Movie {title: "The Matrix"})
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------+
==> | movie.title  | directors                           |
==> +----------------------------------------------------+
==> | "The Matrix" | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------+
==> 1 row

We really want to do both of these in one query so we get back a single result with 3 columns. To do that we’re going to introduce the WITH clause which allows us combine the results of traversals together.

In this case we’ll first do a traversal to get the producers, collect those into an array and then traverse out again to get the directors and collect those. This is what the query looks like:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, producers, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------------------------+
==> | movie.title  | producers       | directors                           |
==> +----------------------------------------------------------------------+
==> | "The Matrix" | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------------------------+
==> 1 row

We can follow the same pattern to return the actors:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
RETURN movie.title, COLLECT(actor.name) AS actors, producers, directors
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie.title  | actors                                                                                | producers       | directors                           |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "The Matrix" | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

So far, so good. We’ve got everything except the other movies recommendation which is a bit trickier so we’ll write it on its own first:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
RETURN otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | otherMovie                                                                                                        | score |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | Node[348]{title:"The Matrix Revolutions",released:2003,tagline:"Everything that has a beginning has an end"}      | 4     |
==> | Node[347]{title:"The Matrix Reloaded",released:2003,tagline:"Free your mind"}                                     | 4     |
==> | Node[490]{title:"Something's Gotta Give",released:2003}                                                           | 1     |
==> | Node[349]{title:"The Devil's Advocate",released:1997,tagline:"Evil has its winning ways"}                         | 1     |
==> | Node[438]{title:"Johnny Mnemonic",released:1995,tagline:"The hottest data on earth. In the coolest head in town"} | 1     |
==> | Node[443]{title:"Cloud Atlas",released:2012,tagline:"Everything is connected"}                                    | 1     |
==> | Node[452]{title:"V for Vendetta",released:2006,tagline:"Freedom! Forever!"}                                       | 1     |
==> | Node[425]{title:"The Replacements",released:2000,tagline:"Pain heals, Chicks dig scars... Glory lasts forever"}   | 1     |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> 8 rows

Our recommendation query finds all the actors in The Matrix and then traverses out to find other movies they’ve acted in and orders those movies based on how many of our actors appeared in them. Not surprisingly the other Matrix movies come out top.

In order to plug this into the rest of the query we need a single row to be returned i.e. our other movie suggestions need to be returned as an array rather than individual rows. Let’s do that:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN COLLECT({movie: otherMovie.title, score: score}) AS otherMovies
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "The Replacements", score -> 1}] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We’ve introduced a WITH clause for two reasons:

  1. To ensure the order of the movies based on highest score
  2. Because we can’t do an aggregation within an aggregation i.e. COLLECT(COUNT(…)) would be an illegal operation in Cypher.

Now we’re ready to plug this recommendation query into our main one:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
WITH  movie, COLLECT(actor.name) AS actors, producers, directors
optional match (movie)<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH movie, actors, producers, directors, otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN movie, actors, producers, directors,  
       COLLECT({movie: otherMovie.title, score: score}) AS recommended
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie                                                                           | actors                                                                                | producers       | directors                           | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[338]{title:"The Matrix",released:1999,tagline:"Welcome to the Real World"} | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "The Replacements", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1}] |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

Voila! 4 different types of data gathered and just one query to do it all.

For the eagle eyed cypher specialists (Hi Michael!), you’ll have noticed a bit of duplication in how we traverse out to the actors twice, once to retrieve them and once to make the movie recommendation.

We could optimise this by collecting the actors once and then using the UNWIND clause but that’s an optimisation which I think slightly obscures the intent of the query so I’ve left it like this for now.

Written by Mark Needham

April 1st, 2015 at 11:54 am

Posted in neo4j

Tagged with

Neo4j: Generating real time recommendations with Cypher

with one comment

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

CREATE
    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
 
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),
 
    (m)-[:FRIEND_OF]->(d),
    (m)-[:FRIEND_OF]->(l),
    (m)-[:FRIEND_OF]->(a),
    (m)-[:FRIEND_OF]->(v),
    (d)-[:FRIEND_OF]->(v),
    (c)-[:FRIEND_OF]->(v),
    (d)-[:LIVES_IN]->(lon),
    (v)-[:LIVES_IN]->(lon),
    (m)-[:LIVES_IN]->(lon),
    (l)-[:LIVES_IN]->(mum);

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
 
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
 
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
 
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        }
 
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();
    }

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
 
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       ageDifference:ageDifference,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
ORDER BY score DESC
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Written by Mark Needham

March 27th, 2015 at 6:59 am

Posted in neo4j

Tagged with

Neo4j: Detecting potential typos using EXPLAIN

without comments

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Written by Mark Needham

March 17th, 2015 at 10:46 pm

Posted in neo4j

Tagged with

Python/Neo4j: Finding interesting computer sciency people to follow on Twitter

with 4 comments

At the beginning of this year I moved from Neo4j’s field team to dev team and since the code we write there is much lower level than I’m used to I thought I should find some people to follow on twitter whom I can learn from.

My technique for finding some of those people was to pick a person from the Neo4j kernel team who’s very good at systems programming and uses twitter which led me to Mr Chris Vest.

I thought that the people Chris interacts with on twitter are likely to be interested in this type of programming so I manually traversed out to those people and had a look who they interacted with and put them all into a list.

This approach has worked well and I’ve picked up lots of reading material from following these people. It does feel like something that could be automated though and we’ll using Python and Neo4j as our tool kit.

We need to find a library to connect to the Twitter API. There are a few to choose from but tweepy seems simple enough so I started using that.

The first thing we need to so is fill in our Twitter API auth details. Since the application is just for me I’m not going to bother with setting up OAuth – instead I’ll just create an application on apps.twitter.com and grab the appropriate tokens:

2015 03 11 01 18 47

Once we’ve done that we can navigate to that app and get a consumer key, consumer secret, access token and access token secret. They all reside on the ‘Keys and Access Tokens’ tab:

2015 03 11 01 20 17
2015 03 11 01 20 30

Now that we’ve got ourself all auth’d up let’s write some code that starts from Chris and goes out and finds his latest tweets and the people he’s interacted with in them. We’ll write the appropriate information out to a CSV file so that we can import it into Neo4j later on:

import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break

As well as printing out Chris’ tweets I’m also capturing other users who he’s had interacted and putting them in a queue that we’ll drain later on. We’re limiting the number of other users that we’ll process to 50 for now but it’s easy to change.

If we print out the first few lines of ‘tweets.csv’ this is what we’d see:

$ head -n 5 tweets.csv
userName,tweetId,contents,usersMentioned,urls
chvest,575427045167071233,@shipilev http://t.co/WxqFIsfiSF,shipilev,
chvest,575403105174552576,@AlTobey I often use http://t.co/G7Cdn9Udst for small one-off graph diagrams.,AlTobey,http://www.apcjones.com/arrows/
chvest,575337346687766528,RT @theburningmonk: this is why you need composition over inheritance... :s #CompositionOverInheritance http://t.co/aKRwUaZ0qo,theburningmonk,
chvest,575269402083459072,@chvest except…? “Each library implementation should therefore be identical with respect to the public API”,chvest,

We’re capturing the user, tweetId, the tweet itself, any users mentioned in the tweet and any URLs shared in the tweet.

Next we want to get some of the tweets of the people Chris has interacted with

# Grab the code from here too - https://gist.github.com/mneedham/3188c44b2cceb88c6de0
 
import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break
    users_processed = set([starting_user])
    while True:
        if len(users_processed) >= USERS_TO_PROCESS:
            break
        else:
            if len(users_to_process) > 0:
                next_user = users_to_process.popleft()
                print next_user
                if next_user in users_processed:
                    "-- user already processed"
                else:
                    "-- processing user"
                    users_processed.add(next_user)
                    for tweet in tweepy.Cursor(api.user_timeline, id=next_user).items(10):
                        writer.writerow(extract_tweet(tweet))
                        tweets.flush()
                        for user_mentioned in tweet.entities["user_mentions"]:
                            if not len(users_processed) > 50:
                                users_to_process.append(user_mentioned["screen_name"])
                                counter[user_mentioned["screen_name"]] += 1
                            else:
                                break
            else:
                break

Finally let’s take a quick look at the users who show up most frequently:

>>> for user_name, count in counter.most_common(20):
        print user_name, count
 
neo4j 13
devnexus 12
AlTobey 11
bitprophet 11
hazelcast 10
chvest 9
shipilev 9
AntoineGrondin 8
gvsmirnov 8
GatlingTool 8
lagergren 7
tomsontom 6
dorkitude 5
noctarius2k 5
DanHeidinga 5
chris_mahan 5
coda 4
mccv 4
gAmUssA 4
jmhodges 4

A few of the people on that list are in my list which is a good start. We can explore the data set better once it’s in Neo4j though so let’s write some Cypher import statements to create our own mini Twitter graph:

// add people
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (p:Person {userName: row.userName});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users
UNWIND users AS user
MERGE (p:Person {userName: user});
 
// add tweets
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (t:Tweet {id: row.tweetId})
ON CREATE SET t.contents = row.contents;
 
// add urls
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls
UNWIND urls AS url
MERGE (:URL {value: url});
 
// add links
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MATCH (p:Person {userName: row.userName})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:TWEETED]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users, row
UNWIND users AS user
MATCH (p:Person {userName: user})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:MENTIONED_IN]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls, row
UNWIND urls AS url
MATCH (u:URL {value: url})
MATCH (t:Tweet {id: row.tweetId})
MERGE (t)-[:CONTAINS_LINK]->(u);

We can put all those commands in a file and execute them using neo4j-shell:

$ ./neo4j-community-2.2.0-RC01/bin/neo4j-shell --file import.cql

Now let’s write some queries against the graph:

// Find the tweets where Chris mentioned himself
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(n)
RETURN path

Graph  5

// Find the most popular links shared in the network
MATCH (u:URL)<-[r:CONTAINS_LINK]->()
RETURN u.value, COUNT(*) AS times
ORDER BY times DESC
LIMIT 10
 
+-------------------------------------------------------------------------------------------------+
| u.value                                                                                 | times |
+-------------------------------------------------------------------------------------------------+
| "http://www.polyglots.dk/"                                                              | 4     |
| "http://www.java-forum-nord.de/"                                                        | 4     |
| "http://hirt.se/blog/?p=646"                                                            | 3     |
| "http://wp.me/p26jdv-Ja"                                                                | 3     |
| "https://instagram.com/p/0D4I_hH77t/"                                                   | 3     |
| "https://blogs.oracle.com/java/entry/new_java_champion_tom_chindl"                      | 3     |
| "http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html"                  | 2     |
| "https://firstlook.org/theintercept/2015/03/10/ispy-cia-campaign-steal-apples-secrets/" | 2     |
| "http://buff.ly/1GzZXlo"                                                                | 2     |
| "http://buff.ly/1BrgtQd"                                                                | 2     |
+-------------------------------------------------------------------------------------------------+
10 rows

The first link is for a programming language meetup in Copenhagen, the second for a Java conference in Hanovier and the third an announcement about the latest version of Java Mission Control. So far so good!

A next step in this area would be to run the links through Prismatic’s interest graph so we can model topics in our graph as well. For now let’s have a look at the interactions between Chris and others in the graph:

// Find the people who Chris interacts with most often
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
RETURN other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------+
| other.userName | times |
+------------------------+
| "gvsmirnov"    | 7     |
| "shipilev"     | 5     |
| "nitsanw"      | 4     |
| "DanHeidinga"  | 3     |
| "AlTobey"      | 3     |
+------------------------+
5 rows

Let’s generalise that to find interactions between any pair of people:

// Find the people who interact most often
MATCH (n:Person)-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
WHERE n <> other
RETURN n.userName, other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------------------------+
| n.userName    | other.userName   | times |
+------------------------------------------+
| "fbogsany"    | "AntoineGrondin" | 8     |
| "chvest"      | "gvsmirnov"      | 7     |
| "chris_mahan" | "bitprophet"     | 6     |
| "maxdemarzi"  | "neo4j"          | 6     |
| "chvest"      | "shipilev"       | 5     |
+------------------------------------------+
5 rows

Let’s combine a couple of these together to come up with a score for each person:

MATCH (n:Person)
 
// number of mentions
OPTIONAL MATCH (n)-[mention:MENTIONED_IN]->()
WITH n, COUNT(mention) AS mentions
 
// number of links shared by someone else
OPTIONAL MATCH (n)-[:TWEETED]->()-[:CONTAINS_LINK]->(link)<-[:CONTAINS_LINK]-()
 
WITH n, mentions, COUNT(link) AS links
RETURN n.userName, mentions + links AS score, mentions, links
ORDER BY score DESC
LIMIT 10
 
+------------------------------------------+
| n.userName    | score | mentions | links |
+------------------------------------------+
| "chvest"      | 17    | 10       | 7     |
| "hazelcast"   | 16    | 10       | 6     |
| "neo4j"       | 15    | 13       | 2     |
| "noctarius2k" | 14    | 4        | 10    |
| "devnexus"    | 12    | 12       | 0     |
| "polyglotsdk" | 11    | 2        | 9     |
| "shipilev"    | 11    | 10       | 1     |
| "AlTobey"     | 11    | 10       | 1     |
| "bitprophet"  | 10    | 9        | 1     |
| "GatlingTool" | 10    | 8        | 2     |
+------------------------------------------+
10 rows

Amusingly Chris is top of his own network but we also see three accounts which aren’t people, but rather products – neo4j, hazelcast and GatlingTool. The rest are legit though

That’s as far as I’ve got but to make this more useful I think we need to introduce follower/friend links as well as importing more data.

In the mean time I’ve got a bunch of links to go and read!

Written by Mark Needham

March 11th, 2015 at 9:13 pm

Posted in neo4j,Python

Tagged with ,