Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Northwind: Finding direct/transitive Reports in SQL and Neo4j’s Cypher

without comments

Every few months we run a relational to graph meetup at the Neo London office where we go through how to take your data from a relational database and into the graph.

We use the Northwind dataset which often comes as a demo dataset on relational databases and come up with some queries which seem graph in nature.

My favourite query is one which finds out how employees are organised and who reports to whom. I thought it’d be quite interesting to see what it would look like in Postgres SQL as well, just for fun.

We’ll start off by getting a list of employees and the person they report to:

SELECT e."EmployeeID", e."ReportsTo"
FROM employees AS e
WHERE e."ReportsTo" IS NOT NULL;
 
 EmployeeID | ReportsTo
------------+-----------
          1 |         2
          3 |         2
          4 |         2
          5 |         2
          6 |         5
          7 |         5
          8 |         2
          9 |         5
(8 ROWS)

In cypher we’d do this:

MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.EmployeeID, e.EmployeeID 
 
+-------------------------------+
| sub.EmployeeID | e.EmployeeID |
+-------------------------------+
| "4"            | "2"          |
| "5"            | "2"          |
| "1"            | "2"          |
| "3"            | "2"          |
| "8"            | "2"          |
| "9"            | "5"          |
| "6"            | "5"          |
| "7"            | "5"          |
+-------------------------------+
8 rows

Next let’s find the big boss who doesn’t report to anyone. First in SQL:

SELECT e."EmployeeID" AS bigBoss
FROM employees AS e
WHERE e."ReportsTo" IS NULL
 
 bigboss
---------
       2
(1 ROW)

And now cypher:

MATCH (e:Employee)
WHERE NOT (e)-[:REPORTS_TO]->()
RETURN e.EmployeeID AS bigBoss
 
+---------+
| bigBoss |
+---------+
| "2"     |
+---------+
1 row

We still don’t need to join anything so the query isn’t that interesting yet. Let’s bring in some more properties from the manager record so we have to self join on the employees table:

SELECT e."FirstName", e."LastName", e."Title", manager."FirstName", manager."LastName", manager."Title"
FROM employees AS e
JOIN employees AS manager ON e."ReportsTo" = manager."EmployeeID"
WHERE e."ReportsTo" IS NOT NULL
 
 FirstName | LastName  |          Title           | FirstName | LastName |         Title
-----------+-----------+--------------------------+-----------+----------+-----------------------
 Nancy     | Davolio   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Janet     | Leverling | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Margaret  | Peacock   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Steven    | Buchanan  | Sales Manager            | Andrew    | Fuller   | Vice President, Sales
 Michael   | Suyama    | Sales Representative     | Steven    | Buchanan | Sales Manager
 Robert    | King      | Sales Representative     | Steven    | Buchanan | Sales Manager
 Laura     | Callahan  | Inside Sales Coordinator | Andrew    | Fuller   | Vice President, Sales
 Anne      | Dodsworth | Sales Representative     | Steven    | Buchanan | Sales Manager
(8 ROWS)
MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.FirstName, sub.LastName, sub.Title, e.FirstName, e.LastName, e.Title
 
+----------------------------------------------------------------------------------------------------------------+
| sub.FirstName | sub.LastName | sub.Title                  | e.FirstName | e.LastName | e.Title                 |
+----------------------------------------------------------------------------------------------------------------+
| "Margaret"    | "Peacock"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Steven"      | "Buchanan"   | "Sales Manager"            | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Nancy"       | "Davolio"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Janet"       | "Leverling"  | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Laura"       | "Callahan"   | "Inside Sales Coordinator" | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Anne"        | "Dodsworth"  | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Michael"     | "Suyama"     | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Robert"      | "King"       | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
+----------------------------------------------------------------------------------------------------------------+
8 rows

Now let’s see how many direct reports each manager has:

SELECT manager."EmployeeID" AS manager, COUNT(e."EmployeeID") AS reports
FROM employees AS manager
LEFT JOIN employees AS e ON e."ReportsTo" = manager."EmployeeID"
GROUP BY manager
ORDER BY reports DESC;
 
 manager | reports
---------+---------
       2 |       5
       5 |       3
       1 |       0
       3 |       0
       4 |       0
       9 |       0
       6 |       0
       7 |       0
       8 |       0
(9 ROWS)
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[rel:REPORTS_TO]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 5       |
| "5"      | 3       |
| "8"      | 0       |
| "7"      | 0       |
| "1"      | 0       |
| "4"      | 0       |
| "6"      | 0       |
| "9"      | 0       |
| "3"      | 0       |
+--------------------+
9 rows

Things start to get more interesting if we find the transitive reporting relationships that exist. I’m not an expert at Postgres but one way to achieve this is by writing a recursive WITH query like so:

WITH RECURSIVE recursive_employees("EmployeeID", "ReportsTo") AS (
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e
      UNION ALL
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e, recursive_employees re
        WHERE e."EmployeeID" = re."ReportsTo"
)
SELECT re."ReportsTo", COUNT(*) AS COUNT
FROM recursive_employees AS re
WHERE re."ReportsTo" IS NOT NULL
GROUP BY re."ReportsTo";
 
 ReportsTo | COUNT
-----------+-------
         2 |     8
         5 |     3
(2 ROWS)

If there’s a simpler way let me know in the comments.

In cypher we only need to add one character, ‘*’, after the ‘REPORTS_TO’ relationship to get it to recurse as far as it can. We’ll also remove the ‘OPTIONAL MATCH’ so that we only get back people who have people reporting to them:

MATCH (e:Employee)<-[rel:REPORTS_TO*]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 8       |
| "5"      | 3       |
+--------------------+
2 rows

Now I need to find some relational datasets with more complicated queries to play around with. If you have any ideas do let me know.

Written by Mark Needham

June 15th, 2015 at 10:53 pm

Posted in neo4j

Tagged with

Neo4j: Using LOAD CSV to help explore CSV files

without comments

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

$ head -n 5 data/import/references.csv
ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText
168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9."
168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl.
58,57,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."
58,63,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

$ wc -l data/import/references.csv
     782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" 
AS row
return count(*)
 
==> +----------+
==> | count(*) |
==> +----------+
==> | 636      |
==> +----------+
==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

$ cat data/import/references.csv | cut -d"," -f1 | grep -v  '[0-9]\+$'| head -n 10
ReferencedEpisodeId
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                     |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "53", ReferenceText -> "The website counting down to the next slap (slapcountdown.com) that Marshall sends Barney reaches zero in Slapsgiving, when the third slap is delivered."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "55", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "57", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "63", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "92", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" | wc -l
637
 
$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/refs.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "56"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "200"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "100"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "86"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "113"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "161"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there we go!

Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ "(?s).*This is the Mother.*"
RETURN row
 
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Written by Mark Needham

June 11th, 2015 at 11:15 pm

Posted in neo4j,Shell Scripting

Tagged with

Neo4j: Cypher – Step by step to creating a linked list of adjacent nodes using UNWIND

without comments

In late 2013 I wrote a post showing how to create a linked list connecting different football seasons together using Neo4j’s Cypher query language, a post I’ve frequently copy & pasted from!

Now 18 months later, and using Neo4j 2.2 rather than 2.0, we can actually solve this problem in what I believe is a more intuitive way using the UNWIND function. Credit for the idea goes to Michael, I’m just the messenger.

To recap, we had a collection of football seasons and we wanted to connect adjacent seasons to each other to allow easy querying between seasons. The following is the code we used:

CREATE (:Season {name: "2013/2014", timestamp: 1375315200})
CREATE (:Season {name: "2012/2013", timestamp: 1343779200})
CREATE (:Season {name: "2011/2012", timestamp: 1312156800})
CREATE (:Season {name: "2010/2011", timestamp: 1280620800})
CREATE (:Season {name: "2009/2010", timestamp: 1249084800})
MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
 
FOREACH(i in RANGE(0, length(seasons)-2) | 
    FOREACH(si in [seasons[i]] | 
        FOREACH(si2 in [seasons[i+1]] | 
            MERGE (si)-[:NEXT]->(si2))))

Our goal is to replace those 3 FOREACH loops with something a bit easier to understand. To start with, let’s run the first part of the query to get some intuition of what we’re trying to do:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
RETURN COLLECT(s) AS seasons
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | seasons                                                                                                                                                                                                                                                     |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[1973]{timestamp:1249084800,name:"2009/2010"},Node[1972]{timestamp:1280620800,name:"2010/2011"},Node[1971]{timestamp:1312156800,name:"2011/2012"},Node[1970]{timestamp:1343779200,name:"2012/2013"},Node[1969]{timestamp:1375315200,name:"2013/2014"}] |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So at this point we’ve got all the seasons in an array going from 2009/2010 up to 2013/2014. We want to create a ‘NEXT’ relationship between 2009/2010 -> 2010/2011, 2010/2011 -> 2011/2012 and so on.

To achieve this we need to get the adjacent seasons split into two columns, like so:

2009/2010	2010/2011
2010/2011	2011/2012
2011/2012	2012/2013
2012/2013	2013/2014

If we can get the data into that format then we can apply a MERGE between the two fields to create the ‘NEXT’ relationship. So how do we do that?

If we were in Python we’d be calling for the zip function which we could apply like this:

>>> seasons = ["2009/2010", "2010/2011", "2011/2012", "2012/2013", "2013/2014"]
 
>>> zip(seasons, seasons[1:])
[('2009/2010', '2010/2011'), ('2010/2011', '2011/2012'), ('2011/2012', '2012/2013'), ('2012/2013', '2013/2014')]

Unfortunately we don’t have an equivalent function in Cypher but we can achieve the same outcome by creating 2 columns with adjacent integer values. The RANGE and UNWIND functions are our friends here:

return RANGE(0,4)
 
==> +-------------+
==> | RANGE(0,4)  |
==> +-------------+
==> | [0,1,2,3,4] |
==> +-------------+
UNWIND RANGE(0,4) as idx 
RETURN idx, idx +1;
 
==> +--------------+
==> | idx | idx +1 |
==> +--------------+
==> | 0   | 1      |
==> | 1   | 2      |
==> | 2   | 3      |
==> | 3   | 4      |
==> | 4   | 5      |
==> +--------------+
==> 5 rows

Now all we need to do is plug this code into our original query where ‘idx’ and ‘idx + 1′ represent indexes into the array of seasons. We use a range which stops 1 element early since there isn’t anywhere to connect our last season to:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
RETURN seasons[idx], seasons[idx+1]
 
==> +-------------------------------------------------------------------------------------------------------+
==> | seasons[idx]                                      | seasons[idx+1]                                    |
==> +-------------------------------------------------------------------------------------------------------+
==> | Node[1973]{timestamp:1249084800,name:"2009/2010"} | Node[1972]{timestamp:1280620800,name:"2010/2011"} |
==> | Node[1972]{timestamp:1280620800,name:"2010/2011"} | Node[1971]{timestamp:1312156800,name:"2011/2012"} |
==> | Node[1971]{timestamp:1312156800,name:"2011/2012"} | Node[1970]{timestamp:1343779200,name:"2012/2013"} |
==> | Node[1970]{timestamp:1343779200,name:"2012/2013"} | Node[1969]{timestamp:1375315200,name:"2013/2014"} |
==> +-------------------------------------------------------------------------------------------------------+
==> 4 rows

Now we’ve got all the adjacent seasons lined up we complete the query with a call to MERGE:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
WITH seasons[idx] AS s1, seasons[idx+1] AS s2
MERGE (s1)-[:NEXT]->(s2)
 
==> +-------------------+
==> | No data returned. |
==> +-------------------+
==> Relationships created: 4

And we’re done. Hopefully I can remember this approach more than I did the initial one!

Written by Mark Needham

June 4th, 2015 at 10:17 pm

Posted in neo4j

Tagged with

Neo4j: The BBC Champions League graph

without comments

A couple of weekends ago I started scraping the BBC live text feed of the Bayern Munich/Barcelona match, initially starting out with just the fouls and building the foul graph.

I’ve spent a bit more time on it since then and have managed to model several other events as well including attempts, goals, cards and free kicks.

I started doing this just for the Bayern Munich/Barcelona match but realised it wasn’t particularly difficult to extend this out and graph the events for every match in the Champions League 2014/2015.

To do this we first need to download the pages for each of the matches. I downloaded this page and wrote a simple Python script to get the appropriate URIs:

from bs4 import BeautifulSoup
from soupselect import select
import bs4
 
soup = BeautifulSoup(open("data/results", "r"))
 
matches = select(soup, "a.report")
 
for match in matches:
    print "http://www.bbc.co.uk/%s" %(match.get("href"))

I then piped the output of running this script into wget:

find_all_matches.py | xargs wget -O data/raw

It was relatively simple to update the scraping and import code to handle multiple matches. The whole process from end to end looks like this:

2015 05 29 23 27 56

Most of the code is in the ‘scraping magic’ phase where I go through all the events and pull out the appropriate elements that we can link together in the graph.

e.g. a freekick and foul event are typically adjacent so we’d look to pull out the two players involved, the type of card issues, the time of the incident and the match the event occurred in.

I used Python’s Beautiful Soup library for this task but there’s no reason you couldn’t use another set of tools.

The README page shows how to create your own version of the graph but here’s an overview of what the graph looks like using Rik’s meta graph query:

Graph  21

Here’s a few of my favourite queries so far:

Which player with more than 10 shots has the best conversion rate?

match (a:Attempt)<-[:HAD_ATTEMPT]-(app)<-[:MADE_APPEARANCE]-(player),
      (app)-[:FOR_TEAM]-(team)
WITH player, COUNT(*) as times, COLLECT(a) AS attempts, team
WITH player, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team
WHERE times > 10
RETURN player.name, team.name, goals, times, (goals * 1.0 / times) AS conversionRate
ORDER BY conversionRate DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------+
==> | player.name           | team.name            | goals | times | conversionRate      |
==> +------------------------------------------------------------------------------------+
==> | "Luiz Adriano"        | "Shakhtar Donetsk"   | 9     | 14    | 0.6428571428571429  |
==> | "Yacine Brahimi"      | "FC Porto"           | 5     | 13    | 0.38461538461538464 |
==> | "Mario Mandzukic"     | "Atlético de Madrid" | 5     | 14    | 0.35714285714285715 |
==> | "Sergio Agüero"       | "Manchester City"    | 6     | 18    | 0.3333333333333333  |
==> | "Karim Benzema"       | "Real Madrid"        | 6     | 19    | 0.3157894736842105  |
==> | "Klaas-Jan Huntelaar" | "FC Schalke 04"      | 5     | 16    | 0.3125              |
==> | "Neymar"              | "Barcelona"          | 9     | 29    | 0.3103448275862069  |
==> | "Thomas Müller"       | "FC Bayern München"  | 7     | 24    | 0.2916666666666667  |
==> | "Jackson Martínez"    | "FC Porto"           | 7     | 24    | 0.2916666666666667  |
==> | "Callum McGregor"     | "Celtic"             | 3     | 11    | 0.2727272727272727  |
==> +------------------------------------------------------------------------------------+

Which players gained immediate revenge for a foul?

match (firstFoul:Foul)-[:COMMITTED_AGAINST]->(app1)<-[:MADE_APPEARANCE]-(revengeFouler),
      (app1)-[:IN_MATCH]->(match), (firstFoulerApp)-[:COMMITTED_FOUL]->(firstFoul),
      (app1)-[:COMMITTED_FOUL]->(revengeFoul)-[:COMMITTED_AGAINST]->(firstFoulerApp),
       (firstFouler)-[:MADE_APPEARANCE]->(firstFoulerApp)
WHERE (firstFoul)-[:NEXT]->(revengeFoul)
RETURN firstFouler.name AS firstFouler, revengeFouler.name AS revengeFouler, firstFoul.time, revengeFoul.time, match.home + " vs " + match.away
 
==> +---------------------------------------------------------------------------------------------------------------------------------+
==> | firstFouler         | revengeFouler               | firstFoul.time | revengeFoul.time | match.home + " vs " + match.away        |
==> +---------------------------------------------------------------------------------------------------------------------------------+
==> | "Derk Boerrigter"   | "Jean Philippe Mendy"       | "88:48"        | "89:42"          | "Celtic vs NK Maribor"                  |
==> | "Mario Suárez"      | "Pajtim Kasami"             | "27:17"        | "32:38"          | "Olympiakos vs Atlético de Madrid"      |
==> | "Aleksandr Volodko" | "Casemiro"                  | "39:27"        | "44:32"          | "FC Porto vs BATE Borisov"              |
==> | "Thomas Müller"     | "Mario Fernandes"           | "87:22"        | "88:31"          | "CSKA Moscow vs FC Bayern München"      |
==> | "Vinicius"          | "Marco Verratti"            | "56:36"        | "58:00"          | "APOEL Nicosia vs Paris Saint Germain"  |
==> | "Lasse Schöne"      | "Dani Alves"                | "84:08"        | "86:18"          | "Barcelona vs Ajax"                     |
==> | "Nick Viergever"    | "Dani Alves"                | "57:22"        | "60:37"          | "Barcelona vs Ajax"                     |
==> | "Nani"              | "Atsuto Uchida"             | "6:10"         | "8:40"           | "FC Schalke 04 vs Sporting Lisbon"      |
==> | "Andreas Samaris"   | "Yannick Ferreira-Carrasco" | "89:21"        | "90:00 +4:21"    | "Monaco vs Benfica"                     |
==> | "Simon Kroon"       | "Guillherme Siqueira"       | "84:05"        | "90:00 +0:29"    | "Atlético de Madrid vs Malmö FF"        |
==> | "Mario Suárez"      | "Isaac Thelin"              | "32:02"        | "38:47"          | "Atlético de Madrid vs Malmö FF"        |
==> | "Hakan Balta"       | "Henrikh Mkhitaryan"        | "62:09"        | "64:14"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Marco Reus"        | "Selcuk Inan"               | "36:17"        | "44:03"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Hakan Balta"       | "Sven Bender"               | "10:57"        | "12:51"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Vinicius"          | "Edinson Cavani"            | "87:56"        | "90:00 +1:25"    | "Paris Saint Germain vs APOEL Nicosia"  |
==> | "Jackson Martínez"  | "Carlos Gurpegi"            | "64:55"        | "66:17"          | "Athletic Club vs FC Porto"             |
==> | "Nani"              | "Chinedu Obasi"             | "1:30"         | "4:47"           | "Sporting Lisbon vs FC Schalke 04"      |
==> | "Vitali Rodionov"   | "Bruno Martins Indi"        | "52:16"        | "60:08"          | "BATE Borisov vs FC Porto"              |
==> | "Raheem Sterling"   | "Behrang Safari"            | "29:00"        | "33:27"          | "Liverpool vs FC Basel"                 |
==> | "Derlis González"   | "Fábio Coentrão"            | "52:55"        | "57:59"          | "FC Basel vs Real Madrid"               |
==> | "Josip Drmic"       | "Lisandro López"            | "15:04"        | "17:35"          | "Benfica vs Bayer 04 Leverkusen"        |
==> | "Fred"              | "Bastian Schweinsteiger"    | "6:04"         | "9:28"           | "Shakhtar Donetsk vs FC Bayern München" |
==> | "Alex Sandro"       | "Derlis González"           | "4:07"         | "7:28"           | "FC Basel vs FC Porto"                  |
==> | "Luca Zuffi"        | "Ruben Neves"               | "73:49"        | "84:44"          | "FC Porto vs FC Basel"                  |
==> | "Marco Verratti"    | "Oscar"                     | "28:49"        | "34:04"          | "Chelsea vs Paris Saint Germain"        |
==> | "Cristiano Ronaldo" | "Jesús Gámez"               | "20:59"        | "25:37"          | "Real Madrid vs Atlético de Madrid"     |
==> | "Bernardo Silva"    | "Álvaro Morata"             | "49:20"        | "62:31"          | "Monaco vs Juventus"                    |
==> | "Arturo Vidal"      | "Fabinho"                   | "38:19"        | "45:00"          | "Monaco vs Juventus"                    |
==> +---------------------------------------------------------------------------------------------------------------------------------+

Which players took the longest to gain revenge for a foul?

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match)<-[:COMMITTED_IN_MATCH]-(foul2)
WHERE (foul1)-[:NEXT*]->(foul2)
WITH match, foul1, player1, player2, foul2 ORDER BY foul1.sortableTime, foul2.sortableTime
WITH match, foul1, player1, player2, COLLECT(foul2) AS revenge
WITH match, foul1,  player1,player2,  revenge[0] AS revengeFoul
RETURN player1.name, player2.name, foul1.time, revengeFoul.time, revengeFoul.sortableTime - foul1.sortableTime AS secondsWaited, match.home + " vs " + match.away AS match
ORDER BY secondsWaited DESC
LIMIT 5
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | player1.name      | player2.name        | foul1.time | revengeFoul.time | secondsWaited | match                           |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | "Stefan Johansen" | "Ondrej Duda"       | "1:30"     | "82:11"          | 4841          | "Legia Warsaw vs Celtic"        |
==> | "Neymar"          | "Vinicius"          | "2:35"     | "80:08"          | 4653          | "Barcelona vs APOEL Nicosia"    |
==> | "Jérémy Toulalan" | "Stefan Kießling"   | "9:19"     | "86:37"          | 4638          | "Monaco vs Bayer 04 Leverkusen" |
==> | "Nabil Dirar"     | "Domenico Criscito" | "6:32"     | "82:39"          | 4567          | "Zenit St Petersburg vs Monaco" |
==> | "Nabil Dirar"     | "Eliseu"            | "7:20"     | "81:30"          | 4450          | "Monaco vs Benfica"             |
==> +---------------------------------------------------------------------------------------------------------------------------+

Who’s had the most shots?

match (team)<-[:FOR_TEAM]-(app)<-[appRel:MADE_APPEARANCE]-(player:Player)
optional match (a:Attempt)<-[att:HAD_ATTEMPT]-(app)
WITH player, COUNT( DISTINCT appRel) AS apps, COUNT(att) as times, COLLECT(a) AS attempts, team
WITH player,apps, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team
WHERE times > 10
RETURN player.name, team.name, apps, goals, times, (goals * 1.0 / times) AS conversionRate
ORDER BY times DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------------+
==> | player.name          | team.name             | apps | goals | times | conversionRate      |
==> +-------------------------------------------------------------------------------------------+
==> | "Cristiano Ronaldo"  | "Real Madrid"         | 12   | 10    | 69    | 0.14492753623188406 |
==> | "Lionel Messi"       | "Barcelona"           | 12   | 10    | 51    | 0.19607843137254902 |
==> | "Robert Lewandowski" | "FC Bayern München"   | 12   | 6     | 43    | 0.13953488372093023 |
==> | "Carlos Tévez"       | "Juventus"            | 12   | 7     | 34    | 0.20588235294117646 |
==> | "Gareth Bale"        | "Real Madrid"         | 10   | 2     | 32    | 0.0625              |
==> | "Luis Suárez"        | "Barcelona"           | 9    | 6     | 30    | 0.2                 |
==> | "Neymar"             | "Barcelona"           | 11   | 9     | 29    | 0.3103448275862069  |
==> | "Hakan Calhanoglu"   | "Bayer 04 Leverkusen" | 8    | 2     | 29    | 0.06896551724137931 |
==> | "Edinson Cavani"     | "Paris Saint Germain" | 8    | 6     | 27    | 0.2222222222222222  |
==> | "Alexis Sánchez"     | "Arsenal"             | 9    | 4     | 25    | 0.16                |
==> +-------------------------------------------------------------------------------------------+

Maybe you can think of some cooler ones? I’d love to see them. Grab the code from github and give it a try.

Written by Mark Needham

May 30th, 2015 at 9:45 pm

Posted in neo4j

Tagged with

Neo4j: The foul revenge graph

without comments

Last week I was showing the foul graph to my colleague Alistair who came up with the idea of running a ‘foul revenge’ query to find out which players gained revenge for a foul with one of their own later in them match.

Queries like this are very path centric and therefore work well in a graph. To recap, this is what the foul graph looks like:

2015 05 26 07 35 33

The first thing that we need to do is connect the fouls in a linked list based on time so that we can query their order more easily.

We can do this with the following query:

MATCH (foul:Foul)-[:COMMITTED_IN_MATCH]->(match)
WITH foul,match
ORDER BY match.id, foul.sortableTime
WITH match, COLLECT(foul) AS fouls
FOREACH(i in range(0, length(fouls) -2) |
  FOREACH(foul1 in [fouls[i]] | FOREACH (foul2 in [fouls[i+1]] |
    MERGE (foul1)-[:NEXT]->(foul2)
)));

This query collects fouls grouped by match and then adds a ‘NEXT’ relationship between adjacent fouls. The graph now looks like this:

2015 05 26 07 43 28

Now let’s find the revenge foulers in the Bayern Munich vs Barcelona match. We’re looking for the following pattern:

2015 05 26 07 55 45

This translates to the following cypher query:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2)
WHERE (foul1)-[:NEXT*]->(foul2)
RETURN player2.name AS firstFouler, player1.name AS revengeFouler, foul1.time, foul1.location, foul2.time, foul2.location

I’ve added in a few extra parts to the pattern to pull out the players involved and to find the revenge foulers in a specific match – the Bayern Munich vs Barcelona Semi Final 2nd leg.


We end up with the following revenge fouls:

2015 05 26 00 05 48

We can see here that Dani Alves actually gains revenge on Bastian Schweinsteiger twice for a foul he made in the 10th minute.

If we tweak the query to the following we can get a visual representation of the revenge fouls as well:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2),
      (foul1)-[:NEXT*]->(foul2)
RETURN *

2015 05 23 15 23 22

At the moment I’ve restricted the revenge concept to single matches but I wonder whether it’d be more interesting to create a linked list of fouls which crosses matches between teams in the same season.

The code for all of this is on github – the README is a bit sketchy at the moment but I’ll be fixing that up soon.

Written by Mark Needham

May 26th, 2015 at 7:03 am

Posted in neo4j

Tagged with

Neo4j: Finding all shortest paths

without comments

One of the Cypher language features we show in Neo4j training courses is the shortest path function which allows you to find the shortest path in terms of number of relationships between two nodes.

Using the movie graph, which you can import via the ‘:play movies’ command in the browser, we’ll first create a ‘KNOWS’ relationship between any people that have appeared in the same movie:

MATCH (p1:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
MERGE (p1)-[:KNOWS]-(p2)

Now that we’ve got that relationship we can easily find the shortest path between two people, say Tom Cruise and Tom Hanks:

MATCH (p1:Person {name: "Tom Hanks"}), (p2:Person {name: "Tom Cruise"}),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
Graph  18

That works pretty well but what if we want to find the longest shortest path between any two people in the graph? We can calculate it like this:

MATCH (p1:Person), (p2:Person),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  19

So that’s 6 hops which is actually the Bacon number – I expect we’d probably see a smaller maximum value if we imported all the movies.

And to round off the post what if we want to find the longest shortest path between the 10 people who acted in the most movies? We might start out with the following query which seems like it should do the job:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH p1 AS p1, p1 AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Unfortunately if we run that query we get no rows returned because ‘p1′ and ‘p2′ always refer to the same node.

Instead we can calculate the shortest path between our hardest working people by creating a cross product using COLLECT and UNWIND:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH COLLECT(p1) AS ps
UNWIND ps AS p1 UNWIND ps AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Graph  20

That’s all for now!

Written by Mark Needham

May 19th, 2015 at 10:45 pm

Posted in neo4j

Tagged with ,

Neo4j: Refactoring the BBC football live text fouls graph

without comments

Yesterday I wrote about a Neo4j graph I’ve started building which contains all the fouls committed in the Champions League game between Barcelona & Bayern Munich and surrounding meta data.

While adding other events into the graph I realised that I’d added some duplication in the model and the model could do with some refactoring to make it easier to use.

To recap, this is the model that we designed in the previous blog post:

The duplication is on the left hand side of the model – we model a foul as being committed by one player against another and then hook the foul back into the match. By doing that we’re not using the ‘appearance’ concept which links a player and a match together.

We can make the ‘COMMITTED_IN_MATCH’ relationship redundant by connecting the foul to appearance rather than to player. The match the foul was committed in can then be found by navigating through the appearance node.

This is what we want the graph to look like:

2015 05 17 10 40 44

We’ll move towards this new model in 3 steps:

  • Introduce the new structure alongside the existing one
  • Rewrite our queries to use the new structure
  • Remove the old structure

Introducing the new structure

First up let’s write a query to introduce the new structure.

match (foul:Foul)-[:COMMITTED_AGAINST]->(fouledPlayer),
      (foul)<-[:COMMITTED_FOUL]-(foulingPlayer),
      (foul)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"}),
      (foulingPlayer)-[:MADE_APPEARANCE]-(foulingPlayerApp)-[:IN_MATCH]->(match),
      (fouledPlayer)-[:MADE_APPEARANCE]-(fouledPlayerApp)-[:IN_MATCH]->(match)
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayerApp)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayerApp)

Remember we’re not going to delete the old structure yet so that’s why there aren’t any delete statements in here.

Rewriting our queries

Now we need to update our queries to work against the new graph structure:

Where do the fouls happen?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-()<-[]-(foul:Foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

Who fouls the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who was fouled the most?

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[r:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)<-[:COMMITTED_AGAINST]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

Who fouled who the most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)-[:COMMITTED_AGAINST]->(fouled:Player),
      (foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"}),
      (match)<-[:IN_MATCH]-(fouledApp)<-[:COMMITTED_AGAINST]->(foul:Foul)<-[:COMMITTED_FOUL]-(foulerApp)-[:IN_MATCH]->(match),
      (fouledApp)<-[:MADE_APPEARANCE]-(fouled),
      (foulerApp)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

Which team fouled most?

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-()<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

Worst fouler for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouler)-[:MADE_APPEARANCE]->(app)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

Most fouled against for each team

match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)<-[:COMMITTED_AGAINST]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouled)-[:MADE_APPEARANCE]->(app)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls

The early queries are made more complicated by the refactoring but the latter ones are slightly simpler. I think we need to hook some more events onto the appearance node to see whether this refactoring is worthwhile or not.

Removing the old structure

Holding judgement for now, let’s look at how we’d remove the old structure – the final step in this refactoring:

match (match:Match {id: "32683310"})<-[oldRel:COMMITTED_IN_MATCH]-(foul:Foul)
DELETE oldRel
match (player:Player)<-[oldRel:COMMITTED_AGAINST]-(foul:Foul)
DELETE oldRel
match (player:Player)-[oldRel:COMMITTED_FOUL]->(foul:Foul)
DELETE oldRel

Hopefully you can see how you’d go about refactoring your own graph if you realise the model isn’t quite what you want.

Any questions/thoughts/suggestions let me know!

Written by Mark Needham

May 17th, 2015 at 11:04 am

Posted in neo4j

Tagged with

Neo4j: BBC football live text fouls graph

without comments

I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.

It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.

We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.

2015 05 16 23 10 43

Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.

I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:

$ wget http://www.bbc.co.uk/sport/0/football/32683310

Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:

2015 05 16 22 19 00

HTML-wise it looks like this:

2015 05 16 22 20 28

4709393221 bddd85c64e z

Image courtesy of William Brawley

I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.

It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange – sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.

Luckily the two parts of the foul can be joined together by matching the time which made life easier.

The full code for the scrapper is on github if you want to play with it.

This is what the resulting CSV file looks like:

$ head -n 10 data/events.csv 
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern München,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern München,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern München
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern München,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern München
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern München
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern München
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern München,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern München,Ivan Rakitic,Barcelona

Now it’s time to create a graph. We’ll aim to massage the data into this model:

2015 05 16 22 50 32

Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:

// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
 
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
 
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
 
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
 
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
 
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
 
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
 
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
 
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);

We’ll use neo4j-shell to execute the script:

$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql

Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!

  • Where do the fouls happen on the pitch?
  • Who made the most fouls?
  • Who was fouled the most?
  • Who fouled who the most?
  • Which team fouled the most?
  • Who’s the worst fouler in each team?
  • Who’s the most fouled in each team?

Where do the fouls happen?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
 
+----------------------------------+
| location                 | fouls |
+----------------------------------+
| "in the defensive half." | 12    |
| "in the attacking half." | 12    |
| "on the right wing."     | 3     |
| "on the left wing."      | 3     |
+----------------------------------+
4 rows

Who fouls the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+------------------------------+
| fouler               | fouls |
+------------------------------+
| "Rafinha"            | 4     |
| "Pedro"              | 3     |
| "Medhi Benatia"      | 3     |
| "Dani Alves"         | 3     |
| "Xabi Alonso"        | 3     |
| "Javier Mascherano"  | 2     |
| "Thiago Alcántara"   | 2     |
| "Robert Lewandowski" | 2     |
| "Sebastian Rode"     | 1     |
| "Sergio Busquets"    | 1     |
+------------------------------+
10 rows

Who was fouled the most?

// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+----------------------------------+
| fouled                   | fouls |
+----------------------------------+
| "Robert Lewandowski"     | 4     |
| "Lionel Messi"           | 4     |
| "Neymar"                 | 3     |
| "Pedro"                  | 2     |
| "Xabi Alonso"            | 2     |
| "Andrés Iniesta"         | 2     |
| "Rafinha"                | 2     |
| "Bastian Schweinsteiger" | 2     |
| "Sebastian Rode"         | 1     |
| "Sergio Busquets"        | 1     |
+----------------------------------+
10 rows

Who fouled who the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
      (foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+--------------------------------------------------------+
| fouler              | fouled                   | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski"     | 2     |
| "Dani Alves"        | "Bastian Schweinsteiger" | 2     |
| "Xabi Alonso"       | "Lionel Messi"           | 2     |
| "Rafinha"           | "Neymar"                 | 2     |
| "Rafinha"           | "Andrés Iniesta"         | 2     |
| "Dani Alves"        | "Xabi Alonso"            | 1     |
| "Thiago Alcántara"  | "Javier Mascherano"      | 1     |
| "Pedro"             | "Juan Bernat"            | 1     |
| "Medhi Benatia"     | "Pedro"                  | 1     |
| "Neymar"            | "Sebastian Rode"         | 1     |
+--------------------------------------------------------+
10 rows

Which team fouled the most?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+-----------------------------+
| team.name           | fouls |
+-----------------------------+
| "FC Bayern München" | 18    |
| "Barcelona"         | 12    |
+-----------------------------+
2 rows

Worst fouler for each team?

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Rafinha"             | 4               |
| "Barcelona"         | "Pedro"               | 3               |
+---------------------------------------------------------------+
2 rows

Most fouled against for each team

match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
      (fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Robert Lewandowski"  | 4               |
| "Barcelona"         | "Lionel Messi"        | 4               |
+---------------------------------------------------------------+
2 rows

So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.

I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on github if you want to play around with it and perhaps get the other events into the graph.

Written by Mark Needham

May 16th, 2015 at 9:13 pm

Posted in neo4j

Tagged with

R: ggplot – Displaying multiple charts with a for loop

without comments

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
eventId = "220750415"
query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
df = cypher(graph, query, id= eventId)
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id
18 1.430819e+12      1.392726e+12 130976662
95 1.430069e+12      1.430069e+12  10286388
79 1.429035e+12      1.429035e+12  38344282
64 1.428108e+12      1.412935e+12 153473172
73 1.429513e+12      1.398236e+12 143322942
19 1.430389e+12      1.430389e+12 129261842
37 1.429643e+12      1.327603e+12   9750821
49 1.429618e+12      1.429618e+12 184325696
69 1.430781e+12      1.404554e+12  67485912
1  1.430929e+12      1.430146e+12 185405773

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
df$earliestTime = timestampToDate(df$earliestRSVP.time)
df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id                time    date        earliestTime earliestDate
47 1.430697e+12      1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11      2015-05
44 1.430924e+12      1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44      2015-05
85 1.429611e+12      1.422378e+12  53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02      2015-01
14 1.430125e+12      1.412690e+12   7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09      2014-10
29 1.430035e+12      1.430035e+12  37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03      2015-04
12 1.430855e+12      1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10      2015-05
41 1.428917e+12      1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40      2015-01
87 1.430927e+12      1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59      2015-05
62 1.430849e+12      1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23      2015-05
8  1.430237e+12      1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40      2015-03

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) 
 
ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))

2015 05 13 00 30 59

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))
}

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415")

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")
 
for(event in events$n) {
  plotEvent(as.character(event))
}

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra)
p = list()
for(i in 1:count(events)$n) {
  event = events[i, 1]
  p[[i]] = plotEvent(as.character(event))
}
 
do.call(grid.arrange,p)

2015 05 14 00 57 10

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
            WITH e,rsvp, person
            MATCH (person)-[:RSVPD]->(otherRSVP)
 
            WITH e,person, rsvp, otherRSVP
            ORDER BY person.id, otherRSVP.time
 
            WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
            return rsvp.time, earliestRSVP.time,  person.id, e.time"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$eventTime = timestampToDate(df$e.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.ticks = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) + 
    labs(title = df$eventTime[1])
}
2015 05 14 01 08 54

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Written by Mark Needham

May 14th, 2015 at 12:17 am

Posted in neo4j,R

Tagged with , ,

Neo4j: LOAD CSV – java.io.InputStreamReader there’s a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn’t supported.

without comments

I recently came across the last.fm dataset via Ben Frederickson’s blog and thought it’d be an interesting one to load into Neo4j and explore.

I started with a simple query to parse the CSV file and count the number of rows:

LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv" 
AS row FIELDTERMINATOR  "\t"
return COUNT(*)
 
At java.io.InputStreamReader@4d307fda:6484 there's a field starting with a quote and whereas it ends that quote there seems  to be character in that field after that ending quote. That isn't supported. This is what I read: 'weird al"'

This blows up because (as the message says) we’ve got a field which uses double quotes but then has other characters either side of the quotes.

A quick search through the file reveals one of the troublesome lines:

$ grep "\"weird" lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv  | head -n 1
0015371426d2cbef354b2f680340de38d0ebd2f0	7746d775-9550-4360-b8d5-c37bd448ce01	"weird al" yankovic	4099

I ran a file containing only that line through CSV Lint to see what it thought and indeed it is invalid:

2015 05 04 10 50 43

Let’s clean up our file to use single quotes instead of double quotes and try the query again:

$ tr "\"" "'" < lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv > lastfm-dataset-360K/clean.tsv
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/clean.tsv" as row FIELDTERMINATOR  "\t"
return COUNT(*)
 
17559530

And we’re back in business! Interestingly Python’s CSV reader chooses to strip out the double quotes rather than throw an exception:

import csv
with open("smallWeird.tsv", "r") as file:
    reader = csv.reader(file, delimiter="\t")
 
    for row in reader:
        print row
$ python explore.py
['0015371426d2cbef354b2f680340de38d0ebd2f0', '7746d775-9550-4360-b8d5-c37bd448ce01', 'weird al yankovic', '4099']

I prefer LOAD CSV’s approach but it’s an interesting trade off I hadn’t considred before.

Written by Mark Needham

May 4th, 2015 at 9:56 am

Posted in neo4j

Tagged with ,