Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ tag

Neo4j: Summarising neo4j-shell output

without comments

I frequently find myself trying to optimise a set of cypher queries and I tend to group them together in a script that I fed to the Neo4j shell.

When tweaking the queries it’s easy to make a mistake and end up not creating the same data so I decided to write a script which will show me the aggregates of all the commands executed.

I want to see the number of constraints created, indexes added, nodes, relationships and properties created. The first 2 don’t need to match across the scripts but the latter 3 should be the same.

I put together the following script:

import re
import sys
from tabulate import tabulate
 
lines = sys.stdin.readlines()
 
def search(term, line):
    m =  re.match(term + ": (.*)", line)
    return (int(m.group(1)) if m else 0)
 
nodes_created, relationships_created, constraints_added, indexes_added, labels_added, properties_set = 0, 0, 0, 0, 0, 0
for line in lines:
    nodes_created = nodes_created + search("Nodes created", line)
    relationships_created = relationships_created + search("Relationships created", line)
    constraints_added = constraints_added + search("Constraints added", line)
    indexes_added = indexes_added + search("Indexes added", line)
    labels_added = labels_added + search("Labels added", line)
    properties_set = properties_set + search("Properties set", line)
 
    time_match = re.match("real.*([0-9]+m[0-9]+\.[0-9]+s)$", line)
 
    if time_match:
        time = time_match.group(1)
 
table = [
            ["Constraints added", constraints_added],
            ["Indexes added", indexes_added],
            ["Nodes created", nodes_created],
            ["Relationships created", relationships_created],
            ["Labels added", labels_added],
            ["Properties set", properties_set],
            ["Time", time]
         ]
print tabulate(table)

Its input is the piped output of the neo4j-shell command which will contain a description of all the queries it executed.

$ cat import.sh
#!/bin/sh
 
{ ./neo4j-community-2.2.3/bin/neo4j stop; } 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
{ ./neo4j-community-2.2.3/bin/neo4j start; } 2>&1
{ time ./neo4j-community-2.2.3/bin/neo4j-shell --file $1; } 2>&1

We can use the script in two ways.

Either we can pipe the output of our shell straight into it and just get the summary e.g.

$ ./import.sh local.import.optimised.cql | python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m17.595s
---------------------  ---------

…or we can make use of the ‘tee’ function in Unix and pipe the output into stdout and into the file and then either tail the file on another window or inspect it afterwards to see the detailed timings. e.g.

$ ./import.sh local.import.optimised.cql | tee /tmp/output.txt |  python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m11.428s
---------------------  ---------
$ tail -f /tmp/output.txt
+-------------+
| appearances |
+-------------+
| 3771        |
+-------------+
1 row
Nodes created: 3439
Properties set: 3439
Labels added: 3439
289 ms
+------------------------------------+
| appearances -> player, match, team |
+------------------------------------+
| 3771                               |
+------------------------------------+
1 row
Relationships created: 10317
1006 ms
...

My only dependency is the tabulate package to get the pretty table:

$ cat requirements.txt
 
tabulate==0.7.5

The cypher script I’m running creates a BBC football graph which is available as a github project. Feel free to grab it and play around – any problems let me know!

Written by Mark Needham

August 21st, 2015 at 8:59 pm

Posted in neo4j

Tagged with

Neo4j 2.2.3: Unmanaged extensions – Creating gzipped streamed responses with Jetty

without comments

Back in 2013 I wrote a couple of blog posts showing examples of an unmanaged extension which had a streamed and gzipped response but two years on I realised they were a bit out of date and deserved a refresh.

When writing unmanaged extensions in Neo4j a good rule of thumb is to try and reduce the amount of objects you keep hanging around. In this context this means that we should stream our response to the client as quickly as possible rather than building it up in memory and sending it in one go.

The documentation has a good example showing how to stream a list of colleagues but in this blog post we’ll look at how to do something simpler – we’ll create a couple of nodes representing people and then write an unmanaged extension to return them.

We’ll first create an unmanaged extension which runs a cypher query, iterates through the rows returned and sends them to the client:

@Path("/example")
public class ExampleResource {
    private final GraphDatabaseService db;
    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
 
    public ExampleResource(@Context GraphDatabaseService db) {
        this.db = db;
    }
 
    @GET
    @Produces(MediaType.APPLICATION_JSON)
    @Path("/people")
    public Response allNodes() throws IOException {
        StreamingOutput stream = streamQueryResponse("MATCH (n:Person) RETURN n.name AS name");
        return Response.ok().entity(stream).type(MediaType.APPLICATION_JSON).build();
    }
 
    private StreamingOutput streamQueryResponse(final String query) {
        return new StreamingOutput() {
                @Override
                public void write(OutputStream os) throws IOException, WebApplicationException {
                    JsonGenerator jg = OBJECT_MAPPER.getJsonFactory().createJsonGenerator(os, JsonEncoding.UTF8);
                    jg.writeStartArray();
 
                    writeQueryResultTo(query, jg);
 
                    jg.writeEndArray();
                    jg.flush();
                    jg.close();
                }
            };
    }
 
    private void writeQueryResultTo(String query, JsonGenerator jg) throws IOException {
        try (Result result = db.execute(query)) {
            while (result.hasNext()) {
                Map<String, Object> row = result.next();
 
                jg.writeStartObject();
                for (Map.Entry<String, Object> entry : row.entrySet()) {
                    jg.writeFieldName(entry.getKey());
                    jg.writeString(entry.getValue().toString());
                }
                jg.writeEndObject();
            }
        }
    }
}

There’s nothing too complicated going on here although notice that we make much more fine grained calls to the JSON Library rather than created a JSON object in memory and calling ObjectMapper#writeValueAsString on it.

To get this to work we’d build a JAR containing this class, put that into the plugins folder and then add the following property to conf/neo4j-server.properties (or the Neo4j desktop equivalent) before restarting the server:

org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.unmanaged=/unmanaged

We can then test it out like this:

$ curl http://localhost:7474/unmanaged/example/people
[{"name":"Mark"},{"name":"Nicole"}]

I’ve put in a couple of test people nodes – full instructions are available on the github README page.

Next we want to make it possible to send that response in the gzip format. To do that we need to add a GzipFilter to the Neo4j lifecycle. This class has moved to a different namespace in Jetty 9 which Neo4j 2.2.3 depends on, but the following class does the job:

import org.eclipse.jetty.servlets.GzipFilter;
 
public class GZipInitialiser implements SPIPluginLifecycle {
    private WebServer webServer;
 
    @Override
    public Collection<Injectable<?>> start(NeoServer neoServer) {
        webServer = getWebServer(neoServer);
        GzipFilter filter = new GzipFilter();
 
        webServer.addFilter(filter, "/*");
        return Collections.emptyList();
    }
 
    private WebServer getWebServer(final NeoServer neoServer) {
        if (neoServer instanceof AbstractNeoServer) {
            return ((AbstractNeoServer) neoServer).getWebServer();
        }
        throw new IllegalArgumentException("expected AbstractNeoServer");
    }
 
    @Override
    public Collection<Injectable<?>> start(GraphDatabaseService graphDatabaseService, Configuration configuration) {
        throw new IllegalAccessError();
    }
 
    @Override
    public void stop() {
 
    }
}

I needed to include the jersey-servlets JAR in my unmanaged extension JAR in order for this to work correctly. Once we redeploy the JAR and restart Neo4j we can try making the same request as above but with a gzip header:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people
��V�K�MU�R�M,�V�Ձ��2��sR�jcf(#

We can unpack that on the fly by piping it through gunzip to check we get a sensible result:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people | gunzip
[{"name":"Mark"},{"name":"Nicole"}]

And there we have it – a gzipped streamed response. All the code is on github so give it a try and give me a shout if it doesn’t work. The fastest way to get me is probably on our new shiny neo4j-users Slack group.

Written by Mark Needham

August 10th, 2015 at 11:57 pm

Posted in neo4j

Tagged with

Neo4j: Cypher – Removing consecutive duplicates

without comments

When writing Cypher queries I sometimes find myself wanting to remove consecutive duplicates in collections that I’ve joined together.

e.g we might start with the following query where 1 and 7 appear consecutively:

RETURN [1,1,2,3,4,5,6,7,7,8] AS values
 
==> +-----------------------+
==> | values                |
==> +-----------------------+
==> | [1,1,2,3,4,5,6,7,7,8] |
==> +-----------------------+
==> 1 row

We want to end up with [1,2,3,4,5,6,7,8]. We can start by exploding our array and putting consecutive elements next to each other:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
RETURN idx, idx+1, values[idx], values[idx+1]
 
==> +-------------------------------------------+
==> | idx | idx+1 | values[idx] | values[idx+1] |
==> +-------------------------------------------+
==> | 0   | 1     | 1           | 1             |
==> | 1   | 2     | 1           | 2             |
==> | 2   | 3     | 2           | 3             |
==> | 3   | 4     | 3           | 4             |
==> | 4   | 5     | 4           | 5             |
==> | 5   | 6     | 5           | 6             |
==> | 6   | 7     | 6           | 7             |
==> | 7   | 8     | 7           | 7             |
==> | 8   | 9     | 7           | 8             |
==> +-------------------------------------------+
==> 9 rows

Next we can filter out rows which have the same values since that means they have consecutive duplicates:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN a,b
 
==> +-------+
==> | a | b |
==> +-------+
==> | 1 | 2 |
==> | 2 | 3 |
==> | 3 | 4 |
==> | 4 | 5 |
==> | 5 | 6 |
==> | 6 | 7 |
==> | 7 | 8 |
==> +-------+
==> 7 rows

Now we need to join the collection back together again. Most of the values we want are in field ‘b’ but we also need to grab the first value from field ‘a':

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

What about if we have more than 2 duplicates in a row?

WITH [1,1,1,2,3,4,5,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

Still happy, good times! Of course if we have a non consecutive duplicate that wouldn’t be removed:

WITH [1,1,1,2,3,4,5,5,6,7,7,8,1] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +---------------------+
==> | noDuplicates        |
==> +---------------------+
==> | [1,2,3,4,5,6,7,8,1] |
==> +---------------------+
==> 1 row

Written by Mark Needham

July 30th, 2015 at 6:23 am

Posted in neo4j

Tagged with ,

Neo4j: MERGE’ing on super nodes

with one comment

In my continued playing with the Chicago crime data set I wanted to connect the crimes committed to their position in the FBI crime type hierarchy.

These are the sub graphs that I want to connect:

2015 07 26 22 19 04

We have a ‘fbiCode’ on each ‘Crime’ node which indicates which ‘Crime Sub Category’ the crime belongs to.

I started with the following query to connect the nodes together:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
MERGE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

I had this running inside a Python script which incremented ‘skip’ by 10,000 on each iteration as long as ‘crimesProcessed’ came back with a value > 0.

To start with the ‘CATEGORY’ relationships were being created very quickly but it slowed down quite noticeably about 1 million nodes in.

I profiled the queries but the query plans didn’t show anything obviously wrong. My suspicion was that I had a super node problem where the cypher run time was iterating through all of the sub category’s relationships to check whether one of them pointed to the crime on the other side of the ‘MERGE’ statement.

I cancelled the import job and wrote a query to check how many relationships each sub category had. It varied from 1,000 to 93,000 somewhat confirming my suspicion.

Michael suggested tweaking the query to use the shortestpath function to check for the existence of the relationship and then use the ‘CREATE’ clause to create it if it didn’t exist.

The neat thing about the shortestpath function is that it will start from the side with the lowest cardinality and as soon as it finds a relationship it will stop searching. Let’s have a look at that version of the query:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
WITH crime, subCat, shortestPath((crime)-[:CATEGORY]->(subCat)) AS path
FOREACH(ignoreMe IN CASE WHEN path is NULL THEN [1] ELSE [] END |
  CREATE (crime)-[:CATEGORY]->(subCat))
RETURN COUNT(*)

This worked much better – 10,000 nodes processed in ~ 2.5 seconds – and the time remained constant as more relationships were added. This allowed me to create all the category nodes but we can actually do even better if we use CREATE UNIQUE instead of MERGE

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
CREATE UNIQUE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

Using this query 10,000 nodes took ~ 250ms -900ms second to process which means we can process all the nodes in 5-6 minutes – good times!

I’m not super familiar with the ‘CREATE UNIQUE’ code so I’m not sure that it’s always a good substitute for ‘MERGE’ but on this occasion it does the job.

The lesson for me here is that if a query is taking longer than you think it should try and use other constructs / a combination of other constructs and see whether things improve – they just might!

Written by Mark Needham

July 28th, 2015 at 9:04 pm

Posted in neo4j

Tagged with

Neo4j: From JSON to CSV to LOAD CSV via jq

without comments


In my last blog post I showed how to import a Chicago crime categories & sub categories JSON document using Neo4j’s cypher query language via the py2neo driver. While this is a good approach for people with a developer background, many of the users I encounter aren’t developers and favour using Cypher via the Neo4j browser.

If we’re going to do this we’ll need to transform our JSON document into a CSV file so that we can use the LOAD CSV command on it. Michael pointed me to the jq tool which comes in very handy.

To recap, this is a part of the JSON file:

{
    "categories": [
        {
            "name": "Index Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        },
        {
            "name": "Non-Index Crime",
            "sub_categories": [
                {
                    "code": "01B",
                    "description": "Involuntary Manslaughter"
                },
            ]
        },
        {
            "name": "Violent Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        }
    ]
}

We want to get one row for each sub category which contains three columns – category name, sub category code, sub category description.

First we need to pull out the categories:

$ jq ".categories[]" categories.json
 
{
  "name": "Index Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}
{
  "name": "Non-Index Crime",
  "sub_categories": [
    {
      "code": "01B",
      "description": "Involuntary Manslaughter"
    },
  ]
}
{
  "name": "Violent Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}

Next we want to create a row for each sub category with the category alongside it. We can use the pipe function to combine the two selectors:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]}" categories.json
 
{
  "name": "Index Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}
...
{
  "name": "Non-Index Crime",
  "sub_category": {
    "code": "01B",
    "description": "Involuntary Manslaughter"
  }
}
...
{
  "name": "Violent Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}

Now we want to un-nest the sub category:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description]" categories.json
 
[
  "Index Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]
 
[
  "Non-Index Crime",
  "01B",
  "Involuntary Manslaughter"
]
 
[
  "Violent Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]

And finally let’s use the @csv filter to generate CSV lines:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"\"Index Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Index Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Index Crime\",\"03\",\"Robbery\""
"\"Index Crime\",\"04A\",\"Aggravated Assault\""
"\"Index Crime\",\"04B\",\"Aggravated Battery\""
"\"Index Crime\",\"05\",\"Burglary\""
"\"Index Crime\",\"06\",\"Larceny\""
"\"Index Crime\",\"07\",\"Motor Vehicle Theft\""
"\"Index Crime\",\"09\",\"Arson\""
"\"Non-Index Crime\",\"01B\",\"Involuntary Manslaughter\""
"\"Non-Index Crime\",\"08A\",\"Simple Assault\""
"\"Non-Index Crime\",\"08B\",\"Simple Battery\""
"\"Non-Index Crime\",\"10\",\"Forgery & Counterfeiting\""
"\"Non-Index Crime\",\"11\",\"Fraud\""
"\"Non-Index Crime\",\"12\",\"Embezzlement\""
"\"Non-Index Crime\",\"13\",\"Stolen Property\""
"\"Non-Index Crime\",\"14\",\"Vandalism\""
"\"Non-Index Crime\",\"15\",\"Weapons Violation\""
"\"Non-Index Crime\",\"16\",\"Prostitution\""
"\"Non-Index Crime\",\"17\",\"Criminal Sexual Abuse\""
"\"Non-Index Crime\",\"18\",\"Drug Abuse\""
"\"Non-Index Crime\",\"19\",\"Gambling\""
"\"Non-Index Crime\",\"20\",\"Offenses Against Family\""
"\"Non-Index Crime\",\"22\",\"Liquor License\""
"\"Non-Index Crime\",\"24\",\"Disorderly Conduct\""
"\"Non-Index Crime\",\"26\",\"Misc Non-Index Offense\""
"\"Violent Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Violent Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Violent Crime\",\"03\",\"Robbery\""
"\"Violent Crime\",\"04A\",\"Aggravated Assault\""
"\"Violent Crime\",\"04B\",\"Aggravated Battery\""

The only annoying thing about this output is that all the double quotes are escaped. We can sort that out by passing the ‘-r’ flag when we call jq:

$ jq -r ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
"Non-Index Crime","01B","Involuntary Manslaughter"
"Non-Index Crime","08A","Simple Assault"
"Non-Index Crime","08B","Simple Battery"
"Non-Index Crime","10","Forgery & Counterfeiting"
"Non-Index Crime","11","Fraud"
"Non-Index Crime","12","Embezzlement"
"Non-Index Crime","13","Stolen Property"
"Non-Index Crime","14","Vandalism"
"Non-Index Crime","15","Weapons Violation"
"Non-Index Crime","16","Prostitution"
"Non-Index Crime","17","Criminal Sexual Abuse"
"Non-Index Crime","18","Drug Abuse"
"Non-Index Crime","19","Gambling"
"Non-Index Crime","20","Offenses Against Family"
"Non-Index Crime","22","Liquor License"
"Non-Index Crime","24","Disorderly Conduct"
"Non-Index Crime","26","Misc Non-Index Offense"
"Violent Crime","01A","Homicide 1st & 2nd Degree"
"Violent Crime","02","Criminal Sexual Assault"
"Violent Crime","03","Robbery"
"Violent Crime","04A","Aggravated Assault"
"Violent Crime","04B","Aggravated Battery"

Excellent. The only thing left is to write a header and then direct the output into a CSV file and get it into Neo4j:

$ echo "category,sub_category_code,sub_category_description" > categories.csv
$ jq -r ".categories[] |
         {name: .name, sub_category: .sub_categories[]} |
         [.name, .sub_category.code, .sub_category.description] |
         @csv " categories.json >> categories.csv
$ head -n10 categories.csv
category,sub_category_code,sub_category_description
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/categories.csv" AS row
MERGE (c:CrimeCategory {name: row.category})
MERGE (sc:SubCategory {code: row.sub_category_code})
ON CREATE SET sc.description = row.sub_category_description
MERGE (c)-[:CHILD]->(sc)

And that’s it!

Graph  25

Written by Mark Needham

July 25th, 2015 at 11:05 pm

Posted in neo4j

Tagged with

Neo4j: Loading JSON documents with Cypher

without comments

One of the most commonly asked questions I get asked is how to load JSON documents into Neo4j and although Cypher doesn’t have a ‘LOAD JSON’ command we can still get JSON data into the graph.

Michael shows how to do this from various languages in this blog post and I recently wanted to load a JSON document that I generated from Chicago crime types.

This is a snippet of the JSON document:

{
    "categories": [
        {
            "name": "Index Crime", 
            "sub_categories": [
                {
                    "code": "01A", 
                    "description": "Homicide 1st & 2nd Degree"
                }
            ]
        }, 
        {
            "name": "Non-Index Crime", 
            "sub_categories": [
                {
                    "code": "01B", 
                    "description": "Involuntary Manslaughter"
                }
            ]
        }, 
        {
            "name": "Violent Crime", 
            "sub_categories": [
                {
                    "code": "01A", 
                    "description": "Homicide 1st & 2nd Degree"
                }
            ]
        }
    ]
}

We want to create the following graph structure from this document:

2015 07 23 06 46 50

We can then connect the crimes to the appropriate sub category and write aggregation queries that drill down from the category.

To do this we’re going to have to pass the JSON document to Neo4j via its HTTP API rather than through the browser. Luckily there are drivers available for {insert your favourite language here} so we should still be good.

Python is my current goto language so I’m going to use py2neo to load the data in.

Let’s start by writing a simple query which passes our JSON document in and gets it straight back. Note that I’ve updated my Neo4j password to be ‘foobar’ – replace that with your equivalent if you’re following along:

import json
from py2neo import Graph, authenticate
 
# replace 'foobar' with your password
authenticate("localhost:7474", "neo4j", "foobar")
graph = Graph()
 
with open('categories.json') as data_file:
    json = json.load(data_file)
 
query = """
RETURN {json}
"""
 
# Send Cypher query.
print graph.cypher.execute(query, json = json)
$ python import_categories.py
   | document
---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1 | {u'categories': [{u'name': u'Index Crime', u'sub_categories': [{u'code': u'01A', u'description': u'Homicide 1st & 2nd Degree'}, {u'code': u'02', u'description': u'Criminal Sexual Assault'}, {u'code': u'03', u'description': u'Robbery'}, {u'code': u'04A', u'description': u'Aggravated Assault'}, {u'code': u'04B', u'description': u'Aggravated Battery'}, {u'code': u'05', u'description': u'Burglary'}, {u'code': u'06', u'description': u'Larceny'}, {u'code': u'07', u'description': u'Motor Vehicle Theft'}, {u'code': u'09', u'description': u'Arson'}]}, {u'name': u'Non-Index Crime', u'sub_categories': [{u'code': u'01B', u'description': u'Involuntary Manslaughter'}, {u'code': u'08A', u'description': u'Simple Assault'}, {u'code': u'08B', u'description': u'Simple Battery'}, {u'code': u'10', u'description': u'Forgery & Counterfeiting'}, {u'code': u'11', u'description': u'Fraud'}, {u'code': u'12', u'description': u'Embezzlement'}, {u'code': u'13', u'description': u'Stolen Property'}, {u'code': u'14', u'description': u'Vandalism'}, {u'code': u'15', u'description': u'Weapons Violation'}, {u'code': u'16', u'description': u'Prostitution'}, {u'code': u'17', u'description': u'Criminal Sexual Abuse'}, {u'code': u'18', u'description': u'Drug Abuse'}, {u'code': u'19', u'description': u'Gambling'}, {u'code': u'20', u'description': u'Offenses Against Family'}, {u'code': u'22', u'description': u'Liquor License'}, {u'code': u'24', u'description': u'Disorderly Conduct'}, {u'code': u'26', u'description': u'Misc Non-Index Offense'}]}, {u'name': u'Violent Crime', u'sub_categories': [{u'code': u'01A', u'description': u'Homicide 1st & 2nd Degree'}, {u'code': u'02', u'description': u'Criminal Sexual Assault'}, {u'code': u'03', u'description': u'Robbery'}, {u'code': u'04A', u'description': u'Aggravated Assault'}, {u'code': u'04B', u'description': u'Aggravated Battery'}]}]}

It’s a bit ugly but we can see that everything’s there! Our next step is to extract each category into its own row. We can do this by accessing the ‘categories’ key in our JSON document and then calling the UNWIND function which allows us to expand a collection into a sequence of rows:

query = """
WITH {json} AS document
UNWIND document.categories AS category
RETURN category.name
"""
$ python import_categories.py
   | category.name
---+-----------------
 1 | Index Crime
 2 | Non-Index Crime
 3 | Violent Crime

Now we can create a node for each of those categories. We’ll use the MERGE command so that we can run this script multiple times without ending up with repeat categories:

query = """
WITH {json} AS document
UNWIND document.categories AS category
MERGE (:CrimeCategory {name: category.name}) 
"""

Let’s quickly check those categories were correctly imported:

match (category:CrimeCategory)
return category

Graph  23

Looking good so far – now for the sub categories. We’re going to use the UNWIND function to help us out here as well:

query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
RETURN category.name, subCategory.code, subCategory.description
"""
$ python import_categories.py
    | category.name   | subCategory.code | subCategory.description
----+-----------------+------------------+---------------------------
  1 | Index Crime     | 01A              | Homicide 1st & 2nd Degree
  2 | Index Crime     | 02               | Criminal Sexual Assault
  3 | Index Crime     | 03               | Robbery
  4 | Index Crime     | 04A              | Aggravated Assault
  5 | Index Crime     | 04B              | Aggravated Battery
  6 | Index Crime     | 05               | Burglary
  7 | Index Crime     | 06               | Larceny
  8 | Index Crime     | 07               | Motor Vehicle Theft
  9 | Index Crime     | 09               | Arson
 10 | Non-Index Crime | 01B              | Involuntary Manslaughter
 11 | Non-Index Crime | 08A              | Simple Assault
 12 | Non-Index Crime | 08B              | Simple Battery
 13 | Non-Index Crime | 10               | Forgery & Counterfeiting
 14 | Non-Index Crime | 11               | Fraud
 15 | Non-Index Crime | 12               | Embezzlement
 16 | Non-Index Crime | 13               | Stolen Property
 17 | Non-Index Crime | 14               | Vandalism
 18 | Non-Index Crime | 15               | Weapons Violation
 19 | Non-Index Crime | 16               | Prostitution
 20 | Non-Index Crime | 17               | Criminal Sexual Abuse
 21 | Non-Index Crime | 18               | Drug Abuse
 22 | Non-Index Crime | 19               | Gambling
 23 | Non-Index Crime | 20               | Offenses Against Family
 24 | Non-Index Crime | 22               | Liquor License
 25 | Non-Index Crime | 24               | Disorderly Conduct
 26 | Non-Index Crime | 26               | Misc Non-Index Offense
 27 | Violent Crime   | 01A              | Homicide 1st & 2nd Degree
 28 | Violent Crime   | 02               | Criminal Sexual Assault
 29 | Violent Crime   | 03               | Robbery
 30 | Violent Crime   | 04A              | Aggravated Assault
 31 | Violent Crime   | 04B              | Aggravated Battery

Let’s give sub categories the MERGE treatment too:

query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
MERGE (c:CrimeCategory {name: category.name})
MERGE (sc:SubCategory {code: subCategory.code})
ON CREATE SET sc.description = subCategory.description
MERGE (c)-[:CHILD]->(sc)
"""

And finally let’s write a query to check what we’ve imported:

match (category:CrimeCategory)-[:CHILD]->(subCategory)
return *
Graph  24

I hadn’t realised before running this query is that some sub categories sit under multiple categories so that’s quite an interesting insight. The final Python script is available on github – any questions let me know.

Written by Mark Needham

July 23rd, 2015 at 6:15 am

Posted in neo4j

Tagged with

Neo4j 2.2.3: neo4j-import – Encoder StringEncoder[2] returned an illegal encoded value 0

without comments

I’ve been playing around with the Chicago crime data set again while preparing for a Neo4j webinar next week and while running the import tool ran into the following exception:

Importing the contents of these files into tmp/crimes.db:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/beats.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/primaryTypes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/locations.csv
Relationships:
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesBeats.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesPrimaryTypes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesLocationsCleaned.csv
 
Available memory:
  Free machine memory: 263.17 MB
  Max heap memory : 3.56 GB
 
Nodes
[*>:17.41 MB/s-------------------------|PROPERTIES(3)=|NODE:3|LABEL SCAN----|v:36.30 MB/s(2)===]  3MImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:200)
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.await(AbstractStep.java:191)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.receive(ProcessorStep.java:98)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.sendDownstream(ProcessorStep.java:224)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.access$400(ProcessorStep.java:42)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$Sender.send(ProcessorStep.java:250)
	at org.neo4j.unsafe.impl.batchimport.LabelScanStorePopulationStep.process(LabelScanStorePopulationStep.java:60)
	at org.neo4j.unsafe.impl.batchimport.LabelScanStorePopulationStep.process(LabelScanStorePopulationStep.java:37)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:120)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:102)
	at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:237)
Caused by: java.lang.IllegalStateException: Encoder StringEncoder[2] returned an illegal encoded value 0
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.encode(EncodingIdMapper.java:229)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:208)
	at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:77)
	at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:43)
	... 3 more

I narrowed the problem down to a specific file and from tracing the code learned that this exception happens when we’ve ended up with a node that doesn’t have an id.

I guessed that this might be due to there being an empty column somewhere in my CSV file so I did a bit of grepping:

$ grep -rn  "\"\"" tmp/locations.csv
tmp/locations.csv:11:"",Location

We can now narrow down the import to just the one line and see if we still get the exception:

$ cat foo.csv
id:ID(Location),:LABEL
"",Location
 
$ ./neo4j-community-2.2.3/bin/neo4j-import --into tmp/foo --nodes foo.csv
Importing the contents of these files into tmp/foo:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/foo.csv
 
Available memory:
  Free machine memory: 2.22 GB
  Max heap memory : 3.56 GB
 
Nodes
Import error: Encoder StringEncoder[2] returned an illegal encoded value 0

Yep, same error. Now we can clean up our CSV file and try again:

$ grep -v  "\"\"" foo.csv > fooCleaned.csv
 
# I put in a few real records so we can see them import
$ cat fooCleaned.csv
id:ID(Location),:LABEL
"RAILROAD PROPERTY",Location
"NEWSSTAND",Location
"SCHOOL, PRIVATE, BUILDING",Location
 
$ ./neo4j-community-2.2.3/bin/neo4j-import --into tmp/foo --nodes fooCleaned.csv
Importing the contents of these files into tmp/foo:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/fooCleaned.csv
 
Available memory:
  Free machine memory: 1.23 GB
  Max heap memory : 3.56 GB
 
Nodes
[*>:??-------------------------------------------------|PROPE|NODE:7.63 MB-----------------|LA] 10k
Done in 110ms
Prepare node index
[*DETECT:7.63 MB-------------------------------------------------------------------------------]   0
Done in 60ms
Calculate dense nodes
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 10ms
Relationships
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 11ms
Node --> Relationship
[*v:??-----------------------------------------------------------------------------------------] 10k
Done in 1ms
Relationship --> Relationship
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 11ms
Node counts
[>:|*COUNT:76.29 MB----------------------------------------------------------------------------] 10k
Done in 46ms
Relationship counts
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 12ms
 
IMPORT DONE in 1s 576ms. Imported:
  3 nodes
  0 relationships
  3 properties

Sweet! We’re back in business.

Written by Mark Needham

July 21st, 2015 at 6:11 am

Posted in neo4j

Tagged with

Neo4j: The football transfers graph

without comments

Given we’re still in pre season transfer madness as far as European football is concerned I thought it’d be interesting to put together a football transfers graph to see whether there are any interesting insights to be had.

It took me a while to find an appropriate source but I eventually came across transfermarkt.co.uk which contains transfers going back at least as far as the start of the Premier League in 1992.

I wrote a quick Python script to create a CSV file of all the transfers. This is what the file looks like:

$ head -n 10 data/transfers.csv
player,from_team,from_team_id,to_team,to_team_id,fee,season
Martin Keown,Everton,29,Arsenal FC,11,"2,10 Mill. £",1992-1993
John Jensen,Bröndby IF,206,Arsenal FC,11,"1,12 Mill. £",1992-1993
Alan Miller,Birmingham,337,Arsenal FC,11,,1992-1993
Jim Will,Sheffield Utd.,350,Arsenal FC,11,,1992-1993
David Rocastle,Arsenal FC,11,Leeds,399,"1,68 Mill. £",1992-1993
Perry Groves,Arsenal FC,11,Southampton FC,180,595 Th. £,1992-1993
Ty Gooden,Arsenal FC,11,Wycombe Wand.,2805,?,1992-1993
Geraint Williams,Derby,22,Ipswich Town,677,525 Th. £,1992-1993
Jason Winters,Chelsea U21,9250,Ipswich Town,677,?,1992-1993

I’m going to create the following graph and then we’ll write some queries which explore chains of transfers involving players and clubs.

2015 07 15 07 28 11

I wrote a few import scripts using Neo4j’s LOAD CSV command, having set up the appropriate indexes first:

create index on :Team(id);
create index on :Season(name);
create index on :Transfer(description);
create index on :Player(name);
// teams
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/teams.csv" as row
merge (team:Team {id: toint(row.team_id)})
on create set team.name = row.team;
 
// seasons
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
merge (season:Season {name: row.season})
ON CREATE SET season.starts =  toint(split(season.name, "-")[0]);
 
// players
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
merge (player:Player {name: row.player});
 
// transfers
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
match (from:Team {id: toint(row.from_team_id)})
match (to:Team {id: toint(row.to_team_id)})
match (season:Season {name: row.season})
match (player:Player {name: row.player})
 
merge (transfer:Transfer {description: row.player + " from " + from.name + " to " + to.name})
merge (transfer)-[:FROM_TEAM]->(from)
merge (transfer)-[:TO_TEAM]->(to)
merge (transfer)-[:IN_SEASON]->(season)
merge (transfer)-[:PLAYER]->(player);
 
// connect transfers
match (season)<-[:IN_SEASON]-(transfer:Transfer)-[:PLAYER]->(player)
WITH player, season, transfer
ORDER BY player.name, season.starts
WITH player, COLLECT({s: season, t: transfer}) AS transfers
UNWIND range(0, length(transfers)-2) AS idx
WITH player, transfers[idx] AS t1, transfers[idx +1] AS t2
WITH player, t1.t AS t1, t2.t AS t2
MERGE (t1)-[:NEXT]->(t2);

All the files and scripts are on this gist if you want to play around with the data. The only thing you’ll need to change is the file path on each of the ‘LOAD CSV’ lines.

The ‘connect transfers’ query is a bit more complicated than the others – in that one we’re first ordering the transfers in ascending order grouped by player and then creating a linked list of a player’s transfers.

Now that we’ve got the data loaded let’s find out which player was transferred the most:

match path = (:Transfer)-[:NEXT*0..]->(transfer:Transfer)
where NOT (transfer)-[:NEXT]->()
RETURN path 
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  22

Which other players have moved teams frequently?

match path = (first:Transfer)-[:NEXT*0..]->(transfer:Transfer),
             (player)<-[:PLAYER]-(transfer)
where NOT ((transfer)-[:NEXT]->()) AND NOT ((first)<-[:NEXT]-())
RETURN player.name, LENGTH(path) AS numberOfTransfers 
ORDER BY numberOfTransfers DESC
LIMIT 10
 
==> +--------------------------------------+
==> | player.name      | numberOfTransfers |
==> +--------------------------------------+
==> | "Craig Bellamy"  | 7                 |
==> | "David Unsworth" | 6                 |
==> | "Andrew Cole"    | 6                 |
==> | "Peter Crouch"   | 6                 |
==> | "Les Ferdinand"  | 5                 |
==> | "Kevin Phillips" | 5                 |
==> | "Mark Hughes"    | 5                 |
==> | "Tommy Wright"   | 4                 |
==> | "Carl Tiler"     | 4                 |
==> | "Don Hutchison"  | 4                 |
==> +--------------------------------------+
==> 10 rows

What are the most frequent combinations of clubs involved in transfers?

match (from)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(to), (t)-[:PLAYER]->(p)
RETURN from.name, to.name, COUNT(*) AS times, COLLECT(p.name) AS players
ORDER BY times DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | from.name           | to.name               | times | players                                                                                                                                                                                                    |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "West Ham United"   | "Queens Park Rangers" | 13    | ["Keith Rowland","Iain Dowie","Tim Breacker","Ludek Miklosko","Bertie Brayley","Terrell Forbes","Steve Lomas","Hogan Ephraim","Nigel Quashie","Danny Gabbidon","Kieron Dyer","Robert Green","Gary O'Neil"] |
==> | "Tottenham Hotspur" | "Portsmouth FC"       | 12    | ["Paul Walsh","Andy Turner","Rory Allen","Justin Edinburgh","Tim Sherwood","Teddy Sheringham","Noé Pamarot","Pedro Mendes","Sean Davis","Jermain Defoe","Younès Kaboul","Kevin-Prince Boateng"]            |
==> | "Liverpool FC"      | "West Ham United"     | 12    | ["Julian Dicks","David Burrows","Mike Marsh","Don Hutchison","Neil Ruddock","Titi Camara","Rob Jones","Rigobert Song","Craig Bellamy","Joe Cole","Andy Carroll","Stewart Downing"]                         |
==> | "Manchester United" | "Everton FC"          | 9     | ["Andrey Kanchelskis","John O'Kane","Jesper Blomqvist","Phil Neville","Tim Howard","Louis Saha","Darron Gibson","Sam Byrne","Tom Cleverley"]                                                               |
==> | "Newcastle United"  | "West Ham United"     | 9     | ["Paul Kitson","Shaka Hislop","Stuart Pearce","Wayne Quinn","Lee Bowyer","Kieron Dyer","Scott Parker","Nolberto Solano","Kevin Nolan"]                                                                     |
==> | "Blackburn Rovers"  | "Leicester City"      | 9     | ["Steve Agnew","Tim Flowers","Callum Davidson","John Curtis","Keith Gillespie","Craig Hignett","Nils-Eric Johansson","Bruno Berner","Paul Gallagher"]                                                      |
==> | "Chelsea FC"        | "Southampton FC"      | 8     | ["Ken Monkou","Kerry Dixon","Neil Shipperley","Mark Hughes","Paul Hughes","Graeme Le Saux","Jack Cork","Ryan Bertrand"]                                                                                    |
==> | "Birmingham City"   | "Coventry City"       | 8     | ["David Rennie","John Gayle","Liam Daish","Gary Breen","Stern John","Julian Gray","Lee Carsley","Gary McSheffrey"]                                                                                         |
==> | "Southampton FC"    | "Fulham FC"           | 8     | ["Micky Adams","Kevin Moore","Terry Hurlock","Maik Taylor","Alan Neilson","Luís Boa Morte","Antti Niemi","Chris Baird"]                                                                                    |
==> | "Portsmouth FC"     | "Stoke City"          | 8     | ["Kevin Harper","Lewis Buxton","Anthony Pulis","Vincent Péricard","Asmir Begovic","Marc Wilson","Elliot Wheeler","Alex Grant"]                                                                             |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 10 rows

Are there ever situations where players get transferred in both directions?

match (from)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(to), (t)-[:PLAYER]->(player)
where id(from) < id(to)
WITH from, to, COUNT(*) AS times, COLLECT(player.name) AS players
match (to)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(from), (t)-[:PLAYER]->(player)
RETURN from.name, to.name, times, COUNT(*) as otherWayTimes, players, COLLECT(player.name) AS otherWayPlayers
ORDER BY times + otherWayTimes DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | from.name           | to.name               | times | otherWayTimes | players                                                                                                                                                                                                    | otherWayPlayers                                                                                                                                                                    |
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "Tottenham Hotspur" | "Portsmouth FC"       | 12    | 5             | ["Paul Walsh","Andy Turner","Rory Allen","Justin Edinburgh","Tim Sherwood","Teddy Sheringham","Noé Pamarot","Pedro Mendes","Sean Davis","Jermain Defoe","Younès Kaboul","Kevin-Prince Boateng"]            | ["Jermain Defoe","Niko Kranjcar","Younès Kaboul","Peter Crouch","Darren Anderton"]                                                                                                 |
==> | "West Ham United"   | "Liverpool FC"        | 4     | 12            | ["Julian Dicks","Daniel Sjölund","Yossi Benayoun","Javier Mascherano"]                                                                                                                                     | ["Stewart Downing","Andy Carroll","Joe Cole","Craig Bellamy","Rigobert Song","Titi Camara","Rob Jones","Neil Ruddock","Don Hutchison","Julian Dicks","Mike Marsh","David Burrows"] |
==> | "West Ham United"   | "Queens Park Rangers" | 13    | 2             | ["Keith Rowland","Iain Dowie","Tim Breacker","Ludek Miklosko","Bertie Brayley","Terrell Forbes","Steve Lomas","Hogan Ephraim","Nigel Quashie","Danny Gabbidon","Kieron Dyer","Robert Green","Gary O'Neil"] | ["Andy Impey","Trevor Sinclair"]                                                                                                                                                   |
==> | "West Ham United"   | "Tottenham Hotspur"   | 5     | 8             | ["Jermain Defoe","Frédéric Kanouté","Michael Carrick","Jimmy Walker","Scott Parker"]                                                                                                                       | ["Sergiy Rebrov","Mauricio Taricco","Calum Davenport","Les Ferdinand","Matthew Etherington","Bobby Zamora","Ilie Dumitrescu","Mark Robson"]                                        |
==> | "West Ham United"   | "Portsmouth FC"       | 8     | 5             | ["Martin Allen","Adrian Whitbread","Marc Keller","Svetoslav Todorov","Hayden Foxe","Shaka Hislop","Sébastien Schemmel","Hayden Mullins"]                                                                   | ["Stephen Henderson","Teddy Sheringham","Shaka Hislop","Marc Keller","Lee Chapman"]                                                                                                |
==> | "Newcastle United"  | "West Ham United"     | 9     | 3             | ["Paul Kitson","Shaka Hislop","Stuart Pearce","Wayne Quinn","Lee Bowyer","Kieron Dyer","Scott Parker","Nolberto Solano","Kevin Nolan"]                                                                     | ["Demba Ba","Lee Bowyer","David Terrier"]                                                                                                                                          |
==> | "Birmingham City"   | "Coventry City"       | 8     | 4             | ["David Rennie","John Gayle","Liam Daish","Gary Breen","Stern John","Julian Gray","Lee Carsley","Gary McSheffrey"]                                                                                         | ["Scott Dann","David Burrows","Peter Ndlovu","David Smith"]                                                                                                                        |
==> | "Manchester City"   | "Portsmouth FC"       | 8     | 4             | ["Paul Walsh","Carl Griffiths","Fitzroy Simpson","Eyal Berkovic","David James","Andrew Cole","Sylvain Distin","Tal Ben Haim"]                                                                              | ["Benjani","Gerry Creaney","Kit Symons","Paul Walsh"]                                                                                                                              |
==> | "Blackburn Rovers"  | "Southampton FC"      | 5     | 6             | ["David Speedie","Stuart Ripley","James Beattie","Kevin Davies","Zak Jones"]                                                                                                                               | ["Zak Jones","Egil Östenstad","Kevin Davies","Alan Shearer","Jeff Kenna","Tim Flowers"]                                                                                            |
==> | "AFC Bournemouth"   | "West Ham United"     | 3     | 8             | ["Keith Rowland","Paul Mitchell","Scott Mean"]                                                                                                                                                             | ["Steve Jones","Matt Holland","Mohammed Berthé","Scott Mean","Paul Mitchell","Jamie Victory","Mark Watson","Stephen Purches"]                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Any players who go back to the same club they were at previously?

match (player:Player)<-[:PLAYER]-(t1:Transfer)-[:FROM_TEAM]->(from)<-[:TO_TEAM]-(t2:Transfer)-[:FROM_TEAM]->(to),
      (t2)-[:PLAYER]->(player), (t1)-[:TO_TEAM]->(to)
WHERE ID(to) < ID(from)
WITH player, COLLECT([ from.name, " ⥄ ", to.name]) AS teams
RETURN player.name, 
       REDUCE(acc = [], item in teams | acc  + REDUCE(acc2 = "", i in item | acc2 + i)) AS thereAndBack
ORDER BY LENGTH(thereAndBack) DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------+
==> | player.name       | thereAndBack                                                    |
==> +-------------------------------------------------------------------------------------+
==> | "Mark Stein"      | ["Stoke City ⥄ Chelsea FC","Ipswich Town ⥄ Chelsea FC"]         |
==> | "Peter Beagrie"   | ["Bradford City ⥄ Everton FC","Bradford City ⥄ Wigan Athletic"] |
==> | "Richard Dryden"  | ["Southampton FC ⥄ Stoke City","Southampton FC ⥄ Swindon Town"] |
==> | "Robbie Elliott"  | ["Bolton Wanderers ⥄ Newcastle United"]                         |
==> | "Elliot Grandin"  | ["Blackpool FC ⥄ Crystal Palace"]                               |
==> | "Robert Fleck"    | ["Chelsea FC ⥄ Norwich City"]                                   |
==> | "Paul Walsh"      | ["Portsmouth FC ⥄ Manchester City"]                             |
==> | "Rick Holden"     | ["Manchester City ⥄ Oldham Athletic"]                           |
==> | "Gary McAllister" | ["Liverpool FC ⥄ Coventry City"]                                |
==> | "Lee Bowyer"      | ["West Ham United ⥄ Newcastle United"]                          |
==> +-------------------------------------------------------------------------------------+

That’s all I’ve got for now – if you can think of any other interesting avenues to explore let me know and I’ll take a look.

Written by Mark Needham

July 16th, 2015 at 6:40 am

Posted in neo4j

Tagged with

Northwind: Finding direct/transitive Reports in SQL and Neo4j’s Cypher

without comments

Every few months we run a relational to graph meetup at the Neo London office where we go through how to take your data from a relational database and into the graph.

We use the Northwind dataset which often comes as a demo dataset on relational databases and come up with some queries which seem graph in nature.

My favourite query is one which finds out how employees are organised and who reports to whom. I thought it’d be quite interesting to see what it would look like in Postgres SQL as well, just for fun.

We’ll start off by getting a list of employees and the person they report to:

SELECT e."EmployeeID", e."ReportsTo"
FROM employees AS e
WHERE e."ReportsTo" IS NOT NULL;
 
 EmployeeID | ReportsTo
------------+-----------
          1 |         2
          3 |         2
          4 |         2
          5 |         2
          6 |         5
          7 |         5
          8 |         2
          9 |         5
(8 ROWS)

In cypher we’d do this:

MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.EmployeeID, e.EmployeeID 
 
+-------------------------------+
| sub.EmployeeID | e.EmployeeID |
+-------------------------------+
| "4"            | "2"          |
| "5"            | "2"          |
| "1"            | "2"          |
| "3"            | "2"          |
| "8"            | "2"          |
| "9"            | "5"          |
| "6"            | "5"          |
| "7"            | "5"          |
+-------------------------------+
8 rows

Next let’s find the big boss who doesn’t report to anyone. First in SQL:

SELECT e."EmployeeID" AS bigBoss
FROM employees AS e
WHERE e."ReportsTo" IS NULL
 
 bigboss
---------
       2
(1 ROW)

And now cypher:

MATCH (e:Employee)
WHERE NOT (e)-[:REPORTS_TO]->()
RETURN e.EmployeeID AS bigBoss
 
+---------+
| bigBoss |
+---------+
| "2"     |
+---------+
1 row

We still don’t need to join anything so the query isn’t that interesting yet. Let’s bring in some more properties from the manager record so we have to self join on the employees table:

SELECT e."FirstName", e."LastName", e."Title", manager."FirstName", manager."LastName", manager."Title"
FROM employees AS e
JOIN employees AS manager ON e."ReportsTo" = manager."EmployeeID"
WHERE e."ReportsTo" IS NOT NULL
 
 FirstName | LastName  |          Title           | FirstName | LastName |         Title
-----------+-----------+--------------------------+-----------+----------+-----------------------
 Nancy     | Davolio   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Janet     | Leverling | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Margaret  | Peacock   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Steven    | Buchanan  | Sales Manager            | Andrew    | Fuller   | Vice President, Sales
 Michael   | Suyama    | Sales Representative     | Steven    | Buchanan | Sales Manager
 Robert    | King      | Sales Representative     | Steven    | Buchanan | Sales Manager
 Laura     | Callahan  | Inside Sales Coordinator | Andrew    | Fuller   | Vice President, Sales
 Anne      | Dodsworth | Sales Representative     | Steven    | Buchanan | Sales Manager
(8 ROWS)
MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.FirstName, sub.LastName, sub.Title, e.FirstName, e.LastName, e.Title
 
+----------------------------------------------------------------------------------------------------------------+
| sub.FirstName | sub.LastName | sub.Title                  | e.FirstName | e.LastName | e.Title                 |
+----------------------------------------------------------------------------------------------------------------+
| "Margaret"    | "Peacock"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Steven"      | "Buchanan"   | "Sales Manager"            | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Nancy"       | "Davolio"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Janet"       | "Leverling"  | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Laura"       | "Callahan"   | "Inside Sales Coordinator" | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Anne"        | "Dodsworth"  | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Michael"     | "Suyama"     | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Robert"      | "King"       | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
+----------------------------------------------------------------------------------------------------------------+
8 rows

Now let’s see how many direct reports each manager has:

SELECT manager."EmployeeID" AS manager, COUNT(e."EmployeeID") AS reports
FROM employees AS manager
LEFT JOIN employees AS e ON e."ReportsTo" = manager."EmployeeID"
GROUP BY manager
ORDER BY reports DESC;
 
 manager | reports
---------+---------
       2 |       5
       5 |       3
       1 |       0
       3 |       0
       4 |       0
       9 |       0
       6 |       0
       7 |       0
       8 |       0
(9 ROWS)
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[rel:REPORTS_TO]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 5       |
| "5"      | 3       |
| "8"      | 0       |
| "7"      | 0       |
| "1"      | 0       |
| "4"      | 0       |
| "6"      | 0       |
| "9"      | 0       |
| "3"      | 0       |
+--------------------+
9 rows

Things start to get more interesting if we find the transitive reporting relationships that exist. I’m not an expert at Postgres but one way to achieve this is by writing a recursive WITH query like so:

WITH RECURSIVE recursive_employees("EmployeeID", "ReportsTo") AS (
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e
      UNION ALL
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e, recursive_employees re
        WHERE e."EmployeeID" = re."ReportsTo"
)
SELECT re."ReportsTo", COUNT(*) AS COUNT
FROM recursive_employees AS re
WHERE re."ReportsTo" IS NOT NULL
GROUP BY re."ReportsTo";
 
 ReportsTo | COUNT
-----------+-------
         2 |     8
         5 |     3
(2 ROWS)

If there’s a simpler way let me know in the comments.

In cypher we only need to add one character, ‘*’, after the ‘REPORTS_TO’ relationship to get it to recurse as far as it can. We’ll also remove the ‘OPTIONAL MATCH’ so that we only get back people who have people reporting to them:

MATCH (e:Employee)<-[rel:REPORTS_TO*]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 8       |
| "5"      | 3       |
+--------------------+
2 rows

Now I need to find some relational datasets with more complicated queries to play around with. If you have any ideas do let me know.

Written by Mark Needham

June 15th, 2015 at 10:53 pm

Posted in neo4j

Tagged with

Neo4j: Using LOAD CSV to help explore CSV files

without comments

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

$ head -n 5 data/import/references.csv
ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText
168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9."
168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl.
58,57,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."
58,63,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

$ wc -l data/import/references.csv
     782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" 
AS row
return count(*)
 
==> +----------+
==> | count(*) |
==> +----------+
==> | 636      |
==> +----------+
==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

$ cat data/import/references.csv | cut -d"," -f1 | grep -v  '[0-9]\+$'| head -n 10
ReferencedEpisodeId
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                     |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "53", ReferenceText -> "The website counting down to the next slap (slapcountdown.com) that Marshall sends Barney reaches zero in Slapsgiving, when the third slap is delivered."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "55", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "57", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "63", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "92", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" | wc -l
637
 
$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/refs.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "56"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "200"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "100"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "86"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "113"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "161"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there we go!

Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ "(?s).*This is the Mother.*"
RETURN row
 
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Written by Mark Needham

June 11th, 2015 at 11:15 pm

Posted in neo4j,Shell Scripting

Tagged with