Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother

without comments

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:

import requests
from bs4 import BeautifulSoup
from soupselect import select
 
episodes = {}
for i in range(1,3):
    page = open("data/transcripts/page-" + str(i) + ".html", 'r')
    soup = BeautifulSoup(page.read())
 
    for row in select(soup, "td.topic-titles a"):
        parts = row.text.split(" - ")
        episodes[parts[0]] = {"title": parts[1], "link": row.get("href")}
 
for key, value in episodes.iteritems():
    parts = key.split("x")
    season = int(parts[0])
    episode = int(parts[1])
    filename = "data/transcripts/S%d-Ep%d" %(season, episode)
    print filename
 
    with open(filename, 'wb') as handle:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        response = requests.get("http://transcripts.foreverdreaming.org" + value["link"], headers = headers)
        if response.ok:
            for block in response.iter_content(1024):
                if not block:
                    break
 
                handle.write(block)

the files containing the lists of episodes are named ‘page-1′ and ‘page-2′

The code is reasonably simple – we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a ‘save’ method that I could use.

Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.

In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.

I ended up with the following script which created a dictionary of (episode -> words + occurrences):

import csv
import nltk
import re
 
from bs4 import BeautifulSoup
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
def count_words(words):
    tally=Counter()
    for elem in words:
        tally[elem] += 1
    return tally
 
episodes_dict = {}
with open('data/import/episodes.csv', 'r') as episodes:
    reader = csv.reader(episodes, delimiter=',')
    reader.next()
 
    for row in reader:
        print row
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
 
        text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip())
        words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")]
 
        episodes_dict[row[0]] = count_words(words)

Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.

s/somewhere/in Neo4j

Neo4j’s query language, Cypher, has a really nice ETL-esque tool called ‘LOAD CSV’ for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:

with open("data/import/words.csv", "w") as words:
    writer = csv.writer(words, delimiter=",")
    writer.writerow(["EpisodeId", "Word", "Occurrences"])
    for episode_id, words in episodes_dict.iteritems():
        for word in words:
            writer.writerow([episode_id, word, words[word]])

This is what the CSV file contents look like:

$ head -n 10 data/import/words.csv
EpisodeId,Word,Occurrences
165,secondly,1
165,focus,1
165,baby,1
165,spiders,1
165,go,4
165,apartment,1
165,buddy,1
165,Exactly,1
165,young,1

Now we need to write some Cypher to get the data into Neo4j:

// words
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (word:Word {value: row.Word})
// episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MATCH (word:Word {value: row.Word})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);

Having done that we can write some simple queries to explore the words used in How I met your mother:

MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode) 
RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences
ORDER BY occurrences DESC
LIMIT 10
 
==> +-------------------------------------+
==> | word.value | episodes | occurrences |
==> +-------------------------------------+
==> | "Ted"      | 207      | 11437       |
==> | "Barney"   | 208      | 8052        |
==> | "Marshall" | 208      | 7236        |
==> | "Robin"    | 205      | 6626        |
==> | "Lily"     | 207      | 6330        |
==> | "m"        | 208      | 4777        |
==> | "re"       | 208      | 4097        |
==> | "know"     | 208      | 3489        |
==> | "Oh"       | 197      | 3448        |
==> | "like"     | 208      | 2498        |
==> +-------------------------------------+
==> 10 rows

The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why ‘m’ and ‘re’ are in the next two position s – I expect that might be scraping gone wrong!

Our next query might focus around checking which character is referred to the post in each episode:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN episode.id, topWord.word AS word, topWord.times AS occurrences
LIMIT 10
 
==> +---------------------------------------+
==> | episode.id | word       | occurrences |
==> +---------------------------------------+
==> | 72         | "Barney"   | 75          |
==> | 143        | "Ted"      | 16          |
==> | 43         | "Lily"     | 74          |
==> | 156        | "Ted"      | 12          |
==> | 206        | "Barney"   | 23          |
==> | 50         | "Marshall" | 51          |
==> | 113        | "Ted"      | 76          |
==> | 178        | "Barney"   | 21          |
==> | 182        | "Barney"   | 22          |
==> | 67         | "Ted"      | 84          |
==> +---------------------------------------+
==> 10 rows

If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times)
 
==> +-------------------------------------------------------------------------------------+
==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) |
==> +-------------------------------------------------------------------------------------+
==> | 3                  | 259                | 63.90865384615385  | 42.36255207691068    |
==> +-------------------------------------------------------------------------------------+
==> 1 row

Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:

  • Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) – this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
  • Pull out scenes – some of the transcripts use the keyword ‘scene’ to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
  • Analyse who talks to each other or who talks about each other most frequently
  • Create a graph of conversations as my colleagues Max and Michael have previously blogged about.

Written by Mark Needham

January 10th, 2015 at 1:22 am

Posted in neo4j,Python

Tagged with ,

Neo4j 2.1.6 – Cypher: FOREACH slowness

without comments

A common problem that people have when using Neo4j for social network applications is updating a person with their newly imported friends.

We’ll have an array of friends that we want to connect to a single Person node. Assuming the following schema…

$ schema
Indexes
  ON :Person(id) ONLINE
 
No constraints

…a simplified version would look like this:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
 
FOREACH(f IN friends |
  MERGE (friend:Person {id: f})
  MERGE (friend)-[:FRIENDS]->p);

If we execute that on an empty database we’ll see something like this:

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
19173 ms

This took much longer than we’d expect so let’s have a look at the PROFILE output:

EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager
      |
      +UpdateGraph(1)
        |
        +Extract
          |
          +Null
 
+----------------+------+---------+-------------+--------------------------------------+
|       Operator | Rows |  DbHits | Identifiers |                                Other |
+----------------+------+---------+-------------+--------------------------------------+
|    EmptyResult |    0 |       0 |             |                                      |
| UpdateGraph(0) |    1 | 3015012 |             |                              Foreach |
|          Eager |    1 |       0 |             |                                      |
| UpdateGraph(1) |    1 |       5 |        p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |       0 |             |                              friends |
|           Null |    ? |       ? |             |                                      |
+----------------+------+---------+-------------+--------------------------------------+

The DbHits value on the 2nd row seems suspiciously high suggesting that FOREACH might not be making use of the Person#id index and is instead scanning all Person nodes each time.

I’m not sure how to drill into that further but an alternative approach is to try out the same query but using UNWIND instead and checking the profile output of that:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
WITH p, friends
UNWIND friends AS f
MERGE (friend:Person {id: f})
MERGE (friend)-[:FRIENDS]->p;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
343 ms
EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager(0)
      |
      +UpdateGraph(1)
        |
        +UNWIND
          |
          +Eager(1)
            |
            +UpdateGraph(2)
              |
              +Extract
                |
                +Null
 
+----------------+------+--------+-------------------------+--------------------------------------+
|       Operator | Rows | DbHits |             Identifiers |                                Other |
+----------------+------+--------+-------------------------+--------------------------------------+
|    EmptyResult |    0 |      0 |                         |                                      |
| UpdateGraph(0) | 1001 |      0 | friend, p,   UNNAMED136 |                         MergePattern |
|       Eager(0) | 1001 |      0 |                         |                                      |
| UpdateGraph(1) | 1001 |   5005 |          friend, friend |            MergeNode; f; :Person(id) |
|         UNWIND | 1001 |      0 |                         |                                      |
|       Eager(1) |    1 |      0 |                         |                                      |
| UpdateGraph(2) |    1 |      5 |                    p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |      0 |                         |                              friends |
|           Null |    ? |      ? |                         |                                      |
+----------------+------+--------+-------------------------+--------------------------------------+

That’s much quicker and doesn’t touch as many nodes as FOREACH was. I expect the index issue will be sorted out in future but until then UNWIND is our friend.

Written by Mark Needham

December 28th, 2014 at 4:28 am

Posted in neo4j

Tagged with ,

Neo4j: Cypher – Avoiding the Eager

without comments

Although I love how easy Cypher’s LOAD CSV command makes it to get data into Neo4j, it currently breaks the rule of least surprise in the way it eagerly loads in all rows for some queries even those using periodic commit.

Neverwhere

Beware of the eager pipe

This is something that my colleague Michael noted in the second of his blog posts explaining how to use LOAD CSV successfully:

The biggest issue that people ran into, even when following the advice I gave earlier, was that for large imports of more than one million rows, Cypher ran into an out-of-memory situation.

That was not related to commit sizes, so it happened even with PERIODIC COMMIT of small batches.

I recently spent a few days importing data into Neo4j on a Windows machine with 4GB RAM so I was seeing this problem even earlier than Michael suggested.

Michael explains how to work out whether your query is suffering from unexpected eager evaluation:

If you profile that query you see that there is an “Eager” step in the query plan.

That is where the “pull in all data” happens.

You can profile queries by prefixing the word ‘PROFILE’. You’ll need to run your query in the console of /webadmin in your web browser or with the Neo4j shell.

I did this for my queries and was able to identify query patterns which get evaluated eagerly and in some cases we can work around it.

We’ll use the Northwind data set to demonstrate how the Eager pipe can sneak into our queries but keep in mind that this data set is sufficiently small to not cause issues.

This is what a row in the file looks like:

$ head -n 2 data/customerDb.csv
OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,CustomerID,CustomerCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,EmployeeID,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath,OrderID,ProductID,UnitPrice,Quantity,Discount,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued,SupplierID,SupplierCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,HomePage,CategoryID,CategoryName,Description,Picture
10248,VINET,5,1996-07-04,1996-08-01,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France,VINET,Vins et alcools Chevalier,Paul Henriot,Accounting Manager,59 rue de l'Abbaye,Reims,,51100,France,26.47.15.10,26.47.15.11,5,Buchanan,Steven,Sales Manager,Mr.,1955-03-04,1993-10-17,14 Garrett Hill,London,,SW1 8JR,UK,(71) 555-4848,3453,\x,"Steven Buchanan graduated from St. Andrews University, Scotland, with a BSC degree in 1976.  Upon joining the company as a sales representative in 1992, he spent 6 months in an orientation program at the Seattle office and then returned to his permanent post in London.  He was promoted to sales manager in March 1993.  Mr. Buchanan has completed the courses ""Successful Telemarketing"" and ""International Sales Management.""  He is fluent in French.",2,http://accweb/emmployees/buchanan.bmp,10248,11,14,12,0,11,Queso Cabrales,5,4,1 kg pkg.,21,22,30,30,0,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Asturias,33007,Spain,(98) 598 76 54,,,4,Dairy Products,Cheeses,\x

MERGE, MERGE, MERGE

The first thing we want to do is create a node for each employee and each order and then create a relationship between them.

We might start with the following query:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)

This does the job but if we profile the query like so…

PROFILE LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)

…we’ll notice an ‘Eager’ lurking on the third line:

==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> |       Operator | Rows | DbHits |                      Identifiers |                                   Other |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> |    EmptyResult |    0 |      0 |                                  |                                         |
==> | UpdateGraph(0) |    0 |      0 |    employee, order,   UNNAMED216 |                            MergePattern |
==> |          Eager |    0 |      0 |                                  |                                         |
==> | UpdateGraph(1) |    0 |      0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> |          Slice |    0 |      0 |                                  |                            {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                              row |                                         |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+

You’ll notice that when we profile each query we’re stripping off the periodic commit section and adding a ‘WITH row LIMIT 0′. This allows us to generate enough of the query plan to identify the ‘Eager’ operator without actually importing any data.

We want to split that query into two so it can be processed in a non eager manner:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> |    Operator | Rows | DbHits |                      Identifiers |                                   Other |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> | EmptyResult |    0 |      0 |                                  |                                         |
==> | UpdateGraph |    0 |      0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> |       Slice |    0 |      0 |                                  |                            {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                              row |                                         |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+

Now that we’ve created the employees and orders we can join them together:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> |    UpdateGraph |    0 |      0 | employee, order,   UNNAMED216 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(1) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+

Not an Eager in sight!

MATCH, MATCH, MATCH, MERGE, MERGE

If we fast forward a few steps we may now have refactored our import script to the point where we create our nodes in one query and the relationships in another query.

Our create query works as expected:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (product:Product {productId: row.ProductID})
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> |    Operator | Rows | DbHits |                                        Identifiers |                                                        Other |
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> | EmptyResult |    0 |      0 |                                                    |                                                              |
==> | UpdateGraph |    0 |      0 | employee, employee, order, order, product, product | MergeNode; :Employee; MergeNode; :Order; MergeNode; :Product |
==> |       Slice |    0 |      0 |                                                    |                                                 {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                                                row |                                                              |
==> +-------------+------+--------+----------------------------------------------------+------------------------------------------------------------

We’ve now got employees, products and orders in the graph. Now let’s create relationships between the trio:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (employee)-[:SOLD]->(order)
MERGE (order)-[:PRODUCT]->(product)

If we profile that we’ll notice Eager has sneaked in again!

==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> | UpdateGraph(0) |    0 |      0 |  order, product,   UNNAMED318 |                                              MergePattern |
==> |          Eager |    0 |      0 |                               |                                                           |
==> | UpdateGraph(1) |    0 |      0 | employee, order,   UNNAMED287 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |    Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) |    0 |      0 |              product, product |                                                  :Product |
==> |      Filter(1) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(2) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(2) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+

In this case the Eager happens on our second call to MERGE as Michael identified in his post:

The issue is that within a single Cypher statement you have to isolate changes that affect matches further on, e.g. when you CREATE nodes with a label that are suddenly matched by a later MATCH or MERGE operation.

We can work around the problem in this case by having separate queries to create the relationships:

LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> |    UpdateGraph |    0 |      0 | employee, order,   UNNAMED236 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(1) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (order)-[:PRODUCT]->(product)
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> |       Operator | Rows | DbHits |                  Identifiers |                                                  Other |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                              |                                                        |
==> |    UpdateGraph |    0 |      0 | order, product,   UNNAMED229 |                                           MergePattern |
==> |      Filter(0) |    0 |      0 |                              | Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) |    0 |      0 |             product, product |                                               :Product |
==> |      Filter(1) |    0 |      0 |                              |       Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) |    0 |      0 |                 order, order |                                                 :Order |
==> |          Slice |    0 |      0 |                              |                                           {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                          row |                                                        |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+

MERGE, SET

I try to make LOAD CSV scripts as idempotent as possible so that if we add more rows or columns of data to our CSV we can rerun the query without having to recreate everything.

This can lead you towards the following pattern where we’re creating suppliers:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
SET supplier.companyName = row.SupplierCompanyName

We want to ensure that there’s only one Supplier with that SupplierID but we might be incrementally adding new properties and decide to just replace everything by using the ‘SET’ command. If we profile that query, the Eager lurks:

==> +----------------+------+--------+--------------------+----------------------+
==> |       Operator | Rows | DbHits |        Identifiers |                Other |
==> +----------------+------+--------+--------------------+----------------------+
==> |    EmptyResult |    0 |      0 |                    |                      |
==> | UpdateGraph(0) |    0 |      0 |                    |          PropertySet |
==> |          Eager |    0 |      0 |                    |                      |
==> | UpdateGraph(1) |    0 |      0 | supplier, supplier | MergeNode; :Supplier |
==> |          Slice |    0 |      0 |                    |         {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                row |                      |
==> +----------------+------+--------+--------------------+----------------------+

We can work around this at the cost of a bit of duplication using ‘ON CREATE SET’ and ‘ON MATCH SET':

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
ON CREATE SET supplier.companyName = row.SupplierCompanyName
ON MATCH SET supplier.companyName = row.SupplierCompanyName
==> +-------------+------+--------+--------------------+----------------------+
==> |    Operator | Rows | DbHits |        Identifiers |                Other |
==> +-------------+------+--------+--------------------+----------------------+
==> | EmptyResult |    0 |      0 |                    |                      |
==> | UpdateGraph |    0 |      0 | supplier, supplier | MergeNode; :Supplier |
==> |       Slice |    0 |      0 |                    |         {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                row |                      |
==> +-------------+------+--------+--------------------+----------------------+

With the data set I’ve been working with I was able to avoid OutOfMemory exceptions in some cases and reduce the amount of time it took to run the query by a factor of 3 in others.

As time goes on I expect all of these scenarios will be addressed but as of Neo4j 2.1.5 these are the patterns that I’ve identified as being overly eager.

If you know of any others do let me know and I can add them to the post or write a second part.

Written by Mark Needham

October 23rd, 2014 at 5:56 am

Posted in neo4j

Tagged with

Neo4j: Modelling sub types

without comments

A question which sometimes comes up when discussing graph data modelling is how you go about modelling sub/super types.

In my experience there are two reasons why we might want to do this:

  • To ensure that certain properties exist on bits of data
  • To write drill down queries based on those types

At the moment the former isn’t built into Neo4j and you’d only be able to achieve it by wiring up some code in a pre commit hook of a transaction event handler so we’ll focus on the latter.

The typical example used for showing how to design sub types is the animal kingdom and I managed to find a data set from Louiseville, Kentucky’s Animal Services which we can use.

In this case the sub types are used to represent the type of animal, breed group and breed. We then also have ‘real data’ in terms of actual dogs under the care of animal services.

We effectively end up with two graphs in one – a model and a meta model:

2014 10 20 22 32 31

The cypher query to create this graph looks like this:

LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-subtypes/data/dogs.csv" AS line
MERGE (animalType:AnimalType {name: "Dog"})
MERGE (breedGroup:BreedGroup {name: line.BreedGroup})
MERGE (breed:Breed {name: line.PrimaryBreed})
MERGE (animal:Animal {id: line.TagIdentity, primaryColour: line.PrimaryColour, size: line.Size})
 
MERGE (animalType)<-[:PARENT]-(breedGroup)
MERGE (breedGroup)<-[:PARENT]-(breed)
MERGE (breed)<-[:PARENT]-(animal)

We could then write a simple query to find out how many dogs we have:

MATCH (animalType:AnimalType)<-[:PARENT*]-(animal)
RETURN animalType, COUNT(*) AS animals
ORDER BY animals DESC
==> +--------------------------------+
==> | animalType           | animals |
==> +--------------------------------+
==> | Node[89]{name:"Dog"} | 131     |
==> +--------------------------------+
==> 1 row

Or we could write a slightly more complex query to find the number of animals at each level of our type hierarchy:

MATCH path = (animalType:AnimalType)<-[:PARENT]-(breedGroup)<-[:PARENT*]-(animal)
RETURN [node IN nodes(path) | node.name][..-1] AS breed, COUNT(*) AS animals
ORDER BY animals DESC
LIMIT 5
==> +-----------------------------------------------------+
==> | breed                                     | animals |
==> +-----------------------------------------------------+
==> | ["Dog","SETTER/RETRIEVE","LABRADOR RETR"] | 15      |
==> | ["Dog","SETTER/RETRIEVE","GOLDEN RETR"]   | 13      |
==> | ["Dog","POODLE","POODLE MIN"]             | 10      |
==> | ["Dog","TERRIER","MIN PINSCHER"]          | 9       |
==> | ["Dog","SHEPHERD","WELSH CORGI CAR"]      | 6       |
==> +-----------------------------------------------------+
==> 5 rows

We might then decide to add an exercise sub graph which indicates how much exercise each type of dog requires:

MATCH (breedGroup:BreedGroup)
WHERE breedGroup.name IN ["SETTER/RETRIEVE", "POODLE"]
MERGE (exercise:Exercise {type: "2 hours hard exercise"})
MERGE (exercise)<-[:REQUIRES_EXERCISE]-(breedGroup);
MATCH (breedGroup:BreedGroup)
WHERE breedGroup.name IN ["TERRIER", "SHEPHERD"]
MERGE (exercise:Exercise {type: "1 hour gentle exercise"})
MERGE (exercise)<-[:REQUIRES_EXERCISE]-(breedGroup);

We could then query that to find out which dogs need to come out for 2 hours of hard exercise:

MATCH (exercise:Exercise {type: "2 hours hard exercise"})<-[:REQUIRES_EXERCISE]-()<-[:PARENT*]-(dog)
WHERE NOT (dog)<-[:PARENT]-()
RETURN dog
LIMIT 10
==> +-----------------------------------------------------------+
==> | dog                                                       |
==> +-----------------------------------------------------------+
==> | Node[541]{id:"664427",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[542]{id:"543787",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[543]{id:"584021",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[544]{id:"584022",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[545]{id:"664430",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[546]{id:"535176",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[567]{id:"613557",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[568]{id:"531376",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[569]{id:"613567",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[570]{id:"531379",primaryColour:"WHITE",size:"SMALL"} |
==> +-----------------------------------------------------------+
==> 10 rows

In this query we ensured that we only returned dogs rather than breeds by checking that there was no incoming PARENT relationship. Alternatively we could have filtered on the Animal label…

MATCH (exercise:Exercise {type: "2 hours hard exercise"})<-[:REQUIRES_EXERCISE]-()<-[:PARENT*]-(dog:Animal)
RETURN dog
LIMIT 10

or if we wanted to only take the dogs out for exercise perhaps we’d have Dog label on the appropriate nodes.

People are often curious why labels don’t have super/sub types between them but I tend to use labels for simple categorisation – anything more complicated and we may as well use the built in power of the graph model!

The code is on github should you wish to play with it.

Written by Mark Needham

October 20th, 2014 at 11:08 pm

Posted in neo4j

Tagged with

Lessons from running Neo4j based ‘hackathons’

without comments

Over the last 6 months my colleagues and I have been running hands on Neo4j based sessions every few weeks and I was recently asked if I could write up the lessons we’ve learned.

So in no particular order here are some of the things that we’ve learnt:

Have a plan but don’t stick to it rigidly

Something we learnt early on is that it’s helpful to have a rough plan of how you’re going to spend the session otherwise it can feel quite chaotic for attendees.

We show people that plan at the beginning of the session so that they know what to expect and can plan their time accordingly if the second part doesn’t interest them as much.

Having said that, we’ve often gone off on a tangent and since people have been very interested in that we’ve just gone with it.

This sometimes means that you don’t cover everything you had in mind but the main thing is that people enjoy themselves so it’s nothing to worry about.

Prepare for people to be unprepared

We try to set expectations in advanced of the sessions with respect to what people should prepare or have installed on their machines but despite that you’ll have people in varying levels of readiness.

Having noticed this trend over a few months we now allot time in the schedule for getting people up and running and if we’re really struggling then we’ll ask people to pair with each other.

There will also be experience level differences so we always factor in some time to go over the basics for those who are new. We also encourage experienced people to help the others out – after all you only really know if you know something when you try to teach someone else!

Don’t try to do too much

Our first ‘hackathon’-esque event involved an attempt to build a Java application based on a British Library dataset.

I thought we’d be able to model the data set, import it and then wire up some queries to an application in a few hours.

This proved to be ever so slightly ambitious!

It took much longer than anticipated to do those first two steps and we didn’t get to build any of the application – teaching people how to model in a graph is probably a session in its own right.

Show the end state

Feedback we got from attendees to the first few versions was that they’d like to see what the end state should have looked like if they’d completed everything.

In our Clojure Hackathon Rohit got the furthest so we shared his code with everyone afterwards.

An even better approach is to have the final solution ready in advance and have it checked in on a different branch that you can point people at afterwards.

Show the intermediate states

Another thing we noticed was that if people got behind in the first part of the session then they’d never be able to catch up.

Nigel therefore came up with the idea of snapshotting intermediate states so that people could reset themselves after each part of the session. This is something that the Polymer tutorial does as well.

We worked out that we have two solid one hour slots before people start getting distracted by their journey home so we came up with two distinct types of tasks for people to do and then created a branch with the solution at the end of those tasks.

No doubt there will be more lessons to come as we run more sessions but this is where we are at the moment. If you fancy joining in our next session is Java based in a couple of weeks time.

Finally, if you want to see a really slick hands on meetup then you’ll want to head over to the London Clojure DojoBruce Durling has even written up some tips on how you run one yourself.

Written by Mark Needham

October 11th, 2014 at 10:52 am

Posted in neo4j

Tagged with

Neo4j: Generic/Vague relationship names

without comments

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node.

Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes that you’re not interested in.

However, I’ve never actually tested the performance differences between the approaches so I thought I’d try it out.

I created 4 different databases which had one node with 60,000 outgoing relationships – 10,000 which we wanted to retrieve and 50,000 that were irrelevant.

I modelled the ‘relationship’ in 4 different ways…

  • Using a specific relationship type
    (node)-[:HAS_ADDRESS]->(address)
  • Using a generic relationship type and then filtering by end node label
    (node)-[:HAS]->(address:Address)
  • Using a generic relationship type and then filtering by relationship property
    (node)-[:HAS {type: “address”}]->(address)
  • Using a generic relationship type and then filtering by end node property
    (node)-[:HAS]->(address {type: “address”})

…and then measured how long it took to retrieve the ‘has address’ relationships.

The code is on github if you want to take a look.

Although it’s obviously not as precise as a JMH micro benchmark I think it’s good enough to get a feel for the difference between the approaches.

I ran a query against each database 100 times and then took the 50th, 75th and 99th percentiles (times are in ms):

Using a generic relationship type and then filtering by end node label
50%ile: 6.0    75%ile: 6.0    99%ile: 402.60999999999825
 
Using a generic relationship type and then filtering by relationship property
50%ile: 21.0   75%ile: 22.0   99%ile: 504.85999999999785
 
Using a generic relationship type and then filtering by end node label
50%ile: 4.0    75%ile: 4.0    99%ile: 145.65999999999931
 
Using a specific relationship type
50%ile: 0.0    75%ile: 1.0    99%ile: 25.749999999999872

We can drill further into why there’s a difference in the times for each of the approaches by profiling the equivalent cypher query. We’ll start with the one which uses a specific relationship name

Using a specific relationship type

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS_ADDRESS]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +SimplePatternMatcher
      |
      +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+-----------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                 Other |
+----------------------+-------+--------+-----------------------------+-----------------------+
|         ColumnFilter |     1 |      0 |                             | keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                       |
| SimplePatternMatcher | 10000 |  10000 | n,   UNNAMED53,   UNNAMED35 |                       |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |          {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+-----------------------+
 
Total database accesses: 10001

Here we can see that there were 10,002 database accesses in order to get a count of our 10,000 HAS_ADDRESS relationships. We get a database access each time we load a node, relationship or property.

By contrast the other approaches have to load in a lot more data only to then filter it out:

Using a generic relationship type and then filtering by end node label

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->(:Address) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+----------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                            Other |
+----------------------+-------+--------+-----------------------------+----------------------------------+
|         ColumnFilter |     1 |      0 |                             |            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                  |
|               Filter | 10000 |  10000 |                             | hasLabel(  UNNAMED45:Address(0)) |
| SimplePatternMatcher | 10000 |  60000 | n,   UNNAMED45,   UNNAMED35 |                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+----------------------------------+
 
Total database accesses: 70001

Using a generic relationship type and then filtering by relationship property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS {type: "address"}]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED35,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED63,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

Using a generic relationship type and then filtering by end node property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->({type: "address"}) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED45,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED45,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

So in summary…specific relationships #ftw!

Written by Mark Needham

September 30th, 2014 at 4:47 pm

Posted in neo4j

Tagged with

Neo4j: COLLECTing multiple values (Too many parameters for function ‘collect’)

without comments

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption.

However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so.

Consider the following data set:

create (p:Person {name: "Mark"})
create (e1:Event {name: "Event1", timestamp: 1234})
create (e2:Event {name: "Event2", timestamp: 4567})
 
create (p)-[:EVENT]->(e1)
create (p)-[:EVENT]->(e2)

If we wanted to return each person along with a collection of the event names they’d participated in we could write the following:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT(e.name);
+--------------------------------------------+
| p                    | COLLECT(e.name)     |
+--------------------------------------------+
| Node[0]{name:"Mark"} | ["Event1","Event2"] |
+--------------------------------------------+
1 row

That works nicely, but what about if we want to collect the event name and the timestamp but don’t want to return the entire event node?

An approach I’ve seen a few people try during workshops is the following:

MATCH (p:Person)-[:EVENT]->(e)
RETURN p, COLLECT(e.name, e.timestamp)

Unfortunately this doesn’t compile:

SyntaxException: Too many parameters for function 'collect' (line 2, column 11)
"RETURN p, COLLECT(e.name, e.timestamp)"
           ^

As the error message suggests, the COLLECT function only takes one argument so we need to find another way to solve our problem.

One way is to put the two values into a literal array which will result in an array of arrays as our return result:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT([e.name, e.timestamp]);
+----------------------------------------------------------+
| p                    | COLLECT([e.name, e.timestamp])    |
+----------------------------------------------------------+
| Node[0]{name:"Mark"} | [["Event1",1234],["Event2",4567]] |
+----------------------------------------------------------+
1 row

The annoying thing about this approach is that as you add more items you’ll forget in which position you’ve put each bit of data so I think a preferable approach is to collect a map of items instead:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT({eventName: e.name, eventTimestamp: e.timestamp});
+--------------------------------------------------------------------------------------------------------------------------+
| p                    | COLLECT({eventName: e.name, eventTimestamp: e.timestamp})                                         |
+--------------------------------------------------------------------------------------------------------------------------+
| Node[0]{name:"Mark"} | [{eventName -> "Event1", eventTimestamp -> 1234},{eventName -> "Event2", eventTimestamp -> 4567}] |
+--------------------------------------------------------------------------------------------------------------------------+
1 row

During the Clojure Neo4j Hackathon that we ran earlier this week this proved to be a particularly pleasing approach as we could easily destructure the collection of maps in our Clojure code.

Written by Mark Needham

September 26th, 2014 at 8:46 pm

Posted in neo4j

Tagged with

Neo4j: LOAD CSV – Column is null

with 3 comments

One problem I’ve seen a few people have recently when using Neo4j’s LOAD CSV function is dealing with CSV files that have dodgy hidden characters at the beginning of the header line.

For example, consider an import of this CSV file:

$ cat ~/Downloads/dodgy.csv
userId,movieId
1,2

We might start by checking which columns it has:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line;
+----------------------------------+
| line                             |
+----------------------------------+
| {userId -> "1", movieId -> "2"} |
+----------------------------------+
1 row

Looks good so far but what about if we try to return just ‘userId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| <null>      |
+-------------+
1 row

Hmmm it’s null…what about ‘movieId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.movieId;
+--------------+
| line.movieId |
+--------------+
| "2"          |
+--------------+
1 row

That works fine so immediately we can suspect there are hidden characters at the beginning of the first line of the file.

The easiest way to check if this is the case is open the file using a Hex Editor – I quite like Hex Fiend for the Mac.

If we look at dodgy.csv we’ll see the following:

2014 09 24 21 20 06

Let’s delete the highlighted characters and try our cypher query again:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| "1"         |
+-------------+
1 row

All is well again, but something to keep in mind if you see a LOAD CSV near you behaving badly.

Written by Mark Needham

September 24th, 2014 at 8:21 pm

Posted in neo4j

Tagged with

Neo4j: LOAD CSV – Handling empty columns

with one comment

A common problem that people encounter when trying to import CSV files into Neo4j using Cypher’s LOAD CSV command is how to handle empty or ‘null’ entries in said files.

For example let’s try and import the following file which has 3 columns, 1 populated, 2 empty:

$ cat /tmp/foo.csv
a,b,c
mark,,
load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
SET p.b = row.b, p.c = row.c
RETURN p

When we execute that query we’ll see that our Person node has properties ‘b’ and ‘c’ with no value:

==> +-----------------------------+
==> | p                           |
==> +-----------------------------+
==> | Node[5]{a:"mark",b:"",c:""} |
==> +-----------------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 3
==> Labels added: 1
==> 26 ms

That isn’t what we want – we don’t want those properties to be set unless they have a value.

TO achieve this we need to introduce a conditional when setting the ‘b’ and ‘c’ properties. We’ll assume that ‘a’ is always present as that’s the key for our Person nodes.

The following query will do what we want:

load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
FOREACH(ignoreMe IN CASE WHEN trim(row.b) <> "" THEN [1] ELSE [] END | SET p.b = row.b)
FOREACH(ignoreMe IN CASE WHEN trim(row.c) <> "" THEN [1] ELSE [] END | SET p.c = row.c)
RETURN p

Since there’s no if or else statements in cypher we create our own conditional statement by using FOREACH. If there’s a value in the CSV column then we’ll loop once and set the property and if not we won’t loop at all and therefore no property will be set.

==> +-------------------+
==> | p                 |
==> +-------------------+
==> | Node[4]{a:"mark"} |
==> +-------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 1
==> Labels added: 1

Written by Mark Needham

August 22nd, 2014 at 12:51 pm

Posted in neo4j

Tagged with

Neo4j 2.1.2: Finding where I am in a linked list

with one comment

I was recently asked how to calculate the position of a node in a linked list and realised that as the list increases in size this is one of the occasions when we should write an unmanaged extension rather than using cypher.

I wrote a quick bit of code to create a linked list with 10,000 elements in it:

public class Chains 
{
    public static void main(String[] args)
    {
        String simpleChains = "/tmp/longchains";
        populate( simpleChains, 10000 );
    }
 
    private static void populate( String path, int chainSize )
    {
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
        try(Transaction tx = db.beginTx()) {
            Node currentNode = null;
            for ( int i = 0; i < chainSize; i++ )
            {
                Node node = db.createNode();
 
                if(currentNode != null) {
                    currentNode.createRelationshipTo( node, NEXT );
                }
                currentNode = node;
            }
            tx.success();
        }
 
 
        db.shutdown();
    }
}

To find our distance from the end of the linked list we could write the following cypher query:

match n  where id(n) = {nodeId}  with n 
match path = (n)-[:NEXT*]->() 
RETURN id(n) AS nodeId, length(path) AS length 
ORDER BY length DESC 
LIMIT 1;

For simplicity we’re finding a node by it’s internal node id and then finding the ‘NEXT’ relationships going out from this node recursively. We then filter the results so that we only get the longest path back which will be our distance to the end of the list.

I noticed that this query would sometimes take 10s of seconds so I wrote a version using the Java Traversal API to see whether I could get it any quicker.

This is the Java version:

try(Transaction tx = db.beginTx()) {
    Node startNode = db.getNodeById( nodeId );
    TraversalDescription traversal = db.traversalDescription();
    Traverser traverse = traversal
            .depthFirst()
            .relationships( NEXT, Direction.OUTGOING )
            .sort( new Comparator<Path>()
            {
                @Override
                public int compare( Path o1, Path o2 )
                {
                    return Integer.valueOf( o2.length() ).compareTo( o1 .length() );
                }
            } )
            .traverse( startNode );
 
    Collection<Path> paths = IteratorUtil.asCollection( traverse );
 
    int maxLength = traverse.iterator().next().length();
    System.out.print( maxLength );
 
    tx.failure();
}

This is a bit more verbose than the cypher version but computes the same result. We’ve sorted the paths by length using a comparator to ensure we get the longest path back first.

I created a little program to warm up the caches and kick off a few iterations where I queried from different nodes and returned the length and time taken. These were the results:

--------
(Traversal API) Node:    1, Length: 9998, Time (ms):  15
       (Cypher) Node:    1, Length: 9998, Time (ms): 26225
(Traversal API) Node:  456, Length: 9543, Time (ms):  10
       (Cypher) Node:  456, Length: 9543, Time (ms): 24881
(Traversal API) Node:  761, Length: 9238, Time (ms):   9
       (Cypher) Node:  761, Length: 9238, Time (ms): 9941
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   9
       (Cypher) Node:    1, Length: 9998, Time (ms): 12537
(Traversal API) Node:  456, Length: 9543, Time (ms):   8
       (Cypher) Node:  456, Length: 9543, Time (ms): 15690
(Traversal API) Node:  761, Length: 9238, Time (ms):   7
       (Cypher) Node:  761, Length: 9238, Time (ms): 9202
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   8
       (Cypher) Node:    1, Length: 9998, Time (ms): 11905
(Traversal API) Node:  456, Length: 9543, Time (ms):   7
       (Cypher) Node:  456, Length: 9543, Time (ms): 22296
(Traversal API) Node:  761, Length: 9238, Time (ms):   8
       (Cypher) Node:  761, Length: 9238, Time (ms): 8739
--------

Interestingly when I reduced the size of the linked list to 1000 the difference wasn’t so pronounced:

--------
(Traversal API) Node:    1, Length: 998, Time (ms):   5
       (Cypher) Node:    1, Length: 998, Time (ms): 174
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  71
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  13
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   2
       (Cypher) Node:    1, Length: 998, Time (ms): 111
(Traversal API) Node:  456, Length: 543, Time (ms):   1
       (Cypher) Node:  456, Length: 543, Time (ms):  40
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   3
       (Cypher) Node:    1, Length: 998, Time (ms): 129
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  48
(Traversal API) Node:  761, Length: 238, Time (ms):   0
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------

which is good news as most linked lists that we’ll create will be in the 10s – 100s range rather than 10,000 which was what I was faced with.

I’m sure cypher will reach parity for this type of query in future which will be great as I like writing cypher much more than I do Java. For now though it’s good to know we have a backup option to call on when necessary.

The code is available as a gist if you want to play around with it further.

Written by Mark Needham

July 20th, 2014 at 3:13 pm

Posted in neo4j

Tagged with