Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Python’s pandas vs Neo4j’s cypher: Exploring popular phrases in How I met your mother transcripts

without comments

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Written by Mark Needham

February 19th, 2015 at 12:52 am

Posted in neo4j

Tagged with ,

Neo4j: Building a topic graph with Prismatic Interest Graph API

without comments

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Written by Mark Needham

February 13th, 2015 at 11:38 pm

Posted in neo4j,Python

Tagged with

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother

without comments

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:

import requests
from bs4 import BeautifulSoup
from soupselect import select
 
episodes = {}
for i in range(1,3):
    page = open("data/transcripts/page-" + str(i) + ".html", 'r')
    soup = BeautifulSoup(page.read())
 
    for row in select(soup, "td.topic-titles a"):
        parts = row.text.split(" - ")
        episodes[parts[0]] = {"title": parts[1], "link": row.get("href")}
 
for key, value in episodes.iteritems():
    parts = key.split("x")
    season = int(parts[0])
    episode = int(parts[1])
    filename = "data/transcripts/S%d-Ep%d" %(season, episode)
    print filename
 
    with open(filename, 'wb') as handle:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        response = requests.get("http://transcripts.foreverdreaming.org" + value["link"], headers = headers)
        if response.ok:
            for block in response.iter_content(1024):
                if not block:
                    break
 
                handle.write(block)

the files containing the lists of episodes are named ‘page-1′ and ‘page-2′

The code is reasonably simple – we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a ‘save’ method that I could use.

Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.

In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.

I ended up with the following script which created a dictionary of (episode -> words + occurrences):

import csv
import nltk
import re
 
from bs4 import BeautifulSoup
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
def count_words(words):
    tally=Counter()
    for elem in words:
        tally[elem] += 1
    return tally
 
episodes_dict = {}
with open('data/import/episodes.csv', 'r') as episodes:
    reader = csv.reader(episodes, delimiter=',')
    reader.next()
 
    for row in reader:
        print row
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
 
        text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip())
        words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")]
 
        episodes_dict[row[0]] = count_words(words)

Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.

s/somewhere/in Neo4j

Neo4j’s query language, Cypher, has a really nice ETL-esque tool called ‘LOAD CSV’ for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:

with open("data/import/words.csv", "w") as words:
    writer = csv.writer(words, delimiter=",")
    writer.writerow(["EpisodeId", "Word", "Occurrences"])
    for episode_id, words in episodes_dict.iteritems():
        for word in words:
            writer.writerow([episode_id, word, words[word]])

This is what the CSV file contents look like:

$ head -n 10 data/import/words.csv
EpisodeId,Word,Occurrences
165,secondly,1
165,focus,1
165,baby,1
165,spiders,1
165,go,4
165,apartment,1
165,buddy,1
165,Exactly,1
165,young,1

Now we need to write some Cypher to get the data into Neo4j:

// words
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (word:Word {value: row.Word})
// episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MATCH (word:Word {value: row.Word})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);

Having done that we can write some simple queries to explore the words used in How I met your mother:

MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode) 
RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences
ORDER BY occurrences DESC
LIMIT 10
 
==> +-------------------------------------+
==> | word.value | episodes | occurrences |
==> +-------------------------------------+
==> | "Ted"      | 207      | 11437       |
==> | "Barney"   | 208      | 8052        |
==> | "Marshall" | 208      | 7236        |
==> | "Robin"    | 205      | 6626        |
==> | "Lily"     | 207      | 6330        |
==> | "m"        | 208      | 4777        |
==> | "re"       | 208      | 4097        |
==> | "know"     | 208      | 3489        |
==> | "Oh"       | 197      | 3448        |
==> | "like"     | 208      | 2498        |
==> +-------------------------------------+
==> 10 rows

The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why ‘m’ and ‘re’ are in the next two position s – I expect that might be scraping gone wrong!

Our next query might focus around checking which character is referred to the post in each episode:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN episode.id, topWord.word AS word, topWord.times AS occurrences
LIMIT 10
 
==> +---------------------------------------+
==> | episode.id | word       | occurrences |
==> +---------------------------------------+
==> | 72         | "Barney"   | 75          |
==> | 143        | "Ted"      | 16          |
==> | 43         | "Lily"     | 74          |
==> | 156        | "Ted"      | 12          |
==> | 206        | "Barney"   | 23          |
==> | 50         | "Marshall" | 51          |
==> | 113        | "Ted"      | 76          |
==> | 178        | "Barney"   | 21          |
==> | 182        | "Barney"   | 22          |
==> | 67         | "Ted"      | 84          |
==> +---------------------------------------+
==> 10 rows

If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:

WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times)
 
==> +-------------------------------------------------------------------------------------+
==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) |
==> +-------------------------------------------------------------------------------------+
==> | 3                  | 259                | 63.90865384615385  | 42.36255207691068    |
==> +-------------------------------------------------------------------------------------+
==> 1 row

Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:

  • Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) – this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
  • Pull out scenes – some of the transcripts use the keyword ‘scene’ to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
  • Analyse who talks to each other or who talks about each other most frequently
  • Create a graph of conversations as my colleagues Max and Michael have previously blogged about.

Written by Mark Needham

January 10th, 2015 at 1:22 am

Posted in neo4j,Python

Tagged with ,

Neo4j 2.1.6 – Cypher: FOREACH slowness

without comments

A common problem that people have when using Neo4j for social network applications is updating a person with their newly imported friends.

We’ll have an array of friends that we want to connect to a single Person node. Assuming the following schema…

$ schema
Indexes
  ON :Person(id) ONLINE
 
No constraints

…a simplified version would look like this:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
 
FOREACH(f IN friends |
  MERGE (friend:Person {id: f})
  MERGE (friend)-[:FRIENDS]->p);

If we execute that on an empty database we’ll see something like this:

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
19173 ms

This took much longer than we’d expect so let’s have a look at the PROFILE output:

EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager
      |
      +UpdateGraph(1)
        |
        +Extract
          |
          +Null
 
+----------------+------+---------+-------------+--------------------------------------+
|       Operator | Rows |  DbHits | Identifiers |                                Other |
+----------------+------+---------+-------------+--------------------------------------+
|    EmptyResult |    0 |       0 |             |                                      |
| UpdateGraph(0) |    1 | 3015012 |             |                              Foreach |
|          Eager |    1 |       0 |             |                                      |
| UpdateGraph(1) |    1 |       5 |        p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |       0 |             |                              friends |
|           Null |    ? |       ? |             |                                      |
+----------------+------+---------+-------------+--------------------------------------+

The DbHits value on the 2nd row seems suspiciously high suggesting that FOREACH might not be making use of the Person#id index and is instead scanning all Person nodes each time.

I’m not sure how to drill into that further but an alternative approach is to try out the same query but using UNWIND instead and checking the profile output of that:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
WITH p, friends
UNWIND friends AS f
MERGE (friend:Person {id: f})
MERGE (friend)-[:FRIENDS]->p;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
343 ms
EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager(0)
      |
      +UpdateGraph(1)
        |
        +UNWIND
          |
          +Eager(1)
            |
            +UpdateGraph(2)
              |
              +Extract
                |
                +Null
 
+----------------+------+--------+-------------------------+--------------------------------------+
|       Operator | Rows | DbHits |             Identifiers |                                Other |
+----------------+------+--------+-------------------------+--------------------------------------+
|    EmptyResult |    0 |      0 |                         |                                      |
| UpdateGraph(0) | 1001 |      0 | friend, p,   UNNAMED136 |                         MergePattern |
|       Eager(0) | 1001 |      0 |                         |                                      |
| UpdateGraph(1) | 1001 |   5005 |          friend, friend |            MergeNode; f; :Person(id) |
|         UNWIND | 1001 |      0 |                         |                                      |
|       Eager(1) |    1 |      0 |                         |                                      |
| UpdateGraph(2) |    1 |      5 |                    p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |      0 |                         |                              friends |
|           Null |    ? |      ? |                         |                                      |
+----------------+------+--------+-------------------------+--------------------------------------+

That’s much quicker and doesn’t touch as many nodes as FOREACH was. I expect the index issue will be sorted out in future but until then UNWIND is our friend.

Written by Mark Needham

December 28th, 2014 at 4:28 am

Posted in neo4j

Tagged with ,

Neo4j: Cypher – Avoiding the Eager

without comments

Although I love how easy Cypher’s LOAD CSV command makes it to get data into Neo4j, it currently breaks the rule of least surprise in the way it eagerly loads in all rows for some queries even those using periodic commit.

Neverwhere

Beware of the eager pipe

This is something that my colleague Michael noted in the second of his blog posts explaining how to use LOAD CSV successfully:

The biggest issue that people ran into, even when following the advice I gave earlier, was that for large imports of more than one million rows, Cypher ran into an out-of-memory situation.

That was not related to commit sizes, so it happened even with PERIODIC COMMIT of small batches.

I recently spent a few days importing data into Neo4j on a Windows machine with 4GB RAM so I was seeing this problem even earlier than Michael suggested.

Michael explains how to work out whether your query is suffering from unexpected eager evaluation:

If you profile that query you see that there is an “Eager” step in the query plan.

That is where the “pull in all data” happens.

You can profile queries by prefixing the word ‘PROFILE’. You’ll need to run your query in the console of /webadmin in your web browser or with the Neo4j shell.

I did this for my queries and was able to identify query patterns which get evaluated eagerly and in some cases we can work around it.

We’ll use the Northwind data set to demonstrate how the Eager pipe can sneak into our queries but keep in mind that this data set is sufficiently small to not cause issues.

This is what a row in the file looks like:

$ head -n 2 data/customerDb.csv
OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,CustomerID,CustomerCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,EmployeeID,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath,OrderID,ProductID,UnitPrice,Quantity,Discount,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued,SupplierID,SupplierCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,HomePage,CategoryID,CategoryName,Description,Picture
10248,VINET,5,1996-07-04,1996-08-01,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France,VINET,Vins et alcools Chevalier,Paul Henriot,Accounting Manager,59 rue de l'Abbaye,Reims,,51100,France,26.47.15.10,26.47.15.11,5,Buchanan,Steven,Sales Manager,Mr.,1955-03-04,1993-10-17,14 Garrett Hill,London,,SW1 8JR,UK,(71) 555-4848,3453,\x,"Steven Buchanan graduated from St. Andrews University, Scotland, with a BSC degree in 1976.  Upon joining the company as a sales representative in 1992, he spent 6 months in an orientation program at the Seattle office and then returned to his permanent post in London.  He was promoted to sales manager in March 1993.  Mr. Buchanan has completed the courses ""Successful Telemarketing"" and ""International Sales Management.""  He is fluent in French.",2,http://accweb/emmployees/buchanan.bmp,10248,11,14,12,0,11,Queso Cabrales,5,4,1 kg pkg.,21,22,30,30,0,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Asturias,33007,Spain,(98) 598 76 54,,,4,Dairy Products,Cheeses,\x

MERGE, MERGE, MERGE

The first thing we want to do is create a node for each employee and each order and then create a relationship between them.

We might start with the following query:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)

This does the job but if we profile the query like so…

PROFILE LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)

…we’ll notice an ‘Eager’ lurking on the third line:

==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> |       Operator | Rows | DbHits |                      Identifiers |                                   Other |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> |    EmptyResult |    0 |      0 |                                  |                                         |
==> | UpdateGraph(0) |    0 |      0 |    employee, order,   UNNAMED216 |                            MergePattern |
==> |          Eager |    0 |      0 |                                  |                                         |
==> | UpdateGraph(1) |    0 |      0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> |          Slice |    0 |      0 |                                  |                            {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                              row |                                         |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+

You’ll notice that when we profile each query we’re stripping off the periodic commit section and adding a ‘WITH row LIMIT 0′. This allows us to generate enough of the query plan to identify the ‘Eager’ operator without actually importing any data.

We want to split that query into two so it can be processed in a non eager manner:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> |    Operator | Rows | DbHits |                      Identifiers |                                   Other |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> | EmptyResult |    0 |      0 |                                  |                                         |
==> | UpdateGraph |    0 |      0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> |       Slice |    0 |      0 |                                  |                            {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                              row |                                         |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+

Now that we’ve created the employees and orders we can join them together:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> |    UpdateGraph |    0 |      0 | employee, order,   UNNAMED216 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(1) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+

Not an Eager in sight!

MATCH, MATCH, MATCH, MERGE, MERGE

If we fast forward a few steps we may now have refactored our import script to the point where we create our nodes in one query and the relationships in another query.

Our create query works as expected:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (product:Product {productId: row.ProductID})
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> |    Operator | Rows | DbHits |                                        Identifiers |                                                        Other |
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> | EmptyResult |    0 |      0 |                                                    |                                                              |
==> | UpdateGraph |    0 |      0 | employee, employee, order, order, product, product | MergeNode; :Employee; MergeNode; :Order; MergeNode; :Product |
==> |       Slice |    0 |      0 |                                                    |                                                 {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                                                row |                                                              |
==> +-------------+------+--------+----------------------------------------------------+------------------------------------------------------------

We’ve now got employees, products and orders in the graph. Now let’s create relationships between the trio:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (employee)-[:SOLD]->(order)
MERGE (order)-[:PRODUCT]->(product)

If we profile that we’ll notice Eager has sneaked in again!

==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> | UpdateGraph(0) |    0 |      0 |  order, product,   UNNAMED318 |                                              MergePattern |
==> |          Eager |    0 |      0 |                               |                                                           |
==> | UpdateGraph(1) |    0 |      0 | employee, order,   UNNAMED287 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |    Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) |    0 |      0 |              product, product |                                                  :Product |
==> |      Filter(1) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(2) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(2) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+

In this case the Eager happens on our second call to MERGE as Michael identified in his post:

The issue is that within a single Cypher statement you have to isolate changes that affect matches further on, e.g. when you CREATE nodes with a label that are suddenly matched by a later MATCH or MERGE operation.

We can work around the problem in this case by having separate queries to create the relationships:

LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |       Operator | Rows | DbHits |                   Identifiers |                                                     Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                               |                                                           |
==> |    UpdateGraph |    0 |      0 | employee, order,   UNNAMED236 |                                              MergePattern |
==> |      Filter(0) |    0 |      0 |                               |          Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) |    0 |      0 |                  order, order |                                                    :Order |
==> |      Filter(1) |    0 |      0 |                               | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) |    0 |      0 |            employee, employee |                                                 :Employee |
==> |          Slice |    0 |      0 |                               |                                              {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                           row |                                                           |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (order)-[:PRODUCT]->(product)
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> |       Operator | Rows | DbHits |                  Identifiers |                                                  Other |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> |    EmptyResult |    0 |      0 |                              |                                                        |
==> |    UpdateGraph |    0 |      0 | order, product,   UNNAMED229 |                                           MergePattern |
==> |      Filter(0) |    0 |      0 |                              | Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) |    0 |      0 |             product, product |                                               :Product |
==> |      Filter(1) |    0 |      0 |                              |       Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) |    0 |      0 |                 order, order |                                                 :Order |
==> |          Slice |    0 |      0 |                              |                                           {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                          row |                                                        |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+

MERGE, SET

I try to make LOAD CSV scripts as idempotent as possible so that if we add more rows or columns of data to our CSV we can rerun the query without having to recreate everything.

This can lead you towards the following pattern where we’re creating suppliers:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
SET supplier.companyName = row.SupplierCompanyName

We want to ensure that there’s only one Supplier with that SupplierID but we might be incrementally adding new properties and decide to just replace everything by using the ‘SET’ command. If we profile that query, the Eager lurks:

==> +----------------+------+--------+--------------------+----------------------+
==> |       Operator | Rows | DbHits |        Identifiers |                Other |
==> +----------------+------+--------+--------------------+----------------------+
==> |    EmptyResult |    0 |      0 |                    |                      |
==> | UpdateGraph(0) |    0 |      0 |                    |          PropertySet |
==> |          Eager |    0 |      0 |                    |                      |
==> | UpdateGraph(1) |    0 |      0 | supplier, supplier | MergeNode; :Supplier |
==> |          Slice |    0 |      0 |                    |         {  AUTOINT0} |
==> |        LoadCSV |    1 |      0 |                row |                      |
==> +----------------+------+--------+--------------------+----------------------+

We can work around this at the cost of a bit of duplication using ‘ON CREATE SET’ and ‘ON MATCH SET':

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
ON CREATE SET supplier.companyName = row.SupplierCompanyName
ON MATCH SET supplier.companyName = row.SupplierCompanyName
==> +-------------+------+--------+--------------------+----------------------+
==> |    Operator | Rows | DbHits |        Identifiers |                Other |
==> +-------------+------+--------+--------------------+----------------------+
==> | EmptyResult |    0 |      0 |                    |                      |
==> | UpdateGraph |    0 |      0 | supplier, supplier | MergeNode; :Supplier |
==> |       Slice |    0 |      0 |                    |         {  AUTOINT0} |
==> |     LoadCSV |    1 |      0 |                row |                      |
==> +-------------+------+--------+--------------------+----------------------+

With the data set I’ve been working with I was able to avoid OutOfMemory exceptions in some cases and reduce the amount of time it took to run the query by a factor of 3 in others.

As time goes on I expect all of these scenarios will be addressed but as of Neo4j 2.1.5 these are the patterns that I’ve identified as being overly eager.

If you know of any others do let me know and I can add them to the post or write a second part.

Written by Mark Needham

October 23rd, 2014 at 5:56 am

Posted in neo4j

Tagged with

Neo4j: Modelling sub types

without comments

A question which sometimes comes up when discussing graph data modelling is how you go about modelling sub/super types.

In my experience there are two reasons why we might want to do this:

  • To ensure that certain properties exist on bits of data
  • To write drill down queries based on those types

At the moment the former isn’t built into Neo4j and you’d only be able to achieve it by wiring up some code in a pre commit hook of a transaction event handler so we’ll focus on the latter.

The typical example used for showing how to design sub types is the animal kingdom and I managed to find a data set from Louiseville, Kentucky’s Animal Services which we can use.

In this case the sub types are used to represent the type of animal, breed group and breed. We then also have ‘real data’ in terms of actual dogs under the care of animal services.

We effectively end up with two graphs in one – a model and a meta model:

2014 10 20 22 32 31

The cypher query to create this graph looks like this:

LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-subtypes/data/dogs.csv" AS line
MERGE (animalType:AnimalType {name: "Dog"})
MERGE (breedGroup:BreedGroup {name: line.BreedGroup})
MERGE (breed:Breed {name: line.PrimaryBreed})
MERGE (animal:Animal {id: line.TagIdentity, primaryColour: line.PrimaryColour, size: line.Size})
 
MERGE (animalType)<-[:PARENT]-(breedGroup)
MERGE (breedGroup)<-[:PARENT]-(breed)
MERGE (breed)<-[:PARENT]-(animal)

We could then write a simple query to find out how many dogs we have:

MATCH (animalType:AnimalType)<-[:PARENT*]-(animal)
RETURN animalType, COUNT(*) AS animals
ORDER BY animals DESC
==> +--------------------------------+
==> | animalType           | animals |
==> +--------------------------------+
==> | Node[89]{name:"Dog"} | 131     |
==> +--------------------------------+
==> 1 row

Or we could write a slightly more complex query to find the number of animals at each level of our type hierarchy:

MATCH path = (animalType:AnimalType)<-[:PARENT]-(breedGroup)<-[:PARENT*]-(animal)
RETURN [node IN nodes(path) | node.name][..-1] AS breed, COUNT(*) AS animals
ORDER BY animals DESC
LIMIT 5
==> +-----------------------------------------------------+
==> | breed                                     | animals |
==> +-----------------------------------------------------+
==> | ["Dog","SETTER/RETRIEVE","LABRADOR RETR"] | 15      |
==> | ["Dog","SETTER/RETRIEVE","GOLDEN RETR"]   | 13      |
==> | ["Dog","POODLE","POODLE MIN"]             | 10      |
==> | ["Dog","TERRIER","MIN PINSCHER"]          | 9       |
==> | ["Dog","SHEPHERD","WELSH CORGI CAR"]      | 6       |
==> +-----------------------------------------------------+
==> 5 rows

We might then decide to add an exercise sub graph which indicates how much exercise each type of dog requires:

MATCH (breedGroup:BreedGroup)
WHERE breedGroup.name IN ["SETTER/RETRIEVE", "POODLE"]
MERGE (exercise:Exercise {type: "2 hours hard exercise"})
MERGE (exercise)<-[:REQUIRES_EXERCISE]-(breedGroup);
MATCH (breedGroup:BreedGroup)
WHERE breedGroup.name IN ["TERRIER", "SHEPHERD"]
MERGE (exercise:Exercise {type: "1 hour gentle exercise"})
MERGE (exercise)<-[:REQUIRES_EXERCISE]-(breedGroup);

We could then query that to find out which dogs need to come out for 2 hours of hard exercise:

MATCH (exercise:Exercise {type: "2 hours hard exercise"})<-[:REQUIRES_EXERCISE]-()<-[:PARENT*]-(dog)
WHERE NOT (dog)<-[:PARENT]-()
RETURN dog
LIMIT 10
==> +-----------------------------------------------------------+
==> | dog                                                       |
==> +-----------------------------------------------------------+
==> | Node[541]{id:"664427",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[542]{id:"543787",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[543]{id:"584021",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[544]{id:"584022",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[545]{id:"664430",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[546]{id:"535176",primaryColour:"BLACK",size:"SMALL"} |
==> | Node[567]{id:"613557",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[568]{id:"531376",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[569]{id:"613567",primaryColour:"WHITE",size:"SMALL"} |
==> | Node[570]{id:"531379",primaryColour:"WHITE",size:"SMALL"} |
==> +-----------------------------------------------------------+
==> 10 rows

In this query we ensured that we only returned dogs rather than breeds by checking that there was no incoming PARENT relationship. Alternatively we could have filtered on the Animal label…

MATCH (exercise:Exercise {type: "2 hours hard exercise"})<-[:REQUIRES_EXERCISE]-()<-[:PARENT*]-(dog:Animal)
RETURN dog
LIMIT 10

or if we wanted to only take the dogs out for exercise perhaps we’d have Dog label on the appropriate nodes.

People are often curious why labels don’t have super/sub types between them but I tend to use labels for simple categorisation – anything more complicated and we may as well use the built in power of the graph model!

The code is on github should you wish to play with it.

Written by Mark Needham

October 20th, 2014 at 11:08 pm

Posted in neo4j

Tagged with

Lessons from running Neo4j based ‘hackathons’

without comments

Over the last 6 months my colleagues and I have been running hands on Neo4j based sessions every few weeks and I was recently asked if I could write up the lessons we’ve learned.

So in no particular order here are some of the things that we’ve learnt:

Have a plan but don’t stick to it rigidly

Something we learnt early on is that it’s helpful to have a rough plan of how you’re going to spend the session otherwise it can feel quite chaotic for attendees.

We show people that plan at the beginning of the session so that they know what to expect and can plan their time accordingly if the second part doesn’t interest them as much.

Having said that, we’ve often gone off on a tangent and since people have been very interested in that we’ve just gone with it.

This sometimes means that you don’t cover everything you had in mind but the main thing is that people enjoy themselves so it’s nothing to worry about.

Prepare for people to be unprepared

We try to set expectations in advanced of the sessions with respect to what people should prepare or have installed on their machines but despite that you’ll have people in varying levels of readiness.

Having noticed this trend over a few months we now allot time in the schedule for getting people up and running and if we’re really struggling then we’ll ask people to pair with each other.

There will also be experience level differences so we always factor in some time to go over the basics for those who are new. We also encourage experienced people to help the others out – after all you only really know if you know something when you try to teach someone else!

Don’t try to do too much

Our first ‘hackathon’-esque event involved an attempt to build a Java application based on a British Library dataset.

I thought we’d be able to model the data set, import it and then wire up some queries to an application in a few hours.

This proved to be ever so slightly ambitious!

It took much longer than anticipated to do those first two steps and we didn’t get to build any of the application – teaching people how to model in a graph is probably a session in its own right.

Show the end state

Feedback we got from attendees to the first few versions was that they’d like to see what the end state should have looked like if they’d completed everything.

In our Clojure Hackathon Rohit got the furthest so we shared his code with everyone afterwards.

An even better approach is to have the final solution ready in advance and have it checked in on a different branch that you can point people at afterwards.

Show the intermediate states

Another thing we noticed was that if people got behind in the first part of the session then they’d never be able to catch up.

Nigel therefore came up with the idea of snapshotting intermediate states so that people could reset themselves after each part of the session. This is something that the Polymer tutorial does as well.

We worked out that we have two solid one hour slots before people start getting distracted by their journey home so we came up with two distinct types of tasks for people to do and then created a branch with the solution at the end of those tasks.

No doubt there will be more lessons to come as we run more sessions but this is where we are at the moment. If you fancy joining in our next session is Java based in a couple of weeks time.

Finally, if you want to see a really slick hands on meetup then you’ll want to head over to the London Clojure DojoBruce Durling has even written up some tips on how you run one yourself.

Written by Mark Needham

October 11th, 2014 at 10:52 am

Posted in neo4j

Tagged with

Neo4j: Generic/Vague relationship names

without comments

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node.

Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes that you’re not interested in.

However, I’ve never actually tested the performance differences between the approaches so I thought I’d try it out.

I created 4 different databases which had one node with 60,000 outgoing relationships – 10,000 which we wanted to retrieve and 50,000 that were irrelevant.

I modelled the ‘relationship’ in 4 different ways…

  • Using a specific relationship type
    (node)-[:HAS_ADDRESS]->(address)
  • Using a generic relationship type and then filtering by end node label
    (node)-[:HAS]->(address:Address)
  • Using a generic relationship type and then filtering by relationship property
    (node)-[:HAS {type: “address”}]->(address)
  • Using a generic relationship type and then filtering by end node property
    (node)-[:HAS]->(address {type: “address”})

…and then measured how long it took to retrieve the ‘has address’ relationships.

The code is on github if you want to take a look.

Although it’s obviously not as precise as a JMH micro benchmark I think it’s good enough to get a feel for the difference between the approaches.

I ran a query against each database 100 times and then took the 50th, 75th and 99th percentiles (times are in ms):

Using a generic relationship type and then filtering by end node label
50%ile: 6.0    75%ile: 6.0    99%ile: 402.60999999999825
 
Using a generic relationship type and then filtering by relationship property
50%ile: 21.0   75%ile: 22.0   99%ile: 504.85999999999785
 
Using a generic relationship type and then filtering by end node label
50%ile: 4.0    75%ile: 4.0    99%ile: 145.65999999999931
 
Using a specific relationship type
50%ile: 0.0    75%ile: 1.0    99%ile: 25.749999999999872

We can drill further into why there’s a difference in the times for each of the approaches by profiling the equivalent cypher query. We’ll start with the one which uses a specific relationship name

Using a specific relationship type

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS_ADDRESS]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +SimplePatternMatcher
      |
      +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+-----------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                 Other |
+----------------------+-------+--------+-----------------------------+-----------------------+
|         ColumnFilter |     1 |      0 |                             | keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                       |
| SimplePatternMatcher | 10000 |  10000 | n,   UNNAMED53,   UNNAMED35 |                       |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |          {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+-----------------------+
 
Total database accesses: 10001

Here we can see that there were 10,002 database accesses in order to get a count of our 10,000 HAS_ADDRESS relationships. We get a database access each time we load a node, relationship or property.

By contrast the other approaches have to load in a lot more data only to then filter it out:

Using a generic relationship type and then filtering by end node label

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->(:Address) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+----------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                            Other |
+----------------------+-------+--------+-----------------------------+----------------------------------+
|         ColumnFilter |     1 |      0 |                             |            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                  |
|               Filter | 10000 |  10000 |                             | hasLabel(  UNNAMED45:Address(0)) |
| SimplePatternMatcher | 10000 |  60000 | n,   UNNAMED45,   UNNAMED35 |                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+----------------------------------+
 
Total database accesses: 70001

Using a generic relationship type and then filtering by relationship property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS {type: "address"}]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED35,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED63,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

Using a generic relationship type and then filtering by end node property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->({type: "address"}) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED45,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED45,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

So in summary…specific relationships #ftw!

Written by Mark Needham

September 30th, 2014 at 4:47 pm

Posted in neo4j

Tagged with

Neo4j: COLLECTing multiple values (Too many parameters for function ‘collect’)

without comments

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption.

However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so.

Consider the following data set:

create (p:Person {name: "Mark"})
create (e1:Event {name: "Event1", timestamp: 1234})
create (e2:Event {name: "Event2", timestamp: 4567})
 
create (p)-[:EVENT]->(e1)
create (p)-[:EVENT]->(e2)

If we wanted to return each person along with a collection of the event names they’d participated in we could write the following:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT(e.name);
+--------------------------------------------+
| p                    | COLLECT(e.name)     |
+--------------------------------------------+
| Node[0]{name:"Mark"} | ["Event1","Event2"] |
+--------------------------------------------+
1 row

That works nicely, but what about if we want to collect the event name and the timestamp but don’t want to return the entire event node?

An approach I’ve seen a few people try during workshops is the following:

MATCH (p:Person)-[:EVENT]->(e)
RETURN p, COLLECT(e.name, e.timestamp)

Unfortunately this doesn’t compile:

SyntaxException: Too many parameters for function 'collect' (line 2, column 11)
"RETURN p, COLLECT(e.name, e.timestamp)"
           ^

As the error message suggests, the COLLECT function only takes one argument so we need to find another way to solve our problem.

One way is to put the two values into a literal array which will result in an array of arrays as our return result:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT([e.name, e.timestamp]);
+----------------------------------------------------------+
| p                    | COLLECT([e.name, e.timestamp])    |
+----------------------------------------------------------+
| Node[0]{name:"Mark"} | [["Event1",1234],["Event2",4567]] |
+----------------------------------------------------------+
1 row

The annoying thing about this approach is that as you add more items you’ll forget in which position you’ve put each bit of data so I think a preferable approach is to collect a map of items instead:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT({eventName: e.name, eventTimestamp: e.timestamp});
+--------------------------------------------------------------------------------------------------------------------------+
| p                    | COLLECT({eventName: e.name, eventTimestamp: e.timestamp})                                         |
+--------------------------------------------------------------------------------------------------------------------------+
| Node[0]{name:"Mark"} | [{eventName -> "Event1", eventTimestamp -> 1234},{eventName -> "Event2", eventTimestamp -> 4567}] |
+--------------------------------------------------------------------------------------------------------------------------+
1 row

During the Clojure Neo4j Hackathon that we ran earlier this week this proved to be a particularly pleasing approach as we could easily destructure the collection of maps in our Clojure code.

Written by Mark Needham

September 26th, 2014 at 8:46 pm

Posted in neo4j

Tagged with

Neo4j: LOAD CSV – Column is null

with 3 comments

One problem I’ve seen a few people have recently when using Neo4j’s LOAD CSV function is dealing with CSV files that have dodgy hidden characters at the beginning of the header line.

For example, consider an import of this CSV file:

$ cat ~/Downloads/dodgy.csv
userId,movieId
1,2

We might start by checking which columns it has:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line;
+----------------------------------+
| line                             |
+----------------------------------+
| {userId -> "1", movieId -> "2"} |
+----------------------------------+
1 row

Looks good so far but what about if we try to return just ‘userId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| <null>      |
+-------------+
1 row

Hmmm it’s null…what about ‘movieId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.movieId;
+--------------+
| line.movieId |
+--------------+
| "2"          |
+--------------+
1 row

That works fine so immediately we can suspect there are hidden characters at the beginning of the first line of the file.

The easiest way to check if this is the case is open the file using a Hex Editor – I quite like Hex Fiend for the Mac.

If we look at dodgy.csv we’ll see the following:

2014 09 24 21 20 06

Let’s delete the highlighted characters and try our cypher query again:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| "1"         |
+-------------+
1 row

All is well again, but something to keep in mind if you see a LOAD CSV near you behaving badly.

Written by Mark Needham

September 24th, 2014 at 8:21 pm

Posted in neo4j

Tagged with