Mark Needham

Thoughts on Software Development

Archive for the ‘Databases’ Category

Neo4j: Create dynamic relationship type

without comments

One of the things I’ve often found frustrating when importing data using Cypher, Neo4j’s query language, is that it’s quite difficult to create dynamic relationship types.

Say we have a CSV file structured like this:

load csv with headers from "file:///people.csv" AS row
RETURN row
╒═══════════════════════════════════════════════════════╕
│row                                                    │
╞═══════════════════════════════════════════════════════╡
│{node1: Mark, node2: Reshmee, relationship: MARRIED_TO}│
├───────────────────────────────────────────────────────┤
│{node1: Mark, node2: Alistair, relationship: FRIENDS}  │
└───────────────────────────────────────────────────────┘

We want to create nodes with the relationship type specified in the file. Unfortunately, in Cypher we can’t pass in relationship types so we have to resort to the FOREACH hack to create our relationships:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "MARRIED_TO" THEN [1] ELSE [] END |
 MERGE (p1)-[:MARRIED_TO]->(p2))
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "FRIENDS" THEN [1] ELSE [] END |
 MERGE (p1)-[:FRIENDS]->(p2))

This works, but:

  1. Looks horrendous
  2. Doesn’t scale particularly well when we have multiple relationship types to deal with

As in my last post the APOC library comes to the rescue again, this time in the form of the apoc.create.relationship procedure.

This procedure allows us to change our initial query to read like this:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel
RETURN rel

Much better!

Written by Mark Needham

October 30th, 2016 at 10:12 pm

Posted in neo4j

Tagged with

Neo4j: Dynamically add property/Set dynamic property

without comments

I’ve been playing around with a dataset which has the timetable for the national rail in the UK and they give you departure and arrival times of each train in a textual format.

For example, the node to represent a stop could be created like this:

CREATE (stop:Stop {arrival: "0802", departure: "0803H"})

That time format isn’t particular amenable to querying so I wanted to add another property which indicated the number of seconds since the start of the day.

So we want to add ‘arrivalSecondsSinceStartOfDay’ and ‘departureSecondsSinceStartOfDay’ properties to our node. I wrote the following query to calculate the values for those properties.

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
 
RETURN key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
╒═══════════════════════════════╤══════════════════════╕
│newKey                         │secondsSinceStartOfDay│
╞═══════════════════════════════╪══════════════════════╡
│arrivalSecondsSinceStartOfDay  │28920                 │
├───────────────────────────────┼──────────────────────┤
│departureSecondsSinceStartOfDay│29010                 │
└───────────────────────────────┴──────────────────────┘

Now we’re ready to set those properties on the ‘stop’ node.

MATCH (stop:Stop2)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
SET stop[newKey] = secondsSinceStartOfDay
Invalid input '[': expected an identifier character, whitespace, '{', node labels, a property map, a relationship pattern, '.', '(', '=' or "+=" (line 12, column 9 (offset: 447))
"SET stop[newKey] = secondsSinceStartOfDay"
         ^

Hmmm that didn’t work as expected! It doesn’t look like we can set dynamic properties using Cypher just yet.

Luckily my colleague Michael Hunger and the Neo4j community have been curating the APOC procedures library and it has just the procedure to help us out.

You’ll need to download the jar for your version of Neo4j and then place it in the plugins directory. I’m using Neo4j 3.1 Beta1 so this is what it looks like for me:

$ tree neo4j-enterprise-3.1.0-BETA1/plugins/
 
neo4j-enterprise-3.1.0-BETA1/plugins/
└── apoc-3.1.0.1-all.jar
 
0 directories, 1 file

After you’ve done that you’ll need to restart Neo4j so that it can pick up the new procedures that we’ve added. Once you’ve done that execute the following query to check they’ve installed correctly:

call dbms.procedures()
YIELD name 
WITH name 
WHERE name STARTS WITH "apoc"
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│183     │
└────────┘

We’re now ready to dynamically set properties in the graph. The procedure that we’ll use is apoc.create.setProperty and it’s easy to update our query to use it:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)
Query cannot conclude with CALL (must be RETURN or an update clause) (line 12, column 1 (offset: 439))
"CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)"
 ^

Oops I spoke too soon! We need to yield the return column of the procedure and return it or just return a count to work around this:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay) 
YIELD node
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│2       │
└────────┘

And that’s it, we can now dynamically set properties in our queries.

Written by Mark Needham

October 27th, 2016 at 5:29 am

Posted in neo4j

Tagged with

Neo4j: Detecting rogue spaces in CSV headers with LOAD CSV

without comments

Last week I was helping someone load the data from a CSV file into Neo4j and we were having trouble filtering out rows which contained a null value in one of the columns.

This is what the data looked like:

load csv with headers from "file:///foo.csv" as row
RETURN row
╒══════════════════════════════════╕
│row                               │
╞══════════════════════════════════╡
│{key1: a,  key2: (null),  key3: c}│
├──────────────────────────────────┤
│{key1: d,  key2: e,  key3: f}     │
└──────────────────────────────────┘

We’d like to filter out any rows which have ‘key2’ as null, so let’s tweak our query to do that:

load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
(no rows)

Hmmm that’s odd, it’s got rid of both rows. We’d expect to see the 2nd row since that doesn’t have a null value.

At this point we might suspect that what we’re seeing on the screen isn’t actually what the data looks like. Let’s write the following query to check our header values:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒═════╤═════════╕
│key  │SIZE(key)│
╞═════╪═════════╡
│key1 │4        │
├─────┼─────────┤
│ key2│5        │
├─────┼─────────┤
│ key3│5        │
└─────┴─────────┘

The second column tells us that there are some extra characters in the columns for ‘key2’ and ‘key3’ or rather ‘ key2’ and ‘ key3’. In this case they are spaces, but it could easily be another character:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, replace(key, " ", "_SPACE_") AS spaces
╒═════╤═══════════╕
│key  │spaces     │
╞═════╪═══════════╡
│key1 │key1       │
├─────┼───────────┤
│ key2│_SPACE_key2│
├─────┼───────────┤
│ key3│_SPACE_key3│
└─────┴───────────┘

If we clean up our CSV file and try again everything works as expected:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒════╤═════════╕
│key │SIZE(key)│
╞════╪═════════╡
│key1│4        │
├────┼─────────┤
│key2│4        │
├────┼─────────┤
│key3│4        │
└────┴─────────┘
load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
╒═══════════════════════════╕
│row                        │
╞═══════════════════════════╡
│{key1: d, key2: e, key3: f}│
└───────────────────────────┘

Written by Mark Needham

October 19th, 2016 at 5:16 am

Posted in neo4j

Tagged with

Neo4j: requirement failed

without comments

Last week during a hands on Cypher meetup, using Neo4j’s built in movie dataset, one of the attendees showed me the following query which wasn’t working as expected:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY COUNT(movies) DESC
LIMIT 10
requirement failed

We can get a full stack trace in logs/debug.log if we run the same query from the cypher-shell, which was introduced during one fo the Neo4j 3.1 milestone releases:

2016-10-03 23:25:07.529+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [UnknownError]: requirement failed, reference to. requirement failed
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:212)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.steps.sortSkipAndLimit$.apply(sortSkipAndLimit.scala:38)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:43)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:31)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:46)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:47)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:30)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.planQueries(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.plan(QueryPlanner.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.produceLogicalPlan(CostBasedExecutablePlanBuilder.scala:95)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.using(package.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.closing(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.producePlan(CostBasedExecutablePlanBuilder.scala:70)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.procs.DelegatingProcedureExecutablePlanBuilder.producePlan(DelegatingProcedureExecutablePlanBuilder.scala:99)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.FallbackBuilder$class.producePlan(SilentFallbackPlanBuilder.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.SilentFallbackPlanBuilder.producePlan(SilentFallbackPlanBuilder.scala:56)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.ExecutionPlanBuilder.build(ExecutionPlanBuilder.scala:106)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.QueryCache$$anonfun$getOrElseUpdate$1$$anonfun$apply$1.apply(CacheAccessor.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.MonitoringCacheAccessor$$anonfun$1.apply(CacheAccessor.scala:57)
        at org.neo4j.cypher.internal.compiler.v3_1.LFUCache$$anon$1.apply(LFUCache.scala:31)

The problem line seems to be this one:

require(sortItems.forall(_.expression.isInstanceOf[Variable]))

The solution is actually quite simple. We should be using the SIZE function to calculate the size of a list rather then the COUNT function:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY SIZE(movies) DESC
LIMIT 10

Written by Mark Needham

October 4th, 2016 at 10:33 pm

Posted in neo4j

Tagged with

Neo4j: Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead)

without comments

A couple of days I was trying to write a Cypher query to filter the labels in my database.

I started with the following procedure call to get the list of all the labels:

CALL db.labels
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Flight    │
├──────────┤
│Airline   │
├──────────┤
│Movie     │
├──────────┤
│AirportDay│
├──────────┤
│Person    │
├──────────┤
│Engineer  │
└──────────┘

I was only interested in labels that contained the letter ‘a’ so I tweaked the query to filter the output of the procedure:

CALL db.labels
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label

Unfortunately that didn’t work as I expected:

Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead) (line 1, column 9 (offset: 8))
"CALL db.labels"
         ^

The mistake I made was calling the procedure implicitly without using parentheses. If you want to do any post processing on the output of a procedure you need to call it explicitly otherwise Cypher gets very confused.

If we add back the parentheses it’s much happier:

CALL db.labels()
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Airline   │
├──────────┤
│AirportDay│
└──────────┘

It stumped me for a while until I figured out what the error message meant! I think I’ll just use explicit parentheses all the time from now on to save me running into this one again.

Written by Mark Needham

October 2nd, 2016 at 10:13 am

Posted in neo4j

Tagged with ,

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

without comments

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

:play http://guides.neo4j.com/got

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0]
Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> one = [1,1,0]
>>> two = [0,1,1]
 
>>> cosine_similarity([one, two])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo"))
session = driver.session()
 
rows = session.run("""
    MATCH (c:Character), (e:Episode)
    OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e)
    RETURN e, c, appearance
    ORDER BY e.id, c.id""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows:
        print row
 
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {}
for row in rows:
    if episodes.get(row["e"]["id"]) is None:
        if row["appearance"] is None:
            episodes[row["e"]["id"]] = [0]
        else:
            episodes[row["e"]["id"]] = [1]
    else:
        if row["appearance"] is None:
            episodes[row["e"]["id"]].append(0)
        else:
            episodes[row["e"]["id"]].append(1)

Here’s an example of one entry in the matrix:

>>> len(episodes)
60
 
>>> len(episodes[1])
638
 
>>> episodes[1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values()
 
>>> cosine_similarity(all[0:1], all)[0]
array([ 1.        ,  0.69637306,  0.48196269,  0.54671752,  0.48196269,
        0.44733753,  0.31707317,  0.42340087,  0.34989921,  0.43314808,
        0.36597766,  0.18421252,  0.30961158,  0.2328101 ,  0.30616181,
        0.41905818,  0.36842504,  0.35338088,  0.18376917,  0.3569686 ,
        0.2328101 ,  0.34539847,  0.25043516,  0.31707317,  0.25329221,
        0.33342786,  0.34921515,  0.2174909 ,  0.2533473 ,  0.28429311,
        0.23026565,  0.22310537,  0.22365301,  0.23816275,  0.28242289,
        0.16070148,  0.24847093,  0.21434648,  0.03582872,  0.21189672,
        0.15460414,  0.17161693,  0.15460414,  0.17494961,  0.1234662 ,
        0.21426863,  0.21434648,  0.18748505,  0.15308091,  0.20161946,
        0.19877675,  0.30920827,  0.21058466,  0.19127301,  0.24607943,
        0.18033393,  0.17734311,  0.16296707,  0.18740851,  0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
0 1.0
1 0.696373059207
3 0.546717521051
2 0.481962692712
4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
59 1.0
52 0.500670191678
46 0.449085146211
43 0.448218732478
49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {}
>>> for idx, episode_id in enumerate(episodes):
        episode_mapping[idx] = episode_id
 
>>> for idx, episode_id in enumerate(episodes):
        similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0]
        for other_idx, similarity_score in enumerate(similarity_matrix):
            other_episode_id = episode_mapping[other_idx]
            print episode_id, other_episode_id, similarity_score
            if episode_id != other_episode_id:
                session.run("""
                    MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}})
                    MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2)
                    ON CREATE SET similarity.score = {similarityScore}
                    """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})
 
    session.close()

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score DESC
LIMIT 10
 
╒═════╤════╤══════════════════╕
│ep1  │ep2 │score             │
╞═════╪════╪══════════════════╡
│S1E2 │S1E1│0.6963730592072543│
├─────┼────┼──────────────────┤
│S1E4 │S1E3│0.6914173051223086│
├─────┼────┼──────────────────┤
│S1E9 │S1E8│0.6869464497590777│
├─────┼────┼──────────────────┤
│S2E10│S2E8│0.6869037302955034│
├─────┼────┼──────────────────┤
│S3E7 │S3E6│0.6819943394704735│
├─────┼────┼──────────────────┤
│S2E7 │S2E6│0.6813598225089799│
├─────┼────┼──────────────────┤
│S1E10│S1E9│0.6796436827080401│
├─────┼────┼──────────────────┤
│S1E5 │S1E4│0.6698105143372364│
├─────┼────┼──────────────────┤
│S1E10│S1E8│0.6624062584864754│
├─────┼────┼──────────────────┤
│S4E5 │S4E4│0.6518358737330705│
└─────┴────┴──────────────────┘

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score
LIMIT 10
 
╒════╤════╤═══════════════════╕
│ep1 │ep2 │score              │
╞════╪════╪═══════════════════╡
│S4E9│S1E5│0                  │
├────┼────┼───────────────────┤
│S4E9│S1E6│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E2│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E4│0                  │
├────┼────┼───────────────────┤
│S5E6│S4E9│0                  │
├────┼────┼───────────────────┤
│S6E8│S4E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E6│0                  │
├────┼────┼───────────────────┤
│S3E9│S2E9│0.03181423814878889│
├────┼────┼───────────────────┤
│S4E9│S1E1│0.03582871819500093│
└────┴────┴───────────────────┘

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode)
WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5
return episode1, episode2
 
(no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Written by Mark Needham

August 22nd, 2016 at 9:12 pm

Neo4j: Cypher – Detecting duplicates using relationships

without comments

I’ve been building a graph of computer science papers on and off for a couple of months and now that I’ve got a few thousand loaded in I realised that there are quite a few duplicates.

They’re not duplicates in the sense that there are multiple entries with the same identifier but rather have different identifiers but seem to be the same paper!

e.g. there are a couple of papers titled ‘Authentication in the Taos operating system’:

http://dl.acm.org/citation.cfm?id=174614

2016 07 20 11 43 00

http://dl.acm.org/citation.cfm?id=168640

2016 07 20 11 43 38

This is the same paper published in two different journals as far as I can tell.

Now in this case it’s quite easy to just do a string similarity comparison of the titles of these papers and realise that they’re identical. I’ve previously use the excellent dedupe library to do this and there’s also an excellent talk from Berlin Buzzwords 2014 where the author uses locality-sensitive hashing to achieve a similar outcome.

However, I was curious whether I could use any of the relationships these papers have to detect duplicates rather than just relying on string matching.

This is what the graph looks like:

Graph  8

We’ll start by writing a query to see how many common references the different Taos papers have:

MATCH (r:Resource {id: "168640"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH other, COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH other, 1.0 * otherReferences / myReferences AS similarity WHERE similarity > 0.5
 
RETURN other.id, other.title, similarity
ORDER BY similarity DESC
LIMIT 10
╒════════╤═══════════════════════════════════════════╤══════════╕
│other.id│other.title                                │similarity│
╞════════╪═══════════════════════════════════════════╪══════════╡
│168640  │Authentication in the Taos operating system│1         │
├────────┼───────────────────────────────────────────┼──────────┤
│174614  │Authentication in the Taos operating system│1         │
└────────┴───────────────────────────────────────────┴──────────┘

This query:

  • picks one of the Taos papers and finds its references
  • finds other papers which reference those same papers
  • calculates a similarity score based on how many common references they have
  • returns papers that have more than 50% of the same references with the most similar ones at the top

I tried it with other papers to see how it fared:

Performance of Firefly RPC

╒════════╤════════════════════════════════════════════════════════════════╤══════════════════╕
│other.id│other.title                                                     │similarity        │
╞════════╪════════════════════════════════════════════════════════════════╪══════════════════╡
│74859   │Performance of Firefly RPC                                      │1                 │
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│77653   │Performance of the Firefly RPC                                  │0.8333333333333334│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│110815  │The X-Kernel: An Architecture for Implementing Network Protocols│0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│96281   │Experiences with the Amoeba distributed operating system        │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│74861   │Lightweight remote procedure call                               │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│106985  │The interaction of architecture and operating system design     │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│77650   │Lightweight remote procedure call                               │0.6666666666666666│
└────────┴────────────────────────────────────────────────────────────────┴──────────────────┘

Authentication in distributed systems: theory and practice

╒════════╤══════════════════════════════════════════════════════════╤══════════════════╕
│other.id│other.title                                               │similarity        │
╞════════╪══════════════════════════════════════════════════════════╪══════════════════╡
│121160  │Authentication in distributed systems: theory and practice│1                 │
├────────┼──────────────────────────────────────────────────────────┼──────────────────┤
│138874  │Authentication in distributed systems: theory and practice│0.9090909090909091│
└────────┴──────────────────────────────────────────────────────────┴──────────────────┘

Sadly it’s not as simple as finding 100% matches on references! I expect the later revisions of a paper added more content and therefore additional references.

What if we look for author similarity as well?

MATCH (r:Resource {id: "121160"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH r, other, authorSimilarity,  COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH r, other, authorSimilarity,  1.0 * otherReferences / myReferences AS referenceSimilarity
WHERE referenceSimilarity > 0.5
 
MATCH (r)<-[:AUTHORED]-(author)
WITH r, myReferences, COLLECT(author) AS myAuthors
 
UNWIND myAuthors AS author
OPTIONAL MATCH path = (other)<-[:AUTHORED]-(author)
WITH other, myReferences, COUNT(path) AS otherAuthors, SIZE(myAuthors) AS myAuthors
WITH other, myReferences, 1.0 * otherAuthors / myAuthors AS authorSimilarity
WHERE authorSimilarity > 0.5
 
 
 
RETURN other.id, other.title, referenceSimilarity, authorSimilarity
ORDER BY (referenceSimilarity + authorSimilarity) DESC
LIMIT 10
╒════════╤══════════════════════════════════════════════════════════╤═══════════════════╤════════════════╕
│other.id│other.title                                               │referenceSimilarity│authorSimilarity│
╞════════╪══════════════════════════════════════════════════════════╪═══════════════════╪════════════════╡
│121160  │Authentication in distributed systems: theory and practice│1                  │1               │
├────────┼──────────────────────────────────────────────────────────┼───────────────────┼────────────────┤
│138874  │Authentication in distributed systems: theory and practice│0.9090909090909091 │1               │
└────────┴──────────────────────────────────────────────────────────┴───────────────────┴────────────────┘
╒════════╤══════════════════════════════╤═══════════════════╤════════════════╕
│other.id│other.title                   │referenceSimilarity│authorSimilarity│
╞════════╪══════════════════════════════╪═══════════════════╪════════════════╡
│74859   │Performance of Firefly RPC    │1                  │1               │
├────────┼──────────────────────────────┼───────────────────┼────────────────┤
│77653   │Performance of the Firefly RPC│0.8333333333333334 │1               │
└────────┴──────────────────────────────┴───────────────────┴────────────────┘

I’m sure I could find some other papers where neither of these similarities worked well but it’s an interesting start.

I think the next step is to build up a training set of pairs of documents that are and aren’t similar to each other. We could then train a classifier to determine whether two documents are identical.

But that’s for another day!

Written by Mark Needham

July 20th, 2016 at 5:32 pm

Posted in neo4j

Tagged with ,

Neo4j 3.0 Drivers – Failed to save the server ID and the certificate received from the server

without comments

I’ve been using the Neo4j Java Driver on various local databases over the past week and ran into the following certificate problem a few times:

org.neo4j.driver.v1.exceptions.ClientException: Unable to process request: General SSLEngine problem
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:88)
	at org.neo4j.driver.internal.connector.socket.SocketConnection.<init>(SocketConnection.java:63)
	at org.neo4j.driver.internal.connector.socket.SocketConnector.connect(SocketConnector.java:52)
	at org.neo4j.driver.internal.pool.InternalConnectionPool.acquire(InternalConnectionPool.java:113)
	at org.neo4j.driver.internal.InternalDriver.session(InternalDriver.java:53)
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1431)
	at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
	at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
	at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
	at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.wrap(TLSSocketChannel.java:270)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:131)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:95)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:77)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:70)
	at org.neo4j.driver.internal.connector.socket.SocketClient$ChannelFactory.create(SocketClient.java:251)
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:75)
	... 14 more
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1497)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:212)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runDelegatedTasks(TLSSocketChannel.java:142)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:203)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:127)
	... 19 more
Caused by: java.security.cert.CertificateException: Unable to connect to neo4j at `localhost:10003`, because the certificate the server uses has changed. This is a security feature to protect against man-in-the-middle attacks.
If you trust the certificate the server uses now, simply remove the line that starts with `localhost:10003` in the file `/Users/markneedham/.neo4j/known_hosts`.
The old certificate saved in file is:
-----BEGIN CERTIFICATE-----
7770ee598be69c8537b0e576e62442c84400008ca0d3e3565b379b7cce9a51de
0fd4396251df2e8da50eb1628d44dcbca3fae5c8fb9c0adc29396839c25eb0c8
 
-----END CERTIFICATE-----
The New certificate received is:
-----BEGIN CERTIFICATE-----
01a422739a39625ee95a0547fa99c7e43fbb33c70ff720e5ae4a8408421aa63b
2fe4f5d6094c5fd770ed1ad214dbdc428a6811d0955ed80d48cc67d84067df2c
 
-----END CERTIFICATE-----
 
	at org.neo4j.driver.internal.connector.socket.TrustOnFirstUseTrustManager.checkServerTrusted(TrustOnFirstUseTrustManager.java:153)
	at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(SSLContextImpl.java:936)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1484)
	... 28 more

I got a bit lazy and just nuked the file it mentions in the error message – /Users/markneedham/.neo4j/known_hosts – which led to this error the next time I call the driver in my application:

Failed to save the server ID and the certificate received from the server to file /Users/markneedham/.neo4j/known_hosts.
Server ID: localhost:10003
Received cert:
-----BEGIN CERTIFICATE-----
933c7ec5d6a1b876bd186dc6d05b04478ae771262f07d26a4d7d2e6b7f71054c
3e6b7c172474493b7fe93170d940b9cc3544661c7966632361649f2fda7c66be
 
-----END CERTIFICATE-----

I recreated the file with no content and tried again and it worked fine. Alternatively we can choose to turn off encryption when working with local databases and avoid the issue:

Config config = Config.build().withEncryptionLevel( Config.EncryptionLevel.NONE ).toConfig();
 
try ( Driver driver = GraphDatabase.driver( "bolt://localhost:7687", config );
      Session session = driver.session() )
{
   // use the driver
}

Written by Mark Needham

July 11th, 2016 at 5:21 am

Posted in neo4j

Tagged with

Neo4j vs Relational: Refactoring – Extracting node/table

without comments

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with.

After introducing this ‘nationality’ property I realised that I now had some duplication in the model:

2016 05 22 10 15 15

players.nationality and clubs.country are referring to the same countries but they’ve both got them stored as strings so we can’t ensure the integrity of our countries and ensure that we’re referring to the same country.

We have the same issue in the graph model as well:

2016 05 22 10 40 40

This time Player.nationality and Club.country refer to the same countries.

We can solve our problem by introducing a countries table in the relational model and a set of nodes with a ‘Country’ label in the graph model. Let’s start with relational.

This is the model we’re driving towards:

2016 05 22 10 50 43

The first thing we need to do is create a countries table and populate it:

CREATE TABLE countries (
    "code" CHARACTER VARYING(3) NOT NULL PRIMARY KEY,
    "name" CHARACTER VARYING(50) NOT NULL
);
INSERT INTO countries VALUES('MNE', 'Montenegro');
INSERT INTO countries VALUES('SWZ', 'Swaziland');
...

Next let’s update the clubs table to reference the countries table:

ALTER TABLE clubs
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);

And let’s run a query to populate that column:

UPDATE clubs AS cl
SET country_id = c.code
FROM clubs
INNER JOIN countries AS c 
ON c.name = clubs.country
WHERE cl.id = clubs.id;

This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the ‘country_id’ field. Finally we can remove the ‘country’ field:

ALTER TABLE clubs
DROP COLUMN country;

Now we do the same drill for the players table:

ALTER TABLE players
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);
UPDATE players AS p
SET country_id = c.code
FROM players
INNER JOIN countries AS c 
ON c.name = players.nationality
WHERE p.id = players.id;
ALTER TABLE players
DROP COLUMN nationality;

Now it’s time for the graph. This is the model we want to get to:

2016 05 22 10 51 49

First we’ll create the countries:

CREATE CONSTRAINT ON (c:Country)
ASSERT c.id IS UNIQUE
LOAD CSV WITH HEADERS FROM "file:///countries.csv"
AS row
MERGE (country:Country {id: row.countryCode})
ON CREATE SET country.name = row.country

And now let’s get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:

MATCH (club:Club)
MATCH (country:Country {name: club.country})
MERGE (club)-[:PART_OF]->(country)
REMOVE club.country
MATCH (player:Player)
MATCH (country:Country {name: player.nationality})
MERGE (player)-[:PLAYS_FOR]->(country)
REMOVE player.nationality

And that’s it, we can now write queries against our new model.

Written by Mark Needham

May 22nd, 2016 at 9:58 am

Posted in neo4j

Tagged with

Neo4j vs Relational: Refactoring – Add a new field/property

without comments

A couple of months ago I presented a webinar comparing how you’d model and evolve a data model using a Postgres SQL database and Neo4j.

This is what the two data models looked like after the initial data import and before any refactoring/migration had been done:

Relational

2016 05 22 09 49 23

Graph

2016 05 22 09 52 16

I wanted to add a ‘nationality’ property to the players table in the SQL schema and to the nodes with the ‘Player’ label in the graph.

This refactoring is quite easy in both models. In the relational database we first run a query to add the ‘nationality’ field to the table:

ALTER TABLE players 
ADD COLUMN nationality VARYING(30);

And then we need to generate UPDATE statements from our data dump to update all the existing records:

UPDATE players 
SET nationality = 'Brazil'
WHERE players.id = '/aldair/profil/spieler/4151';
 
...

In the graph we can do this in a single step by processing our data dump using the LOAD CSV command and then setting a property on each player:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///transfers.csv" AS row
MATCH (player:Player {id: row.playerUri})
SET player.nationality = row.playerNationality

If we wanted to make the nationality field non nullable we could go back and run the following queries:

ALTER TABLE players 
ALTER COLUMN nationality SET NOT NULL
CREATE CONSTRAINT ON (player:Player) 
ASSERT exists(player.nationality)

And we’re done!

Written by Mark Needham

May 22nd, 2016 at 9:09 am

Posted in neo4j

Tagged with