Mark Needham

Thoughts on Software Development

Archive for the ‘Databases’ Category

Neo4j: Find the intermediate point between two lat/longs

without comments

Yesterday I wrote a blog post showing how to find the midpoint between two lat/longs using Cypher which worked well as a first attempt at filling in missing locations, but I realised I could do better.

As I mentioned in the last post, when I find a stop that’s missing lat/long coordinates I can usually find two nearby stops that allow me to triangulate this stop’s location.

I also have train routes which indicate the number of seconds it takes to go from one stop to another, which allows me to indicate whether the location-less stop is closer to one stop than the other.

For example, consider stops a, b, and c where b doesn’t have a location. If we have these distances between the stops:

(a)-[:NEXT {time: 60}]->(b)-[:NEXT {time: 240}]->(c)

it tells us that point ‘b’ is actually 0.2 of the distance from ‘a’ to ‘c’ rather than being the midpoint.

There’s a formula we can use to work out that point:

a = sin((1−f)⋅δ) / sin δ
b = sin(f⋅δ) / sin δ
x = a ⋅ cos φ1 ⋅ cos λ1 + b ⋅ cos φ2 ⋅ cos λ2
y = a ⋅ cos φ1 ⋅ sin λ1 + b ⋅ cos φ2 ⋅ sin λ2
z = a ⋅ sin φ1 + b ⋅ sin φ2
φi = atan2(z, √x² + y²)
λi = atan2(y, x)
 
δ is the angular distance d/R between the two points.
φ = latitude
λ = longitude

Translated to Cypher (with mandatory Greek symbols) it reads like this to find the point 0.2 of the way from one point to another

with {latitude: 51.4931963543, longitude: -0.0475185810} AS p1, 
     {latitude: 51.47908, longitude: -0.05393950 } AS p2
 
WITH p1, p2, distance(point(p1), point(p2)) / 6371000 AS δ, 0.2 AS f
WITH p1, p2, δ, 
     sin((1-f) * δ) / sin(δ) AS a,
     sin(f * δ) / sin(δ) AS b
WITH radians(p1.latitude) AS φ1, radians(p1.longitude) AS λ1,
     radians(p2.latitude) AS φ2, radians(p2.longitude) AS λ2,
     a, b
WITH a * cos(φ1) * cos(λ1) + b * cos(φ2) * cos(λ2) AS x,
     a * cos(φ1) * sin(λ1) + b * cos(φ2) * sin(λ2) AS y,
     a * sin(φ1) + b * sin(φ2) AS z
RETURN degrees(atan2(z, sqrt(x^2 + y^2))) AS φi,
       degrees(atan2(y,x)) AS λi
╒═════════════════╤════════════════════╕
│φi               │λi                  │
╞═════════════════╪════════════════════╡
│51.49037311149128│-0.04880308288561931│
└─────────────────┴────────────────────┘

A quick sanity check plugging in 0.5 instead of 0.2 finds the midpoint which I was able to sanity check against yesterday’s post:

╒═════════════════╤═════════════════════╕
│φi               │λi                   │
╞═════════════════╪═════════════════════╡
│51.48613822097523│-0.050729537454086385│
└─────────────────┴─────────────────────┘

That’s all for now!

Written by Mark Needham

November 1st, 2016 at 10:10 pm

Posted in neo4j

Tagged with

Neo4j: Find the midpoint between two lat/longs

without comments

2016 10 31 06 06 00

Over the last couple of weekends I’ve been playing around with some transport data and I wanted to run the A* algorithm to find the quickest route between two stations.

The A* algorithm takes an estimateEvaluator as one of its parameters and the evaluator looks at lat/longs of nodes to work out whether a path is worth following or not. I therefore needed to add lat/longs for each station and I found it surprisingly hard to find this location date for all the points in the dataset.

Luckily I tend to have the lat/longs for two points either side of a station so I can work out the midpoint as an approximation for the missing one.

I found an article which defines a formula we can use to do this and there’s a StackOverflow post which has some Java code that implements the formula.

I wanted to find the midpoint between Surrey Quays station (51.4931963543,-0.0475185810) and a point further south on the train line (51.47908,-0.05393950). I wrote the following Cypher query to calculate this point:

WITH 51.4931963543 AS lat1, -0.0475185810 AS lon1, 
     51.47908 AS lat2 , -0.05393950 AS lon2
 
WITH radians(lat1) AS rlat1, radians(lon1) AS rlon1, 
     radians(lat2) AS rlat2, radians(lon2) AS rlon2, 
     radians(lon2 - lon1) AS dLon
 
WITH rlat1, rlon1, rlat2, rlon2, 
     cos(rlat2) * cos(dLon) AS Bx, 
     cos(rlat2) * sin(dLon) AS By
 
WITH atan2(sin(rlat1) + sin(rlat2), 
           sqrt( (cos(rlat1) + Bx) * (cos(rlat1) + Bx) + By * By )) AS lat3,
     rlon1 + atan2(By, cos(rlat1) + Bx) AS lon3
 
RETURN degrees(lat3) AS midLat, degrees(lon3) AS midLon
╒═════════════════╤═════════════════════╕
│midLat           │midLon               │
╞═════════════════╪═════════════════════╡
│51.48613822097523│-0.050729537454086385│
└─────────────────┴─────────────────────┘

The Google Maps screenshot on the right hand side shows the initial points at the top and bottom and the midpoint in between. It’s not perfect; ideally I’d like the midpoint to be on the track, but I think it’s good enough for the purposes of the algorithm.

Now I need to go and fill in the lat/longs for my location-less stations!

Written by Mark Needham

October 31st, 2016 at 7:31 pm

Posted in neo4j

Tagged with ,

Neo4j: Create dynamic relationship type

without comments

One of the things I’ve often found frustrating when importing data using Cypher, Neo4j’s query language, is that it’s quite difficult to create dynamic relationship types.

Say we have a CSV file structured like this:

load csv with headers from "file:///people.csv" AS row
RETURN row
╒═══════════════════════════════════════════════════════╕
│row                                                    │
╞═══════════════════════════════════════════════════════╡
│{node1: Mark, node2: Reshmee, relationship: MARRIED_TO}│
├───────────────────────────────────────────────────────┤
│{node1: Mark, node2: Alistair, relationship: FRIENDS}  │
└───────────────────────────────────────────────────────┘

We want to create nodes with the relationship type specified in the file. Unfortunately, in Cypher we can’t pass in relationship types so we have to resort to the FOREACH hack to create our relationships:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "MARRIED_TO" THEN [1] ELSE [] END |
 MERGE (p1)-[:MARRIED_TO]->(p2))
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "FRIENDS" THEN [1] ELSE [] END |
 MERGE (p1)-[:FRIENDS]->(p2))

This works, but:

  1. Looks horrendous
  2. Doesn’t scale particularly well when we have multiple relationship types to deal with

As in my last post the APOC library comes to the rescue again, this time in the form of the apoc.create.relationship procedure.

This procedure allows us to change our initial query to read like this:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel
RETURN rel

Much better!

Written by Mark Needham

October 30th, 2016 at 10:12 pm

Posted in neo4j

Tagged with

Neo4j: Dynamically add property/Set dynamic property

without comments

I’ve been playing around with a dataset which has the timetable for the national rail in the UK and they give you departure and arrival times of each train in a textual format.

For example, the node to represent a stop could be created like this:

CREATE (stop:Stop {arrival: "0802", departure: "0803H"})

That time format isn’t particular amenable to querying so I wanted to add another property which indicated the number of seconds since the start of the day.

So we want to add ‘arrivalSecondsSinceStartOfDay’ and ‘departureSecondsSinceStartOfDay’ properties to our node. I wrote the following query to calculate the values for those properties.

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
 
RETURN key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
╒═══════════════════════════════╤══════════════════════╕
│newKey                         │secondsSinceStartOfDay│
╞═══════════════════════════════╪══════════════════════╡
│arrivalSecondsSinceStartOfDay  │28920                 │
├───────────────────────────────┼──────────────────────┤
│departureSecondsSinceStartOfDay│29010                 │
└───────────────────────────────┴──────────────────────┘

Now we’re ready to set those properties on the ‘stop’ node.

MATCH (stop:Stop2)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
SET stop[newKey] = secondsSinceStartOfDay
Invalid input '[': expected an identifier character, whitespace, '{', node labels, a property map, a relationship pattern, '.', '(', '=' or "+=" (line 12, column 9 (offset: 447))
"SET stop[newKey] = secondsSinceStartOfDay"
         ^

Hmmm that didn’t work as expected! It doesn’t look like we can set dynamic properties using Cypher just yet.

Luckily my colleague Michael Hunger and the Neo4j community have been curating the APOC procedures library and it has just the procedure to help us out.

You’ll need to download the jar for your version of Neo4j and then place it in the plugins directory. I’m using Neo4j 3.1 Beta1 so this is what it looks like for me:

$ tree neo4j-enterprise-3.1.0-BETA1/plugins/
 
neo4j-enterprise-3.1.0-BETA1/plugins/
└── apoc-3.1.0.1-all.jar
 
0 directories, 1 file

After you’ve done that you’ll need to restart Neo4j so that it can pick up the new procedures that we’ve added. Once you’ve done that execute the following query to check they’ve installed correctly:

call dbms.procedures()
YIELD name 
WITH name 
WHERE name STARTS WITH "apoc"
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│183     │
└────────┘

We’re now ready to dynamically set properties in the graph. The procedure that we’ll use is apoc.create.setProperty and it’s easy to update our query to use it:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)
Query cannot conclude with CALL (must be RETURN or an update clause) (line 12, column 1 (offset: 439))
"CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)"
 ^

Oops I spoke too soon! We need to yield the return column of the procedure and return it or just return a count to work around this:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay) 
YIELD node
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│2       │
└────────┘

And that’s it, we can now dynamically set properties in our queries.

Written by Mark Needham

October 27th, 2016 at 5:29 am

Posted in neo4j

Tagged with

Neo4j: Detecting rogue spaces in CSV headers with LOAD CSV

without comments

Last week I was helping someone load the data from a CSV file into Neo4j and we were having trouble filtering out rows which contained a null value in one of the columns.

This is what the data looked like:

load csv with headers from "file:///foo.csv" as row
RETURN row
╒══════════════════════════════════╕
│row                               │
╞══════════════════════════════════╡
│{key1: a,  key2: (null),  key3: c}│
├──────────────────────────────────┤
│{key1: d,  key2: e,  key3: f}     │
└──────────────────────────────────┘

We’d like to filter out any rows which have ‘key2’ as null, so let’s tweak our query to do that:

load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
(no rows)

Hmmm that’s odd, it’s got rid of both rows. We’d expect to see the 2nd row since that doesn’t have a null value.

At this point we might suspect that what we’re seeing on the screen isn’t actually what the data looks like. Let’s write the following query to check our header values:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒═════╤═════════╕
│key  │SIZE(key)│
╞═════╪═════════╡
│key1 │4        │
├─────┼─────────┤
│ key2│5        │
├─────┼─────────┤
│ key3│5        │
└─────┴─────────┘

The second column tells us that there are some extra characters in the columns for ‘key2’ and ‘key3’ or rather ‘ key2’ and ‘ key3’. In this case they are spaces, but it could easily be another character:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, replace(key, " ", "_SPACE_") AS spaces
╒═════╤═══════════╕
│key  │spaces     │
╞═════╪═══════════╡
│key1 │key1       │
├─────┼───────────┤
│ key2│_SPACE_key2│
├─────┼───────────┤
│ key3│_SPACE_key3│
└─────┴───────────┘

If we clean up our CSV file and try again everything works as expected:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒════╤═════════╕
│key │SIZE(key)│
╞════╪═════════╡
│key1│4        │
├────┼─────────┤
│key2│4        │
├────┼─────────┤
│key3│4        │
└────┴─────────┘
load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
╒═══════════════════════════╕
│row                        │
╞═══════════════════════════╡
│{key1: d, key2: e, key3: f}│
└───────────────────────────┘

Written by Mark Needham

October 19th, 2016 at 5:16 am

Posted in neo4j

Tagged with

Neo4j: requirement failed

without comments

Last week during a hands on Cypher meetup, using Neo4j’s built in movie dataset, one of the attendees showed me the following query which wasn’t working as expected:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY COUNT(movies) DESC
LIMIT 10
requirement failed

We can get a full stack trace in logs/debug.log if we run the same query from the cypher-shell, which was introduced during one fo the Neo4j 3.1 milestone releases:

2016-10-03 23:25:07.529+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [UnknownError]: requirement failed, reference to. requirement failed
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:212)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.steps.sortSkipAndLimit$.apply(sortSkipAndLimit.scala:38)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:43)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:31)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:46)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:47)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:30)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.planQueries(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.plan(QueryPlanner.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.produceLogicalPlan(CostBasedExecutablePlanBuilder.scala:95)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.using(package.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.closing(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.producePlan(CostBasedExecutablePlanBuilder.scala:70)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.procs.DelegatingProcedureExecutablePlanBuilder.producePlan(DelegatingProcedureExecutablePlanBuilder.scala:99)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.FallbackBuilder$class.producePlan(SilentFallbackPlanBuilder.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.SilentFallbackPlanBuilder.producePlan(SilentFallbackPlanBuilder.scala:56)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.ExecutionPlanBuilder.build(ExecutionPlanBuilder.scala:106)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.QueryCache$$anonfun$getOrElseUpdate$1$$anonfun$apply$1.apply(CacheAccessor.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.MonitoringCacheAccessor$$anonfun$1.apply(CacheAccessor.scala:57)
        at org.neo4j.cypher.internal.compiler.v3_1.LFUCache$$anon$1.apply(LFUCache.scala:31)

The problem line seems to be this one:

require(sortItems.forall(_.expression.isInstanceOf[Variable]))

The solution is actually quite simple. We should be using the SIZE function to calculate the size of a list rather then the COUNT function:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY SIZE(movies) DESC
LIMIT 10

Written by Mark Needham

October 4th, 2016 at 10:33 pm

Posted in neo4j

Tagged with

Neo4j: Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead)

without comments

A couple of days I was trying to write a Cypher query to filter the labels in my database.

I started with the following procedure call to get the list of all the labels:

CALL db.labels
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Flight    │
├──────────┤
│Airline   │
├──────────┤
│Movie     │
├──────────┤
│AirportDay│
├──────────┤
│Person    │
├──────────┤
│Engineer  │
└──────────┘

I was only interested in labels that contained the letter ‘a’ so I tweaked the query to filter the output of the procedure:

CALL db.labels
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label

Unfortunately that didn’t work as I expected:

Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead) (line 1, column 9 (offset: 8))
"CALL db.labels"
         ^

The mistake I made was calling the procedure implicitly without using parentheses. If you want to do any post processing on the output of a procedure you need to call it explicitly otherwise Cypher gets very confused.

If we add back the parentheses it’s much happier:

CALL db.labels()
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Airline   │
├──────────┤
│AirportDay│
└──────────┘

It stumped me for a while until I figured out what the error message meant! I think I’ll just use explicit parentheses all the time from now on to save me running into this one again.

Written by Mark Needham

October 2nd, 2016 at 10:13 am

Posted in neo4j

Tagged with ,

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

without comments

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

:play http://guides.neo4j.com/got

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0]
Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> one = [1,1,0]
>>> two = [0,1,1]
 
>>> cosine_similarity([one, two])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo"))
session = driver.session()
 
rows = session.run("""
    MATCH (c:Character), (e:Episode)
    OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e)
    RETURN e, c, appearance
    ORDER BY e.id, c.id""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows:
        print row
 
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {}
for row in rows:
    if episodes.get(row["e"]["id"]) is None:
        if row["appearance"] is None:
            episodes[row["e"]["id"]] = [0]
        else:
            episodes[row["e"]["id"]] = [1]
    else:
        if row["appearance"] is None:
            episodes[row["e"]["id"]].append(0)
        else:
            episodes[row["e"]["id"]].append(1)

Here’s an example of one entry in the matrix:

>>> len(episodes)
60
 
>>> len(episodes[1])
638
 
>>> episodes[1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values()
 
>>> cosine_similarity(all[0:1], all)[0]
array([ 1.        ,  0.69637306,  0.48196269,  0.54671752,  0.48196269,
        0.44733753,  0.31707317,  0.42340087,  0.34989921,  0.43314808,
        0.36597766,  0.18421252,  0.30961158,  0.2328101 ,  0.30616181,
        0.41905818,  0.36842504,  0.35338088,  0.18376917,  0.3569686 ,
        0.2328101 ,  0.34539847,  0.25043516,  0.31707317,  0.25329221,
        0.33342786,  0.34921515,  0.2174909 ,  0.2533473 ,  0.28429311,
        0.23026565,  0.22310537,  0.22365301,  0.23816275,  0.28242289,
        0.16070148,  0.24847093,  0.21434648,  0.03582872,  0.21189672,
        0.15460414,  0.17161693,  0.15460414,  0.17494961,  0.1234662 ,
        0.21426863,  0.21434648,  0.18748505,  0.15308091,  0.20161946,
        0.19877675,  0.30920827,  0.21058466,  0.19127301,  0.24607943,
        0.18033393,  0.17734311,  0.16296707,  0.18740851,  0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
0 1.0
1 0.696373059207
3 0.546717521051
2 0.481962692712
4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
59 1.0
52 0.500670191678
46 0.449085146211
43 0.448218732478
49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {}
>>> for idx, episode_id in enumerate(episodes):
        episode_mapping[idx] = episode_id
 
>>> for idx, episode_id in enumerate(episodes):
        similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0]
        for other_idx, similarity_score in enumerate(similarity_matrix):
            other_episode_id = episode_mapping[other_idx]
            print episode_id, other_episode_id, similarity_score
            if episode_id != other_episode_id:
                session.run("""
                    MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}})
                    MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2)
                    ON CREATE SET similarity.score = {similarityScore}
                    """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})
 
    session.close()

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score DESC
LIMIT 10
 
╒═════╤════╤══════════════════╕
│ep1  │ep2 │score             │
╞═════╪════╪══════════════════╡
│S1E2 │S1E1│0.6963730592072543│
├─────┼────┼──────────────────┤
│S1E4 │S1E3│0.6914173051223086│
├─────┼────┼──────────────────┤
│S1E9 │S1E8│0.6869464497590777│
├─────┼────┼──────────────────┤
│S2E10│S2E8│0.6869037302955034│
├─────┼────┼──────────────────┤
│S3E7 │S3E6│0.6819943394704735│
├─────┼────┼──────────────────┤
│S2E7 │S2E6│0.6813598225089799│
├─────┼────┼──────────────────┤
│S1E10│S1E9│0.6796436827080401│
├─────┼────┼──────────────────┤
│S1E5 │S1E4│0.6698105143372364│
├─────┼────┼──────────────────┤
│S1E10│S1E8│0.6624062584864754│
├─────┼────┼──────────────────┤
│S4E5 │S4E4│0.6518358737330705│
└─────┴────┴──────────────────┘

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score
LIMIT 10
 
╒════╤════╤═══════════════════╕
│ep1 │ep2 │score              │
╞════╪════╪═══════════════════╡
│S4E9│S1E5│0                  │
├────┼────┼───────────────────┤
│S4E9│S1E6│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E2│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S2E4│0                  │
├────┼────┼───────────────────┤
│S5E6│S4E9│0                  │
├────┼────┼───────────────────┤
│S6E8│S4E9│0                  │
├────┼────┼───────────────────┤
│S4E9│S4E6│0                  │
├────┼────┼───────────────────┤
│S3E9│S2E9│0.03181423814878889│
├────┼────┼───────────────────┤
│S4E9│S1E1│0.03582871819500093│
└────┴────┴───────────────────┘

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode)
WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5
return episode1, episode2
 
(no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Written by Mark Needham

August 22nd, 2016 at 9:12 pm

Neo4j: Cypher – Detecting duplicates using relationships

without comments

I’ve been building a graph of computer science papers on and off for a couple of months and now that I’ve got a few thousand loaded in I realised that there are quite a few duplicates.

They’re not duplicates in the sense that there are multiple entries with the same identifier but rather have different identifiers but seem to be the same paper!

e.g. there are a couple of papers titled ‘Authentication in the Taos operating system’:

http://dl.acm.org/citation.cfm?id=174614

2016 07 20 11 43 00

http://dl.acm.org/citation.cfm?id=168640

2016 07 20 11 43 38

This is the same paper published in two different journals as far as I can tell.

Now in this case it’s quite easy to just do a string similarity comparison of the titles of these papers and realise that they’re identical. I’ve previously use the excellent dedupe library to do this and there’s also an excellent talk from Berlin Buzzwords 2014 where the author uses locality-sensitive hashing to achieve a similar outcome.

However, I was curious whether I could use any of the relationships these papers have to detect duplicates rather than just relying on string matching.

This is what the graph looks like:

Graph  8

We’ll start by writing a query to see how many common references the different Taos papers have:

MATCH (r:Resource {id: "168640"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH other, COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH other, 1.0 * otherReferences / myReferences AS similarity WHERE similarity > 0.5
 
RETURN other.id, other.title, similarity
ORDER BY similarity DESC
LIMIT 10
╒════════╤═══════════════════════════════════════════╤══════════╕
│other.id│other.title                                │similarity│
╞════════╪═══════════════════════════════════════════╪══════════╡
│168640  │Authentication in the Taos operating system│1         │
├────────┼───────────────────────────────────────────┼──────────┤
│174614  │Authentication in the Taos operating system│1         │
└────────┴───────────────────────────────────────────┴──────────┘

This query:

  • picks one of the Taos papers and finds its references
  • finds other papers which reference those same papers
  • calculates a similarity score based on how many common references they have
  • returns papers that have more than 50% of the same references with the most similar ones at the top

I tried it with other papers to see how it fared:

Performance of Firefly RPC

╒════════╤════════════════════════════════════════════════════════════════╤══════════════════╕
│other.id│other.title                                                     │similarity        │
╞════════╪════════════════════════════════════════════════════════════════╪══════════════════╡
│74859   │Performance of Firefly RPC                                      │1                 │
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│77653   │Performance of the Firefly RPC                                  │0.8333333333333334│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│110815  │The X-Kernel: An Architecture for Implementing Network Protocols│0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│96281   │Experiences with the Amoeba distributed operating system        │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│74861   │Lightweight remote procedure call                               │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│106985  │The interaction of architecture and operating system design     │0.6666666666666666│
├────────┼────────────────────────────────────────────────────────────────┼──────────────────┤
│77650   │Lightweight remote procedure call                               │0.6666666666666666│
└────────┴────────────────────────────────────────────────────────────────┴──────────────────┘

Authentication in distributed systems: theory and practice

╒════════╤══════════════════════════════════════════════════════════╤══════════════════╕
│other.id│other.title                                               │similarity        │
╞════════╪══════════════════════════════════════════════════════════╪══════════════════╡
│121160  │Authentication in distributed systems: theory and practice│1                 │
├────────┼──────────────────────────────────────────────────────────┼──────────────────┤
│138874  │Authentication in distributed systems: theory and practice│0.9090909090909091│
└────────┴──────────────────────────────────────────────────────────┴──────────────────┘

Sadly it’s not as simple as finding 100% matches on references! I expect the later revisions of a paper added more content and therefore additional references.

What if we look for author similarity as well?

MATCH (r:Resource {id: "121160"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH r, other, authorSimilarity,  COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH r, other, authorSimilarity,  1.0 * otherReferences / myReferences AS referenceSimilarity
WHERE referenceSimilarity > 0.5
 
MATCH (r)<-[:AUTHORED]-(author)
WITH r, myReferences, COLLECT(author) AS myAuthors
 
UNWIND myAuthors AS author
OPTIONAL MATCH path = (other)<-[:AUTHORED]-(author)
WITH other, myReferences, COUNT(path) AS otherAuthors, SIZE(myAuthors) AS myAuthors
WITH other, myReferences, 1.0 * otherAuthors / myAuthors AS authorSimilarity
WHERE authorSimilarity > 0.5
 
 
 
RETURN other.id, other.title, referenceSimilarity, authorSimilarity
ORDER BY (referenceSimilarity + authorSimilarity) DESC
LIMIT 10
╒════════╤══════════════════════════════════════════════════════════╤═══════════════════╤════════════════╕
│other.id│other.title                                               │referenceSimilarity│authorSimilarity│
╞════════╪══════════════════════════════════════════════════════════╪═══════════════════╪════════════════╡
│121160  │Authentication in distributed systems: theory and practice│1                  │1               │
├────────┼──────────────────────────────────────────────────────────┼───────────────────┼────────────────┤
│138874  │Authentication in distributed systems: theory and practice│0.9090909090909091 │1               │
└────────┴──────────────────────────────────────────────────────────┴───────────────────┴────────────────┘
╒════════╤══════════════════════════════╤═══════════════════╤════════════════╕
│other.id│other.title                   │referenceSimilarity│authorSimilarity│
╞════════╪══════════════════════════════╪═══════════════════╪════════════════╡
│74859   │Performance of Firefly RPC    │1                  │1               │
├────────┼──────────────────────────────┼───────────────────┼────────────────┤
│77653   │Performance of the Firefly RPC│0.8333333333333334 │1               │
└────────┴──────────────────────────────┴───────────────────┴────────────────┘

I’m sure I could find some other papers where neither of these similarities worked well but it’s an interesting start.

I think the next step is to build up a training set of pairs of documents that are and aren’t similar to each other. We could then train a classifier to determine whether two documents are identical.

But that’s for another day!

Written by Mark Needham

July 20th, 2016 at 5:32 pm

Posted in neo4j

Tagged with ,

Neo4j 3.0 Drivers – Failed to save the server ID and the certificate received from the server

without comments

I’ve been using the Neo4j Java Driver on various local databases over the past week and ran into the following certificate problem a few times:

org.neo4j.driver.v1.exceptions.ClientException: Unable to process request: General SSLEngine problem
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:88)
	at org.neo4j.driver.internal.connector.socket.SocketConnection.<init>(SocketConnection.java:63)
	at org.neo4j.driver.internal.connector.socket.SocketConnector.connect(SocketConnector.java:52)
	at org.neo4j.driver.internal.pool.InternalConnectionPool.acquire(InternalConnectionPool.java:113)
	at org.neo4j.driver.internal.InternalDriver.session(InternalDriver.java:53)
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1431)
	at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
	at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
	at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
	at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.wrap(TLSSocketChannel.java:270)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:131)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:95)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:77)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:70)
	at org.neo4j.driver.internal.connector.socket.SocketClient$ChannelFactory.create(SocketClient.java:251)
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:75)
	... 14 more
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1497)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:212)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runDelegatedTasks(TLSSocketChannel.java:142)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:203)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:127)
	... 19 more
Caused by: java.security.cert.CertificateException: Unable to connect to neo4j at `localhost:10003`, because the certificate the server uses has changed. This is a security feature to protect against man-in-the-middle attacks.
If you trust the certificate the server uses now, simply remove the line that starts with `localhost:10003` in the file `/Users/markneedham/.neo4j/known_hosts`.
The old certificate saved in file is:
-----BEGIN CERTIFICATE-----
7770ee598be69c8537b0e576e62442c84400008ca0d3e3565b379b7cce9a51de
0fd4396251df2e8da50eb1628d44dcbca3fae5c8fb9c0adc29396839c25eb0c8
 
-----END CERTIFICATE-----
The New certificate received is:
-----BEGIN CERTIFICATE-----
01a422739a39625ee95a0547fa99c7e43fbb33c70ff720e5ae4a8408421aa63b
2fe4f5d6094c5fd770ed1ad214dbdc428a6811d0955ed80d48cc67d84067df2c
 
-----END CERTIFICATE-----
 
	at org.neo4j.driver.internal.connector.socket.TrustOnFirstUseTrustManager.checkServerTrusted(TrustOnFirstUseTrustManager.java:153)
	at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(SSLContextImpl.java:936)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1484)
	... 28 more

I got a bit lazy and just nuked the file it mentions in the error message – /Users/markneedham/.neo4j/known_hosts – which led to this error the next time I call the driver in my application:

Failed to save the server ID and the certificate received from the server to file /Users/markneedham/.neo4j/known_hosts.
Server ID: localhost:10003
Received cert:
-----BEGIN CERTIFICATE-----
933c7ec5d6a1b876bd186dc6d05b04478ae771262f07d26a4d7d2e6b7f71054c
3e6b7c172474493b7fe93170d940b9cc3544661c7966632361649f2fda7c66be
 
-----END CERTIFICATE-----

I recreated the file with no content and tried again and it worked fine. Alternatively we can choose to turn off encryption when working with local databases and avoid the issue:

Config config = Config.build().withEncryptionLevel( Config.EncryptionLevel.NONE ).toConfig();
 
try ( Driver driver = GraphDatabase.driver( "bolt://localhost:7687", config );
      Session session = driver.session() )
{
   // use the driver
}

Written by Mark Needham

July 11th, 2016 at 5:21 am

Posted in neo4j

Tagged with