Mark Needham

Thoughts on Software Development

Neo4j: Find the intermediate point between two lat/longs

without comments

Yesterday I wrote a blog post showing how to find the midpoint between two lat/longs using Cypher which worked well as a first attempt at filling in missing locations, but I realised I could do better.

As I mentioned in the last post, when I find a stop that’s missing lat/long coordinates I can usually find two nearby stops that allow me to triangulate this stop’s location.

I also have train routes which indicate the number of seconds it takes to go from one stop to another, which allows me to indicate whether the location-less stop is closer to one stop than the other.

For example, consider stops a, b, and c where b doesn’t have a location. If we have these distances between the stops:

(a)-[:NEXT {time: 60}]->(b)-[:NEXT {time: 240}]->(c)

it tells us that point ‘b’ is actually 0.2 of the distance from ‘a’ to ‘c’ rather than being the midpoint.

There’s a formula we can use to work out that point:

a = sin((1−f)⋅δ) / sin δ
b = sin(f⋅δ) / sin δ
x = a ⋅ cos φ1 ⋅ cos λ1 + b ⋅ cos φ2 ⋅ cos λ2
y = a ⋅ cos φ1 ⋅ sin λ1 + b ⋅ cos φ2 ⋅ sin λ2
z = a ⋅ sin φ1 + b ⋅ sin φ2
φi = atan2(z, √x² + y²)
λi = atan2(y, x)
 
δ is the angular distance d/R between the two points.
φ = latitude
λ = longitude

Translated to Cypher (with mandatory Greek symbols) it reads like this to find the point 0.2 of the way from one point to another

with {latitude: 51.4931963543, longitude: -0.0475185810} AS p1, 
     {latitude: 51.47908, longitude: -0.05393950 } AS p2
 
WITH p1, p2, distance(point(p1), point(p2)) / 6371000 AS δ, 0.2 AS f
WITH p1, p2, δ, 
     sin((1-f) * δ) / sin(δ) AS a,
     sin(f * δ) / sin(δ) AS b
WITH radians(p1.latitude) AS φ1, radians(p1.longitude) AS λ1,
     radians(p2.latitude) AS φ2, radians(p2.longitude) AS λ2,
     a, b
WITH a * cos(φ1) * cos(λ1) + b * cos(φ2) * cos(λ2) AS x,
     a * cos(φ1) * sin(λ1) + b * cos(φ2) * sin(λ2) AS y,
     a * sin(φ1) + b * sin(φ2) AS z
RETURN degrees(atan2(z, sqrt(x^2 + y^2))) AS φi,
       degrees(atan2(y,x)) AS λi
╒═════════════════╤════════════════════╕
│φi               │λi                  │
╞═════════════════╪════════════════════╡
│51.49037311149128│-0.04880308288561931│
└─────────────────┴────────────────────┘

A quick sanity check plugging in 0.5 instead of 0.2 finds the midpoint which I was able to sanity check against yesterday’s post:

╒═════════════════╤═════════════════════╕
│φi               │λi                   │
╞═════════════════╪═════════════════════╡
│51.48613822097523│-0.050729537454086385│
└─────────────────┴─────────────────────┘

That’s all for now!

Written by Mark Needham

November 1st, 2016 at 10:10 pm

Posted in neo4j

Tagged with

Neo4j: Find the midpoint between two lat/longs

without comments

2016 10 31 06 06 00

Over the last couple of weekends I’ve been playing around with some transport data and I wanted to run the A* algorithm to find the quickest route between two stations.

The A* algorithm takes an estimateEvaluator as one of its parameters and the evaluator looks at lat/longs of nodes to work out whether a path is worth following or not. I therefore needed to add lat/longs for each station and I found it surprisingly hard to find this location date for all the points in the dataset.

Luckily I tend to have the lat/longs for two points either side of a station so I can work out the midpoint as an approximation for the missing one.

I found an article which defines a formula we can use to do this and there’s a StackOverflow post which has some Java code that implements the formula.

I wanted to find the midpoint between Surrey Quays station (51.4931963543,-0.0475185810) and a point further south on the train line (51.47908,-0.05393950). I wrote the following Cypher query to calculate this point:

WITH 51.4931963543 AS lat1, -0.0475185810 AS lon1, 
     51.47908 AS lat2 , -0.05393950 AS lon2
 
WITH radians(lat1) AS rlat1, radians(lon1) AS rlon1, 
     radians(lat2) AS rlat2, radians(lon2) AS rlon2, 
     radians(lon2 - lon1) AS dLon
 
WITH rlat1, rlon1, rlat2, rlon2, 
     cos(rlat2) * cos(dLon) AS Bx, 
     cos(rlat2) * sin(dLon) AS By
 
WITH atan2(sin(rlat1) + sin(rlat2), 
           sqrt( (cos(rlat1) + Bx) * (cos(rlat1) + Bx) + By * By )) AS lat3,
     rlon1 + atan2(By, cos(rlat1) + Bx) AS lon3
 
RETURN degrees(lat3) AS midLat, degrees(lon3) AS midLon
╒═════════════════╤═════════════════════╕
│midLat           │midLon               │
╞═════════════════╪═════════════════════╡
│51.48613822097523│-0.050729537454086385│
└─────────────────┴─────────────────────┘

The Google Maps screenshot on the right hand side shows the initial points at the top and bottom and the midpoint in between. It’s not perfect; ideally I’d like the midpoint to be on the track, but I think it’s good enough for the purposes of the algorithm.

Now I need to go and fill in the lat/longs for my location-less stations!

Written by Mark Needham

October 31st, 2016 at 7:31 pm

Posted in neo4j

Tagged with ,

Neo4j: Create dynamic relationship type

without comments

One of the things I’ve often found frustrating when importing data using Cypher, Neo4j’s query language, is that it’s quite difficult to create dynamic relationship types.

Say we have a CSV file structured like this:

load csv with headers from "file:///people.csv" AS row
RETURN row
╒═══════════════════════════════════════════════════════╕
│row                                                    │
╞═══════════════════════════════════════════════════════╡
│{node1: Mark, node2: Reshmee, relationship: MARRIED_TO}│
├───────────────────────────────────────────────────────┤
│{node1: Mark, node2: Alistair, relationship: FRIENDS}  │
└───────────────────────────────────────────────────────┘

We want to create nodes with the relationship type specified in the file. Unfortunately, in Cypher we can’t pass in relationship types so we have to resort to the FOREACH hack to create our relationships:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "MARRIED_TO" THEN [1] ELSE [] END |
 MERGE (p1)-[:MARRIED_TO]->(p2))
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "FRIENDS" THEN [1] ELSE [] END |
 MERGE (p1)-[:FRIENDS]->(p2))

This works, but:

  1. Looks horrendous
  2. Doesn’t scale particularly well when we have multiple relationship types to deal with

As in my last post the APOC library comes to the rescue again, this time in the form of the apoc.create.relationship procedure.

This procedure allows us to change our initial query to read like this:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel
RETURN rel

Much better!

Written by Mark Needham

October 30th, 2016 at 10:12 pm

Posted in neo4j

Tagged with

Neo4j: Dynamically add property/Set dynamic property

without comments

I’ve been playing around with a dataset which has the timetable for the national rail in the UK and they give you departure and arrival times of each train in a textual format.

For example, the node to represent a stop could be created like this:

CREATE (stop:Stop {arrival: "0802", departure: "0803H"})

That time format isn’t particular amenable to querying so I wanted to add another property which indicated the number of seconds since the start of the day.

So we want to add ‘arrivalSecondsSinceStartOfDay’ and ‘departureSecondsSinceStartOfDay’ properties to our node. I wrote the following query to calculate the values for those properties.

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
 
RETURN key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
╒═══════════════════════════════╤══════════════════════╕
│newKey                         │secondsSinceStartOfDay│
╞═══════════════════════════════╪══════════════════════╡
│arrivalSecondsSinceStartOfDay  │28920                 │
├───────────────────────────────┼──────────────────────┤
│departureSecondsSinceStartOfDay│29010                 │
└───────────────────────────────┴──────────────────────┘

Now we’re ready to set those properties on the ‘stop’ node.

MATCH (stop:Stop2)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
SET stop[newKey] = secondsSinceStartOfDay
Invalid input '[': expected an identifier character, whitespace, '{', node labels, a property map, a relationship pattern, '.', '(', '=' or "+=" (line 12, column 9 (offset: 447))
"SET stop[newKey] = secondsSinceStartOfDay"
         ^

Hmmm that didn’t work as expected! It doesn’t look like we can set dynamic properties using Cypher just yet.

Luckily my colleague Michael Hunger and the Neo4j community have been curating the APOC procedures library and it has just the procedure to help us out.

You’ll need to download the jar for your version of Neo4j and then place it in the plugins directory. I’m using Neo4j 3.1 Beta1 so this is what it looks like for me:

$ tree neo4j-enterprise-3.1.0-BETA1/plugins/
 
neo4j-enterprise-3.1.0-BETA1/plugins/
└── apoc-3.1.0.1-all.jar
 
0 directories, 1 file

After you’ve done that you’ll need to restart Neo4j so that it can pick up the new procedures that we’ve added. Once you’ve done that execute the following query to check they’ve installed correctly:

call dbms.procedures()
YIELD name 
WITH name 
WHERE name STARTS WITH "apoc"
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│183     │
└────────┘

We’re now ready to dynamically set properties in the graph. The procedure that we’ll use is apoc.create.setProperty and it’s easy to update our query to use it:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)
Query cannot conclude with CALL (must be RETURN or an update clause) (line 12, column 1 (offset: 439))
"CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)"
 ^

Oops I spoke too soon! We need to yield the return column of the procedure and return it or just return a count to work around this:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay) 
YIELD node
RETURN COUNT(*)
╒════════╕
│COUNT(*)│
╞════════╡
│2       │
└────────┘

And that’s it, we can now dynamically set properties in our queries.

Written by Mark Needham

October 27th, 2016 at 5:29 am

Posted in neo4j

Tagged with

Neo4j: Detecting rogue spaces in CSV headers with LOAD CSV

without comments

Last week I was helping someone load the data from a CSV file into Neo4j and we were having trouble filtering out rows which contained a null value in one of the columns.

This is what the data looked like:

load csv with headers from "file:///foo.csv" as row
RETURN row
╒══════════════════════════════════╕
│row                               │
╞══════════════════════════════════╡
│{key1: a,  key2: (null),  key3: c}│
├──────────────────────────────────┤
│{key1: d,  key2: e,  key3: f}     │
└──────────────────────────────────┘

We’d like to filter out any rows which have ‘key2’ as null, so let’s tweak our query to do that:

load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
(no rows)

Hmmm that’s odd, it’s got rid of both rows. We’d expect to see the 2nd row since that doesn’t have a null value.

At this point we might suspect that what we’re seeing on the screen isn’t actually what the data looks like. Let’s write the following query to check our header values:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒═════╤═════════╕
│key  │SIZE(key)│
╞═════╪═════════╡
│key1 │4        │
├─────┼─────────┤
│ key2│5        │
├─────┼─────────┤
│ key3│5        │
└─────┴─────────┘

The second column tells us that there are some extra characters in the columns for ‘key2’ and ‘key3’ or rather ‘ key2’ and ‘ key3’. In this case they are spaces, but it could easily be another character:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, replace(key, " ", "_SPACE_") AS spaces
╒═════╤═══════════╕
│key  │spaces     │
╞═════╪═══════════╡
│key1 │key1       │
├─────┼───────────┤
│ key2│_SPACE_key2│
├─────┼───────────┤
│ key3│_SPACE_key3│
└─────┴───────────┘

If we clean up our CSV file and try again everything works as expected:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒════╤═════════╕
│key │SIZE(key)│
╞════╪═════════╡
│key1│4        │
├────┼─────────┤
│key2│4        │
├────┼─────────┤
│key3│4        │
└────┴─────────┘
load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
╒═══════════════════════════╕
│row                        │
╞═══════════════════════════╡
│{key1: d, key2: e, key3: f}│
└───────────────────────────┘

Written by Mark Needham

October 19th, 2016 at 5:16 am

Posted in neo4j

Tagged with

Neo4j: requirement failed

without comments

Last week during a hands on Cypher meetup, using Neo4j’s built in movie dataset, one of the attendees showed me the following query which wasn’t working as expected:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY COUNT(movies) DESC
LIMIT 10
requirement failed

We can get a full stack trace in logs/debug.log if we run the same query from the cypher-shell, which was introduced during one fo the Neo4j 3.1 milestone releases:

2016-10-03 23:25:07.529+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [UnknownError]: requirement failed, reference to. requirement failed
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:212)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.steps.sortSkipAndLimit$.apply(sortSkipAndLimit.scala:38)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:43)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:31)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:46)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:47)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:30)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.planQueries(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.plan(QueryPlanner.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.produceLogicalPlan(CostBasedExecutablePlanBuilder.scala:95)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.using(package.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.closing(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.producePlan(CostBasedExecutablePlanBuilder.scala:70)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.procs.DelegatingProcedureExecutablePlanBuilder.producePlan(DelegatingProcedureExecutablePlanBuilder.scala:99)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.FallbackBuilder$class.producePlan(SilentFallbackPlanBuilder.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.SilentFallbackPlanBuilder.producePlan(SilentFallbackPlanBuilder.scala:56)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.ExecutionPlanBuilder.build(ExecutionPlanBuilder.scala:106)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.QueryCache$$anonfun$getOrElseUpdate$1$$anonfun$apply$1.apply(CacheAccessor.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.MonitoringCacheAccessor$$anonfun$1.apply(CacheAccessor.scala:57)
        at org.neo4j.cypher.internal.compiler.v3_1.LFUCache$$anon$1.apply(LFUCache.scala:31)

The problem line seems to be this one:

require(sortItems.forall(_.expression.isInstanceOf[Variable]))

The solution is actually quite simple. We should be using the SIZE function to calculate the size of a list rather then the COUNT function:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY SIZE(movies) DESC
LIMIT 10

Written by Mark Needham

October 4th, 2016 at 10:33 pm

Posted in neo4j

Tagged with

Neo4j: Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead)

without comments

A couple of days I was trying to write a Cypher query to filter the labels in my database.

I started with the following procedure call to get the list of all the labels:

CALL db.labels
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Flight    │
├──────────┤
│Airline   │
├──────────┤
│Movie     │
├──────────┤
│AirportDay│
├──────────┤
│Person    │
├──────────┤
│Engineer  │
└──────────┘

I was only interested in labels that contained the letter ‘a’ so I tweaked the query to filter the output of the procedure:

CALL db.labels
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label

Unfortunately that didn’t work as I expected:

Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead) (line 1, column 9 (offset: 8))
"CALL db.labels"
         ^

The mistake I made was calling the procedure implicitly without using parentheses. If you want to do any post processing on the output of a procedure you need to call it explicitly otherwise Cypher gets very confused.

If we add back the parentheses it’s much happier:

CALL db.labels()
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label
╒══════════╕
│label     │
╞══════════╡
│Airport   │
├──────────┤
│Airline   │
├──────────┤
│AirportDay│
└──────────┘

It stumped me for a while until I figured out what the error message meant! I think I’ll just use explicit parentheses all the time from now on to save me running into this one again.

Written by Mark Needham

October 2nd, 2016 at 10:13 am

Posted in neo4j

Tagged with ,

scikit-learn: First steps with log_loss

without comments

Over the last week I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition, and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

>>> actual_labels = ["bam", "ham", "spam"]


To work out the log loss score we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

e.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First we’ll import the function:

>>> from sklearn.metrics import log_loss

Now let’s see what score we get if we make a perfect prediction:

>>> log_loss(actual_labels,  [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
2.1094237467877998e-15

What about if we make a completely wrong prediction?

>>> log_loss(actual_labels,  [[0, 0, 1], [1, 0, 0], [0, 1, 0]])
34.538776394910684

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the cass that that your model only gives less than a 36% probability estimate for the correct class.

This is the formula of logloss:

NEmt7

In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The yij value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First we need to work out what value would have been passed to the log function which is easy in this case. The value of yij is

# every prediction exactly right
>>> math.log(1)
0.0
 
>>> math.exp(0)
1.0
# every prediction completely wrong
>>> math.log(0.000000001)
-20.72326583694641
 
>>> math.exp(-20.72326583694641)
1.0000000000000007e-09

I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])
0.363548039673

We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:

# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7)
 
>>> print log_loss(actual_labels,  [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])
0.363548039673

In this case, on average our probability estimate would be:

# we put in the negative value since we multiplied by -1/N
>>> math.exp(-0.363548039673)
0.6952053289772744

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time we’ll make one more very certain (90%) prediction for ‘spam’:

>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]])
0.299001158669
 
>>> math.exp(-0.299001158669)
0.741558550213609

74% accuracy overall, sounds about right!

Written by Mark Needham

September 14th, 2016 at 5:33 am

Posted in Machine Learning,Python

Tagged with

scikit-learn: Clustering and the curse of dimensionality

without comments

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
def distances(a, b):
    return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1]
 
def mixed(n_zeros, n_ones):
    return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0)
 
def ones(n_ones):
    return np.repeat([1], n_ones)
 
print distances(mixed(2, 2), ones(4))
print distances(mixed(3, 3), ones(6))
print distances(mixed(50, 50), ones(100))
print distances(mixed(300, 300), ones(600))
 
(1.4142135623730951, 0.70710678118654746)
(1.7320508075688772, 0.70710678118654768)
(7.0710678118654755, 0.70710678118654757)
(17.320508075688775, 0.70710678118654746)

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD
 
n_components = 2
reducer = TruncatedSVD(n_components=n_components)
reducer.fit(all)
new_all = reducer.transform(all)
print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum()))
 
2: Percentage explained: 0.124579183633

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans
 
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(new_all)
    silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000)
 
    print n_clusters, silhouette_avg
 
2 0.559681096025
3 0.498456585461
4 0.524704352941
5 0.441580592398
6 0.44703058946
7 0.447895331824
8 0.433698007009
9 0.459874485986

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

2016 08 27 21 18 14

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

Written by Mark Needham

August 27th, 2016 at 8:32 pm

scikit-learn: Trying to find clusters of Game of Thrones episodes

without comments

In my last post I showed how to find similar Game of Thrones episodes based on the characters that appear in different episodes. This allowed us to find similar episodes on an episode by episode basis, but I was curious whether there were groups of similar episodes that we could identify.

scikit-learn provides several clustering algorithms that can run over our episode vectors and hopefully find clusters of similar episodes. A clustering algorithm groups similar documents together, where similarity is based on calculating a ‘distance’ between documents. Documents separated by a small distance would be in the same cluster, whereas if there’s a large distance between episodes then they’d probably be in different clusters.

The simplest variant is K-means clustering:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

The output from the algorithm is a list of labels which correspond to the cluster assigned to each episode.

Let’s give it a try on the Game of Thrones episodes. We’ll start from the 2 dimensional array of episodes/character appearances that we created in the previous post.

>>> all.shape
(60, 638)
 
>>> all
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We have a 60 (episodes) x 638 (characters) array which we can now plug into the K-means clustering algorithm:

>>> from sklearn.cluster import KMeans
 
>>> n_clusters = 3
>>> km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
>>> cluster_labels = km.fit_predict(all)
 
>>> cluster_labels
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

cluster_labels is an array containing a label for each episode in the all array. The spread of these labels is as follows:

>>> import numpy as np
>>> np.bincount(cluster_labels)
array([19, 12, 29])

i.e. 19 episodes in cluster 0, 12 in cluster 1, and 29 in cluster 2.

How do we know if the clustering is any good?

Ideally we’d have some labelled training data which we could compare our labels against, but since we don’t we can measure the effectiveness of our clustering by calculating inter-centroidal separation and intra-cluster variance.

i.e. how close are the episodes to other episodes in the same cluster vs how close are they to episodes in the closest different cluster.

scikit-learn gives us a function that we can use to calculate this score – the silhouette coefficient.

The output of this function is a score between -1 and 1.

  • A score of 1 means that our clustering has worked well and a document is far away from the boundary of another cluster.
  • A score of -1 means that our document should have been placed in another cluster.
  • A score of 0 means that the document is very close to the decision boundary between two clusters.

I tried calculating this coefficient for some different values of K. This is what I found:

from sklearn import metrics
 
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(all)
 
    silhouette_avg = metrics.silhouette_score(all, cluster_labels, sample_size=1000)
    sample_silhouette_values = metrics.silhouette_samples(all, cluster_labels)
 
    print n_clusters, silhouette_avg
 
2 0.0798610142955
3 0.0648416081725
4 0.0390877994786
5 0.020165277756
6 0.030557856406
7 0.0389677156458
8 0.0590721834989
9 0.0466170527996

The best score we manage here is 0.07 when we set the number of clusters to 2. Even our highest score is much lower than the lowest score on the documentation page!

I tried it out with some higher values of K but only saw a score over 0.5 once I put the number of clusters to 40 which would mean 1 or 2 episodes per cluster at most.

At the moment our episode arrays contain 638 elements so they’re too long to visualise on a 2D silhouette plot. We’d need to apply a dimensionality reduction algorithm before doing that.

In summary it looks like character co-occurrence isn’t a good way to cluster episodes. I’m curious what would happen if we flip the array on its head and try and cluster the characters instead, but that’s for another day.

If anyone spots anything that I’ve missed when reading the output of the algorithm let me know in the comments. I’m just learning by experimentation at the moment.

Written by Mark Needham

August 25th, 2016 at 10:07 pm

Posted in Machine Learning,Python

Tagged with