Mark Needham

Thoughts on Software Development

Archive for the ‘Databases’ Category

Neo4j: Cypher – Deleting duplicate nodes

without comments

I had a problem on a graph I was working on recently where I’d managed to create duplicate nodes because I hadn’t applied any unique constraints.

I wanted to remove the duplicates, and came across Jimmy Ruts’ excellent post which shows some ways to do this.

Let’s first create a graph with some duplicate nodes to play with:

UNWIND range(0, 100) AS id
CREATE (p1:Person {id: toInteger(rand() * id)})
MERGE (p2:Person {id: toInteger(rand() * id)})
MERGE (p3:Person {id: toInteger(rand() * id)})
MERGE (p4:Person {id: toInteger(rand() * id)})
CREATE (p1)-[:KNOWS]->(p2)
CREATE (p1)-[:KNOWS]->(p3)
CREATE (p1)-[:KNOWS]->(p4)
 
Added 173 labels, created 173 nodes, set 173 properties, created 5829 relationships, completed after 408 ms.

How do we find the duplicate nodes?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════╤═════════════╕
│"ids"                 │"size(nodes)"│
╞══════════════════════╪═════════════╡
│[1,1,1,1,1,1,1,1]     │8            │
├──────────────────────┼─────────────┤
│[0,0,0,0,0,0,0,0]     │8            │
├──────────────────────┼─────────────┤
│[17,17,17,17,17,17,17]│7            │
├──────────────────────┼─────────────┤
│[4,4,4,4,4,4,4]       │7            │
├──────────────────────┼─────────────┤
│[2,2,2,2,2,2]         │6            │
├──────────────────────┼─────────────┤
│[5,5,5,5,5,5]         │6            │
├──────────────────────┼─────────────┤
│[19,19,19,19,19,19]   │6            │
├──────────────────────┼─────────────┤
│[11,11,11,11,11]      │5            │
├──────────────────────┼─────────────┤
│[25,25,25,25,25]      │5            │
├──────────────────────┼─────────────┤
│[43,43,43,43,43]      │5            │
└──────────────────────┴─────────────┘

Let’s zoom in on all the people with ‘id:1’ and work out how many relationships they have. Our plan is keep the node that has the most connections and get rid of the others.

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
WITH nodes ORDER BY size(nodes) DESC
LIMIT 1
UNWIND nodes AS n 
RETURN n.id, id(n) AS internalId, size((n)--()) AS rels
ORDER BY rels DESC
 
╒══════╤════════════╤══════╕
│"n.id"│"internalId"│"rels"│
╞══════╪════════════╪══════╡
│1     │175         │1284  │
├──────┼────────────┼──────┤
│1     │184         │721   │
├──────┼────────────┼──────┤
│1     │180         │580   │
├──────┼────────────┼──────┤
│1     │2           │391   │
├──────┼────────────┼──────┤
│1     │195         │361   │
├──────┼────────────┼──────┤
│1     │199         │352   │
├──────┼────────────┼──────┤
│1     │302         │5     │
├──────┼────────────┼──────┤
│1     │306         │1     │
└──────┴────────────┴──────┘

So in this example we want to keep the node that has 210 relationships and delete the rest.

To make things easy we need the node with the highest cardinality to be first or last in our list. We can ensure that’s the case by ordering the nodes before we group them.

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | {id: n.id,rels: size((n)--()) } ] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════════════════════════════════════════════════════╤═════════════╕
│"ids"                                                                 │"size(nodes)"│
╞══════════════════════════════════════════════════════════════════════╪═════════════╡
│[{"id":1,"rels":1284},{"id":1,"rels":721},{"id":1,"rels":580},{"id":1,│8            │
│"rels":391},{"id":1,"rels":361},{"id":1,"rels":352},{"id":1,"rels":5},│             │
│{"id":1,"rels":1}]                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":0,"rels":2064},{"id":0,"rels":2059},{"id":0,"rels":1297},{"id":│8            │
│0,"rels":1124},{"id":0,"rels":995},{"id":0,"rels":928},{"id":0,"rels":│             │
│730},{"id":0,"rels":702}]                                             │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":17,"rels":153},{"id":17,"rels":105},{"id":17,"rels":81},{"id":1│7            │
│7,"rels":31},{"id":17,"rels":15},{"id":17,"rels":14},{"id":17,"rels":1│             │
│}]                                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":4,"rels":394},{"id":4,"rels":320},{"id":4,"rels":250},{"id":4,"│7            │
│rels":201},{"id":4,"rels":162},{"id":4,"rels":162},{"id":4,"rels":14}]│             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":2,"rels":514},{"id":2,"rels":329},{"id":2,"rels":318},{"id":2,"│6            │
│rels":241},{"id":2,"rels":240},{"id":2,"rels":2}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":5,"rels":487},{"id":5,"rels":378},{"id":5,"rels":242},{"id":5,"│6            │
│rels":181},{"id":5,"rels":158},{"id":5,"rels":8}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":19,"rels":153},{"id":19,"rels":120},{"id":19,"rels":84},{"id":1│6            │
│9,"rels":53},{"id":19,"rels":45},{"id":19,"rels":1}]                  │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":11,"rels":222},{"id":11,"rels":192},{"id":11,"rels":172},{"id":│5            │
│11,"rels":152},{"id":11,"rels":89}]                                   │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":25,"rels":133},{"id":25,"rels":107},{"id":25,"rels":98},{"id":2│5            │
│5,"rels":15},{"id":25,"rels":2}]                                      │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":43,"rels":92},{"id":43,"rels":85},{"id":43,"rels":9},{"id":43,"│5            │
│rels":5},{"id":43,"rels":1}]                                          │             │
└──────────────────────────────────────────────────────────────────────┴─────────────┘───────────────────────────────────────────────────────┴─────────────┘

Now it’s time to delete the duplicates:

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
UNWIND nodes[1..] AS n
DETACH DELETE n
 
Deleted 143 nodes, deleted 13806 relationships, completed after 29 ms.

Now if we run our duplicate query:

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
(no changes, no records)

What about if we remove the WHERE clause?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒═════╤═════════════╕
│"ids"│"size(nodes)"│
╞═════╪═════════════╡
│[23] │1            │
├─────┼─────────────┤
│[86] │1            │
├─────┼─────────────┤
│[77] │1            │
├─────┼─────────────┤
│[59] │1            │
├─────┼─────────────┤
│[50] │1            │
├─────┼─────────────┤
│[32] │1            │
├─────┼─────────────┤
│[41] │1            │
├─────┼─────────────┤
│[53] │1            │
├─────┼─────────────┤
│[44] │1            │
├─────┼─────────────┤
│[8]  │1            │
└─────┴─────────────┘

Hoorah, no more duplicates! Finally, let’s check that we kept the node we expected to keep. We expect it to have an ‘internalId’ of 175:

MATCH (p:Person {id: 1})
RETURN size((p)--()), id(p) AS internalId
 
╒═══════════════╤════════════╕
│"size((p)--())"│"internalId"│
╞═══════════════╪════════════╡
│242            │175         │
└───────────────┴────────────┘

Which it does! There are many fewer relationships than we had before because a lot of those relationships were to duplicate nodes that we’ve now deleted.

If we want to go a step further we could ‘merge’ the duplicate node’s relationships onto the nodes that we did keep, but that’s for another post!

Written by Mark Needham

October 6th, 2017 at 4:13 pm

Posted in neo4j

Tagged with ,

AWS: Spinning up a Neo4j instance with APOC installed

without comments

One of the first things I do after installing Neo4j is install the APOC library, but I find it’s a bit of a manual process when spinning up a server on AWS so I wanted to simplify it a bit.

There’s already a Neo4j AMI which installs Neo4j 3.2.0 and my colleague Michael pointed out that we could download APOC into the correct folder by writing a script and sending it as UserData.

I’ve been doing some work in JavaScript over the last two weeks so I thought I’d automate all the steps using the AWS library. You can find the full script on GitHub.

The UserData part of the script is actually very simple:

This script creates a key pair, security group, opens up that security group on ports 22 (SSH), 7474 (HTTP), 7473 (HTTPS), and 7687 (Bolt). The server created is m3.medium, but you can change that to something else if you prefer.

#!/bin/bash
curl -L https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/3.2.0.3/apoc-3.2.0.3-all.jar -O
sudo cp apoc-3.2.0.3-all.jar /var/lib/neo4j/plugins/

We can run it like this:

$ node neo4j-with-apoc.js 
Creating a Neo4j server
Key pair created. Save this to a file - you'll need to use it if you want to ssh into the Neo4j server
-----BEGIN RSA PRIVATE KEY-----
<Private key details>
-----END RSA PRIVATE KEY-----
Created Group Id:<Group Id>
Opened Neo4j ports
Instance Id: <Instance Id>
Your Neo4j server is now ready!
You'll need to login to the server and change the default password:
https://ec2-ip-address.compute-1.amazonaws.com:7473 or http://ec2-ip-address.compute-1.amazonaws.com:7474
User:neo4j, Password:<Instance Id>

We’ll need to wait a few seconds for Neo4j to spin up, but it’ll be accessible at the URI specified.

Once it’s accessible we can login with the username neo4j and password . We’ll then be instructed to choose a new password.

We can then run the following query to check that APOC has been installed:

call dbms.procedures() YIELD name
WHERE name starts with "apoc"
RETURN count(*)
 
╒══════════╕
│"count(*)"│
╞══════════╡
│214       │
└──────────┘

Cool, it worked and we can now Neo4j and APOC to our heart’s content! If we want to SSH into the server we can do that as well by first saving the private key printed on the command line to a file and then executing the following command:

$ cat aws-private-key.pem
-----BEGIN RSA PRIVATE KEY-----
<Private key details>
-----END RSA PRIVATE KEY-----
 
$ chmod 600 aws-private-key.pem
 
$ ssh -i aws-private-key.pem ubuntu@ec2-ip-address.compute-1.amazonaws.com
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-1013-aws x86_64)
 
 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
 
  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud
 
106 packages can be updated.
1 update is a security update.
 
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

You can start/stop neo4j by running the following command:

$ /etc/init.d/neo4j 
Usage: /etc/init.d/neo4j {start|stop|status|restart|force-reload}

The other commands you may be used to finding in the bin folder can be found here:

$ ls -lh /usr/share/neo4j/bin/
total 48K
-rwxr-xr-x 1 neo4j adm   15K May  9 09:22 neo4j
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-admin
-rwxr-xr-x 1 root  root  612 May 12 00:03 neo4j-awspasswd
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-import
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-shell
drwxr-xr-x 2 neo4j adm  4.0K May 11 22:13 tools

Let me know if this is helpful and if you have any suggestions/improvements.

Written by Mark Needham

September 30th, 2017 at 9:23 pm

Posted in neo4j

Tagged with , , ,

Neo4j: Cypher – Create Cypher map with dynamic keys

without comments

I was recently trying to create a map in a Cypher query but wanted to have dynamic keys in that map. I started off with this query:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN { dynamicKey: dynamicValue } AS map
 
 
╒══════════════════╕
│"map"             │
╞══════════════════╡
│{"dynamicKey":"b"}│
└──────────────────┘

Not quite what we want! We want dynamicKey to be evaluated rather than treated as a literal. As usual, APOC comes to the rescue!

In fact APOC has several functions that will help us out here. Let’s take a look at them:

CALL dbms.functions() yield name, description
WHERE name STARTS WITH "apoc.map.from"
RETURN name, description
 
╒═════════════════════╤═════════════════════════════════════════════════════╕
│"name"               │"description"                                        │
╞═════════════════════╪═════════════════════════════════════════════════════╡
│"apoc.map.fromLists" │"apoc.map.fromLists([keys],[values])"                │
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromNodes" │"apoc.map.fromNodes(label, property)"                │
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromPairs" │"apoc.map.fromPairs([[key,value],[key2,value2],...])"│
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromValues"│"apoc.map.fromValues([key1,value1,key2,value2,...])" │
└─────────────────────┴─────────────────────────────────────────────────────┘

So we can generate a map like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromValues([dynamicKey, dynamicValue]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

or like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromLists([dynamicKey], [dynamicValue]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

or even like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromPairs([[dynamicKey, dynamicValue]]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

Written by Mark Needham

September 19th, 2017 at 7:30 pm

Posted in neo4j

Tagged with , ,

Neo4j: Cypher – Rounding of floating point numbers/BigDecimals

without comments

I was doing some data cleaning a few days ago and wanting to multiply a value by 1 million. My Cypher code to do this looked like this:

with "8.37" as rawNumeric 
RETURN toFloat(rawNumeric) * 1000000 AS numeric
 
╒═════════════════╕
│"numeric"        │
╞═════════════════╡
│8369999.999999999│
└─────────────────┘

Unfortunately that suffers from the classic rounding error when working with floating point numbers. I couldn’t figure out a way to solve it using pure Cypher, but there tends to be an APOC function to solve every problem and this was no exception.

I’m using Neo4j 3.2.3 so I downloaded the corresponding APOC jar and put it in a plugins directory:

$ ls -lh plugins/
total 3664
-rw-r--r--@ 1 markneedham  staff   1.8M  9 Aug 09:14 apoc-3.2.0.4-all.jar

I’m using Docker so I needed to tell that where my plugins folder lives:

$ docker run -v $PWD/plugins:/plugins \
    -p 7474:7474 \
    -p 7687:7687 \
    -e NEO4J_AUTH="none" \
    neo4j:3.2.3

Now we’re reading to try out our new function:

with "8.37" as rawNumeric 
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│"8370000.00"    │
└────────────────┘

That almost does what we want, but the result is a string rather than numeric value. It’s not too difficult to fix though:

with "8.37" as rawNumeric 
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│8370000         │
└────────────────┘

That’s more like it!

Written by Mark Needham

August 13th, 2017 at 7:23 am

Posted in neo4j

Tagged with , ,

Docker: Building custom Neo4j images on Mac OS X

without comments

I sometimes needs to create custom Neo4j Docker images to try things out and wanted to share my work flow, mostly for future Mark but also in case it’s useful to someone else.

There’s already a docker-neo4j repository so we’ll just tweak the files in there to achieve what we want.

$ git clone git@github.com:neo4j/docker-neo4j.git
$ cd docker-neo4j

If we want to build a Docker image for Neo4j Enterprise Edition we can run the following build target:

$ make clean build-enterprise
Makefile:9: *** This Make does not support .RECIPEPREFIX. Please use GNU Make 4.0 or later.  Stop.

Denied at the first hurdle! What version of make have we got on this machine?

$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
 
This program built for i386-apple-darwin11.3.0

We can sort that out by installing a newer version using brew:

$ brew install make
$ gmake --version
GNU Make 4.2.1
Built for x86_64-apple-darwin15.6.0
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

That’s more like it! brew installs make with the ‘g’ prefix and since I’m not sure if anything else on my system relies on the older version of make I won’t bother changing the symlink.

Let’s retry our original command:

$ gmake clean build-enterprise
Makefile:14: *** NEO4J_VERSION is not set.  Stop.

It’s still not happy with us! Let’s set that environment variable to the latest released version as of writing:

$ export NEO4J_VERSION="3.2.2"
$ gmake clean build-enterprise
...
Successfully built c16b6f2738de
Successfully tagged test/18334:latest
Neo4j 3.2.2-enterprise available as: test/18334

We can see that image in Docker land by running the following command:

$ docker images | head -n2
REPOSITORY                                     TAG                          IMAGE ID            CREATED             SIZE
test/18334                                     latest                       c16b6f2738de        4 minutes ago       303MB

If I wanted to deploy that image to my own Docker Hub I could run the following commands:

$ docker login --username=markhneedham
$ docker tag c16b6f2738de markhneedham/neo4j:3.2.2
$ docker push markhneedham/neo4j

Putting Neo4j Enterprise 3.2.2 on my Docker Hub isn’t very interesting though – that version is already on the official Neo4j Docker Hub.

I’ve actually been building versions of Neo4j against the HEAD of the Neo4j 3.2 branch (i.e. 3.2.3-SNAPSHOT), deploying those to S3, and then building a Docker image based on those archives.

To change the destination of the Neo4j artifact we need to tweak this line in the Makefile:

$ git diff Makefile
diff --git a/Makefile b/Makefile
index c77ed1f..98e05ca 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,7 @@ ifndef NEO4J_VERSION
 endif
 
 tarball = neo4j-$(1)-$(2)-unix.tar.gz
-dist_site := http://dist.neo4j.org
+dist_site := https://s3-eu-west-1.amazonaws.com/core-edge.neotechnology.com/20170726
 series := $(shell echo "$(NEO4J_VERSION)" | sed -E 's/^([0-9]+\.[0-9]+)\..*/\1/')
 
 all: out/enterprise/.sentinel out/community/.sentinel

We can then update the Neo4j version environment variable:

$ export NEO4J_VERSION="3.2.3-SNAPSHOT"

And then repeat the Docker commands above. You’ll need to sub in your own Docker Hub user and repository names.

I’m using these custom images as part of Kubernetes deployments but you can use them anywhere that accepts a Docker container.

If anything on the post didn’t make sense or you want more clarification let me know @markhneedham.

Written by Mark Needham

July 26th, 2017 at 10:20 pm

Posted in neo4j

Tagged with ,

PostgreSQL: ERROR: argument of WHERE must not return a set

without comments

In my last post I showed how to load and query data from the Strava API in PostgreSQL and after executing some simple queries my next task was to query more complex part of the JSON structure.

2017 05 01 21 22 55

Strava allows users to create segments, which are edited portions of road or trail where athletes can compete for time.

I wanted to write a query to find all the times that I’d run a particular segment. e.g. the Akerman Road segment covers a road running North to South in Kennington/Stockwell in South London.

This segment has the id ‘6818475’ so we’ll need to look inside segment_efforts and then compare the value segment.id against this id.

I initially wrote this query to try and find the times I’d run this segment:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs
WHERE jsonb_array_elements(data->'segment_efforts')->'segment'->>'id' = '6818475'
 
ERROR:  argument OF WHERE must NOT RETURN a SET
LINE 3: WHERE jsonb_array_elements(data->'segment_efforts')->'segmen...

This doesn’t work since jsonb_array_elements returns a set of boolean values, as Craig Ringer points out on Stack Overflow.

Instead we can use a LATERAL subquery to achieve our goal:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs r,
LATERAL jsonb_array_elements(r.data->'segment_efforts') segment
WHERE segment ->'segment'->>'id' = '6818475'
 
    id     |       startdate        | averagespeed 
-----------+------------------------+--------------
 455461182 | "2015-12-24T11:20:26Z" | 2.841
 440088621 | "2015-11-27T06:10:42Z" | 2.975
 407930503 | "2015-10-07T05:18:34Z" | 2.985
 317170464 | "2015-06-03T04:44:59Z" | 2.842
 312629236 | "2015-05-27T04:46:33Z" | 2.857
 277786711 | "2015-04-02T05:25:59Z" | 2.408
 226351235 | "2014-12-05T07:59:15Z" | 2.803
 225073326 | "2014-12-01T06:15:21Z" | 2.929
 224287690 | "2014-11-29T09:02:46Z" | 3.087
 223964715 | "2014-11-28T06:18:29Z" | 2.844
(10 ROWS)

Perfect!

Written by Mark Needham

May 1st, 2017 at 8:42 pm

Posted in PostgreSQL

Tagged with

Loading and analysing Strava runs using PostgreSQL JSON data type

without comments

In my last post I showed how to map Strava runs using data that I’d extracted from their /activities API, but the API returns a lot of other data that I discarded because I wasn’t sure what I should keep.

The API returns a nested JSON structure so the easiest solution would be to save each run as an individual file but I’ve always wanted to try out PostgreSQL’s JSON data type and this seemed like a good opportunity.

Creating a JSON ready PostgreSQL table

First up we need to create a database in which we’ll store our Strava data. Let’s name it appropriately:

CREATE DATABASE strava;
\CONNECT strava;

Now we can now create a table with one field with the JSON data type:

CREATE TABLE runs (
  id INTEGER NOT NULL,
  DATA jsonb
);
 
ALTER TABLE runs ADD PRIMARY KEY(id);

Easy enough. Now we’re ready to populate the table.

Importing Strava API

We can partially reuse the script from the last post except rather than saving to CSV file we’ll save to PostgreSQL using the psycopg2 library.

2017 05 01 13 45 58

The script relies on a TOKEN environment variable. If you want to try this on your own Strava account you’ll need to create an application, which will give you a key.

extract-runs.py

import requests
import os
import json
import psycopg2
 
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
 
with psycopg2.connect("dbname=strava user=markneedham") as conn:
    with conn.cursor() as cur:
        page = 1
        while True:
            r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
            response = r.json()
 
            if len(response) == 0:
                break
            else:
                for activity in response:
                    r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                    json_response = r.json()
                    cur.execute("INSERT INTO runs (id, data) VALUES(%s, %s)", (activity["id"], json.dumps(json_response)))
                    conn.commit()
                page += 1

Querying Strava

We can now write some queries against our newly imported data.

My quickest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'average_speed')::FLOAT AS speed 
FROM runs 
ORDER BY speed DESC 
LIMIT 5
 
    id     |      start_date      | speed 
-----------+----------------------+-------
 649253963 | 2016-07-22T05:18:37Z | 3.736
 914796614 | 2017-03-26T08:37:56Z | 3.614
 653703601 | 2016-07-26T05:25:07Z | 3.606
 548540883 | 2016-04-17T18:18:05Z | 3.604
 665006485 | 2016-08-05T04:11:21Z | 3.604
(5 ROWS)

My longest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'distance')::FLOAT AS distance
FROM runs
ORDER BY distance DESC
LIMIT 5
 
    id     |      start_date      | distance 
-----------+----------------------+----------
 840246999 | 2017-01-22T10:20:33Z |  10764.1
 461124609 | 2016-01-02T08:42:47Z |  10457.9
 467634177 | 2016-01-10T18:48:47Z |  10434.5
 471467618 | 2016-01-16T12:33:28Z |  10359.3
 540811705 | 2016-04-10T07:26:55Z |   9651.6
(5 ROWS)

Runs this year

SELECT COUNT(*)
FROM runs
WHERE data->>'start_date' >= '2017-01-01 00:00:00'
 
 COUNT 
-------
    62
(1 ROW)

Runs per year

SELECT EXTRACT(YEAR FROM to_date(data->>'start_date', 'YYYY-mm-dd')) AS YEAR, 
       COUNT(*) 
FROM runs 
GROUP BY YEAR 
ORDER BY YEAR
 
 YEAR | COUNT 
------+-------
 2014 |    18
 2015 |   139
 2016 |   166
 2017 |    62
(4 ROWS)

That’s all for now. Next I’m going to learn how to query segments, which are stored inside a nested array inside the JSON document. Stay tuned for that in a future post.

Written by Mark Needham

May 1st, 2017 at 7:11 pm

Posted in PostgreSQL

Tagged with , ,

Neo4j: apoc.date.parse – java.lang.IllegalArgumentException: Illegal pattern character ‘T’ / java.text.ParseException: Unparseable date: “2012-11-12T08:46:15Z”

without comments

I often find myself wanting to convert date strings into Unix timestamps using Neo4j’s APOC library and unfortunately some sources don’t use the format that apoc.date.parse expects.

e.g.

return apoc.date.parse("2012-11-12T08:46:15Z",'s') 
AS ts
 
Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: java.text.ParseException: Unparseable date: "2012-11-12T08:46:15Z"

We need to define the format explicitly so the SimpleDataFormat documentation comes in handy. I tried the following:

return apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-ddTHH:mm:ssZ") 
AS ts
 
Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: Illegal pattern character 'T'

Hmmm, we need to quote the ‘T’ character – we can’t just include it in the pattern. Let’s try again:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ssZ") 
AS ts
 
Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: java.text.ParseException: Unparseable date: "2012-11-12T08:46:15Z"

The problem now is that we haven’t quoted the ‘Z’ but the error doesn’t indicate that – not sure why!

We can either quote the ‘Z’:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ss'Z'") 
AS ts
 
╒══════════╕
│"ts"      │
╞══════════╡
│1352709975│
└──────────┘

Or we can match the timezone using ‘XXX’:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ssXXX") 
AS ts
 
╒══════════╕
│"ts"      │
╞══════════╡
│1352709975│
└──────────┘

Written by Mark Needham

March 6th, 2017 at 8:52 pm

Posted in neo4j

Tagged with

Neo4j: Graphing the ‘My name is…I work’ Twitter meme

without comments

Over the last few days I’ve been watching the chain of ‘My name is…’ tweets kicked off by DHH with interest. As I understand it, the idea is to show that coding interview riddles/hard tasks on a whiteboard are ridiculous.

Other people quoted that tweet and added their own piece and yesterday Eduardo Hernacki suggested that traversing this chain of tweets seemed tailor made for Neo4j.

Michael was quickly on the scene and created a Cypher query which calls the Twitter API and creates a Neo4j graph from the resulting JSON response. The only tricky bit is creating a ‘bearer token’ but Jason Kotchoff has a helpful gist showing how to generate one from your Twitter consumer key and consumer secret.

Now that we’re got our bearer token let’s create a parameter to store it. Type the following in the Neo4j browser:

:param bearer: '<your-bearer-token-goes-here>'

Now we’re ready to query the Twitter API. We’ll start with the search API and find all tweets which contain the text ‘”my name” “I work”‘. That will return a JSON response containing lots of tweets. We’ll then create a node for each tweet it returns, a node for the user who posted the tweet, a node for the tweet it quotes, and relationships to glue them all together.

We’re going to use the apoc.load.jsonParams procedure from the APOC library to help us import the data. If you want to follow along you can use a Neo4j sandbox instance which comes with APOC installed. For your local Neo4j installation, grab the APOC jar and put it into your plugins folder before restarting Neo4j.

This is the query in full:

WITH 'https://api.twitter.com/1.1/search/tweets.json?count=100&result_type=recent&lang=en&q=' as url, {bearer} as bearer
 
CALL apoc.load.jsonParams(url + "%22my%20name%22%20is%22%20%22I%20work%22",{Authorization:"Bearer "+bearer},null) yield value
 
UNWIND value.statuses as status
WITH status, status.user as u, status.entities as e
WHERE status.quoted_status_id is not null
 
// create a node for the original tweet
MERGE (t:Tweet {id:status.id}) 
ON CREATE SET t.text=status.text,t.created_at=status.created_at,t.retweet_count=status.retweet_count, t.favorite_count=status.favorite_count
 
// create a node for the author + a POSTED relationship from the author to the tweet
MERGE (p:User {name:u.screen_name})
MERGE (p)-[:POSTED]->(t)
 
// create a MENTIONED relationship from the tweet to any users mentioned in the tweet
FOREACH (m IN e.user_mentions | MERGE (mu:User {name:m.screen_name}) MERGE (t)-[:MENTIONED]->(mu))
 
// create a node for the quoted tweet and create a QUOTED relationship from the original tweet to the quoted one
MERGE (q:Tweet {id:status.quoted_status_id})
MERGE (t)–[:QUOTED]->(q)
 
// repeat the above steps for the quoted tweet
WITH t as t0, status.quoted_status as status WHERE status is not null
WITH t0, status, status.user as u, status.entities as e
 
MERGE (t:Tweet {id:status.id}) 
ON CREATE SET t.text=status.text,t.created_at=status.created_at,t.retweet_count=status.retweet_count, t.favorite_count=status.favorite_count
 
MERGE (t0)-[:QUOTED]->(t)
 
MERGE (p:User {name:u.screen_name})
MERGE (p)-[:POSTED]->(t)
 
FOREACH (m IN e.user_mentions | MERGE (mu:User {name:m.screen_name}) MERGE (t)-[:MENTIONED]->(mu))
 
MERGE (q:Tweet {id:status.quoted_status_id})
MERGE (t)–[:QUOTED]->(q);

The resulting graph looks like this:

MATCH p=()-[r:QUOTED]->() RETURN p LIMIT 25

Graph  21

A more interesting query would be to find the path from DHH to Eduardo which we can find with the following query:

match path = (dhh:Tweet {id: 834146806594433025})<-[:QUOTED*]-(eduardo:Tweet{id: 836400531983724545})
UNWIND NODES(path) AS tweet
MATCH (tweet)<-[:POSTED]->(user)
RETURN tweet, user

This query:

  • starts from DHH’s tweet
  • traverses all QUOTED relationships until it finds Eduardo’s tweet
  • collects all those tweets and then finds the author
  • returns the tweet and the author

And this is the output:

Graph  20

I ran a couple of other queries against the Twitter API to hydrate some nodes that we hadn’t set all the properties on – you can see all the queries on this gist.

For the next couple of days I also have a sandbox running https://10-0-1-157-32898.neo4jsandbox.com/browser/. You can login using the credentials readonly/twitter.

If you have any questions/suggestions let me know in the comments, @markhneedham on twitter, or email the Neo4j DevRel team – devrel@neo4j.com.

Written by Mark Needham

February 28th, 2017 at 3:50 pm

Posted in neo4j

Tagged with ,

Neo4j: How do null values even work?

without comments

Every now and then I find myself wanting to import a CSV file into Neo4j and I always get confused with how to handle the various null values that can lurk within.

Let’s start with an example that doesn’t have a CSV file in sight. Consider the following list and my attempt to only return null values:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value = null
RETURN value
 
(no changes, no records)

Hmm that’s weird. I’d have expected that at least keep the first value in the collection. What about if we do the inverse?

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value <> null
RETURN value
 
(no changes, no records)

Still nothing! Let’s try returning the output of our comparisons rather than filtering rows:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value = null AS outcome
 
╒═══════╤═════════╕
│"value"│"outcome"│
╞═══════╪═════════╡
│null   │null     │
├───────┼─────────┤
│"null" │null     │
├───────┼─────────┤
│""     │null     │
├───────┼─────────┤
│"Mark" │null     │
└───────┴─────────┘

Ok so that isn’t what we expected. Everything has an ‘outcome’ of ‘null’! What about if we want to check whether the value is the string “Mark”?

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value = "Mark" AS outcome
 
╒═══════╤═════════╕
│"value"│"outcome"│
╞═══════╪═════════╡
│null   │null     │
├───────┼─────────┤
│"null" │false    │
├───────┼─────────┤
│""     │false    │
├───────┼─────────┤
│"Mark" │true     │
└───────┴─────────┘

From executing this query we learn that if one side of a comparison is null then the return value is always going to be null.

So how do we exclude a row if it’s null?

It turns out we have to use the ‘is’ keyword rather than using the equality operator. Let’s see what that looks like:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value is null
RETURN value
 
╒═══════╕
│"value"│
╞═══════╡
│null   │
└───────┘

And the positive case:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value is not null
RETURN value
 
╒═══════╕
│"value"│
╞═══════╡
│"null" │
├───────┤
│""     │
├───────┤
│"Mark" │
└───────┘

What if we want to get rid of empty strings?

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value <> ""
RETURN value
 
╒═══════╕
│"value"│
╞═══════╡
│"null" │
├───────┤
│"Mark" │
└───────┘

Interestingly that also gets rid of the null value which I hadn’t expected. But if we look for values matching the empty string:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value = ""
RETURN value
 
╒═══════╕
│"value"│
╞═══════╡
│""     │
└───────┘

It’s not there either! Hmm what’s going on here:

WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value, value = "" AS isEmpty, value <> "" AS isNotEmpty
 
╒═══════╤═════════╤════════════╕
│"value"│"isEmpty"│"isNotEmpty"│
╞═══════╪═════════╪════════════╡
│null   │null     │null        │
├───────┼─────────┼────────────┤
│"null" │false    │true        │
├───────┼─────────┼────────────┤
│""     │true     │false       │
├───────┼─────────┼────────────┤
│"Mark" │false    │true        │
└───────┴─────────┴────────────┘

null values seem to get filtered out for every type of equality match unless we explicitly check that a value ‘is null’.

So how do we use this knowledge when we’re parsing CSV files using Neo4j’s LOAD CSV tool?

Let’s say we have a CSV file that looks like this:

$ cat nulls.csv
name,company
"Mark",
"Michael",""
"Will",null
"Ryan","Neo4j"

So none of the first three rows have a value for ‘company’. I don’t have any value at all, Michael has an empty string, and Will has a null value. Let’s see how LOAD CSV interprets this:

load csv with headers from "file:///nulls.csv" AS row
RETURN row
 
╒═════════════════════════════════╕
│"row"                            │
╞═════════════════════════════════╡
│{"name":"Mark","company":null}   │
├─────────────────────────────────┤
│{"name":"Michael","company":""}  │
├─────────────────────────────────┤
│{"name":"Will","company":"null"} │
├─────────────────────────────────┤
│{"name":"Ryan","company":"Neo4j"}│
└─────────────────────────────────┘

We’ve got the full sweep of all the combinations from above. We’d like to create a Person node for each row but only create a Company node and associated ‘WORKS_FOR’ relationshp if an actual company is defined – we don’t want to create a null company.

So we only want to create a company node and ‘WORKS_FOR’ relationship for the Ryan row.

The following query does the trick:

load csv with headers from "file:///nulls.csv" AS row
MERGE (p:Person {name: row.name})
WITH p, row
WHERE row.company <> "" AND row.company <> "null"
MERGE (c:Company {name: row.company})
MERGE (p)-[:WORKS_FOR]->(c)
 
Added 5 labels, created 5 nodes, set 5 properties, created 1 relationship, statement completed in 117 ms.

And if we visualise what’s been created:

Graph  15

Perfect. Perhaps this behaviour is obvious but it always trips me up so hopefully it’ll be useful to someone else as well!

There’s also a section on the Neo4j developer pages describing even more null scenarios that’s worth checking out.

Written by Mark Needham

February 22nd, 2017 at 11:28 pm

Posted in neo4j

Tagged with