Mark Needham

Thoughts on Software Development

Archive for the ‘neo4j’ Category

Neo4j: Cypher – Property values can only be of primitive types or arrays thereof.

without comments

I ran into an interesting Cypher error message earlier this week while trying to create an array property on a node which I thought I’d share.

This was the Cypher query I wrote:

CREATE (:Person {id: [1, "mark", 2.0]})

which results in this error:

Neo.ClientError.Statement.TypeError
Property values can only be of primitive types or arrays thereof.

We actually are storing an array of primitives but we have a mix of different types which isn’t allowed. Let’s try coercing all the values to strings:

CREATE (:Person {id: [value in [1, "mark", 2.0] | toString(value)]})
 
Added 1 label, created 1 node, set 1 property, completed after 4 ms.

Success!

Written by Mark Needham

December 1st, 2017 at 10:09 pm

Posted in neo4j

Tagged with ,

Kubernetes: Copy a dataset to a StatefulSet’s PersistentVolume

without comments

In this post we’ll learn how to copy an existing dataset to the PersistentVolumes used by a Neo4j cluster running on Kubernetes.

Neo4j Clusters on Kubernetes

This posts assumes that we’re familiar with deploying Neo4j on Kubernetes. I wrote an article on the Neo4j blog explaining this in more detail.

The StatefulSet we create for our core servers require persistent storage, achieved via the PersistentVolumeClaim (PVC) primitive. A Neo4j cluster containing 3 core servers would have the following PVCs:

$ kubectl get pvc
NAME                            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
datadir-neo-helm-neo4j-core-0   Bound     pvc-043efa91-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       45s
datadir-neo-helm-neo4j-core-1   Bound     pvc-1737755a-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       13s
datadir-neo-helm-neo4j-core-2   Bound     pvc-18696bfd-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       11s

Each of the PVCs has a corresponding PersistentVolume (PV) that satisifies it:

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                                   STORAGECLASS   REASON    AGE
pvc-043efa91-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-0   standard                 41m
pvc-1737755a-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-1   standard                 40m
pvc-18696bfd-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-2   standard                 40m

The PVCs and PVs are usually created at the same time that we deploy our StatefulSet. We need to intervene in that lifecycle so that our dataset is already in place before the StatefulSet is deployed.

Deploying an existing dataset

We can do this by following these steps:

  • Create PVCs with the above names manually
  • Attach pods to those PVCs
  • Copy our dataset onto those pods
  • Delete the pods
  • Deploy our Neo4j cluster

We can use the following script to create the PVCs and pods:

pvs.sh

#!/usr/bin/env bash
 
set -exuo pipefail
 
for i in $(seq 0 2); do
  cat <<EOF | kubectl apply -f -
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: datadir-neo-helm-neo4j-core-${i}
  labels:
    app: neo4j
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF
 
  cat <<EOF | kubectl apply -f -
kind: Pod
apiVersion: v1
metadata:
  name: neo4j-load-data-${i}
  labels:
    app: neo4j-loader
spec:
  volumes:
    - name: datadir-neo4j-core-${i}
      persistentVolumeClaim:
        claimName: datadir-neo-helm-neo4j-core-${i}
  containers:
    - name: neo4j-load-data-${i}
      image: ubuntu
      volumeMounts:
      - name: datadir-neo4j-core-${i}
        mountPath: /data
      command: ["/bin/bash", "-ecx", "while :; do printf '.'; sleep 5 ; done"]
EOF
 
done;

Let’s run that script to create our PVCs and pods:

$ ./pvs.sh 
++ seq 0 2
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-0" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-0" configured
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-1" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-1" configured
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-2" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-2" configured

Now we can copy our database onto the pods:

for i in $(seq 0 2); do
  kubectl cp graph.db.tar.gz neo4j-load-data-${i}:/data/
  kubectl exec neo4j-load-data-${i} -- bash -c "mkdir -p /data/databases && tar -xf  /data/graph.db.tar.gz -C /data/databases"
done

graph.db.tar.gz contains a backup from a local database I created:

$ tar -tvf graph.db.tar.gz 
 
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:23 graph.db/
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:23 graph.db/certificates/
drwxr-xr-x  0 markneedham staff       0 17 Feb  2017 graph.db/index/
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:22 graph.db/logs/
-rw-r--r--  0 markneedham staff    8192 24 Jul 15:23 graph.db/neostore
-rw-r--r--  0 markneedham staff     896 24 Jul 15:23 graph.db/neostore.counts.db.a
-rw-r--r--  0 markneedham staff    1344 24 Jul 15:23 graph.db/neostore.counts.db.b
-rw-r--r--  0 markneedham staff       9 24 Jul 15:23 graph.db/neostore.id
-rw-r--r--  0 markneedham staff   65536 24 Jul 15:23 graph.db/neostore.labelscanstore.db
...
-rw-------  0 markneedham staff     1700 24 Jul 15:23 graph.db/certificates/neo4j.key

We’ll run the following command to check the databases are in place:

$ kubectl exec neo4j-load-data-0 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db
 
$ kubectl exec neo4j-load-data-1 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db
 
$ kubectl exec neo4j-load-data-2 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db

All good so far. The pods have done their job so we’ll tear those down:

$ kubectl delete pods -l app=neo4j-loader
pod "neo4j-load-data-0" deleted
pod "neo4j-load-data-1" deleted
pod "neo4j-load-data-2" deleted

We’re now ready to deploy our Neo4j cluster.

helm install incubator/neo4j --name neo-helm --wait --set authEnabled=false

Finally we’ll run a Cypher query to check that the Neo4j servers used the database that we uploaded:

$ kubectl exec neo-helm-neo4j-core-0 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314
 
$ kubectl exec neo-helm-neo4j-core-1 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314
 
$ kubectl exec neo-helm-neo4j-core-2 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314

Success!

We could achieve similar results by using an init container but I haven’t had a chance to try out that approach yet. If you give it a try let me know in the comments and I’ll add it to the post.

Written by Mark Needham

November 18th, 2017 at 12:44 pm

Posted in Kubernetes,neo4j

Tagged with ,

Kubernetes 1.8: Using Cronjobs to take Neo4j backups

without comments

With the release of Kubernetes 1.8 Cronjobs have graduated to beta, which means we can now more easily run Neo4j backup jobs against Kubernetes clusters.

Before we learn how to write a Cronjob let’s first create a local Kubernetes cluster and deploy Neo4j.

Spinup Kubernetes & Helm

minikube start --memory 8192
helm init && kubectl rollout status -w deployment/tiller-deploy --namespace=kube-system

Deploy a Neo4j cluster

helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/
helm install incubator/neo4j --name neo-helm --wait --set authEnabled=false,core.extraVars.NEO4J_dbms_backup_address=0.0.0.0:6362

Populate our cluster with data

We can run the following command to check our cluster is ready:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "CALL dbms.cluster.overview() YIELD id, role RETURN id, role"
 
+-----------------------------------------------------+
| id                                     | role       |
+-----------------------------------------------------+
| "0b3bfe6c-6a68-4af5-9dd2-e96b564df6e5" | "LEADER"   |
| "09e9bee8-bdc5-4e95-926c-16ea8213e6e7" | "FOLLOWER" |
| "859b9b56-9bfc-42ae-90c3-02cedacfe720" | "FOLLOWER" |
+-----------------------------------------------------+

Now let’s create some data:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "UNWIND range(0,1000) AS id CREATE (:Person {id: id})"
 
0 rows available after 653 ms, consumed after another 0 ms
Added 1001 nodes, Set 1001 properties, Added 1001 labels

Now that our Neo4j cluster is running and contains data we want to take regular backups.

Neo4j backups

The Neo4j backup tool supports full and incremental backups.

Full backup

A full backup streams a copy of the store files to the backup location and any transactions that happened during the backup. Those transactions are then applied to the backed up copy of the database.

Incremental backup

An incremental backup is triggered if there is already a Neo4j database in the backup location. In this case there will be no copying of store files. Instead the tool will copy any new transactions from Neo4j and apply them to the backup.

Backup Cronjob

We will use a Cronjob to execute a backup. In the example below we attach a PersistentVolumeClaim to our Cronjob so that we can see both of the backup scenarios in action.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: backupdir-neo4j
  labels:
    app: neo4j-backup
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: neo4j-backup
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          volumes:
            - name: backupdir-neo4j
              persistentVolumeClaim:
                claimName: backupdir-neo4j
          containers:
            - name: neo4j-backup
              image: neo4j:3.3.0-enterprise
              env:
                - name: NEO4J_ACCEPT_LICENSE_AGREEMENT
                  value: "yes"
              volumeMounts:
              - name: backupdir-neo4j
                mountPath: /tmp
              command: ["bin/neo4j-admin",  "backup", "--backup-dir", "/tmp", "--name", "backup", "--from", "neo-helm-neo4j-core-2.neo-helm-neo4j.default.svc.cluster.local:6362"]
          restartPolicy: OnFailure

$ kubectl apply -f cronjob.yaml 
cronjob "neo4j-backup" created
kubectl get jobs -w
NAME                      DESIRED   SUCCESSFUL   AGE
neo4j-backup-1510940940   1         1            34s

Now let’s view the logs from the job:

kubectl logs $(kubectl get pods -a --selector=job-name=neo4j-backup-1510940940 --output=jsonpath={.items..metadata.name})
 
Doing full backup...
2017-11-17 17:49:05.920+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.nodestore.db.labels
...
2017-11-17 17:49:06.038+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.labelscanstore.db 48.00 kB
2017-11-17 17:49:06.038+0000 INFO [o.n.c.s.StoreCopyClient] Done, copied 18 files
2017-11-17 17:49:06.094+0000 INFO [o.n.b.BackupService] Start recovering store
2017-11-17 17:49:07.669+0000 INFO [o.n.b.BackupService] Finish recovering store
Doing consistency check...
2017-11-17 17:49:07.716+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_2[v0.A.8] record format from store /tmp/backup
2017-11-17 17:49:07.716+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_2[v0.A.8]
2017-11-17 17:49:07.755+0000 INFO [o.n.m.MetricsExtension] Initiating metrics...
....................  10%
...
.................... 100%
Backup complete.

All good so far. Now let’s create more data:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "UNWIND range(0,1000) AS id CREATE (:Person {id: id})"
 
 
0 rows available after 114 ms, consumed after another 0 ms
Added 1001 nodes, Set 1001 properties, Added 1001 labels

And wait for another backup job to run:

kubectl get jobs -w
NAME                      DESIRED   SUCCESSFUL   AGE
neo4j-backup-1510941180   1         1            2m
neo4j-backup-1510941240   1         1            1m
neo4j-backup-1510941300   1         1            17s
kubectl logs $(kubectl get pods -a --selector=job-name=neo4j-backup-1510941300 --output=jsonpath={.items..metadata.name})
 
Destination is not empty, doing incremental backup...
Doing consistency check...
2017-11-17 17:55:07.958+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_2[v0.A.8] record format from store /tmp/backup
2017-11-17 17:55:07.959+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_2[v0.A.8]
2017-11-17 17:55:07.995+0000 INFO [o.n.m.MetricsExtension] Initiating metrics...
....................  10%
...
.................... 100%
Backup complete.

If we were to extend this job further we could have it copy each backup it takes to an S3 bucket but I’ll leave that for another day.

Written by Mark Needham

November 17th, 2017 at 6:10 pm

Posted in Kubernetes,neo4j

Tagged with , , ,

Neo4j Browser: Expected entity id to be an integral value

without comments

I came across an interesting error while writing a Cypher query that used parameters in the Neo4j browser which I thought I should document for future me.

We’ll start with a graph that has 1,000 people:

unwind range(0,1000) AS id
create (:Person {id: id})

Now we’ll try and retrieve some of those people via a parameter lookup:

:param ids: [0]
match (p:Person) where p.id in {ids}
return p
 
╒════════╕
│"p"     │
╞════════╡
│{"id":0}│
└────────┘

All good so far. Now what about if we try and look them up by their internal node id instead:

match (p:Person) where id(p) in {ids}
return p
 
Neo.ClientError.Statement.TypeError
Expected entity id to be an integral value

Hmm, that was unexpected. It turns out that to get this query to work we need to cast each of the integer values in the {ids} array to a 64 bit integer using the toInteger function:

match (p:Person) where id(p) in [id in {ids} | toInteger(id)]
return p
 
╒════════╕
│"p"     │
╞════════╡
│{"id":0}│
└────────┘

Success!

Written by Mark Needham

November 6th, 2017 at 4:17 pm

Posted in neo4j

Tagged with , ,

Neo4j: Traversal query timeout

without comments

I’ve been spending some of my spare time over the last few weeks creating an application that generates running routes from Open Roads data – transformed and imported into Neo4j of course!

I’ve created a user defined procedure which combines several shortest path queries, but I wanted to exit any of these shortest path searches if they were taking too long. My code without a timeout looks like this:

StandardExpander orderedExpander = new OrderedByTypeExpander()
    .add( RelationshipType.withName( "CONNECTS" ), Direction.BOTH );
 
PathFinder<Path> shortestPathFinder = GraphAlgoFactory.shortestPath( expander, 250 );
 
...

There are several places where we could check the time elapsed, but the expand method in the Expander seemed like an obvious one to me. I wrote my own Expander class which looks like this:

public class TimeConstrainedExpander implements PathExpander
{
    private final StandardExpander expander;
    private final long startTime;
    private final Clock clock;
    private int pathsExpanded = 0;
    private long timeLimitInMillis;
 
    public TimeConstrainedExpander( StandardExpander expander, Clock clock, long timeLimitInMillis )
    {
        this.expander = expander;
        this.clock = clock;
        this.startTime = clock.instant().toEpochMilli();
        this.timeLimitInMillis = timeLimitInMillis;
    }
 
    @Override
    public Iterable<Relationship> expand( Path path, BranchState state )
    {
        long timeSoFar = clock.instant().toEpochMilli() - startTime;
        if ( timeSoFar > timeLimitInMillis )
        {
            return Collections.emptyList();
        }
 
        return expander.expand( path, state );
    }
 
    @Override
    public PathExpander reverse()
    {
        return expander.reverse();
    }
}

The code snippet from earlier now needs to be updated to use our new class, which isn’t too tricky:

StandardExpander orderedExpander = new OrderedByTypeExpander()
    .add( RelationshipType.withName( "CONNECTS" ), Direction.BOTH );
 
TimeConstrainedExpander expander = new TimeConstrainedExpander(orderedExpander, 
    Clock.systemUTC(), 200);
 
PathFinder<Path> shortestPathFinder = GraphAlgoFactory.shortestPath( expander, 250 );
...

I’m not sure if this is the best way to achieve what I want but after failing with several other approaches at least this one actually works!

Written by Mark Needham

October 31st, 2017 at 9:43 pm

Posted in neo4j

Tagged with ,

Neo4j: Cypher – Deleting duplicate nodes

without comments

I had a problem on a graph I was working on recently where I’d managed to create duplicate nodes because I hadn’t applied any unique constraints.

I wanted to remove the duplicates, and came across Jimmy Ruts’ excellent post which shows some ways to do this.

Let’s first create a graph with some duplicate nodes to play with:

UNWIND range(0, 100) AS id
CREATE (p1:Person {id: toInteger(rand() * id)})
MERGE (p2:Person {id: toInteger(rand() * id)})
MERGE (p3:Person {id: toInteger(rand() * id)})
MERGE (p4:Person {id: toInteger(rand() * id)})
CREATE (p1)-[:KNOWS]->(p2)
CREATE (p1)-[:KNOWS]->(p3)
CREATE (p1)-[:KNOWS]->(p4)
 
Added 173 labels, created 173 nodes, set 173 properties, created 5829 relationships, completed after 408 ms.

How do we find the duplicate nodes?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════╤═════════════╕
│"ids"                 │"size(nodes)"│
╞══════════════════════╪═════════════╡
│[1,1,1,1,1,1,1,1]     │8            │
├──────────────────────┼─────────────┤
│[0,0,0,0,0,0,0,0]     │8            │
├──────────────────────┼─────────────┤
│[17,17,17,17,17,17,17]│7            │
├──────────────────────┼─────────────┤
│[4,4,4,4,4,4,4]       │7            │
├──────────────────────┼─────────────┤
│[2,2,2,2,2,2]         │6            │
├──────────────────────┼─────────────┤
│[5,5,5,5,5,5]         │6            │
├──────────────────────┼─────────────┤
│[19,19,19,19,19,19]   │6            │
├──────────────────────┼─────────────┤
│[11,11,11,11,11]      │5            │
├──────────────────────┼─────────────┤
│[25,25,25,25,25]      │5            │
├──────────────────────┼─────────────┤
│[43,43,43,43,43]      │5            │
└──────────────────────┴─────────────┘

Let’s zoom in on all the people with ‘id:1’ and work out how many relationships they have. Our plan is keep the node that has the most connections and get rid of the others.

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
WITH nodes ORDER BY size(nodes) DESC
LIMIT 1
UNWIND nodes AS n 
RETURN n.id, id(n) AS internalId, size((n)--()) AS rels
ORDER BY rels DESC
 
╒══════╤════════════╤══════╕
│"n.id"│"internalId"│"rels"│
╞══════╪════════════╪══════╡
│1     │175         │1284  │
├──────┼────────────┼──────┤
│1     │184         │721   │
├──────┼────────────┼──────┤
│1     │180         │580   │
├──────┼────────────┼──────┤
│1     │2           │391   │
├──────┼────────────┼──────┤
│1     │195         │361   │
├──────┼────────────┼──────┤
│1     │199         │352   │
├──────┼────────────┼──────┤
│1     │302         │5     │
├──────┼────────────┼──────┤
│1     │306         │1     │
└──────┴────────────┴──────┘

So in this example we want to keep the node that has 210 relationships and delete the rest.

To make things easy we need the node with the highest cardinality to be first or last in our list. We can ensure that’s the case by ordering the nodes before we group them.

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | {id: n.id,rels: size((n)--()) } ] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════════════════════════════════════════════════════╤═════════════╕
│"ids"                                                                 │"size(nodes)"│
╞══════════════════════════════════════════════════════════════════════╪═════════════╡
│[{"id":1,"rels":1284},{"id":1,"rels":721},{"id":1,"rels":580},{"id":1,│8            │
│"rels":391},{"id":1,"rels":361},{"id":1,"rels":352},{"id":1,"rels":5},│             │
│{"id":1,"rels":1}]                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":0,"rels":2064},{"id":0,"rels":2059},{"id":0,"rels":1297},{"id":│8            │
│0,"rels":1124},{"id":0,"rels":995},{"id":0,"rels":928},{"id":0,"rels":│             │
│730},{"id":0,"rels":702}]                                             │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":17,"rels":153},{"id":17,"rels":105},{"id":17,"rels":81},{"id":1│7            │
│7,"rels":31},{"id":17,"rels":15},{"id":17,"rels":14},{"id":17,"rels":1│             │
│}]                                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":4,"rels":394},{"id":4,"rels":320},{"id":4,"rels":250},{"id":4,"│7            │
│rels":201},{"id":4,"rels":162},{"id":4,"rels":162},{"id":4,"rels":14}]│             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":2,"rels":514},{"id":2,"rels":329},{"id":2,"rels":318},{"id":2,"│6            │
│rels":241},{"id":2,"rels":240},{"id":2,"rels":2}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":5,"rels":487},{"id":5,"rels":378},{"id":5,"rels":242},{"id":5,"│6            │
│rels":181},{"id":5,"rels":158},{"id":5,"rels":8}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":19,"rels":153},{"id":19,"rels":120},{"id":19,"rels":84},{"id":1│6            │
│9,"rels":53},{"id":19,"rels":45},{"id":19,"rels":1}]                  │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":11,"rels":222},{"id":11,"rels":192},{"id":11,"rels":172},{"id":│5            │
│11,"rels":152},{"id":11,"rels":89}]                                   │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":25,"rels":133},{"id":25,"rels":107},{"id":25,"rels":98},{"id":2│5            │
│5,"rels":15},{"id":25,"rels":2}]                                      │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":43,"rels":92},{"id":43,"rels":85},{"id":43,"rels":9},{"id":43,"│5            │
│rels":5},{"id":43,"rels":1}]                                          │             │
└──────────────────────────────────────────────────────────────────────┴─────────────┘───────────────────────────────────────────────────────┴─────────────┘

Now it’s time to delete the duplicates:

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
UNWIND nodes[1..] AS n
DETACH DELETE n
 
Deleted 143 nodes, deleted 13806 relationships, completed after 29 ms.

Now if we run our duplicate query:

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
(no changes, no records)

What about if we remove the WHERE clause?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒═════╤═════════════╕
│"ids"│"size(nodes)"│
╞═════╪═════════════╡
│[23] │1            │
├─────┼─────────────┤
│[86] │1            │
├─────┼─────────────┤
│[77] │1            │
├─────┼─────────────┤
│[59] │1            │
├─────┼─────────────┤
│[50] │1            │
├─────┼─────────────┤
│[32] │1            │
├─────┼─────────────┤
│[41] │1            │
├─────┼─────────────┤
│[53] │1            │
├─────┼─────────────┤
│[44] │1            │
├─────┼─────────────┤
│[8]  │1            │
└─────┴─────────────┘

Hoorah, no more duplicates! Finally, let’s check that we kept the node we expected to keep. We expect it to have an ‘internalId’ of 175:

MATCH (p:Person {id: 1})
RETURN size((p)--()), id(p) AS internalId
 
╒═══════════════╤════════════╕
│"size((p)--())"│"internalId"│
╞═══════════════╪════════════╡
│242            │175         │
└───────────────┴────────────┘

Which it does! There are many fewer relationships than we had before because a lot of those relationships were to duplicate nodes that we’ve now deleted.

If we want to go a step further we could ‘merge’ the duplicate node’s relationships onto the nodes that we did keep, but that’s for another post!

Written by Mark Needham

October 6th, 2017 at 4:13 pm

Posted in neo4j

Tagged with ,

AWS: Spinning up a Neo4j instance with APOC installed

without comments

One of the first things I do after installing Neo4j is install the APOC library, but I find it’s a bit of a manual process when spinning up a server on AWS so I wanted to simplify it a bit.

There’s already a Neo4j AMI which installs Neo4j 3.2.0 and my colleague Michael pointed out that we could download APOC into the correct folder by writing a script and sending it as UserData.

I’ve been doing some work in JavaScript over the last two weeks so I thought I’d automate all the steps using the AWS library. You can find the full script on GitHub.

The UserData part of the script is actually very simple:

This script creates a key pair, security group, opens up that security group on ports 22 (SSH), 7474 (HTTP), 7473 (HTTPS), and 7687 (Bolt). The server created is m3.medium, but you can change that to something else if you prefer.

#!/bin/bash
curl -L https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/3.2.0.3/apoc-3.2.0.3-all.jar -O
sudo cp apoc-3.2.0.3-all.jar /var/lib/neo4j/plugins/

We can run it like this:

$ node neo4j-with-apoc.js 
Creating a Neo4j server
Key pair created. Save this to a file - you'll need to use it if you want to ssh into the Neo4j server
-----BEGIN RSA PRIVATE KEY-----
<Private key details>
-----END RSA PRIVATE KEY-----
Created Group Id:<Group Id>
Opened Neo4j ports
Instance Id: <Instance Id>
Your Neo4j server is now ready!
You'll need to login to the server and change the default password:
https://ec2-ip-address.compute-1.amazonaws.com:7473 or http://ec2-ip-address.compute-1.amazonaws.com:7474
User:neo4j, Password:<Instance Id>

We’ll need to wait a few seconds for Neo4j to spin up, but it’ll be accessible at the URI specified.

Once it’s accessible we can login with the username neo4j and password . We’ll then be instructed to choose a new password.

We can then run the following query to check that APOC has been installed:

call dbms.procedures() YIELD name
WHERE name starts with "apoc"
RETURN count(*)
 
╒══════════╕
│"count(*)"│
╞══════════╡
│214       │
└──────────┘

Cool, it worked and we can now Neo4j and APOC to our heart’s content! If we want to SSH into the server we can do that as well by first saving the private key printed on the command line to a file and then executing the following command:

$ cat aws-private-key.pem
-----BEGIN RSA PRIVATE KEY-----
<Private key details>
-----END RSA PRIVATE KEY-----
 
$ chmod 600 aws-private-key.pem
 
$ ssh -i aws-private-key.pem ubuntu@ec2-ip-address.compute-1.amazonaws.com
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-1013-aws x86_64)
 
 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
 
  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud
 
106 packages can be updated.
1 update is a security update.
 
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

You can start/stop neo4j by running the following command:

$ /etc/init.d/neo4j 
Usage: /etc/init.d/neo4j {start|stop|status|restart|force-reload}

The other commands you may be used to finding in the bin folder can be found here:

$ ls -lh /usr/share/neo4j/bin/
total 48K
-rwxr-xr-x 1 neo4j adm   15K May  9 09:22 neo4j
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-admin
-rwxr-xr-x 1 root  root  612 May 12 00:03 neo4j-awspasswd
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-import
-rwxr-xr-x 1 neo4j adm  5.6K May  9 09:22 neo4j-shell
drwxr-xr-x 2 neo4j adm  4.0K May 11 22:13 tools

Let me know if this is helpful and if you have any suggestions/improvements.

Written by Mark Needham

September 30th, 2017 at 9:23 pm

Posted in neo4j

Tagged with , , ,

Neo4j: Cypher – Create Cypher map with dynamic keys

without comments

I was recently trying to create a map in a Cypher query but wanted to have dynamic keys in that map. I started off with this query:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN { dynamicKey: dynamicValue } AS map
 
 
╒══════════════════╕
│"map"             │
╞══════════════════╡
│{"dynamicKey":"b"}│
└──────────────────┘

Not quite what we want! We want dynamicKey to be evaluated rather than treated as a literal. As usual, APOC comes to the rescue!

In fact APOC has several functions that will help us out here. Let’s take a look at them:

CALL dbms.functions() yield name, description
WHERE name STARTS WITH "apoc.map.from"
RETURN name, description
 
╒═════════════════════╤═════════════════════════════════════════════════════╕
│"name"               │"description"                                        │
╞═════════════════════╪═════════════════════════════════════════════════════╡
│"apoc.map.fromLists" │"apoc.map.fromLists([keys],[values])"                │
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromNodes" │"apoc.map.fromNodes(label, property)"                │
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromPairs" │"apoc.map.fromPairs([[key,value],[key2,value2],...])"│
├─────────────────────┼─────────────────────────────────────────────────────┤
│"apoc.map.fromValues"│"apoc.map.fromValues([key1,value1,key2,value2,...])" │
└─────────────────────┴─────────────────────────────────────────────────────┘

So we can generate a map like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromValues([dynamicKey, dynamicValue]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

or like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromLists([dynamicKey], [dynamicValue]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

or even like this:

WITH "a" as dynamicKey, "b" as dynamicValue
RETURN apoc.map.fromPairs([[dynamicKey, dynamicValue]]) AS map
 
╒═════════╕
│"map"    │
╞═════════╡
│{"a":"b"}│
└─────────┘

Written by Mark Needham

September 19th, 2017 at 7:30 pm

Posted in neo4j

Tagged with , ,

Neo4j: Cypher – Rounding of floating point numbers/BigDecimals

without comments

I was doing some data cleaning a few days ago and wanting to multiply a value by 1 million. My Cypher code to do this looked like this:

with "8.37" as rawNumeric 
RETURN toFloat(rawNumeric) * 1000000 AS numeric
 
╒═════════════════╕
│"numeric"        │
╞═════════════════╡
│8369999.999999999│
└─────────────────┘

Unfortunately that suffers from the classic rounding error when working with floating point numbers. I couldn’t figure out a way to solve it using pure Cypher, but there tends to be an APOC function to solve every problem and this was no exception.

I’m using Neo4j 3.2.3 so I downloaded the corresponding APOC jar and put it in a plugins directory:

$ ls -lh plugins/
total 3664
-rw-r--r--@ 1 markneedham  staff   1.8M  9 Aug 09:14 apoc-3.2.0.4-all.jar

I’m using Docker so I needed to tell that where my plugins folder lives:

$ docker run -v $PWD/plugins:/plugins \
    -p 7474:7474 \
    -p 7687:7687 \
    -e NEO4J_AUTH="none" \
    neo4j:3.2.3

Now we’re reading to try out our new function:

with "8.37" as rawNumeric 
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│"8370000.00"    │
└────────────────┘

That almost does what we want, but the result is a string rather than numeric value. It’s not too difficult to fix though:

with "8.37" as rawNumeric 
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│8370000         │
└────────────────┘

That’s more like it!

Written by Mark Needham

August 13th, 2017 at 7:23 am

Posted in neo4j

Tagged with , ,

Docker: Building custom Neo4j images on Mac OS X

without comments

I sometimes needs to create custom Neo4j Docker images to try things out and wanted to share my work flow, mostly for future Mark but also in case it’s useful to someone else.

There’s already a docker-neo4j repository so we’ll just tweak the files in there to achieve what we want.

$ git clone git@github.com:neo4j/docker-neo4j.git
$ cd docker-neo4j

If we want to build a Docker image for Neo4j Enterprise Edition we can run the following build target:

$ make clean build-enterprise
Makefile:9: *** This Make does not support .RECIPEPREFIX. Please use GNU Make 4.0 or later.  Stop.

Denied at the first hurdle! What version of make have we got on this machine?

$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
 
This program built for i386-apple-darwin11.3.0

We can sort that out by installing a newer version using brew:

$ brew install make
$ gmake --version
GNU Make 4.2.1
Built for x86_64-apple-darwin15.6.0
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

That’s more like it! brew installs make with the ‘g’ prefix and since I’m not sure if anything else on my system relies on the older version of make I won’t bother changing the symlink.

Let’s retry our original command:

$ gmake clean build-enterprise
Makefile:14: *** NEO4J_VERSION is not set.  Stop.

It’s still not happy with us! Let’s set that environment variable to the latest released version as of writing:

$ export NEO4J_VERSION="3.2.2"
$ gmake clean build-enterprise
...
Successfully built c16b6f2738de
Successfully tagged test/18334:latest
Neo4j 3.2.2-enterprise available as: test/18334

We can see that image in Docker land by running the following command:

$ docker images | head -n2
REPOSITORY                                     TAG                          IMAGE ID            CREATED             SIZE
test/18334                                     latest                       c16b6f2738de        4 minutes ago       303MB

If I wanted to deploy that image to my own Docker Hub I could run the following commands:

$ docker login --username=markhneedham
$ docker tag c16b6f2738de markhneedham/neo4j:3.2.2
$ docker push markhneedham/neo4j

Putting Neo4j Enterprise 3.2.2 on my Docker Hub isn’t very interesting though – that version is already on the official Neo4j Docker Hub.

I’ve actually been building versions of Neo4j against the HEAD of the Neo4j 3.2 branch (i.e. 3.2.3-SNAPSHOT), deploying those to S3, and then building a Docker image based on those archives.

To change the destination of the Neo4j artifact we need to tweak this line in the Makefile:

$ git diff Makefile
diff --git a/Makefile b/Makefile
index c77ed1f..98e05ca 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,7 @@ ifndef NEO4J_VERSION
 endif
 
 tarball = neo4j-$(1)-$(2)-unix.tar.gz
-dist_site := http://dist.neo4j.org
+dist_site := https://s3-eu-west-1.amazonaws.com/core-edge.neotechnology.com/20170726
 series := $(shell echo "$(NEO4J_VERSION)" | sed -E 's/^([0-9]+\.[0-9]+)\..*/\1/')
 
 all: out/enterprise/.sentinel out/community/.sentinel

We can then update the Neo4j version environment variable:

$ export NEO4J_VERSION="3.2.3-SNAPSHOT"

And then repeat the Docker commands above. You’ll need to sub in your own Docker Hub user and repository names.

I’m using these custom images as part of Kubernetes deployments but you can use them anywhere that accepts a Docker container.

If anything on the post didn’t make sense or you want more clarification let me know @markhneedham.

Written by Mark Needham

July 26th, 2017 at 10:20 pm

Posted in neo4j

Tagged with ,