Mark Needham

Thoughts on Software Development

scikit-learn: Using GridSearch to tune the hyper-parameters of VotingClassifier

without comments

In my last blog post I showed how to create a multi class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble otherwise they can be excluded.

We have a TF/IDF based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=3, max_features=None,
                              strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                              ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                              stop_words='english')),
    ('mnb', MultinomialNB())
])
 
classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
    ("tfidf", tfidf_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

from sklearn.model_selection import GridSearchCV
 
 
def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]
 
 
param_grid = dict(
    voting__weights=combinations_on_off(len(classifiers))
)
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
y = train_df[Y_COLUMN].copy()
X = pd.Series(train_df[TEXT_COLUMN])
 
grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")
 
grid_search.fit(X, y)
 
cv_results = grid_search.cv_results_
 
for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)
 
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Let’s run the grid scan and see what it comes up with:

{'voting__weights': [0, 0, 1]} -0.60533660756
{'voting__weights': [0, 1, 0]} -0.474562462086
{'voting__weights': [0, 1, 1]} -0.508363479586
{'voting__weights': [1, 0, 0]} -0.697231760084
{'voting__weights': [1, 0, 1]} -0.456599644003
{'voting__weights': [1, 1, 0]} -0.409406571361
{'voting__weights': [1, 1, 1]} -0.439084397238
 
Best score: -0.409
Best parameters set:
	voting__weights: [1, 1, 0]

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included – our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place

Written by Mark Needham

December 10th, 2017 at 7:55 am

scikit-learn: Building a multi class classification ensemble

without comments

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we could have it make a ‘hard’ prediction instead.

We start with three classifiers which generate different n-gram based features. The code for those is as follows:

from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])

We can combine those classifiers together like this:

classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane‘s tutorial.

import pandas as pd
import numpy as np
 
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
 
def test_pipeline(df, nlp_pipeline):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    accuracies = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
        accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
 
    print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
        str([str(round(x, 3)) for x in sorted(losses)]),
        round(np.mean(losses), 3),
        round(np.mean(accuracies), 3)
    ))
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)

Let’s run the script:

kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

Looks good.

I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

Written by Mark Needham

December 5th, 2017 at 10:19 pm

Python: Combinations of values on and off

without comments

In my continued exploration of Kaggle’s Spooky Authors competition, I wanted to run a GridSearch turning on and off different classifiers to work out the best combination.

I therefore needed to generate combinations of 1s and 0s enabling different classifiers.

e.g. if we had 3 classifiers we’d generate these combinations

0 0 1
0 1 0
1 0 0
1 1 0
1 0 1
0 1 1
1 1 1

where…

  • ‘0 0 1’ means: classifier1 is disabled, classifier3 is disabled, classifier3 is enabled
  • ‘0 1 0’ means: classifier1 is disabled, classifier3 is enabled, classifier3 is disabled
  • ‘1 1 0’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is disabled
  • ‘1 1 1’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is enabled

…and so on. In other words, we need to generate the binary representation for all the values from 1 to 2number of classifiers-1.

We can write the following code fragments to calculate a 3 bit representation of different numbers:

>>> "{0:0b}".format(1).zfill(3)
'001'
>>> "{0:0b}".format(5).zfill(3)
'101'
>>> "{0:0b}".format(6).zfill(3)
'110'

We need an array of 0s and 1s rather than a string, so let’s use the list function to create our array and then cast each value to an integer:

>>> [int(x) for x in list("{0:0b}".format(1).zfill(3))]
[0, 0, 1]

Finally we can wrap that code inside a list comprehension:

def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]

And let’s check it works:

>>> for combination in combinations_on_off(3):
       print(combination)
 
[0, 0, 1]
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]

what about if we have 4 classifiers?

>>> for combination in combinations_on_off(4):
       print(combination)
 
[0, 0, 0, 1]
[0, 0, 1, 0]
[0, 0, 1, 1]
[0, 1, 0, 0]
[0, 1, 0, 1]
[0, 1, 1, 0]
[0, 1, 1, 1]
[1, 0, 0, 0]
[1, 0, 0, 1]
[1, 0, 1, 0]
[1, 0, 1, 1]
[1, 1, 0, 0]
[1, 1, 0, 1]
[1, 1, 1, 0]
[1, 1, 1, 1]

Perfect! We can now use this function to help work out which combinations of classifiers are needed.

Written by Mark Needham

December 3rd, 2017 at 5:23 pm

Posted in Python

Tagged with ,

Neo4j: Cypher – Property values can only be of primitive types or arrays thereof.

without comments

I ran into an interesting Cypher error message earlier this week while trying to create an array property on a node which I thought I’d share.

This was the Cypher query I wrote:

CREATE (:Person {id: [1, "mark", 2.0]})

which results in this error:

Neo.ClientError.Statement.TypeError
Property values can only be of primitive types or arrays thereof.

We actually are storing an array of primitives but we have a mix of different types which isn’t allowed. Let’s try coercing all the values to strings:

CREATE (:Person {id: [value in [1, "mark", 2.0] | toString(value)]})
 
Added 1 label, created 1 node, set 1 property, completed after 4 ms.

Success!

Written by Mark Needham

December 1st, 2017 at 10:09 pm

Posted in neo4j

Tagged with ,

Python: Learning about defaultdict’s handling of missing keys

without comments

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__

Let’s quickly see how it works by adapting an example from scikit-learn:

>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__
 
>>> vocabulary["foo"]
0
>>> vocabulary.items()
dict_items([('foo', 0)])
 
>>> vocabulary["bar"]
1
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
'Mark'
>>> vocabulary["python"]
3

Now let’s see what the dictionary contains:

>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

"""
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value
"""
pass

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

Written by Mark Needham

December 1st, 2017 at 3:26 pm

Posted in Python

Tagged with , ,

scikit-learn: Creating a matrix of named entity counts

without comments

I’ve been trying to improve my score on Kaggle’s Spooky Author Identification competition, and my latest idea was building a model which used named entities extracted using the polyglot NLP library.

We’ll start by learning how to extract entities form a sentence using polyglot which isn’t too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We’d like each entity to be a string rather than an array of values so let’s refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That’s it for the polyglot part of the solution. Now let’s work out how to integrate that with scikit-learn.

I’ve been using scikit-learn’s Pipeline abstraction for the other models I’ve created so I’d like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
])
 
...
# Train and Test the model
...

I was going to write a class similar to CountVectorizer but after reading its code for a couple of hours I realised that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}
 
 
def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())
])

I’m caching the results in a dictionary because the entity extraction is quite time consuming and there’s no point recalculating it each time the function is called.

Unfortunately this model didn’t help me improve my best score. It scores a log loss of around 0.5, a bit worse than the 0.45 I’ve achieved using the unigram model 🙁

Written by Mark Needham

November 29th, 2017 at 11:01 pm

Python: polyglot – ModuleNotFoundError: No module named ‘icu’

without comments

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

$ pip install polyglot
 
ImportError: No module named 'icu'

Hmmm I’m not sure what icu is but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

brew install icu4c
export ICU_VERSION=58
export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include
export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib
pip install pyicu

I already had icu4c installed so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

$ ls -lh /usr/local/Cellar/icu4c/
total 0
drwxr-xr-x  12 markneedham  admin   408B 28 Nov 06:12 58.2

That still wasn’t enough though! I had to install these two libraries as well:

pip install pycld2
pip install morfessor

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

polyglot download embeddings2.de
polyglot download ner2.de
polyglot download embeddings2.en
polyglot download ner2.en

Written by Mark Needham

November 28th, 2017 at 7:52 pm

Posted in Python

Tagged with , ,

Python 3: TypeError: unsupported format string passed to numpy.ndarray.__format__

without comments

This post explains how to work around a change in how Python string formatting works for numpy arrays between Python 2 and Python 3.

I’ve been going through Kevin Markham‘s scikit-learn Jupyter notebooks and ran into a problem on the Cross Validation one, which was throwing this error when attempting to print the KFold example:

Iteration                   Training set observations                   Testing set observations
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-007cbab507e3> in <module>()
      6 print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
      7 for iteration, data in enumerate(kf, start=1):
----> 8     print('{0:^9} {1} {2:^25}'.format(iteration, data[0], data[1]))
 
TypeError: unsupported format string passed to numpy.ndarray.__format__

We can reproduce this easily:

>>> import numpy as np
>>> "{:9}".format(np.array([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported format string passed to numpy.ndarray.__format__

What about if we use Python 2?

>>> "{:9}".format(np.array([1,2,3]))
'[1 2 3]  '

Hmmm, must be a change between the Python versions.

We can work around it by coercing our numpy array to a string:

>>> "{:9}".format(str(np.array([1,2,3])))
'[1 2 3]  '

Written by Mark Needham

November 19th, 2017 at 7:16 am

Posted in Python

Tagged with ,

Kubernetes: Copy a dataset to a StatefulSet’s PersistentVolume

without comments

In this post we’ll learn how to copy an existing dataset to the PersistentVolumes used by a Neo4j cluster running on Kubernetes.

Neo4j Clusters on Kubernetes

This posts assumes that we’re familiar with deploying Neo4j on Kubernetes. I wrote an article on the Neo4j blog explaining this in more detail.

The StatefulSet we create for our core servers require persistent storage, achieved via the PersistentVolumeClaim (PVC) primitive. A Neo4j cluster containing 3 core servers would have the following PVCs:

$ kubectl get pvc
NAME                            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
datadir-neo-helm-neo4j-core-0   Bound     pvc-043efa91-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       45s
datadir-neo-helm-neo4j-core-1   Bound     pvc-1737755a-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       13s
datadir-neo-helm-neo4j-core-2   Bound     pvc-18696bfd-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            standard       11s

Each of the PVCs has a corresponding PersistentVolume (PV) that satisifies it:

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                                   STORAGECLASS   REASON    AGE
pvc-043efa91-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-0   standard                 41m
pvc-1737755a-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-1   standard                 40m
pvc-18696bfd-cc54-11e7-bfa5-080027ab9eac   10Gi       RWO            Delete           Bound     default/datadir-neo-helm-neo4j-core-2   standard                 40m

The PVCs and PVs are usually created at the same time that we deploy our StatefulSet. We need to intervene in that lifecycle so that our dataset is already in place before the StatefulSet is deployed.

Deploying an existing dataset

We can do this by following these steps:

  • Create PVCs with the above names manually
  • Attach pods to those PVCs
  • Copy our dataset onto those pods
  • Delete the pods
  • Deploy our Neo4j cluster

We can use the following script to create the PVCs and pods:

pvs.sh

#!/usr/bin/env bash
 
set -exuo pipefail
 
for i in $(seq 0 2); do
  cat <<EOF | kubectl apply -f -
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: datadir-neo-helm-neo4j-core-${i}
  labels:
    app: neo4j
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF
 
  cat <<EOF | kubectl apply -f -
kind: Pod
apiVersion: v1
metadata:
  name: neo4j-load-data-${i}
  labels:
    app: neo4j-loader
spec:
  volumes:
    - name: datadir-neo4j-core-${i}
      persistentVolumeClaim:
        claimName: datadir-neo-helm-neo4j-core-${i}
  containers:
    - name: neo4j-load-data-${i}
      image: ubuntu
      volumeMounts:
      - name: datadir-neo4j-core-${i}
        mountPath: /data
      command: ["/bin/bash", "-ecx", "while :; do printf '.'; sleep 5 ; done"]
EOF
 
done;

Let’s run that script to create our PVCs and pods:

$ ./pvs.sh 
++ seq 0 2
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-0" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-0" configured
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-1" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-1" configured
+ for i in $(seq 0 2)
+ cat
+ kubectl apply -f -
persistentvolumeclaim "datadir-neo-helm-neo4j-core-2" configured
+ cat
+ kubectl apply -f -
pod "neo4j-load-data-2" configured

Now we can copy our database onto the pods:

for i in $(seq 0 2); do
  kubectl cp graph.db.tar.gz neo4j-load-data-${i}:/data/
  kubectl exec neo4j-load-data-${i} -- bash -c "mkdir -p /data/databases && tar -xf  /data/graph.db.tar.gz -C /data/databases"
done

graph.db.tar.gz contains a backup from a local database I created:

$ tar -tvf graph.db.tar.gz 
 
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:23 graph.db/
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:23 graph.db/certificates/
drwxr-xr-x  0 markneedham staff       0 17 Feb  2017 graph.db/index/
drwxr-xr-x  0 markneedham staff       0 24 Jul 15:22 graph.db/logs/
-rw-r--r--  0 markneedham staff    8192 24 Jul 15:23 graph.db/neostore
-rw-r--r--  0 markneedham staff     896 24 Jul 15:23 graph.db/neostore.counts.db.a
-rw-r--r--  0 markneedham staff    1344 24 Jul 15:23 graph.db/neostore.counts.db.b
-rw-r--r--  0 markneedham staff       9 24 Jul 15:23 graph.db/neostore.id
-rw-r--r--  0 markneedham staff   65536 24 Jul 15:23 graph.db/neostore.labelscanstore.db
...
-rw-------  0 markneedham staff     1700 24 Jul 15:23 graph.db/certificates/neo4j.key

We’ll run the following command to check the databases are in place:

$ kubectl exec neo4j-load-data-0 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db
 
$ kubectl exec neo4j-load-data-1 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db
 
$ kubectl exec neo4j-load-data-2 -- ls -lh /data/databases/
total 4.0K
drwxr-xr-x 6 501 staff 4.0K Jul 24 14:23 graph.db

All good so far. The pods have done their job so we’ll tear those down:

$ kubectl delete pods -l app=neo4j-loader
pod "neo4j-load-data-0" deleted
pod "neo4j-load-data-1" deleted
pod "neo4j-load-data-2" deleted

We’re now ready to deploy our Neo4j cluster.

helm install incubator/neo4j --name neo-helm --wait --set authEnabled=false

Finally we’ll run a Cypher query to check that the Neo4j servers used the database that we uploaded:

$ kubectl exec neo-helm-neo4j-core-0 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314
 
$ kubectl exec neo-helm-neo4j-core-1 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314
 
$ kubectl exec neo-helm-neo4j-core-2 -- bin/cypher-shell "match (n) return count(*)"
count(*)
32314

Success!

We could achieve similar results by using an init container but I haven’t had a chance to try out that approach yet. If you give it a try let me know in the comments and I’ll add it to the post.

Written by Mark Needham

November 18th, 2017 at 12:44 pm

Posted in Kubernetes,neo4j

Tagged with ,

Kubernetes 1.8: Using Cronjobs to take Neo4j backups

without comments

With the release of Kubernetes 1.8 Cronjobs have graduated to beta, which means we can now more easily run Neo4j backup jobs against Kubernetes clusters.

Before we learn how to write a Cronjob let’s first create a local Kubernetes cluster and deploy Neo4j.

Spinup Kubernetes & Helm

minikube start --memory 8192
helm init && kubectl rollout status -w deployment/tiller-deploy --namespace=kube-system

Deploy a Neo4j cluster

helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/
helm install incubator/neo4j --name neo-helm --wait --set authEnabled=false,core.extraVars.NEO4J_dbms_backup_address=0.0.0.0:6362

Populate our cluster with data

We can run the following command to check our cluster is ready:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "CALL dbms.cluster.overview() YIELD id, role RETURN id, role"
 
+-----------------------------------------------------+
| id                                     | role       |
+-----------------------------------------------------+
| "0b3bfe6c-6a68-4af5-9dd2-e96b564df6e5" | "LEADER"   |
| "09e9bee8-bdc5-4e95-926c-16ea8213e6e7" | "FOLLOWER" |
| "859b9b56-9bfc-42ae-90c3-02cedacfe720" | "FOLLOWER" |
+-----------------------------------------------------+

Now let’s create some data:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "UNWIND range(0,1000) AS id CREATE (:Person {id: id})"
 
0 rows available after 653 ms, consumed after another 0 ms
Added 1001 nodes, Set 1001 properties, Added 1001 labels

Now that our Neo4j cluster is running and contains data we want to take regular backups.

Neo4j backups

The Neo4j backup tool supports full and incremental backups.

Full backup

A full backup streams a copy of the store files to the backup location and any transactions that happened during the backup. Those transactions are then applied to the backed up copy of the database.

Incremental backup

An incremental backup is triggered if there is already a Neo4j database in the backup location. In this case there will be no copying of store files. Instead the tool will copy any new transactions from Neo4j and apply them to the backup.

Backup Cronjob

We will use a Cronjob to execute a backup. In the example below we attach a PersistentVolumeClaim to our Cronjob so that we can see both of the backup scenarios in action.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: backupdir-neo4j
  labels:
    app: neo4j-backup
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: neo4j-backup
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          volumes:
            - name: backupdir-neo4j
              persistentVolumeClaim:
                claimName: backupdir-neo4j
          containers:
            - name: neo4j-backup
              image: neo4j:3.3.0-enterprise
              env:
                - name: NEO4J_ACCEPT_LICENSE_AGREEMENT
                  value: "yes"
              volumeMounts:
              - name: backupdir-neo4j
                mountPath: /tmp
              command: ["bin/neo4j-admin",  "backup", "--backup-dir", "/tmp", "--name", "backup", "--from", "neo-helm-neo4j-core-2.neo-helm-neo4j.default.svc.cluster.local:6362"]
          restartPolicy: OnFailure

$ kubectl apply -f cronjob.yaml 
cronjob "neo4j-backup" created
kubectl get jobs -w
NAME                      DESIRED   SUCCESSFUL   AGE
neo4j-backup-1510940940   1         1            34s

Now let’s view the logs from the job:

kubectl logs $(kubectl get pods -a --selector=job-name=neo4j-backup-1510940940 --output=jsonpath={.items..metadata.name})
 
Doing full backup...
2017-11-17 17:49:05.920+0000 INFO [o.n.c.s.StoreCopyClient] Copying neostore.nodestore.db.labels
...
2017-11-17 17:49:06.038+0000 INFO [o.n.c.s.StoreCopyClient] Copied neostore.labelscanstore.db 48.00 kB
2017-11-17 17:49:06.038+0000 INFO [o.n.c.s.StoreCopyClient] Done, copied 18 files
2017-11-17 17:49:06.094+0000 INFO [o.n.b.BackupService] Start recovering store
2017-11-17 17:49:07.669+0000 INFO [o.n.b.BackupService] Finish recovering store
Doing consistency check...
2017-11-17 17:49:07.716+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_2[v0.A.8] record format from store /tmp/backup
2017-11-17 17:49:07.716+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_2[v0.A.8]
2017-11-17 17:49:07.755+0000 INFO [o.n.m.MetricsExtension] Initiating metrics...
....................  10%
...
.................... 100%
Backup complete.

All good so far. Now let’s create more data:

kubectl exec neo-helm-neo4j-core-0 \
  -- bin/cypher-shell --format verbose \
  "UNWIND range(0,1000) AS id CREATE (:Person {id: id})"
 
 
0 rows available after 114 ms, consumed after another 0 ms
Added 1001 nodes, Set 1001 properties, Added 1001 labels

And wait for another backup job to run:

kubectl get jobs -w
NAME                      DESIRED   SUCCESSFUL   AGE
neo4j-backup-1510941180   1         1            2m
neo4j-backup-1510941240   1         1            1m
neo4j-backup-1510941300   1         1            17s
kubectl logs $(kubectl get pods -a --selector=job-name=neo4j-backup-1510941300 --output=jsonpath={.items..metadata.name})
 
Destination is not empty, doing incremental backup...
Doing consistency check...
2017-11-17 17:55:07.958+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_2[v0.A.8] record format from store /tmp/backup
2017-11-17 17:55:07.959+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_2[v0.A.8]
2017-11-17 17:55:07.995+0000 INFO [o.n.m.MetricsExtension] Initiating metrics...
....................  10%
...
.................... 100%
Backup complete.

If we were to extend this job further we could have it copy each backup it takes to an S3 bucket but I’ll leave that for another day.

Written by Mark Needham

November 17th, 2017 at 6:10 pm

Posted in Kubernetes,neo4j

Tagged with , , ,