# Mark Needham

Thoughts on Software Development

## Strava: Calculating the similarity of two runs

I go running several times a week and wanted to compare my runs against each other to see how similar they are.

I record my runs with the Strava app and it has an API that returns lat/long coordinates for each run in the Google encoded polyline algorithm format.

We can use the polyline library to decode these values into a list of lat/long tuples. For example:

```import polyline polyline.decode('u{~vFvyys@fS]') [(40.63179, -8.65708), (40.62855, -8.65693)]```

Once we’ve got the route defined as a set of coordinates we need to compare them. My Googling led me to an algorithm called Dynamic Time Warping

DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restrictions.

The sequences are “warped” non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension.

The fastdtw library implements an approximation of this library and returns a value indicating the distance between sets of points.

We can see how to apply fastdtw and polyline against Strava data in the following example:

```import os import polyline import requests from fastdtw import fastdtw   token = os.environ["TOKEN"] headers = {'Authorization': "Bearer {0}".format(token)}   def find_points(activity_id): r = requests.get("https://www.strava.com/api/v3/activities/{0}".format(activity_id), headers=headers) response = r.json() line = response["map"]["polyline"] return polyline.decode(line)```

Now let’s try it out on two runs, 1361109741 and 1346460542:

```from scipy.spatial.distance import euclidean   activity1_id = 1361109741 activity2_id = 1346460542   distance, path = fastdtw(find_points(activity1_id), find_points(activity2_id), dist=euclidean)   >>> print(distance) 2.91985018100644```

These two runs are both near my house so the value is small. Let’s change the second route to be from my trip to New York:

```activity1_id = 1361109741 activity2_id = 1246017379   distance, path = fastdtw(find_points(activity1_id), find_points(activity2_id), dist=euclidean)   >>> print(distance) 29383.492965394034```

Much bigger!

I’m not really interested in the actual value returned but I am interested in the relative values. I’m building a little application to generate routes that I should run and I want it to come up with a routes that are different to recent ones that I’ve run. This score can now form part of the criteria.

Written by Mark Needham

January 18th, 2018 at 11:35 pm

## scikit-learn: Using GridSearch to tune the hyper-parameters of VotingClassifier

In my last blog post I showed how to create a multi class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble otherwise they can be excluded.

We have a TF/IDF based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

```import pandas as pd from sklearn import linear_model from sklearn.ensemble import VotingClassifier from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer   from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline   Y_COLUMN = "author" TEXT_COLUMN = "text"   unigram_log_pipe = Pipeline([ ('cv', CountVectorizer()), ('logreg', linear_model.LogisticRegression()) ])   ngram_pipe = Pipeline([ ('cv', CountVectorizer(ngram_range=(1, 2))), ('mnb', MultinomialNB()) ])   tfidf_pipe = Pipeline([ ('tfidf', TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english')), ('mnb', MultinomialNB()) ])   classifiers = [ ("ngram", ngram_pipe), ("unigram", unigram_log_pipe), ("tfidf", tfidf_pipe), ]   mixed_pipe = Pipeline([ ("voting", VotingClassifier(classifiers, voting="soft")) ])```

Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

```from sklearn.model_selection import GridSearchCV     def combinations_on_off(num_classifiers): return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))] for i in range(1, 2 ** num_classifiers)]     param_grid = dict( voting__weights=combinations_on_off(len(classifiers)) )   train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN]) y = train_df[Y_COLUMN].copy() X = pd.Series(train_df[TEXT_COLUMN])   grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")   grid_search.fit(X, y)   cv_results = grid_search.cv_results_   for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]): print(params, mean_score)   print("Best score: %0.3f" % grid_search.best_score_) print("Best parameters set:") best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(param_grid.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))```

Let’s run the grid scan and see what it comes up with:

```{'voting__weights': [0, 0, 1]} -0.60533660756 {'voting__weights': [0, 1, 0]} -0.474562462086 {'voting__weights': [0, 1, 1]} -0.508363479586 {'voting__weights': [1, 0, 0]} -0.697231760084 {'voting__weights': [1, 0, 1]} -0.456599644003 {'voting__weights': [1, 1, 0]} -0.409406571361 {'voting__weights': [1, 1, 1]} -0.439084397238   Best score: -0.409 Best parameters set: voting__weights: [1, 1, 0]```

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included – our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place

Written by Mark Needham

December 10th, 2017 at 7:55 am

## Python: Combinations of values on and off

In my continued exploration of Kaggle’s Spooky Authors competition, I wanted to run a GridSearch turning on and off different classifiers to work out the best combination.

I therefore needed to generate combinations of 1s and 0s enabling different classifiers.

e.g. if we had 3 classifiers we’d generate these combinations

```0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 1```

where…

• ‘0 0 1’ means: classifier1 is disabled, classifier3 is disabled, classifier3 is enabled
• ‘0 1 0’ means: classifier1 is disabled, classifier3 is enabled, classifier3 is disabled
• ‘1 1 0’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is disabled
• ‘1 1 1’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is enabled

…and so on. In other words, we need to generate the binary representation for all the values from 1 to 2number of classifiers-1.

We can write the following code fragments to calculate a 3 bit representation of different numbers:

```>>> "{0:0b}".format(1).zfill(3) '001' >>> "{0:0b}".format(5).zfill(3) '101' >>> "{0:0b}".format(6).zfill(3) '110'```

We need an array of 0s and 1s rather than a string, so let’s use the list function to create our array and then cast each value to an integer:

```>>> [int(x) for x in list("{0:0b}".format(1).zfill(3))] [0, 0, 1]```

Finally we can wrap that code inside a list comprehension:

```def combinations_on_off(num_classifiers): return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))] for i in range(1, 2 ** num_classifiers)]```

And let’s check it works:

```>>> for combination in combinations_on_off(3): print(combination)   [0, 0, 1] [0, 1, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1]```

what about if we have 4 classifiers?

```>>> for combination in combinations_on_off(4): print(combination)   [0, 0, 0, 1] [0, 0, 1, 0] [0, 0, 1, 1] [0, 1, 0, 0] [0, 1, 0, 1] [0, 1, 1, 0] [0, 1, 1, 1] [1, 0, 0, 0] [1, 0, 0, 1] [1, 0, 1, 0] [1, 0, 1, 1] [1, 1, 0, 0] [1, 1, 0, 1] [1, 1, 1, 0] [1, 1, 1, 1]```

Perfect! We can now use this function to help work out which combinations of classifiers are needed.

Written by Mark Needham

December 3rd, 2017 at 5:23 pm

Posted in Python

Tagged with ,

## Python: Learning about defaultdict’s handling of missing keys

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

```vocabulary = defaultdict() vocabulary.default_factory = vocabulary.__len__```

Let’s quickly see how it works by adapting an example from scikit-learn:

```>>> from collections import defaultdict >>> vocabulary = defaultdict() >>> vocabulary.default_factory = vocabulary.__len__   >>> vocabulary["foo"] 0 >>> vocabulary.items() dict_items([('foo', 0)])   >>> vocabulary["bar"] 1 >>> vocabulary.items() dict_items([('foo', 0), ('bar', 1)])```

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

```>>> vocabulary["baz"] = "Mark >>> vocabulary["baz"] 'Mark' >>> vocabulary["python"] 3```

Now let’s see what the dictionary contains:

```>>> vocabulary.items() dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])```

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

```""" __missing__(key) # Called by __getitem__ for missing key; pseudo-code: if self.default_factory is None: raise KeyError((key,)) self[key] = value = self.default_factory() return value """ pass```

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

Written by Mark Needham

December 1st, 2017 at 3:26 pm

Posted in Python

Tagged with , ,

## Python: polyglot – ModuleNotFoundError: No module named ‘icu’

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

```\$ pip install polyglot   ImportError: No module named 'icu'```

Hmmm I’m not sure what icu is but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

```brew install icu4c export ICU_VERSION=58 export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib pip install pyicu```

I already had icu4c installed so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

```\$ ls -lh /usr/local/Cellar/icu4c/ total 0 drwxr-xr-x 12 markneedham admin 408B 28 Nov 06:12 58.2```

That still wasn’t enough though! I had to install these two libraries as well:

```pip install pycld2 pip install morfessor```

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

```polyglot download embeddings2.de polyglot download ner2.de polyglot download embeddings2.en polyglot download ner2.en```

Written by Mark Needham

November 28th, 2017 at 7:52 pm

Posted in Python

Tagged with , ,

## Python 3: TypeError: unsupported format string passed to numpy.ndarray.__format__

This post explains how to work around a change in how Python string formatting works for numpy arrays between Python 2 and Python 3.

I’ve been going through Kevin Markham‘s scikit-learn Jupyter notebooks and ran into a problem on the Cross Validation one, which was throwing this error when attempting to print the KFold example:

```Iteration Training set observations Testing set observations --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-28-007cbab507e3> in <module>() 6 print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')) 7 for iteration, data in enumerate(kf, start=1): ----> 8 print('{0:^9} {1} {2:^25}'.format(iteration, data[0], data[1]))   TypeError: unsupported format string passed to numpy.ndarray.__format__```

We can reproduce this easily:

`>>> import numpy as np`
```>>> "{:9}".format(np.array([1,2,3])) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported format string passed to numpy.ndarray.__format__```

What about if we use Python 2?

```>>> "{:9}".format(np.array([1,2,3])) '[1 2 3] '```

Hmmm, must be a change between the Python versions.

We can work around it by coercing our numpy array to a string:

```>>> "{:9}".format(str(np.array([1,2,3]))) '[1 2 3] '```

Written by Mark Needham

November 19th, 2017 at 7:16 am

Posted in Python

Tagged with ,

## Python 3: Create sparklines using matplotlib

I recently wanted to create sparklines to show how some values were changing over time. In addition, I wanted to generate them as images on the server rather than introducing a JavaScript library.

Chris Seymour’s excellent gist which shows how to create sparklines inside a Pandas dataframe got me most of the way there, but I had to tweak his code a bit to get it to play nicely with Python 3.6.

This is what I ended up with:

```import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt import base64   from io import BytesIO   def sparkline(data, figsize=(4, 0.25), **kwags): """ Returns a HTML image tag containing a base64 encoded sparkline style plot """ data = list(data)   fig, ax = plt.subplots(1, 1, figsize=figsize, **kwags) ax.plot(data) for k,v in ax.spines.items(): v.set_visible(False) ax.set_xticks([]) ax.set_yticks([])   plt.plot(len(data) - 1, data[len(data) - 1], 'r.')   ax.fill_between(range(len(data)), data, len(data)*[min(data)], alpha=0.1)   img = BytesIO() plt.savefig(img, transparent=True, bbox_inches='tight') img.seek(0) plt.close()   return base64.b64encode(img.read()).decode("UTF-8")```

I had to change the class used to write the image from StringIO to BytesIO and I found I needed to decode the bytes produced if I wanted it to display in a HTML page.

This is how you would call the above function:

```if __name__ == "__main__": values = [ [1,2,3,4,5,6,7,8,9,10], [7,10,12,18,2,8,10,6,7,12], [10,9,8,7,6,5,4,3,2,1] ]   with open("/tmp/foo.html", "w") as file: for value in values: file.write('<div><img src="data:image/png;base64,{}"/></div>'.format(sparkline(value)))```

And the HTML page looks like this:

Written by Mark Needham

September 23rd, 2017 at 6:51 am

## Serverless: Python – virtualenv – { “errorMessage”: “Unable to import module ‘handler'” }

I’ve been using the Serverless library to deploy and run some Python functions on AWS lambda recently and was initially confused about how to handle my dependencies.

I tend to create a new virtualenv for each of my project so let’s get that setup first:

### Prerequisites

`\$ npm install serverless`
```\$ virtualenv -p python3 a \$ . a/bin/activate```

Now let’s create our Serverless project. I’m going to install the requests library so that I can use it in my function.

### My Serverless project

serverless.yaml

```service: python-starter-template frameworkVersion: ">=1.2.0 <2.0.0" provider: name: aws runtime: python3.6 timeout: 180 functions: starter-function: name: Starter handler: handler.starter```

handler.py

```import requests   def starter(event, context): print("event:", event, "context:", context) r = requests.get("http://www.google.com") print(r.status_code)```
`\$ pip install requests`

Ok, we’re now ready to try out the function. A nice feature of Serverless is that it lets us try out functions locally before we deploy them onto one of the Cloud providers:

```\$ ./node_modules/serverless/bin/serverless invoke local --function starter-function event: {} context: <__main__.FakeLambdaContext object at 0x10bea9a20> 200 null```

So far so good. Next we’ll deploy our function to AWS. I’m assuming you’ve already got your credentials setup but if not you can follow the tutorial on the Serverless page.

```\$ ./node_modules/serverless/bin/serverless deploy Serverless: Packaging service... Serverless: Excluding development dependencies... Serverless: Uploading CloudFormation file to S3... Serverless: Uploading artifacts... Serverless: Uploading service .zip file to S3 (26.48 MB)... Serverless: Validating template... Serverless: Updating Stack... Serverless: Checking Stack update progress... ......... Serverless: Stack update finished... Service Information service: python-starter-template stage: dev region: us-east-1 api keys: None endpoints: None functions: starter-function: python-starter-template-dev-starter-function```

Now let’s invoke our function:

```\$ ./node_modules/serverless/bin/serverless invoke --function starter-function { "errorMessage": "Unable to import module 'handler'" }   Error --------------------------------------------------   Invoked function failed   For debugging logs, run again after setting the "SLS_DEBUG=*" environment variable.   Get Support -------------------------------------------- Docs: docs.serverless.com Bugs: github.com/serverless/serverless/issues Forums: forum.serverless.com Chat: gitter.im/serverless/serverless   Your Environment Information ----------------------------- OS: darwin Node Version: 6.7.0 Serverless Version: 1.19.0```

Hmmm, that’s odd – I wonder why it can’t import our handler module? We can call the logs function to check. The logs are usually a few seconds behind so we’ll have to be a bit patient if we don’t see them immediately.

```\$ ./node_modules/serverless/bin/serverless logs --function starter-function START RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552 Version: \$LATEST Unable to import module 'handler': No module named 'requests'   END RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552 REPORT RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552 Duration: 0.42 ms Billed Duration: 100 ms Memory Size: 1024 MB Max Memory Used: 22 MB```

That explains it – the requests module wasn’t imported.

If we look in .serverless/python-starter-template.zip

we can see that the requests module is hidden inside the a directory and the instance of Python that runs on Lambda doesn’t know where to find it.

I’m sure there are other ways of solving this but the easiest one I found is a Serverless plugin called serverless-python-requirements.

So how does this plugin work?

A Serverless v1.x plugin to automatically bundle dependencies from requirements.txt and make them available in your PYTHONPATH.

Doesn’t sound too tricky – we can use pip freeze to get our list of requirements and write them into a file. Let’s rework serverless.yaml to make use of the plugin:

### My Serverless project using serverless-python-requirements

`\$ npm install --save serverless-python-requirements`
```\$ pip freeze > requirements.txt \$ cat requirements.txt certifi==2017.7.27.1 chardet==3.0.4 idna==2.5 requests==2.18.3 urllib3==1.22```

serverless.yaml

```service: python-starter-template frameworkVersion: ">=1.2.0 <2.0.0" provider: name: aws runtime: python3.6 timeout: 180 plugins: - serverless-python-requirements functions: starter-function: name: Starter handler: handler.starter package: exclude: - a/** # virtualenv```

We have two changes from before:

• We added the serverless-python-requirements plugin
• We excluded the a directory since we don’t need it

Let’s deploy again and run the function:

```\$ ./node_modules/serverless/bin/serverless deploy Serverless: Parsing Python requirements.txt Serverless: Installing required Python packages for runtime python3.6... Serverless: Linking required Python packages... Serverless: Packaging service... Serverless: Excluding development dependencies... Serverless: Unlinking required Python packages... Serverless: Uploading CloudFormation file to S3... Serverless: Uploading artifacts... Serverless: Uploading service .zip file to S3 (14.39 MB)... Serverless: Validating template... Serverless: Updating Stack... Serverless: Checking Stack update progress... ......... Serverless: Stack update finished... Service Information service: python-starter-template stage: dev region: us-east-1 api keys: None endpoints: None functions: starter-function: python-starter-template-dev-starter-function```
```\$ ./node_modules/serverless/bin/serverless invoke --function starter-function null```

Looks good. Let’s check the logs:

```\$ ./node_modules/serverless/bin/serverless logs --function starter-function START RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24 Version: \$LATEST event: {} context: <__main__.LambdaContext object at 0x7f568b105f28> 200 END RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24 REPORT RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24 Duration: 55.55 ms Billed Duration: 100 ms Memory Size: 1024 MB Max Memory Used: 29 M```

All good here as well so we’re done!

Written by Mark Needham

August 6th, 2017 at 7:03 pm

Posted in Software Development

Tagged with ,

## PHP vs Python: Generating a HMAC

I’ve been writing a bit of code to integrate with a ClassMarker webhook, and you’re required to check that an incoming request actually came from ClassMarker by checking the value of a base64 hash using HMAC SHA256.

The example in the documentation is written in PHP which I haven’t done for about 10 years so I had to figure out how to do the same thing in Python.

This is the PHP version:

```\$ php -a php > echo base64_encode(hash_hmac("sha256", "my data", "my_secret", true)); vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE=```

The Python equivalent is a bit more code but it’s not too bad.

### Import all the libraries

```import hmac import hashlib import base64```

### Generate that hash

```data = "my data".encode("utf-8") digest = hmac.new(b"my_secret", data, digestmod=hashlib.sha256).digest()   print(base64.b64encode(digest).decode()) 'vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE='```

We’re getting the same value as the PHP version so it’s good times all round.

Written by Mark Needham

August 2nd, 2017 at 6:09 am

Posted in Python

Tagged with ,

## Pandas: ValueError: The truth value of a Series is ambiguous.

I’ve been playing around with Kaggle in my spare time over the last few weeks and came across an unexpected behaviour when trying to add a column to a dataframe.

First let’s get Panda’s into our program scope:

### Prerequisites

`import pandas as pd`

Now we’ll create a data frame to play with for the duration of this post:

```>>> df = pd.DataFrame({"a": [1,2,3,4,5], "b": [2,3,4,5,6]}) >>> df a b 0 5 2 1 6 6 2 0 8 3 3 2 4 1 6```

Let’s say we want to create a new column which returns True if either of the numbers are odd. If not then it’ll return False.

We’d expect to see a column full of True values so let’s get started.

```>>> divmod(df["a"], 2)[1] > 0 0 True 1 False 2 True 3 False 4 True Name: a, dtype: bool   >>> divmod(df["b"], 2)[1] > 0 0 False 1 True 2 False 3 True 4 False Name: b, dtype: bool```

So far so good. Now let’s combine those two calculations together and create a new column in our data frame:

```>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) or (divmod(df["b"], 2)[1] > 0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/markneedham/projects/kaggle/house-prices/a/lib/python3.6/site-packages/pandas/core/generic.py", line 953, in __nonzero__ .format(self.__class__.__name__)) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().```

Hmmm, that was unexpected! Unfortunately Python’s or and and statements don’t work very well against Panda’s Series’, so instead we need to use the bitwise or (|) and and (&).

Let’s update our example:

```>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) | (divmod(df["b"], 2)[1] > 0) >>> df a b anyOdd 0 1 2 True 1 2 3 True 2 3 4 True 3 4 5 True 4 5 6 True```

Much better. And what about if we wanted to check if both values are odd?

```>>> df["bothOdd"] = (divmod(df["a"], 2)[1] > 0) & (divmod(df["b"], 2)[1] > 0) >>> df a b anyOdd bothOdd 0 1 2 True False 1 2 3 True False 2 3 4 True False 3 4 5 True False 4 5 6 True False```

Works exactly as expected, hoorah!

Written by Mark Needham

July 26th, 2017 at 9:41 pm

Posted in Data Science

Tagged with , ,