Mark Needham

Thoughts on Software Development

Asciidoctor: Creating a macro

without comments

I’ve been writing the TWIN4j blog for almost a year now and during that time I’ve written a few different asciidoc macros to avoid repetition.

The most recent one I wrote does the formatting around the Featured Community Member of the Week. I call it like this from the asciidoc, passing in the name of the person and a link to an image:

featured::[name="Suellen Stringer-Hye"]

The code for the macro has two parts. The first is some wiring code that registers the macro with Asciidoctor:


RUBY_ENGINE == 'opal' ? (require 'featured-macro/extension') : (require_relative 'featured-macro/extension')
Asciidoctor::Extensions.register do
  if (@document.basebackend? 'html') && ( < SafeMode::SECURE)
    block_macro FeaturedBlockMacro

And this is the code for the macro itself:


require 'asciidoctor/extensions' unless RUBY_ENGINE == 'opal'
include ::Asciidoctor
class FeaturedBlockMacro < Extensions::BlockMacroProcessor
  named :featured
  def process parent, target, attrs
    name = attrs["name"]
    html = %(<div class="imageblock image-heading">
                <div class="content">
                    <img src="#{target}" alt="#{name} - This Week’s Featured Community Member" width="800" height="400">
            <p style="font-size: .8em; line-height: 1.5em;" align="center">
              <strong>#{name} - This Week's Featured Community Member</strong>
    create_pass_block parent, html, attrs, subs: nil

When we convert the asciidoc into HTML we need to tell asciidoctor about the macro, which we can do like this:

asciidoctor template.adoc \
  -r ./lib/featured-macro.rb \
  -o -

And that’s it!

Written by Mark Needham

February 19th, 2018 at 8:51 pm

Posted in Software Development

Tagged with ,

Tensorflow: Kaggle Spooky Authors Bag of Words Model

without comments

I’ve been playing around with some Tensorflow tutorials recently and wanted to see if I could create a submission for Kaggle’s Spooky Author Identification competition that I’ve written about recently.

My model is based on one from the text classification tutorial. The tutorial shows how to create custom Estimators which we can learn more about in a post on the Google Developers blog.


Let’s get started. First, our imports:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

We’ve obviously got Tensorflow, but also scikit-learn which we’ll use to split our data into a training and test sets as well as convert the author names into numeric values.

Model building functions

Next we’ll create a function to create a bag of words model. This function calls another one that creates different EstimatorSpecs depending on the context it’s called from.

WORDS_FEATURE = 'words'  # Name of the input words feature.
def bag_of_words_model(features, labels, mode):
    bow_column = tf.feature_column.categorical_column_with_identity(WORDS_FEATURE, num_buckets=n_words)
    bow_embedding_column = tf.feature_column.embedding_column(bow_column, dimension=EMBEDDING_SIZE)
    bow = tf.feature_column.input_layer(features, feature_columns=[bow_embedding_column])
    logits = tf.layers.dense(bow, MAX_LABEL, activation=None)
    return create_estimator_spec(logits=logits, labels=labels, mode=mode)
def create_estimator_spec(logits, labels, mode):
    predicted_classes = tf.argmax(logits, 1)
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
                'class': predicted_classes,
                'prob': tf.nn.softmax(logits),
                'log_loss': tf.nn.softmax(logits),
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
    eval_metric_ops = {
        'accuracy': tf.metrics.accuracy(labels=labels, predictions=predicted_classes)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

Loading data

Now we’re ready to load our data.

Y_COLUMN = "author"
TEXT_COLUMN = "text"
le = preprocessing.LabelEncoder()
train_df = pd.read_csv("train.csv")
X = pd.Series(train_df[TEXT_COLUMN])
y = le.fit_transform(train_df[Y_COLUMN].copy())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

The only interesting thing here is the LabelEncoder. We’ll keep that around as we’ll use it later as well.

Transform documents

At the moment our training and test dataframes contain text, but Tensorflow works with vectors so we need to convert our data into that format. We can use the VocabularyProcessor to do this:

vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
X_transform_train = vocab_processor.fit_transform(X_train)
X_transform_test = vocab_processor.transform(X_test)
X_train = np.array(list(X_transform_train))
X_test = np.array(list(X_transform_test))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Training our model

Finally we’re ready to train our model! We’ll call the Bag of Words model we created at the beginning and build a train input function where we pass in the training arrays that we just created:

model_fn = bag_of_words_model
classifier = tf.estimator.Estimator(model_fn=model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={WORDS_FEATURE: X_train},
classifier.train(input_fn=train_input_fn, steps=100)

Evaluating our model

Let’s see how our model fares. We’ll call the evaluate function with our test data:

test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={WORDS_FEATURE: X_test},
scores = classifier.evaluate(input_fn=test_input_fn)
print('Accuracy: {0:f}, Loss {1:f}'.format(scores['accuracy'], scores["loss"]))
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt.
INFO:tensorflow:loss = 1.0888131, step = 1
INFO:tensorflow:Saving checkpoints for 100 into /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt.
INFO:tensorflow:Loss for final step: 0.18394235.
INFO:tensorflow:Starting evaluation at 2018-01-28-22:41:34
INFO:tensorflow:Restoring parameters from /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt-100
INFO:tensorflow:Finished evaluation at 2018-01-28-22:41:34
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.8246673, global_step = 100, loss = 0.44942895
Accuracy: 0.824667, Loss 0.449429

Not too bad! I managed to get a log loss score of ~ 0.36 with a scikit-learn ensemble model but it is better than some of my first attempts.

Generating predictions

I wanted to see how it’d do against Kaggle’s test dataset so I generated a CSV file with predictions:

test_df = pd.read_csv("test.csv")
X_test = pd.Series(test_df[TEXT_COLUMN])
X_test = np.array(list(vocab_processor.transform(X_test)))
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={WORDS_FEATURE: X_test},
predictions = classifier.predict(test_input_fn)
y_predicted_classes = np.array(list(p['prob'] for p in predictions))
output = pd.DataFrame(y_predicted_classes, columns=le.classes_)
output["id"] = test_df["id"]
output.to_csv("output.csv", index=False, float_format='%.6f')

Here we go:

2018 01 29 06 44 30

The score is roughly the same as we saw with the test split of the training set. If you want to see all the code in one place I’ve put it on my Spooky Authors GitHub repository.

Written by Mark Needham

January 29th, 2018 at 6:51 am

Asciidoc to Asciidoc: Exploding includes

without comments

One of my favourite features in AsciiDoc is the ability to include other files, but when using lots of includes is that it becomes difficult to read the whole document unless you convert it to one of the supported backends.

$ asciidoctor --help
Usage: asciidoctor [OPTION]... FILE...
Translate the AsciiDoc source FILE or FILE(s) into the backend output format (e.g., HTML 5, DocBook 4.5, etc.)
By default, the output is written to a file with the basename of the source file and the appropriate extension.
Example: asciidoctor -b html5 source.asciidoc
    -b, --backend BACKEND            set output format backend: [html5, xhtml5, docbook5, docbook45, manpage] (default: html5)
                                     additional backends are supported via extensions (e.g., pdf, latex)

I don’t want to have to convert my code to one of these formats each time – I want to convert asciidoc to asciidoc!

For example, given the following files:


= My Blog example
== Heading 1
Some awesome text
== Heading 2


Some included text

I want to generate another asciidoc file where the contents of the include file are exploded and displayed inline.

After a lot of searching I came across an excellent script written by Dan Allen and put it in a file called adoc.rb. We can then call it like this:

$ ruby adoc.rb mydoc.adoc
= My Blog example
== Heading 1
Some awesome text
== Heading 2
Some included text

Problem solved!

In my case I actually wanted to explode HTTP includes so I needed to pass the -a allow-uri-read flag to the script:

$ ruby adoc.rb mydoc.adoc -a allow-uri-read

And now I can generate asciidoc files until my heart’s content.

Written by Mark Needham

January 23rd, 2018 at 9:11 pm

Posted in Software Development

Tagged with ,

Strava: Calculating the similarity of two runs

without comments

I go running several times a week and wanted to compare my runs against each other to see how similar they are.

I record my runs with the Strava app and it has an API that returns lat/long coordinates for each run in the Google encoded polyline algorithm format.

We can use the polyline library to decode these values into a list of lat/long tuples. For example:

import polyline
[(40.63179, -8.65708), (40.62855, -8.65693)]

Once we’ve got the route defined as a set of coordinates we need to compare them. My Googling led me to an algorithm called Dynamic Time Warping

DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restrictions.

The sequences are “warped” non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension.

The fastdtw library implements an approximation of this library and returns a value indicating the distance between sets of points.

We can see how to apply fastdtw and polyline against Strava data in the following example:

import os
import polyline
import requests
from fastdtw import fastdtw
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
def find_points(activity_id):
    r = requests.get("{0}".format(activity_id), headers=headers)
    response = r.json()
    line = response["map"]["polyline"]
    return polyline.decode(line)

Now let’s try it out on two runs, 1361109741 and 1346460542:

from scipy.spatial.distance import euclidean
activity1_id = 1361109741
activity2_id = 1346460542
distance, path = fastdtw(find_points(activity1_id), find_points(activity2_id), dist=euclidean)
>>> print(distance)

These two runs are both near my house so the value is small. Let’s change the second route to be from my trip to New York:

activity1_id = 1361109741
activity2_id = 1246017379
distance, path = fastdtw(find_points(activity1_id), find_points(activity2_id), dist=euclidean)
>>> print(distance)

Much bigger!

I’m not really interested in the actual value returned but I am interested in the relative values. I’m building a little application to generate routes that I should run and I want it to come up with a routes that are different to recent ones that I’ve run. This score can now form part of the criteria.

Written by Mark Needham

January 18th, 2018 at 11:35 pm

Leaflet: Fit polyline in view

without comments

I’ve been playing with the Leaflet.js library over the Christmas holidays to visualise running routes drawn onto the map using a Polyline and I wanted to zoom the map the right amount to see all the points.

Pre requisites

We have the following HTML to define the div that will contain the map.

<div id="container">
	<div id="map" style="width: 100%; height: 100%">

We also need to import the following Javascript and CSS files:

<script src=""></script>
  <script type="text/javascript" src=""></script>
  <link rel="stylesheet" href=""/>
  <link rel="stylesheet" href=""/>
  <script src=""></script>

Polyline representing part of a route

The following code creates a polyline for a Strava segment that I often run.

var map ='map');
L.tileLayer('http://{s}{z}/{x}/{y}.png', {maxZoom: 18,}).addTo(map);
var rawPoints = [
  { "latitude": 51.357874010145395, "longitude": -0.198045110923591 },
  { "latitude": 51.3573858289394, "longitude": -0.19787754933584795 },
  { "latitude": 51.35632791810057, "longitude": -0.19750254941422557 },
  { "latitude": 51.35553240304241, "longitude": -0.197232163894512 },
  { "latitude": 51.35496267279901, "longitude": -0.1970247338143316 },
  { "latitude": 51.35388700570004, "longitude": -0.19666483094752069 },
  { "latitude": 51.3533898352570, "longitude": -0.1964976504847828 },
  { "latitude": 51.35358452733139, "longitude": -0.19512563906602554 },
  { "latitude": 51.354762877995036, "longitude": -0.1945622934585907 },
  { "latitude": 51.355610110109986, "longitude": -0.19468697186046677 },
  { "latitude": 51.35680377680643, "longitude": -0.19395063336295112 },
  { "latitude": 51.356861596801075, "longitude": -0.1936180154828497 },
  { "latitude": 51.358487396611125, "longitude": -0.19349660642888197 }
var coordinates = => new L.LatLng(rawPoint["latitude"], rawPoint["longitude"]))
let polyline = L.polyline(
        color: 'blue',
        weight: 3,
        opacity: .7,
        lineJoin: 'round'

I wanted to centre the map around the polyline and initially wrote the following code to do this:

let lats = => c.latitude).reduce((previous, current) => current += previous, 0.0);
let longs = => c.longitude).reduce((previous, current) => current += previous, 0.0);
const position = [lats / rawPoints.length, longs / rawPoints.length];
map.setView(position, 17);

This works fine but the zoom factor was wrong when I drew longer polylines so I needed a better solution.

I should have RTFM because there’s a much simpler way to do this. I actually found the explanation in a GitHub issue from 2011! We can replace the previous snippet with this single line of code:


And this is how it looks on the screen:

2017 12 31 17 30 25

Written by Mark Needham

December 31st, 2017 at 5:35 pm

Posted in Javascript

Tagged with ,

Ethereum Hello World Example using solc and web3

without comments

I’ve been trying to find an Ethereum Hello World example and came across Thomas Conté’s excellent post that shows how to compile and deploy an Ethereum smart contract with solc and web3.

In the latest version of web3 the API has changed to be based on promises so I decided to translate Thomas’ example.

Let’s get started.

Install npm libraries

We need to install these libraries before we start:

npm install web3
npm install abi-decoder
npm install ethereumjs-testrpc

What do these libraries do?

  • web3 is a client library for interacting with an Ethereum blockchain
  • abi-decoder is used to decode the hash of a smart contract so that we can work out what was in it.
  • ethereum-testrpc lets us spin up a local test version of Ethereum

Smart contract

We’ll still use the same smart contract as Thomas did. Token.sol is a smart contract written in the Solidity language and describes money being transferred between addresses:


pragma solidity ^0.4.0;
contract Token {
    mapping (address => uint) public balances;
    function Token() {
        balances[msg.sender] = 1000000;
    function transfer(address _to, uint _amount) {
        if (balances[msg.sender] < _amount) {
        balances[msg.sender] -= _amount;
        balances[_to] += _amount;

Whenever somebody tries to transfer some money we’ll put 1,000,000 in their account and then transfer the appropriate amount, assuming there’s enough money in the account.

Start local Ethereum node

Let’s start a local Ethereum node. We’ll reduce the gas price – the amount you ‘pay’ to execute a transaction – so we don’t run out.

$ ./node_modules/.bin/testrpc --gasPrice 20000
EthereumJS TestRPC v6.0.3 (ganache-core: 2.0.2)
Listening on localhost:8545

Pre requisites

We need to load a few Node.js modules:

const fs = require("fs"),
      abiDecoder = require('abi-decoder'),
      Web3 = require('web3'),
      solc = require('solc');

Compile smart contract

Next we’ll compile our smart contract:

const input = fs.readFileSync('contracts/Token.sol');
const output = solc.compile(input.toString(), 1);
const bytecode = output.contracts[':Token'].bytecode;
const abi = JSON.parse(output.contracts[':Token'].interface);

Connect to Ethereum and create contract object

Now that we’ve got the ABI (Application Binary Interface) we’ll connect to our local Ethereum node and create a contract object based on the ABI:

let provider = new Web3.providers.HttpProvider("http://localhost:8545");
const web3 = new Web3(provider);
let Voting = new web3.eth.Contract(abi);

Add ABI to decoder

Before we interact with the blockchain we’ll first add the ABI to our ABI decoder to use later:


Find (dummy) Ethereum accounts

Now we’re ready to create some transactions! We’ll need some Ethereum accounts to play with and if we call web3.eth.getAccounts we can get a collection of accounts that the node controls. Since our node is a test one these are all dummy accounts.

web3.eth.getAccounts().then(accounts => {
  accounts.forEach(account => {

Transfer money between accounts

Now that we have some accounts let’s transfer some money between them.

var allAccounts;
web3.eth.getAccounts().then(accounts => {
  allAccounts = accounts;
  Voting.deploy({data: bytecode}).send({
    from: accounts[0],
    gas: 1500000,
    gasPrice: '30000000000000'
  }).on('receipt', receipt => {
    Voting.options.address = receipt.contractAddress;
    Voting.methods.transfer(accounts[1], 10).send({from: accounts[0]}).then(transaction => {
      console.log("Transfer lodged. Transaction ID: " + transaction.transactionHash);
      let blockHash = transaction.blockHash
      return web3.eth.getBlock(blockHash, true);
    }).then(block => {
      block.transactions.forEach(transaction => {
      allAccounts.forEach(account => {
          Voting.methods.balances(account).call({from: allAccounts[0]}).then(amount => {
            console.log(account + ": " + amount);

Let’s run in:

Transfer lodged. Transaction ID: 0x699cbe40121d6c2da7b36a107cd5f28b35a71aff2a0d584f8e734b10f4c49de4
{ name: 'transfer',
   [ { name: '_to',
       value: '0xeb25dbd0931386eeab267981626ae3908d598404',
       type: 'address' },
     { name: '_amount', value: '10', type: 'uint256' } ] }
0x084181d6fDe8bA802Ee85396aB1d25Ddf1d7D061: 999990
0xEb25dbD0931386eEaB267981626AE3908D598404: 10
0x7deB2487E6Ac40f85fB8f5A3bC6896391bf2570F: 0
0xA15ad4371B62afECE5a7A70457F82A30530630a3: 0
0x64644f3B6B95e81A385c8114DF81663C39084C6a: 0
0xBB68FF2935080c807D5A534b1fc481Aa3fafF1C0: 0
0x38d4A3d635B451Cb006d63ce542950C067D47F58: 0
0x7878bA9138361A08522418BD1c8376Af7220a506: 0
0xf400c0e749Fe02E7073E08d713E0A207dc91FBeb: 0
0x7070d1712a25eb7FCf78A549F17705AA66B0aD47: 0

This code:

  • Deploys our smart contract to the blockchain
  • Transfers £10 from account 1 to account 2
  • Decodes that transaction and shows the output
  • Show the balances of all the dummy accounts
  • The full example is available in my ethereum-nursery GitHub repository. Thomas also has a follow up post that shows how to deploy a contract on a remote node where client side signatures become necessary.

    Written by Mark Needham

    December 28th, 2017 at 11:03 am

    Posted in Ethereum

    Tagged with , ,

    Morning Pages: What should I write about?

    without comments

    I’ve been journalling for almost 2 years now but some days I get stuck and can’t think of anything to write about.

    I did a bit of searching to see if anybody had advice on solving this problem and found a few different articles:

    The articles talk about different approaches to journalling and since I’m not following one particular approach I thought I’d summarise all their ideas and put them in a document that I can use. I’ve also added some of my own ones.

    Here’s the list:

    • Something you’re grateful for
    • An unusual event
    • Goals or hopes
    • A past event
    • Things that are stopping you achieving your goals
    • Values that are important to you
    • Ideas or nagging thoughts
    • What did you do yesterday?

      • What did you work on?
      • Who did you talk to?
      • What book did you read?
      • What podcasts did you listen to?
      • What TV shows/movies did you watch?
      • Where did you go?
    • What are you doing today?
    • What scares you?
    • Decisions you need to make/made – small or big

    If you have any other ideas of what I can write about let me know in the comments. While researching for this post I noticed that Julia Cameron, the inventor of Morning Pages, has a new book out – The Right to Write: An Invitation and Initiation into the Writing Life – so I’m hoping to get some ideas from there as well.

    Written by Mark Needham

    December 27th, 2017 at 11:28 pm

    Posted in Software Development

    Tagged with ,

    scikit-learn: Using GridSearch to tune the hyper-parameters of VotingClassifier

    without comments

    In my last blog post I showed how to create a multi class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

    We need to get a better score with each of the classifiers in the ensemble otherwise they can be excluded.

    We have a TF/IDF based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

    import pandas as pd
    from sklearn import linear_model
    from sklearn.ensemble import VotingClassifier
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import Pipeline
    Y_COLUMN = "author"
    TEXT_COLUMN = "text"
    unigram_log_pipe = Pipeline([
        ('cv', CountVectorizer()),
        ('logreg', linear_model.LogisticRegression())
    ngram_pipe = Pipeline([
        ('cv', CountVectorizer(ngram_range=(1, 2))),
        ('mnb', MultinomialNB())
    tfidf_pipe = Pipeline([
        ('tfidf', TfidfVectorizer(min_df=3, max_features=None,
                                  strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                                  ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
        ('mnb', MultinomialNB())
    classifiers = [
        ("ngram", ngram_pipe),
        ("unigram", unigram_log_pipe),
        ("tfidf", tfidf_pipe),
    mixed_pipe = Pipeline([
        ("voting", VotingClassifier(classifiers, voting="soft"))

    Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

    from sklearn.model_selection import GridSearchCV
    def combinations_on_off(num_classifiers):
        return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
                for i in range(1, 2 ** num_classifiers)]
    param_grid = dict(
    train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
    y = train_df[Y_COLUMN].copy()
    X = pd.Series(train_df[TEXT_COLUMN])
    grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")
, y)
    cv_results = grid_search.cv_results_
    for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
        print(params, mean_score)
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

    Let’s run the grid scan and see what it comes up with:

    {'voting__weights': [0, 0, 1]} -0.60533660756
    {'voting__weights': [0, 1, 0]} -0.474562462086
    {'voting__weights': [0, 1, 1]} -0.508363479586
    {'voting__weights': [1, 0, 0]} -0.697231760084
    {'voting__weights': [1, 0, 1]} -0.456599644003
    {'voting__weights': [1, 1, 0]} -0.409406571361
    {'voting__weights': [1, 1, 1]} -0.439084397238
    Best score: -0.409
    Best parameters set:
    	voting__weights: [1, 1, 0]

    We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included – our log loss score is worse when it is added.

    The code is on GitHub if you want to see it all in one place

    Written by Mark Needham

    December 10th, 2017 at 7:55 am

    scikit-learn: Building a multi class classification ensemble

    without comments

    For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

    We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we could have it make a ‘hard’ prediction instead.

    We start with three classifiers which generate different n-gram based features. The code for those is as follows:

    from sklearn import linear_model
    from sklearn.ensemble import VotingClassifier
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import Pipeline
    ngram_pipe = Pipeline([
        ('cv', CountVectorizer(ngram_range=(1, 2))),
        ('mnb', MultinomialNB())
    unigram_log_pipe = Pipeline([
        ('cv', CountVectorizer()),
        ('logreg', linear_model.LogisticRegression())

    We can combine those classifiers together like this:

    classifiers = [
        ("ngram", ngram_pipe),
        ("unigram", unigram_log_pipe),
    mixed_pipe = Pipeline([
        ("voting", VotingClassifier(classifiers, voting="soft"))

    Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane‘s tutorial.

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import StratifiedKFold
    from sklearn import metrics
    Y_COLUMN = "author"
    TEXT_COLUMN = "text"
    def test_pipeline(df, nlp_pipeline):
        y = df[Y_COLUMN].copy()
        X = pd.Series(df[TEXT_COLUMN])
        rskf = StratifiedKFold(n_splits=5, random_state=1)
        losses = []
        accuracies = []
        for train_index, test_index in rskf.split(X, y):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
  , y_train)
            losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
            accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
        print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
            str([str(round(x, 3)) for x in sorted(losses)]),
            round(np.mean(losses), 3),
            round(np.mean(accuracies), 3)
    train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
    test_pipeline(train_df, mixed_pipe)

    Let’s run the script:

    kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

    Looks good.

    I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

    Written by Mark Needham

    December 5th, 2017 at 10:19 pm

    Python: Combinations of values on and off

    without comments

    In my continued exploration of Kaggle’s Spooky Authors competition, I wanted to run a GridSearch turning on and off different classifiers to work out the best combination.

    I therefore needed to generate combinations of 1s and 0s enabling different classifiers.

    e.g. if we had 3 classifiers we’d generate these combinations

    0 0 1
    0 1 0
    1 0 0
    1 1 0
    1 0 1
    0 1 1
    1 1 1


    • ‘0 0 1’ means: classifier1 is disabled, classifier3 is disabled, classifier3 is enabled
    • ‘0 1 0’ means: classifier1 is disabled, classifier3 is enabled, classifier3 is disabled
    • ‘1 1 0’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is disabled
    • ‘1 1 1’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is enabled

    …and so on. In other words, we need to generate the binary representation for all the values from 1 to 2number of classifiers-1.

    We can write the following code fragments to calculate a 3 bit representation of different numbers:

    >>> "{0:0b}".format(1).zfill(3)
    >>> "{0:0b}".format(5).zfill(3)
    >>> "{0:0b}".format(6).zfill(3)

    We need an array of 0s and 1s rather than a string, so let’s use the list function to create our array and then cast each value to an integer:

    >>> [int(x) for x in list("{0:0b}".format(1).zfill(3))]
    [0, 0, 1]

    Finally we can wrap that code inside a list comprehension:

    def combinations_on_off(num_classifiers):
        return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
                for i in range(1, 2 ** num_classifiers)]

    And let’s check it works:

    >>> for combination in combinations_on_off(3):
    [0, 0, 1]
    [0, 1, 0]
    [0, 1, 1]
    [1, 0, 0]
    [1, 0, 1]
    [1, 1, 0]
    [1, 1, 1]

    what about if we have 4 classifiers?

    >>> for combination in combinations_on_off(4):
    [0, 0, 0, 1]
    [0, 0, 1, 0]
    [0, 0, 1, 1]
    [0, 1, 0, 0]
    [0, 1, 0, 1]
    [0, 1, 1, 0]
    [0, 1, 1, 1]
    [1, 0, 0, 0]
    [1, 0, 0, 1]
    [1, 0, 1, 0]
    [1, 0, 1, 1]
    [1, 1, 0, 0]
    [1, 1, 0, 1]
    [1, 1, 1, 0]
    [1, 1, 1, 1]

    Perfect! We can now use this function to help work out which combinations of classifiers are needed.

    Written by Mark Needham

    December 3rd, 2017 at 5:23 pm

    Posted in Python

    Tagged with ,