Mark Needham

Thoughts on Software Development

Archive for the ‘Python’ Category

scikit-learn: Using GridSearch to tune the hyper-parameters of VotingClassifier

without comments

In my last blog post I showed how to create a multi class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble otherwise they can be excluded.

We have a TF/IDF based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=3, max_features=None,
                              strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                              ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                              stop_words='english')),
    ('mnb', MultinomialNB())
])
 
classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
    ("tfidf", tfidf_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

from sklearn.model_selection import GridSearchCV
 
 
def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]
 
 
param_grid = dict(
    voting__weights=combinations_on_off(len(classifiers))
)
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
y = train_df[Y_COLUMN].copy()
X = pd.Series(train_df[TEXT_COLUMN])
 
grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")
 
grid_search.fit(X, y)
 
cv_results = grid_search.cv_results_
 
for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)
 
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Let’s run the grid scan and see what it comes up with:

{'voting__weights': [0, 0, 1]} -0.60533660756
{'voting__weights': [0, 1, 0]} -0.474562462086
{'voting__weights': [0, 1, 1]} -0.508363479586
{'voting__weights': [1, 0, 0]} -0.697231760084
{'voting__weights': [1, 0, 1]} -0.456599644003
{'voting__weights': [1, 1, 0]} -0.409406571361
{'voting__weights': [1, 1, 1]} -0.439084397238
 
Best score: -0.409
Best parameters set:
	voting__weights: [1, 1, 0]

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included – our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place

Written by Mark Needham

December 10th, 2017 at 7:55 am

scikit-learn: Building a multi class classification ensemble

without comments

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we could have it make a ‘hard’ prediction instead.

We start with three classifiers which generate different n-gram based features. The code for those is as follows:

from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])

We can combine those classifiers together like this:

classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane‘s tutorial.

import pandas as pd
import numpy as np
 
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
 
def test_pipeline(df, nlp_pipeline):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    accuracies = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
        accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
 
    print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
        str([str(round(x, 3)) for x in sorted(losses)]),
        round(np.mean(losses), 3),
        round(np.mean(accuracies), 3)
    ))
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)

Let’s run the script:

kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

Looks good.

I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

Written by Mark Needham

December 5th, 2017 at 10:19 pm

Python: Combinations of values on and off

without comments

In my continued exploration of Kaggle’s Spooky Authors competition, I wanted to run a GridSearch turning on and off different classifiers to work out the best combination.

I therefore needed to generate combinations of 1s and 0s enabling different classifiers.

e.g. if we had 3 classifiers we’d generate these combinations

0 0 1
0 1 0
1 0 0
1 1 0
1 0 1
0 1 1
1 1 1

where…

  • ‘0 0 1’ means: classifier1 is disabled, classifier3 is disabled, classifier3 is enabled
  • ‘0 1 0’ means: classifier1 is disabled, classifier3 is enabled, classifier3 is disabled
  • ‘1 1 0’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is disabled
  • ‘1 1 1’ means: classifier1 is enabled, classifier3 is enabled, classifier3 is enabled

…and so on. In other words, we need to generate the binary representation for all the values from 1 to 2number of classifiers-1.

We can write the following code fragments to calculate a 3 bit representation of different numbers:

>>> "{0:0b}".format(1).zfill(3)
'001'
>>> "{0:0b}".format(5).zfill(3)
'101'
>>> "{0:0b}".format(6).zfill(3)
'110'

We need an array of 0s and 1s rather than a string, so let’s use the list function to create our array and then cast each value to an integer:

>>> [int(x) for x in list("{0:0b}".format(1).zfill(3))]
[0, 0, 1]

Finally we can wrap that code inside a list comprehension:

def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]

And let’s check it works:

>>> for combination in combinations_on_off(3):
       print(combination)
 
[0, 0, 1]
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]

what about if we have 4 classifiers?

>>> for combination in combinations_on_off(4):
       print(combination)
 
[0, 0, 0, 1]
[0, 0, 1, 0]
[0, 0, 1, 1]
[0, 1, 0, 0]
[0, 1, 0, 1]
[0, 1, 1, 0]
[0, 1, 1, 1]
[1, 0, 0, 0]
[1, 0, 0, 1]
[1, 0, 1, 0]
[1, 0, 1, 1]
[1, 1, 0, 0]
[1, 1, 0, 1]
[1, 1, 1, 0]
[1, 1, 1, 1]

Perfect! We can now use this function to help work out which combinations of classifiers are needed.

Written by Mark Needham

December 3rd, 2017 at 5:23 pm

Posted in Python

Tagged with ,

Python: Learning about defaultdict’s handling of missing keys

without comments

While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.

This is the code snippet that intrigued me:

vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__

Let’s quickly see how it works by adapting an example from scikit-learn:

>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__
 
>>> vocabulary["foo"]
0
>>> vocabulary.items()
dict_items([('foo', 0)])
 
>>> vocabulary["bar"]
1
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])

What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.

Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:

>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
'Mark'
>>> vocabulary["python"]
3

Now let’s see what the dictionary contains:

>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])

All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:

"""
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value
"""
pass

scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.

All in all, very neat!

Written by Mark Needham

December 1st, 2017 at 3:26 pm

Posted in Python

Tagged with , ,

scikit-learn: Creating a matrix of named entity counts

without comments

I’ve been trying to improve my score on Kaggle’s Spooky Author Identification competition, and my latest idea was building a model which used named entities extracted using the polyglot NLP library.

We’ll start by learning how to extract entities form a sentence using polyglot which isn’t too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We’d like each entity to be a string rather than an array of values so let’s refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That’s it for the polyglot part of the solution. Now let’s work out how to integrate that with scikit-learn.

I’ve been using scikit-learn’s Pipeline abstraction for the other models I’ve created so I’d like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
])
 
...
# Train and Test the model
...

I was going to write a class similar to CountVectorizer but after reading its code for a couple of hours I realised that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}
 
 
def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())
])

I’m caching the results in a dictionary because the entity extraction is quite time consuming and there’s no point recalculating it each time the function is called.

Unfortunately this model didn’t help me improve my best score. It scores a log loss of around 0.5, a bit worse than the 0.45 I’ve achieved using the unigram model 🙁

Written by Mark Needham

November 29th, 2017 at 11:01 pm

Python: polyglot – ModuleNotFoundError: No module named ‘icu’

without comments

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

$ pip install polyglot
 
ImportError: No module named 'icu'

Hmmm I’m not sure what icu is but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

brew install icu4c
export ICU_VERSION=58
export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include
export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib
pip install pyicu

I already had icu4c installed so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

$ ls -lh /usr/local/Cellar/icu4c/
total 0
drwxr-xr-x  12 markneedham  admin   408B 28 Nov 06:12 58.2

That still wasn’t enough though! I had to install these two libraries as well:

pip install pycld2
pip install morfessor

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

polyglot download embeddings2.de
polyglot download ner2.de
polyglot download embeddings2.en
polyglot download ner2.en

Written by Mark Needham

November 28th, 2017 at 7:52 pm

Posted in Python

Tagged with , ,

Python 3: TypeError: unsupported format string passed to numpy.ndarray.__format__

without comments

This post explains how to work around a change in how Python string formatting works for numpy arrays between Python 2 and Python 3.

I’ve been going through Kevin Markham‘s scikit-learn Jupyter notebooks and ran into a problem on the Cross Validation one, which was throwing this error when attempting to print the KFold example:

Iteration                   Training set observations                   Testing set observations
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-007cbab507e3> in <module>()
      6 print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
      7 for iteration, data in enumerate(kf, start=1):
----> 8     print('{0:^9} {1} {2:^25}'.format(iteration, data[0], data[1]))
 
TypeError: unsupported format string passed to numpy.ndarray.__format__

We can reproduce this easily:

>>> import numpy as np
>>> "{:9}".format(np.array([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported format string passed to numpy.ndarray.__format__

What about if we use Python 2?

>>> "{:9}".format(np.array([1,2,3]))
'[1 2 3]  '

Hmmm, must be a change between the Python versions.

We can work around it by coercing our numpy array to a string:

>>> "{:9}".format(str(np.array([1,2,3])))
'[1 2 3]  '

Written by Mark Needham

November 19th, 2017 at 7:16 am

Posted in Python

Tagged with ,

Python 3: Create sparklines using matplotlib

without comments

I recently wanted to create sparklines to show how some values were changing over time. In addition, I wanted to generate them as images on the server rather than introducing a JavaScript library.

Chris Seymour’s excellent gist which shows how to create sparklines inside a Pandas dataframe got me most of the way there, but I had to tweak his code a bit to get it to play nicely with Python 3.6.

This is what I ended up with:

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import base64
 
from io import BytesIO
 
def sparkline(data, figsize=(4, 0.25), **kwags):
    """
    Returns a HTML image tag containing a base64 encoded sparkline style plot
    """
    data = list(data)
 
    fig, ax = plt.subplots(1, 1, figsize=figsize, **kwags)
    ax.plot(data)
    for k,v in ax.spines.items():
        v.set_visible(False)
    ax.set_xticks([])
    ax.set_yticks([])
 
    plt.plot(len(data) - 1, data[len(data) - 1], 'r.')
 
    ax.fill_between(range(len(data)), data, len(data)*[min(data)], alpha=0.1)
 
    img = BytesIO()
    plt.savefig(img, transparent=True, bbox_inches='tight')
    img.seek(0)
    plt.close()
 
    return base64.b64encode(img.read()).decode("UTF-8")

I had to change the class used to write the image from StringIO to BytesIO and I found I needed to decode the bytes produced if I wanted it to display in a HTML page.

This is how you would call the above function:

if __name__ == "__main__":
    values = [
        [1,2,3,4,5,6,7,8,9,10],
        [7,10,12,18,2,8,10,6,7,12],
        [10,9,8,7,6,5,4,3,2,1]
    ]
 
    with open("/tmp/foo.html", "w") as file:
        for value in values:
            file.write('<div><img src="data:image/png;base64,{}"/></div>'.format(sparkline(value)))

And the HTML page looks like this:

2017 09 23 07 49 32

Written by Mark Needham

September 23rd, 2017 at 6:51 am

PHP vs Python: Generating a HMAC

without comments

I’ve been writing a bit of code to integrate with a ClassMarker webhook, and you’re required to check that an incoming request actually came from ClassMarker by checking the value of a base64 hash using HMAC SHA256.

The example in the documentation is written in PHP which I haven’t done for about 10 years so I had to figure out how to do the same thing in Python.

This is the PHP version:

$ php -a
php > echo base64_encode(hash_hmac("sha256", "my data", "my_secret", true));
vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE=

The Python equivalent is a bit more code but it’s not too bad.

Import all the libraries

import hmac
import hashlib
import base64

Generate that hash

data = "my data".encode("utf-8")
digest = hmac.new(b"my_secret", data, digestmod=hashlib.sha256).digest()
 
print(base64.b64encode(digest).decode())
'vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE='

We’re getting the same value as the PHP version so it’s good times all round.

Written by Mark Needham

August 2nd, 2017 at 6:09 am

Posted in Python

Tagged with ,

Pandas/scikit-learn: get_dummies test/train sets – ValueError: shapes not aligned

without comments

I’ve been using panda’s get_dummies function to generate dummy columns for categorical variables to use with scikit-learn, but noticed that it sometimes doesn’t work as I expect.

Prerequisites

import pandas as pd
import numpy as np
from sklearn import linear_model

Let’s say we have the following training and test sets:

Training set

train = pd.DataFrame({"letter":["A", "B", "C", "D"], "value": [1, 2, 3, 4]})
X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
y_train = train["value"]

Test set

test = pd.DataFrame({"letter":["D", "D", "B", "E"], "value": [4, 5, 7, 19]})
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
y_test = test["value"]

Now say we want to train a linear model on our training set and then use it to predict the values in our test set:

Train the model

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model

model.score(X_test, y_test)
ValueError: shapes (4,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

Hmmm that didn’t go to plan. If we print X_train and X_test it might help shed some light:

Checking the train/test datasets

print(X_train)
   letter_A  letter_B  letter_C  letter_D
0         1         0         0         0
1         0         1         0         0
2         0         0         1         0
3         0         0         0         1
print(X_test)
   letter_B  letter_D  letter_E
0         0         1         0
1         0         1         0
2         1         0         0
3         0         0         1

They do indeed have different shapes and some different column names because the test set contained some values that weren’t present in the training set.

We can fix this by making the ‘letter’ field categorical before we run the get_dummies method over the dataframe. At the moment the field is of type ‘object’:

Column types

print(train.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
letter    4 non-null object
value     4 non-null int64
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes

We can fix this by converting the ‘letter’ field to the type ‘category’ and setting the list of allowed values to be the unique set of values in the train/test sets.

All allowed values

all_data = pd.concat((train,test))
for column in all_data.select_dtypes(include=[np.object]).columns:
    print(column, all_data[column].unique())
letter ['A' 'B' 'C' 'D' 'E']

Now let’s update the type of our ‘letter’ field in the train and test dataframes.

Type: ‘category’

all_data = pd.concat((train,test))
 
for column in all_data.select_dtypes(include=[np.object]).columns:
    train[column] = train[column].astype('category', categories = all_data[column].unique())
    test[column] = test[column].astype('category', categories = all_data[column].unique())

And now if we call get_dummies on either dataframe we’ll get the same set of columns:

get_dummies: Take 2

X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         1         0         0         0         0
1         0         1         0         0         0
2         0         0         1         0         0
3         0         0         0         1         0
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         0         0         0         1         0
1         0         0         0         1         0
2         0         1         0         0         0
3         0         0         0         0         1

Great! Now we should be able to train our model and use it against the test set:

Train the model: Take 2

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model: Take 2

model.score(X_test, y_test)
-1.0604490500863557

And we’re done!

Written by Mark Needham

July 5th, 2017 at 3:42 pm

Posted in Python

Tagged with ,