Mark Needham

Thoughts on Software Development

Archive for the ‘machine learning’ tag

scikit-learn: Using GridSearch to tune the hyper-parameters of VotingClassifier

without comments

In my last blog post I showed how to create a multi class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble otherwise they can be excluded.

We have a TF/IDF based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=3, max_features=None,
                              strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                              ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                              stop_words='english')),
    ('mnb', MultinomialNB())
])
 
classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
    ("tfidf", tfidf_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

from sklearn.model_selection import GridSearchCV
 
 
def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]
 
 
param_grid = dict(
    voting__weights=combinations_on_off(len(classifiers))
)
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
y = train_df[Y_COLUMN].copy()
X = pd.Series(train_df[TEXT_COLUMN])
 
grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")
 
grid_search.fit(X, y)
 
cv_results = grid_search.cv_results_
 
for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)
 
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Let’s run the grid scan and see what it comes up with:

{'voting__weights': [0, 0, 1]} -0.60533660756
{'voting__weights': [0, 1, 0]} -0.474562462086
{'voting__weights': [0, 1, 1]} -0.508363479586
{'voting__weights': [1, 0, 0]} -0.697231760084
{'voting__weights': [1, 0, 1]} -0.456599644003
{'voting__weights': [1, 1, 0]} -0.409406571361
{'voting__weights': [1, 1, 1]} -0.439084397238
 
Best score: -0.409
Best parameters set:
	voting__weights: [1, 1, 0]

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included – our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place

Written by Mark Needham

December 10th, 2017 at 7:55 am

scikit-learn: Building a multi class classification ensemble

without comments

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a ‘soft’ prediction. If we only needed to know the most likely author we could have it make a ‘hard’ prediction instead.

We start with three classifiers which generate different n-gram based features. The code for those is as follows:

from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])

We can combine those classifiers together like this:

classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane‘s tutorial.

import pandas as pd
import numpy as np
 
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
 
def test_pipeline(df, nlp_pipeline):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    accuracies = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
        accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
 
    print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
        str([str(round(x, 3)) for x in sorted(losses)]),
        round(np.mean(losses), 3),
        round(np.mean(accuracies), 3)
    ))
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)

Let’s run the script:

kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

Looks good.

I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

Written by Mark Needham

December 5th, 2017 at 10:19 pm