Mark Needham

Thoughts on Software Development

Python: Making scikit-learn and pandas play nice

with 8 comments

In the last post I wrote about Nathan and my attempts at the Kaggle Titanic Problem I mentioned that we our next step was to try out scikit-learn so I thought I should summarise where we’ve got up to.

We needed to write a classification algorithm to work out whether a person onboard the Titanic survived and luckily scikit-learn has extensive documentation on each of the algorithms.

Unfortunately almost all those examples use numpy data structures and we’d loaded the data using pandas and didn’t particularly want to switch back!

Luckily it was really easy to get the data into numpy format by calling ‘values’ on the pandas data structure, something we learnt from a reply on Stack Overflow.

For example if we were to wire up an ExtraTreesClassifier which worked out survival rate based on the ‘Fare’ and ‘Pclass’ attributes we could write the following code:

import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
 
train_df = pd.read_csv("train.csv")
 
et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0)
 
columns = ["Fare", "Pclass"]
 
labels = train_df["Survived"].values
features = train_df[list(columns)].values
 
et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
 
print("{0} -> ET: {1})".format(columns, et_score))

To start with with read in the CSV file which looks like this:

$ head -n5 train.csv 
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

Next we create our classifier which “fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.“. i.e. a better version of a random forest.

On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier.

Finally we call the cross_val_score function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter.

If we run this code we’ll get roughly the following output:

$ python et.py
 
['Fare', 'Pclass'] -> ET: 0.687991021324)

This is actually a worse accuracy than we’d get by saying that females survived and males didn’t.

We can introduce ‘Sex’ into the classifier by adding it to the list of columns:

columns = ["Fare", "Pclass", "Sex"]

If we re-run the code we’ll get the following error:

$ python et.py
 
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (514, 0))
...
Traceback (most recent call last):
  File "et.py", line 14, in <module>
    et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/cross_validation.py", line 1152, in cross_val_score
    for train, test in cv)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
    self.retrieve()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 450, in retrieve
    raise exception_type(report)
sklearn.externals.joblib.my_exceptions.JoblibValueError/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  self.message,
: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...
ValueError: could not convert string to float: male
___________________________________________________________________________

This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier – in this case ‘Sex’ has the values ‘female’ and ‘male’. We’ll need to write a function to replace those values with numeric equivalents.

train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)

Now if we re-run the classifier we’ll get a slightly more accurate prediction:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)

The next step is use the classifier against the test data set so let’s load the data and run the prediction:

test_df = pd.read_csv("test.csv")
 
et.fit(features, labels)
et.predict(test_df[columns].values)

Now if we run that:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 22, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 91, in array2d
    X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: male

which is the same problem we had earlier! We need to replace the ‘male’ and ‘female’ values in the test set too so we’ll pull out a function to do that now.

def replace_non_numeric(df):
	df["Sex"] = df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)
	return df

Now we’ll call that function with our training and test data frames:

train_df = replace_non_numeric(pd.read_csv("train.csv"))
...
test_df = replace_non_numeric(pd.read_csv("test.csv"))

If we run the program again:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 26, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 93, in array2d
    _assert_all_finite(X_2d)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

There are missing values in the test set so we’ll replace those with average values from our training set using an Imputer:

from sklearn.preprocessing import Imputer
 
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(features)
 
test_df = replace_non_numeric(pd.read_csv("test.csv"))
 
et.fit(features, labels)
print et.predict(imp.transform(test_df[columns].values))

If we run that it completes successfully:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
[0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 0 0 1 0 0 0]

The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle.

The type of those values is ‘numpy.ndarray’ which we can convert to a pandas Series quite easily:

predictions = et.predict(imp.transform(test_df[columns].values))
test_df["Survived"] = pd.Series(predictions)

We can then write the ‘PassengerId’ and ‘Survived’ columns to a file:

test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'], index=False)

Then output file looks like this:

$ head -n5 foo.csv 
PassengerId,Survived
892,0
893,1
894,0

The code we’ve written is on github in case it’s useful to anyone.

Be Sociable, Share!

Written by Mark Needham

November 9th, 2013 at 1:58 pm

Posted in Python

Tagged with

  • Christian Jauvin

    Nice article! I just want to note that to map the non-numeric features to numeric in a more general way, you can also use sklearn.preprocessing.LabelEncoder, which is very handy.

  • @christianjauvin:disqus glad you liked the post & thankyou for the tip! I didn’t know about the LabelEncoder. Used it yesterday, very nice 🙂

  • Christian Jauvin

    I also fiddled around with the Kaggle Titanic problem for a while, and rapidly figured that it seems very hard to obtain a model doing better than ~0.80, no matter what modeling technique or feature processing I used. I also got the impression that the bunch of 0.99+ results on the leaderboard were somewhat suspicious, but then that’s just a hunch. What’s your take on that?

  • @christianjauvin:disqus yeh we’ve got stuck around 80 as well so I’m not sure if any scores above that are gamed in some way or not. I get the feeling that without overfitting it’d be hard to get much higher than 80.

    I also think this dataset is probably a bit small for some of the algorithms their value so we’ll probably move onto another problem in a couple of weeks.

    Have you done any of the other Kaggle problems?

  • Christian Jauvin

    I also did the Stack Overflow “predict closed questions” one, which was quite interesting, and actually gave the impression that the (pretty substantial) dataset contained enough information to build a model that made sense (although it’s hard to imagine what kind of features would allow to distinguish between the subtle nuances of the closing types).

  • Dan Ofer

    This is probably a silly question, but how/where do you remove the training sets labels from the data(features) you input for training the predictor?

    labels = train_df[“Survived”].values
    features = train_df[list(columns)].values

    So, features would have the training labels (0/1 = “survived) in it as well. (And I missed where it’s parsed out, I only saw the replacement of missing values and categorical values) –

    et.fit(features, labels)

    Thanks!

    (I’m trying to figure out how to gracefully remove samples names and to add 0/1 labels for training, on data i’m generating myself, putting into pandas, labelling there, then storing as a CSV that i’ll read again into a pandas dataframe before training on it with sklearn) 🙂

  • Clement

    Very nice article, a lot of explanation and definitely worthy for a beginner

  • Pingback: Python: scikit-learn – Training a classifier with non numeric features at Mark Needham()