Mark Needham

Thoughts on Software Development

Archive for the ‘pandas’ tag

Pandas: ValueError: The truth value of a Series is ambiguous.

without comments

I’ve been playing around with Kaggle in my spare time over the last few weeks and came across an unexpected behaviour when trying to add a column to a dataframe.

First let’s get Panda’s into our program scope:

Prerequisites

import pandas as pd

Now we’ll create a data frame to play with for the duration of this post:

>>> df = pd.DataFrame({"a": [1,2,3,4,5], "b": [2,3,4,5,6]})
>>> df
   a  b
0  5  2
1  6  6
2  0  8
3  3  2
4  1  6

Let’s say we want to create a new column which returns True if either of the numbers are odd. If not then it’ll return False.

We’d expect to see a column full of True values so let’s get started.

>>> divmod(df["a"], 2)[1] > 0
0     True
1    False
2     True
3    False
4     True
Name: a, dtype: bool
 
>>> divmod(df["b"], 2)[1] > 0
0    False
1     True
2    False
3     True
4    False
Name: b, dtype: bool

So far so good. Now let’s combine those two calculations together and create a new column in our data frame:

>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) or (divmod(df["b"], 2)[1] > 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/kaggle/house-prices/a/lib/python3.6/site-packages/pandas/core/generic.py", line 953, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Hmmm, that was unexpected! Unfortunately Python’s or and and statements don’t work very well against Panda’s Series’, so instead we need to use the bitwise or (|) and and (&).

Let’s update our example:

>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) | (divmod(df["b"], 2)[1] > 0)
>>> df
   a  b  anyOdd
0  1  2    True
1  2  3    True
2  3  4    True
3  4  5    True
4  5  6    True

Much better. And what about if we wanted to check if both values are odd?

>>> df["bothOdd"] = (divmod(df["a"], 2)[1] > 0) & (divmod(df["b"], 2)[1] > 0)
>>> df
   a  b  anyOdd  bothOdd
0  1  2    True    False
1  2  3    True    False
2  3  4    True    False
3  4  5    True    False
4  5  6    True    False

Works exactly as expected, hoorah!

Written by Mark Needham

July 26th, 2017 at 9:41 pm

Posted in Data Science

Tagged with , ,

Pandas/scikit-learn: get_dummies test/train sets – ValueError: shapes not aligned

without comments

I’ve been using panda’s get_dummies function to generate dummy columns for categorical variables to use with scikit-learn, but noticed that it sometimes doesn’t work as I expect.

Prerequisites

import pandas as pd
import numpy as np
from sklearn import linear_model

Let’s say we have the following training and test sets:

Training set

train = pd.DataFrame({"letter":["A", "B", "C", "D"], "value": [1, 2, 3, 4]})
X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
y_train = train["value"]

Test set

test = pd.DataFrame({"letter":["D", "D", "B", "E"], "value": [4, 5, 7, 19]})
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
y_test = test["value"]

Now say we want to train a linear model on our training set and then use it to predict the values in our test set:

Train the model

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model

model.score(X_test, y_test)
ValueError: shapes (4,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

Hmmm that didn’t go to plan. If we print X_train and X_test it might help shed some light:

Checking the train/test datasets

print(X_train)
   letter_A  letter_B  letter_C  letter_D
0         1         0         0         0
1         0         1         0         0
2         0         0         1         0
3         0         0         0         1
print(X_test)
   letter_B  letter_D  letter_E
0         0         1         0
1         0         1         0
2         1         0         0
3         0         0         1

They do indeed have different shapes and some different column names because the test set contained some values that weren’t present in the training set.

We can fix this by making the ‘letter’ field categorical before we run the get_dummies method over the dataframe. At the moment the field is of type ‘object’:

Column types

print(train.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
letter    4 non-null object
value     4 non-null int64
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes

We can fix this by converting the ‘letter’ field to the type ‘category’ and setting the list of allowed values to be the unique set of values in the train/test sets.

All allowed values

all_data = pd.concat((train,test))
for column in all_data.select_dtypes(include=[np.object]).columns:
    print(column, all_data[column].unique())
letter ['A' 'B' 'C' 'D' 'E']

Now let’s update the type of our ‘letter’ field in the train and test dataframes.

Type: ‘category’

all_data = pd.concat((train,test))
 
for column in all_data.select_dtypes(include=[np.object]).columns:
    train[column] = train[column].astype('category', categories = all_data[column].unique())
    test[column] = test[column].astype('category', categories = all_data[column].unique())

And now if we call get_dummies on either dataframe we’ll get the same set of columns:

get_dummies: Take 2

X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         1         0         0         0         0
1         0         1         0         0         0
2         0         0         1         0         0
3         0         0         0         1         0
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         0         0         0         1         0
1         0         0         0         1         0
2         0         1         0         0         0
3         0         0         0         0         1

Great! Now we should be able to train our model and use it against the test set:

Train the model: Take 2

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model: Take 2

model.score(X_test, y_test)
-1.0604490500863557

And we’re done!

Written by Mark Needham

July 5th, 2017 at 3:42 pm

Posted in Python

Tagged with ,

Pandas: Find rows where column/field is null

without comments

In my continued playing around with the Kaggle house prices dataset I wanted to find any columns/fields that have null values in.

If we want to get a count of the number of null fields by column we can use the following code, adapted from Poonam Ligade’s kernel:

Prerequisites

import pandas as pd

Count the null columns

train = pd.read_csv("train.csv")
null_columns=train.columns[train.isnull().any()]
train[null_columns].isnull().sum()
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

So there are lots of different columns containing null values. What if we want to find the solitary row which has ‘Electrical’ as null?

Single column is null

print(train[train["Electrical"].isnull()][null_columns])
      LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
1379         73.0   NaN       None         0.0       Gd       TA           No   
 
     BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
1379          Unf          Unf        NaN         NaN    BuiltIn       2007.0   
 
     GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
1379          Fin         TA         TA    NaN   NaN         NaN

And what if we want to return every row that contains at least one null value? That’s not too difficult – it’s just a combination of the code in the previous two sections:

All null columns

print(train[train.isnull().any(axis=1)][null_columns].head())
   LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
0         65.0   NaN    BrkFace       196.0       Gd       TA           No   
1         80.0   NaN       None         0.0       Gd       TA           Gd   
2         68.0   NaN    BrkFace       162.0       Gd       TA           Mn   
3         60.0   NaN       None         0.0       TA       Gd           No   
4         84.0   NaN    BrkFace       350.0       Gd       TA           Av   
 
  BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
0          GLQ          Unf      SBrkr         NaN     Attchd       2003.0   
1          ALQ          Unf      SBrkr          TA     Attchd       1976.0   
2          GLQ          Unf      SBrkr          TA     Attchd       2001.0   
3          ALQ          Unf      SBrkr          Gd     Detchd       1998.0   
4          GLQ          Unf      SBrkr          TA     Attchd       2000.0   
 
  GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
0          RFn         TA         TA    NaN   NaN         NaN  
1          RFn         TA         TA    NaN   NaN         NaN  
2          RFn         TA         TA    NaN   NaN         NaN  
3          Unf         TA         TA    NaN   NaN         NaN  
4          RFn         TA         TA    NaN   NaN         NaN

Hope that helps future Mark!

Written by Mark Needham

July 5th, 2017 at 2:31 pm

Posted in Python

Tagged with , ,

Python: Find the highest value in a group

without comments

In my continued playing around with a How I met your mother data set I needed to find out the last episode that happened in a season so that I could use it in a chart I wanted to plot.

I had this CSV file containing each of the episodes:

$ head -n 10 data/import/episodes.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200

I started out by parsing the CSV file into a dictionary of (seasons -> episode ids):

import csv
from collections import defaultdict
 
seasons = defaultdict(list)
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons[int(row[3])].append(int(row[0]))
 
print seasons

which outputs the following:

$ python blog.py
defaultdict(<type 'list'>, {
  1: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 
  2: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 
  3: [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64], 
  4: [65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88], 
  5: [89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112], 
  6: [113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136], 
  7: [137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160], 
  8: [161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184], 
  9: [185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208]})

It’s reasonably easy to transform that into a dictionary of (season -> max episode id) with the following couple of lines:

for season, episode_ids in seasons.iteritems():
    seasons[season] = max(episode_ids)
 
>>> print seasons
defaultdict(<type 'list'>, {1: 22, 2: 44, 3: 64, 4: 88, 5: 112, 6: 136, 7: 160, 8: 184, 9: 208})

This works fine but it felt very much like a dplyr problem to me so I wanted to see whether I could write something cleaner using pandas.

I started out by capturing the seasons and episode ids in separate lists and then building up a DataFrame:

import pandas as pd
from pandas import DataFrame
 
seasons, episode_ids = [], []
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons.append(int(row[3]))
        episode_ids.append(int(row[0]))
 
df = DataFrame.from_items([('Season', seasons), ('EpisodeId', episode_ids)])
 
>>> print df.groupby("Season").max()["EpisodeId"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Or we can simplify that and read the CSV file directly into a DataFrame:

df = pd.read_csv('data/import/episodes.csv', index_col=False, header=0)
 
>>> print df.groupby("Season").max()["NumberOverall"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Pretty neat. I need to get more into pandas.

Written by Mark Needham

January 25th, 2015 at 12:47 pm

Posted in Python

Tagged with ,

pandas: Adding a column to a DataFrame (based on another DataFrame)

with one comment

Nathan and I have been working on the Titanic Kaggle problem using the pandas data analysis library and one thing we wanted to do was add a column to a DataFrame indicating if someone survived.

We had the following (simplified) DataFrame containing some information about customers on board the Titanic:

def addrow(df, row):
    return df.append(pd.DataFrame(row), ignore_index=True)
 
customers = pd.DataFrame(columns=['PassengerId','Pclass','Name','Sex','Fare'])
customers = addrow(customers, [dict(PassengerId=892, Pclass=3, Name="Kelly, Mr. James", Sex="male", Fare=7.8292)])
customers = addrow(customers, [dict(PassengerId=893, Pclass=3, Name="Wilkes, Mrs. James (Ellen Needs)", Sex="female", Fare=7)])
 
>>> customers
 
     Fare                              Name  PassengerId  Pclass     Sex
0  7.8292                  Kelly, Mr. James          892       3    male
1  7.0000  Wilkes, Mrs. James (Ellen Needs)          893       3  female

We wanted to add a ‘Survived’ column to that by doing a lookup in the survival_table below to work out the appropriate value:

survival_table = pd.DataFrame(columns=['Sex', 'Pclass', 'PriceDist', 'Survived'])
 
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="female", PriceDist = 0, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="male", PriceDist = 0, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="female", PriceDist = 1, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="male", PriceDist = 1, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="female", PriceDist = 2, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="male", PriceDist = 2, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="female", PriceDist = 3, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=1, Sex="male", PriceDist = 3, Survived = 0)])
 
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="female", PriceDist = 0, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="male", PriceDist = 0, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="female", PriceDist = 1, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="male", PriceDist = 1, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="female", PriceDist = 2, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="male", PriceDist = 2, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="female", PriceDist = 3, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=2, Sex="male", PriceDist = 3, Survived = 0)])
 
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="female", PriceDist = 0, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="male", PriceDist = 0, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="female", PriceDist = 1, Survived = 1)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="male", PriceDist = 1, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="female", PriceDist = 2, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="male", PriceDist = 2, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="female", PriceDist = 3, Survived = 0)])
survival_table = addrow(survival_table, [dict(Pclass=3, Sex="male", PriceDist = 3, Survived = 0)])
 
>>> survival_table
 
    Pclass  PriceDist     Sex  Survived
0        1          0  female         0
1        1          0    male         0
2        1          1  female         0
3        1          1    male         0
4        1          2  female         1
5        1          2    male         0
6        1          3  female         1
7        1          3    male         0
8        2          0  female         0
9        2          0    male         0
10       2          1  female         1
11       2          1    male         0
12       2          2  female         1
13       2          2    male         0
14       2          3  female         1
15       2          3    male         0
16       3          0  female         1
17       3          0    male         0
18       3          1  female         1
19       3          1    male         0
20       3          2  female         0
21       3          2    male         0
22       3          3  female         0
23       3          3    male         0

To do this we can use the DataFrame#apply function which allows us to map over each row.

Our initial attempt read like this:

def select_bucket(fare):
    if (fare >= 0 and fare < 10):
        return 0
    elif (fare >= 10 and fare < 20):
        return 1
    elif (fare >= 20 and fare < 30):
        return 2
    else:
        return 3
 
def calculate_survival(survival_table, customer):
    survival_row = survival_table[(survival_table["Sex"] == customer["Sex"]) & (survival_table["Pclass"] == customer["Pclass"]) & (survival_table["PriceDist"] == select_bucket(customer["Fare"]))]
    return survival_row["Survived"]
 
>>> customers["Survived"] = customers.apply(lambda customer: calculate_survival(survival_table, customer), axis=1)

When we ran that we got the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2119, in __setitem__
    self._set_item(key, value)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2166, in _set_item
    NDFrame._set_item(self, key, value)
  File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 679, in _set_item
    self._data.set(key, value)
  File "/Library/Python/2.7/site-packages/pandas/core/internals.py", line 1781, in set
    self.insert(len(self.items), item, value)
  File "/Library/Python/2.7/site-packages/pandas/core/internals.py", line 1795, in insert
    self._add_new_block(item, value, loc=loc)
  File "/Library/Python/2.7/site-packages/pandas/core/internals.py", line 1911, in _add_new_block
    self.items, fastpath=True)
  File "/Library/Python/2.7/site-packages/pandas/core/internals.py", line 966, in make_block
    return klass(values, items, ref_items, ndim=values.ndim, fastpath=fastpath, placement=placement)
  File "/Library/Python/2.7/site-packages/pandas/core/internals.py", line 44, in __init__
    % (len(items), len(values)))
ValueError: Wrong number of items passed 1, indices imply 2

After much googling and confusion as to why we were getting this error I tried printing out the result of calling apply rather than immediately assigning it and realised that the output wasn’t what I expected:

>>> customers.apply(lambda customer: calculate_survival(survival_table, customer), axis=1)
   16  17
0 NaN   0
1   1 NaN

I’d expected to get one column showing the survived values but instead we’ve got a 2×2 DataFrame. Adding some logging to the calculate_survival function revealed why:

def calculate_survival(survival_table, customer):
    survival_row = survival_table[(survival_table["Sex"] == customer["Sex"]) & (survival_table["Pclass"] == customer["Pclass"]) & (survival_table["PriceDist"] == select_bucket(customer["Fare"]))]
    print(type(survival_row["Survived"]))
    return survival_row["Survived"]
 
>>> customers.apply(lambda customer: calculate_survival(survival_table, customer), axis=1)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
   16  17
0 NaN   0
1   1 NaN

Our function is actually returning a Series object rather than a single value 0 or 1 which I found surprising. We can use the iat function to retrieve a scalar value from a Series:

def calculate_survival(survival_table, customer):
    survival_row = survival_table[(survival_table["Sex"] == customer["Sex"]) & (survival_table["Pclass"] == customer["Pclass"]) & (survival_table["PriceDist"] == select_bucket(customer["Fare"]))]
    return int(survival_row["Survived"].iat[0])
 
>>> customers.apply(lambda customer: calculate_survival(survival_table, customer), axis=1)
0    0
1    1
dtype: int64

Now if we assign the output of that function like before it works as expected:

>>> customers["Survived"] = customers.apply(lambda customer: calculate_survival(survival_table, customer), axis=1)
>>> customers
     Fare                              Name  PassengerId  Pclass     Sex  Survived
0  7.8292                  Kelly, Mr. James          892       3    male         0
1  7.0000  Wilkes, Mrs. James (Ellen Needs)          893       3  female         1

Written by Mark Needham

October 30th, 2013 at 6:12 am

Posted in Python

Tagged with ,