Mark Needham

Thoughts on Software Development

Archive for the ‘Data Science’ Category

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

Prerequisites

import numpy as np
import pandas as pd
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
 
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
 
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

without comments

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
null_columns=train.columns[train.isnull().any()]
 
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
 
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
 
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
 
data = pd.concat([
        sub_train[cols],
        pd.get_dummies(sub_train[cols].LotShape),
        pd.get_dummies(sub_train[cols].LotConfig)
    ], axis=1).select_dtypes(include=[np.number])
 
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
 
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
 
lr = linear_model.LinearRegression()
 
model = lr.fit(X_train, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 
 -0.84137438493

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
 
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))
0.0

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))
1.0

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

Written by Mark Needham

June 4th, 2017 at 9:22 am

Posted in Data Science,Python

Tagged with ,

Exploring (potential) data entry errors in the Land Registry data set

without comments

I’ve previously written a couple of blog posts describing the mechanics of analysing the Land Registry data set and I thought it was about time I described some of the queries I’ve been running the discoveries I’ve made.

To recap, the land registry provides a 3GB, 20 million line CSV file containing all the property sales in the UK since 1995.

We’ll be loading and query the data in R using the data.table package:

> library(data.table)
> dt = fread("pp-complete.csv", header = FALSE)
> dt[1:5]
                                       V1     V2               V3       V4 V5
1: {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T
2: {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D
3: {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F
4: {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D
5: {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D
   V6 V7  V8 V9           V10        V11         V12
1:  N  F 106     READING ROAD   NORTHOLT    NORTHOLT
2:  N  F  58     ADAMS MEADOW  ILMINSTER   ILMINSTER
3:  N  L  58    WHELLOCK ROAD                 LONDON
4:  N  F  17         WESTGATE    MORPETH     MORPETH
5:  N  F   4    MASON GARDENS WEST WINCH KING'S LYNN
                            V13            V14 V15
1:                       EALING GREATER LONDON   A
2:               SOUTH SOMERSET       SOMERSET   A
3:                       EALING GREATER LONDON   A
4:               CASTLE MORPETH NORTHUMBERLAND   A
5: KING'S LYNN AND WEST NORFOLK        NORFOLK   A

For our first query we’re going to find the most expensive query sold for each year from 1995 – 2015.

The first thing we’ll need to do is make column ‘V2’ (price) numeric and convert column ‘V3’ (sale date) to data format so we can do date arithmetic on it:

> dt = dt[, V2:= as.numeric(V2)]
> dt = dt[, V3:= as.Date(V3)]

Now let’s write the query:

> dt[, .SD[which.max(V2)], by=year(V3)][order(year)][, .(year,V9,V8,V10,V12,V14,V4,V2)]
    year             V9               V8                   V10            V12            V14       V4       V2
 1: 1995                  THORNETS HOUSE       BUILDER GARDENS    LEATHERHEAD         SURREY KT22 7DE  5610000
 2: 1996                              24             MAIN ROAD MELTON MOWBRAY LEICESTERSHIRE LE14 3SP 17250000
 3: 1997                              42        HYDE PARK GATE         LONDON GREATER LONDON  SW7 5DU  7500000
 4: 1998                              19     NEW BRIDGE STREET         LONDON GREATER LONDON EC4V 6DB 11250000
 5: 1999                  TERMINAL HOUSE LOWER BELGRAVE STREET         LONDON GREATER LONDON SW1W 0NH 32477000
 6: 2000         UNIT 3     JUNIPER PARK            FENTON WAY       BASILDON          ESSEX SS15 6RZ 12600000
 7: 2001                              19        BABMAES STREET         LONDON GREATER LONDON SW1Y 6HD 24750000
 8: 2002                              72        VINCENT SQUARE         LONDON GREATER LONDON SW1P 2PA  8300000
 9: 2003                              81          ADDISON ROAD         LONDON GREATER LONDON  W14 8ED  9250000
10: 2004                              29   HOLLAND VILLAS ROAD         LONDON GREATER LONDON  W14 8DH  7950000
11: 2005 APARTMENT 1102              199         KNIGHTSBRIDGE         LONDON GREATER LONDON  SW7 1RH 15193950
12: 2006                               1     THORNWOOD GARDENS         LONDON GREATER LONDON   W8 7EA 12400000
13: 2007                              36         CADOGAN PLACE         LONDON GREATER LONDON SW1X 9RX 17000000
14: 2008             50                         CHESTER SQUARE         LONDON GREATER LONDON SW1W 9EA 19750000
15: 2009                       CASA SARA     HEATHERSIDE DRIVE VIRGINIA WATER         SURREY GU25 4JU 13800000
16: 2010                              10   HOLLAND VILLAS ROAD         LONDON GREATER LONDON  W14 8BP 16200000
17: 2011                WHITESTONE HOUSE       WHITESTONE LANE         LONDON GREATER LONDON  NW3 1EA 19250000
18: 2012                              20           THE BOLTONS         LONDON GREATER LONDON SW10 9SU 54959000
19: 2013   APARTMENT 7F              171         KNIGHTSBRIDGE         LONDON GREATER LONDON  SW7 1DW 39000000
20: 2014                  APARTMENT 6, 5          PRINCES GATE         LONDON GREATER LONDON  SW7 1QJ 50000000
21: 2015                              37       BURNSALL STREET         LONDON GREATER LONDON  SW3 3SR 27750000
    year             V9               V8                   V10            V12            V14       V4       V2

The results mostly make sense – the majority of the highest priced properties are around Hyde Park and often somewhere near Knightsbridge which is one of the most expensive places in the country.

There are some odd odds though. e.g. in 1996 the top priced property is in Leicester and sold for just over £17m. I looked it up on the Land Registry site to quickly see what it was subsequently sold for:

2015 10 17 22 06 03

Based on the subsequent prices I think we can safely assume that the initial price is incorrect and should actually have been £17,250.

We can also say the same about our 2000 winner in Juniper Park in Basildon which sold for £12.6 million. If we look at the next sale price after that it’s £172,500 in 2003 so most likely it was sold for £126,000 – only 100 times out!

I wanted to follow this observation and see if I could find other anomalies by comparing adjacent sale prices of properties.

First we’ll create a ‘fullAddress’ field which we’ll use as an identifier for each property. It’s not completely unique but it’s not far away:

> dt = dt[, fullAddress := paste(dt$V8, dt$V9, dt$V10, dt$V11, dt$V12, dt$V13, dt$V4, sep=", ")]
> setkey(dt, fullAddress)
 
> dt[, .(fullAddress, V2)][1:5]
                                                                                  fullAddress     V2
1:                ''NUTSHELL COTTAGE, 72, , KIRKLAND, KENDAL, KENDAL, SOUTH LAKELAND, LA9 5AP  89000
2:                         'FARRIERS', , FARRIERS CLOSE, WOODLEY, READING, WOKINGHAM, RG5 3DD 790000
3: 'HOLMCROFT', 40, , BRIDGNORTH ROAD, WOMBOURNE, WOLVERHAMPTON, SOUTH STAFFORDSHIRE, WV5 0AA 305000
4:                            (AKERS), , CHAPEL STREET, EASINGWOLD, YORK, HAMBLETON, YO61 3AE 118000
5:                                       (ANNINGS), , , FARWAY, COLYTON, EAST DEVON, EX24 6DF 150000

Next we’ll add a column to the data table which contains the previous sale price and another column which calculate the difference between the two prices:

> dt[, lag.V2:=c(NA, V2[-.N]), by = fullAddress]
> dt[, V2.diff := V2 - lag.V2]
 
> dt[!is.na(lag.V2),][1:10][, .(fullAddress, lag.V2, V2, V2.diff)]
                                                                                   fullAddress lag.V2     V2 V2.diff
 1:                                       (ANNINGS), , , FARWAY, COLYTON, EAST DEVON, EX24 6DF 150000 385000  235000
 2:                  (BARBER), , PEACOCK CORNER, MOULTON ST MARY, NORWICH, BROADLAND, NR13 3NF 115500 136000   20500
 3:                      (BELL), , BAWBURGH ROAD, MARLINGFORD, NORWICH, SOUTH NORFOLK, NR9 5AG 128000 300000  172000
 4:                      (BEVERLEY), , DAWNS LANE, ASLOCKTON, NOTTINGHAM, RUSHCLIFFE, NG13 9AD  95000 210000  115000
 5: (BLACKMORE), , GREAT STREET, NORTON SUB HAMDON, STOKE-SUB-HAMDON, SOUTH SOMERSET, TA14 6SJ  53000 118000   65000
 6:                        (BOWDERY), , HIGH STREET, MARKINGTON, HARROGATE, HARROGATE, HG3 3NR 140000 198000   58000
 7:                  (BULLOCK), , MOORLAND ROAD, INDIAN QUEENS, ST. COLUMB, RESTORMEL, TR9 6HN  50000  50000       0
 8:                                   (CAWTHRAY), , CAWOOD ROAD, WISTOW, SELBY, SELBY, YO8 3XB 130000 120000  -10000
 9:                                   (CAWTHRAY), , CAWOOD ROAD, WISTOW, SELBY, SELBY, YO8 3XB 120000 155000   35000
10:                                 (COATES), , , BARDSEA, ULVERSTON, SOUTH LAKELAND, LA12 9QT  26000  36000   10000

Let’s find the properties which have the biggest £ value difference in adjacent sales:

> dt[!is.na(V2.diff)][order(-abs(V2.diff))][, .(fullAddress, lag.V2, V2, V2.diff)][1:20]
                                                                fullAddress   lag.V2       V2   V2.diff
 1:     , 50, CHESTER SQUARE, LONDON, LONDON, CITY OF WESTMINSTER, SW1W 9EA  1135000 19750000  18615000
 2:         44, , LANSDOWNE ROAD, , LONDON, KENSINGTON AND CHELSEA, W11 2LU  3675000 22000000  18325000
 3:      24, , MAIN ROAD, ASFORDBY VALLEY, MELTON MOWBRAY, MELTON, LE14 3SP 17250000    32500 -17217500
 4:           11, , ORMONDE GATE, , LONDON, KENSINGTON AND CHELSEA, SW3 4EU   250000 16000000  15750000
 5:     2, , HOLLAND VILLAS ROAD, , LONDON, KENSINGTON AND CHELSEA, W14 8BP  8675000 24000000  15325000
 6:          1, , PEMBRIDGE PLACE, , LONDON, KENSINGTON AND CHELSEA, W2 4XB  2340250 17000000  14659750
 7:     10, , CHESTER SQUARE, LONDON, LONDON, CITY OF WESTMINSTER, SW1W 9HH   680000 15000000  14320000
 8:        12, , SOUTH EATON PLACE, , LONDON, CITY OF WESTMINSTER, SW1W 9JA  4250000 18550000  14300000
 9:     32, FLAT 1, HOLLAND PARK, , LONDON, KENSINGTON AND CHELSEA, W11 3TA   420000 14100000  13680000
10:       42, , EGERTON CRESCENT, , LONDON, KENSINGTON AND CHELSEA, SW3 2EB  1125000 14650000  13525000
11:   36, , CADOGAN PLACE, LONDON, LONDON, KENSINGTON AND CHELSEA, SW1X 9RX  3670000 17000000  13330000
12:        22, , ILCHESTER PLACE, , LONDON, KENSINGTON AND CHELSEA, W14 8AA  3350000 16250000  12900000
13:                3, , BOLNEY GATE, , LONDON, CITY OF WESTMINSTER, SW7 1QW  5650000 18250000  12600000
14:        JUNIPER PARK, UNIT 3, FENTON WAY, , BASILDON, BASILDON, SS15 6RZ 12600000   172500 -12427500
15:           10, , WALTON PLACE, , LONDON, KENSINGTON AND CHELSEA, SW3 1RJ   356000 12750000  12394000
16: 84, MAISONETTE C, EATON SQUARE, , LONDON, CITY OF WESTMINSTER, SW1W 9AG  1500000 13400000  11900000
17:          3, , CHESTERFIELD HILL, , LONDON, CITY OF WESTMINSTER, W1J 5BJ   955000 12600000  11645000
18:   39, , ENNISMORE GARDENS, LONDON, LONDON, CITY OF WESTMINSTER, SW7 1AG  3650000 15250000  11600000
19:       76, FLAT 2, EATON SQUARE, , LONDON, CITY OF WESTMINSTER, SW1W 9AW  3500000 15000000  11500000
20:                            85, , AVENUE ROAD, , LONDON, CAMDEN, NW8 6JD   519000 12000000  11481000

Most of the entries here are in Westminster or Hyde Park and don’t look particularly dodgy at first glance. We’d have to drill into the sale dates to confirm.

What you might also have noticed is that our Melton Mowbray and Juniper Park properties both show up and although they don’t have the biggest £ value difference they would probably rank top if calculated the multiplier instead. Let’s give that a try:

> dt[, V2.multiplier := ifelse(V2 > lag.V2, V2 / lag.V2, lag.V2 / V2)]
 
> dt[!is.na(V2.multiplier)][order(-V2.multiplier)][, .(fullAddress, lag.V2, V2, V2.multiplier)][1:20]
                                                                            fullAddress   lag.V2       V2 V2.multiplier
 1:                  24, , MAIN ROAD, ASFORDBY VALLEY, MELTON MOWBRAY, MELTON, LE14 3SP 17250000    32500     530.76923
 2:                          LEA HAVEN, FLAT 1, CASTLE LANE, , TORQUAY, TORBAY, TQ1 3BE    38000  7537694     198.36037
 3:   NIGHTINGALE HOUSE, , BURLEIGH ROAD, ASCOT, ASCOT, WINDSOR AND MAIDENHEAD, SL5 7LD     9500  1100000     115.78947
 4:                    JUNIPER PARK, UNIT 3, FENTON WAY, , BASILDON, BASILDON, SS15 6RZ 12600000   172500      73.04348
 5:                           9, , ROTHSAY GARDENS, BEDFORD, BEDFORD, BEDFORD, MK40 3QA    21000  1490000      70.95238
 6:       22, GROUND FLOOR FLAT, SEA VIEW AVENUE, , PLYMOUTH, CITY OF PLYMOUTH, PL4 8RU    27950  1980000      70.84079
 7: 91A, , TINTERN AVENUE, WESTCLIFF-ON-SEA, WESTCLIFF-ON-SEA, SOUTHEND-ON-SEA, SS0 9QQ    17000  1190000      70.00000
 8:     204C, , SUTTON ROAD, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SS2 5ES    18000  1190000      66.11111
 9:            PRIORY COURT, FLAT 3, PRIORY AVENUE, TOTNES, TOTNES, SOUTH HAMS, TQ9 5HS  2226500    34000      65.48529
10:      59, , ST ANNS ROAD, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SS2 5AT    18250  1190000      65.20548
11:                                    15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ    13500   880000      65.18519
12:                       11, , ORMONDE GATE, , LONDON, KENSINGTON AND CHELSEA, SW3 4EU   250000 16000000      64.00000
13:                         WOODEND, , CANNONGATE ROAD, HYTHE, HYTHE, SHEPWAY, CT21 5PX    19261  1200000      62.30206
14:                 DODLESTON OAKS, , CHURCH ROAD, DODLESTON, CHESTER, CHESTER, CH4 9NG    10000   620000      62.00000
15:         CREEKSIDE, , CURLEW DRIVE, WEST CHARLETON, KINGSBRIDGE, SOUTH HAMS, TQ7 2AA    28000  1700000      60.71429
16:                              20, , BRANCH ROAD, BURNLEY, BURNLEY, BURNLEY, BB11 3AT     9000   540000      60.00000
17:             THE BARN, , LEE WICK LANE, ST OSYTH, CLACTON-ON-SEA, TENDRING, CO16 8ES    10000   600000      60.00000
18:                           11, , OAKWOOD GARDENS, KNAPHILL, WOKING, WOKING, GU21 2RX     6000   357000      59.50000
19:                              23, , OLDHAM ROAD, GRASSCROFT, OLDHAM, OLDHAM, OL4 4HY     8000   475000      59.37500
20:                  THE SUNDAY HOUSE, , WATER LANE, GOLANT, FOWEY, RESTORMEL, PL23 1LF     8000   475000      59.37500

This is much better! Our Melton Mowbray property comes in first by miles and Juniper Park is there in 4th. The rest of the price increases look implausible as well but let’s drill into a couple of them:

> dt[fullAddress == "15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ"][, .(fullAddress, V3, V2)]
                                        fullAddress         V3     V2
1: 15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ 1995-06-29  13500
2: 15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ 2008-03-28 880000

If we look at some other properties on the same road and look at the property’s features it seems more likely that’s meant to say £88,000.

I noticed a similar trend when looking at some of the others on this list but I also realised that the data needs a bit of cleaning up as the ‘fullAddress’ column isn’t uniquely identifying properties e.g. sometimes a property might have a Town/City of ‘London’ and a District of ‘London’ but on another transaction the District could be blank.

On top of that, my strategy of looking for subsequent prices to spot anomalies falls down when trying to explore properties which only have one sale.

So I have a couple of things to look into for now but once I’ve done those it’d be interesting to write an algorithm/program that could predict which transactions are likely to be anomalies.

I can imagine how that might work if I had a labelled training set but I’m not sure if I could do it with an unsupervised algorithm so if you have any pointers let me know.

Written by Mark Needham

October 18th, 2015 at 10:03 am

Posted in Data Science

Tagged with

Python: scikit-learn – Training a classifier with non numeric features

with one comment

Following on from my previous posts on training a classifier to pick out the speaker in sentences of HIMYM transcripts the next thing to do was train a random forest of decision trees to see how that fared.

I’ve used scikit-learn for this before so I decided to use that. However, before building a random forest I wanted to check that I could build an equivalent decision tree.

I initially thought that scikit-learn’s DecisionTree classifier would take in data in the same format as nltk’s so I started out with the following code:

import json
import nltk
import collections
 
from himymutil.ml import pos_features
from sklearn import tree
from sklearn.cross_validation import train_test_split
 
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )
 
clf = tree.DecisionTreeClassifier()
 
train_data, test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
>>> train_data[1]
({'word': u'your', 'word-pos': 'PRP$', 'next-word-pos': 'NN', 'prev-word-pos': 'VB', 'prev-word': u'throw', 'next-word': u'body'}, False)
 
>>> clf.fit([item[0] for item in train_data], [item[1] for item in train_data])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/tree/tree.py", line 137, in fit
    X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/utils/validation.py", line 281, in check_arrays
    array = np.asarray(array, dtype=dtype)
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number

In fact, the classifier can only deal with numeric features so we need to translate our features into that format using DictVectorizer.

from sklearn.feature_extraction import DictVectorizer
 
vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
 
>>> len(X)
13016
 
>>> len(X[0])
7302
 
>>> vec.get_feature_names()[10:15]
['next-word-pos=EX', 'next-word-pos=IN', 'next-word-pos=JJ', 'next-word-pos=JJR', 'next-word-pos=JJS']

We end up with one feature for every key/value combination that exists in featuresets.

I was initially confused about how to split up training and test data sets but it’s actually fairly easy – train_test_split allows us to pass in multiple lists which it splits along the same seam:

vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
Y = [item[1] for item in featuresets]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, train_size=0.80)

Next we want to train the classifier which is a couple of lines of code:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)

I wrote the following function to assess the classifier:

import collections
import nltk
 
def assess(text, predictions_actual):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (prediction, actual) in enumerate(predictions_actual):
        refsets[actual].add(i)
        testsets[prediction].add(i)
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

We can call it like so:

predictions = clf.predict(X_test)
assessment = assess("Decision Tree", zip(predictions, Y_test))
 
>>> assessment
['Decision Tree', 0.9459459459459459, 0.9210526315789473, 0.9970134395221503, 0.9980069755854509]

Those values are in the same ball park as we’ve seen with the nltk classifier so I’m happy it’s all wired up correctly.

The last thing I wanted to do was visualise the decision tree that had been created and the easiest way to do that is export the classifier to DOT format and then use graphviz to create an image:

with open("/tmp/decisionTree.dot", 'w') as file:
    tree.export_graphviz(clf, out_file = file, feature_names = vec.get_feature_names())
dot -Tpng /tmp/decisionTree.dot -o /tmp/decisionTree.png


The decision tree is quite a few levels deep so here’s part of it:

DecisionTreeSection

The full script is on github if you want to play around with it.

Written by Mark Needham

March 2nd, 2015 at 7:48 am

Posted in Data Science,Python

Tagged with

Data Science: Mo’ Data Mo’ Problems

without comments

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on.

In the early parts of a project we’re trying to prove out our approach rather than prove we can handle big data – something that Ashok taught me a couple of years ago on a project we worked on together.

In a Neo4j project that means coming up with an effective way to model and query our data and if we lose track of this it’s very easy to get sucked into working on the big data problem.

This could mean optimising our import scripts to deal with huge amounts of data or working out how to handle different aspects of the data (e.g. variability in shape or encoding) that only seem to reveal themselves at scale.

These are certainly problems that we need to solve but in my experience they end up taking much more time than expected and therefore aren’t the best problem to tackle when time is limited. Early on we want to create some momentum and keep the feedback cycle fast.

We probably want to tackle the data size problem as part of the implementation/production stage of the project to use Michael Nygaard’s terminology.

At this stage we’ll have some confidence that our approach makes sense and then we can put aside the time to set things up properly.

I’m sure there are some types of projects where this approach doesn’t make sense so I’d love to hear about them in the comments so I can spot them in future.

Written by Mark Needham

June 28th, 2014 at 11:35 pm

Posted in Data Science

Tagged with

Data Science: Don’t build a crawler (if you can avoid it!)

with one comment

On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available.

When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy.

Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules.

Don’t build a crawler

One of the most tempting things to do is build a crawler which starts on the home page and then follows some/all the links it comes across, downloading those pages as it goes.

This is incredibly time consuming and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.

Ashok wanted to get the same data a few months later and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted and then built up a list of pages to download.

It took him in the order of minutes to build a script that would get the data whereas I spent many hours using the crawler based approach.

If there is no discernible URI structure or if you want to get every single page then the crawler approach might make sense but I try to avoid it as a first port of call.

Download the files

The second thing I learnt is that running Web Driver or nokogiri or enlive against live web pages and then only storing the parts of the page we’re interested in is sub optimal.

We pay the network cost every time we run the script and at the beginning of a data gathering exercise we won’t know exactly what data we need so we’re bound to have to run it multiple times until we get it right.

It’s much quicker to download the files to disk and work on them locally.

Use wget

Having spent a lot of time writing different tools to download the ThoughtWorks data set Ashok asked me why I wasn’t using wget instead.

I couldn’t think of a good reason so now I favour building up a list of URIs and then letting wget take care of downloading them for us. e.g.

$ head -n 5 uris.txt
https://www.some-made-up-place.com/page1.html
https://www.some-made-up-place.com/page2.html
https://www.some-made-up-place.com/page3.html
https://www.some-made-up-place.com/page4.html
https://www.some-made-up-place.com/page5.html
 
$ cat uris.txt | time xargs wget
...
Total wall clock time: 3.7s
Downloaded: 60 files, 625K in 0.7s (870 KB/s)
        3.73 real         0.03 user         0.09 sys

If we need to speed things up we can always use the ‘-P’ flag of xargs to do so:

cat uris.txt | time xargs -n1 -P10 wget
        1.65 real         0.20 user         0.21 sys

It pays to be reasonably sensible when using tools like this and of course read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.

Given that you can get the pages using a web browser anyway it’s generally fine but it makes sense not to bombard their site with requests for every single page and instead just focus on the data you’re interested in.

Written by Mark Needham

September 19th, 2013 at 6:55 am

Posted in Data Science

Tagged with

Micro Services Style Data Work Flow

with 2 comments

Having worked on a few data related applications over the last ten months or so Ashok and I were recently discussing some of the things that we’ve learnt

One of the things he pointed out is that it’s very helpful to separate the different stages of a data work flow into their own applications/scripts.

I decided to try out this idea with some football data that I’m currently trying to model and I ended up with the following stages:

Data workflow

The stages do the following:

  • Find – Finds web pages which have the data we need and writes those URLs of those to a text file.
  • Download – Reads in the URLs and downloads the contents to the file system.
  • Extract – Reads in the web pages from the file system and using CSS selectors extracts appropriate data and saves JSON files to disk.
  • Import – Reads in the JSON files and creates nodes/relationships in neo4j.

It’s reasonably similar to micro services except instead of using HTTP as the protocol between each part we use text files as the interface between different scripts.

In fact it’s more like a variation of Unix pipelining as described in The Art of Unix Programming except we store the results of each stage of the pipeline instead of piping them directly into the next one.

If following the Unix way isn’t enough of a reason to split up the problem like this there are a couple of other reasons why this approach is useful:

  • We end up tweaking some parts more than others therefore it’s good if we don’t have to run all the steps each time we make a change e.g. I find that I spend much more time in the extract & import stages than in the other two stages. Once I’ve got the script for getting all the data written it doesn’t seem to change that substantially.
  • We can choose the appropriate technology to do each of the jobs. In this case I find that any data processing is much easier to do in Ruby but the data import is significantly quicker if you use the Java API.
  • We can easily make changes to the work flow if we find a better way of doing things.

That third advantage became clear to me on Saturday when I realised that waiting 3 minutes for the import stage to run each time was becoming quite frustrating.

All node/relationship creation was happening via the REST interface from a Ruby script since that was the easiest way to get started.

I was planning to plugin some Java code using the batch importer to speed things up until Ashok pointed me to a CSV driven batch importer which seemed like it might be even better.

That batch importer takes CSV files of nodes and edges as its input so I needed to add another stage to the work flow if I wanted to use it:

Data workflow 2

I spent a few hours working on the ‘Extract to CSV’ stage and then replaced the initial ‘Import’ script with a call to the batch importer.

It now takes 1.3 seconds to go through the last two stages instead of 3 minutes for the old import stage.

Since all I added was another script that took a text file as input and created text files as output it was really easy to make this change to the work flow.

I’m not sure how well this scales if you’re dealing with massive amounts of data but you can always split the data up into multiple files if the size becomes unmanageable.

Written by Mark Needham

February 18th, 2013 at 10:16 pm

Posted in Data Science

Tagged with

Data Science: Don’t filter data prematurely

without comments

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

  • Scrape internal application using web driver and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not – we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

  • Scrape internal application using web driver and save pages as HTML documents
  • Parse HTML documents and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

Written by Mark Needham

February 17th, 2013 at 8:02 pm

Posted in Data Science

Tagged with ,

Data Science: Discovery work

with one comment

Aaron Erickson recently wrote a blog post where he talks through some of the problems he’s seen with big data initiatives where organisations end up buying a product and expecting it to magically produce results.

[…] corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking… if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe *we* could become the next Amazon.

He goes on to suggest that a more ‘agile’ approach might be more beneficial whereby we drive our work from a business problem with a small team in a short discovery exercise. We can then build on top of that if we’re seeing good results.

A few months ago Ashok and I were doing this type of work for one of our clients and afterwards we tried to summarise how it differed to a normal project.

Hacker Mentality

Since the code we’re writing is almost certainly going to be throwaway it doesn’t make sense to spend a lot of time making it beautiful. It just needs to work.

We didn’t spend any time setting up a continuous integration server or a centralised source control repository since there were only two of us. These things make sense when you have a bigger team and more time but for this type of work it feels overkill.

Most of the code we wrote was in Ruby because that was the language in which we could hack together something useful in the least amount of time but I’m sure others could go just as fast in other languages. We did, however, end up moving some of the code to Java later on after realising the performance gains we’d get from doing so.

2 or 3 hour ‘iterations’

As I mentioned in a previous post we took the approach of finding questions that we wanted the answers to and then spending a few hours working on those before talking to our client again.

Since we don’t really know what the outcome of our discovery work is going to be we want to be able to quickly change direction and not go down too many rabbit holes.

1 or 2 weeks in total

We don’t have any data to prove this but it seems like you’d need a week or two to iterate through enough ideas that you’d have a reasonable chance of coming up with something useful.

It took us 4 days before we zoomed in on something that was useful to the client and allowed them to learn something that they didn’t previously know.

If we do find something worth pursuing then we’d want to bake that work into the normal project back log and then treat it the same as any other piece of work, driven by priority and so on.

Small team

You could argue that small teams are beneficial all the time but it’s especially the case here if we want to keep the feedback cycle tight and the communication overhead low.

Our thinking was that 2 or 3 people would probably be sufficient where 2 of the people would be developers and 1 might be someone with a UX background to help do any visualisation work.

If the domain was particularly complex then that 3rd person could be someone with experience in that area who could help derive useful questions to answer.

Written by Mark Needham

December 9th, 2012 at 10:36 am

Posted in Data Science

Tagged with

Nygard Big Data Model: The Investigation Stage

without comments

Earlier this year Michael Nygard wrote an extremely detailed post about his experiences in the world of big data projects and included in the post was the following diagram which I’ve found very useful.

Nygard

Nygard’s Big Data Model (shamelessly borrowed by me because it’s awesome)

Ashok and I have been doing some work in this area helping one of our clients make sense of and visualise some of their data and we realised retrospectively that we were very acting very much in the investigation stage of the model.

In particular Nygard makes the following suggestions about the way that we work when we’re in this mode:

We don’t want to invest in fully automated machine learning and feedback. That will follow once we validate a hypothesis and want to integrate it into our routine operation.

Ad-hoc analysis refers to human-based data exploration. This can be as simple as spreadsheets with line graphs…the key aspect is that most of the tools are interactive. Questions are expressed as code, but that code is usually just “one shot” and is not meant for production operations.

In our case we weren’t doing anything as complicated as machine learning – most of our work was working out the relationships between things, how best to model those and what visualisations would best describe what we were seeing.

We didn’t TDD any of the code, we copy/pasted a lot and when we had a longer running query we didn’t try and optimise it there and then, instead we ran it once, saved the results to a file and then used the file to load it onto the UI.

We were able to work in iterations of 2/3 hours during which we tried to answer a question (or more than one if we had time) and then showed our client what we’d managed to do and then decided where we wanted to go next.

To start with we did all this with a subset of the actual data set and then once we were on the right track we loaded in the rest of the data.

We can easily get distracted by the difficulties of loading large amounts of data before checking whether what we’re doing makes sense.

We iterated 4 or 5 different ideas before we got to one that allowed us to explore an area which hadn’t previously been explored.

Now that we’ve done that we’re rewriting the application from scratch, still using the same ideas as from the initial prototype, but this time making sure the queries can run on the fly and making the code a bit closer to production quality!

We’ve moved into the implementation stage of the model for this avenue although if I understand the model correctly, it would be ok to go back into investigation mode if we want to do some discovery work with others parts of the data.

I’m probably quoting this model way too much to people that I talk to about this type of work but I think it’s really good so nice work Mr Nygard!

Written by Mark Needham

October 10th, 2012 at 12:00 am

Posted in Data Science

Tagged with