Mark Needham

Thoughts on Software Development

Archive for the ‘data science’ tag

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

Prerequisites

import numpy as np
import pandas as pd
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
 
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
 
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Luigi: An ExternalProgramTask example – Converting JSON to CSV

with 2 comments

I’ve been playing around with the Python library Luigi which is used to build pipelines of batch jobs and I struggled to find an example of an ExternalProgramTask so this is my attempt at filling that void.

Luigi - the Python data library for building data science pipelines

I’m building a little data pipeline to get data from the meetup.com API and put it into CSV files that can be loaded into Neo4j using the LOAD CSV command.

The first task I created calls the /groups endpoint and saves the result into a JSON file:

import luigi
import requests
import json
from collections import Counter
 
class GroupsToJSON(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def run(self):
        seed_topic = "nosql"
        uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(seed_topic, self.lat, self.lon, self.key)
 
        r = requests.get(uri)
        all_topics = [topic["urlkey"]  for result in r.json()["results"] for topic in result["topics"]]
        c = Counter(all_topics)
 
        topics = [entry[0] for entry in c.most_common(10)]
 
        groups = {}
        for topic in topics:
            uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(topic, self.lat, self.lon, self.key)
            r = requests.get(uri)
            for group in r.json()["results"]:
                groups[group["id"]] = group
 
        with self.output().open('w') as groups_file:
            json.dump(list(groups.values()), groups_file, indent=4, sort_keys=True)
 
    def output(self):
        return luigi.LocalTarget("/tmp/groups.json")

We define a few parameters at the top of the class which will be passed in when this task is executed. The most interesting lines of the run function are the last couple where we write the JSON to a file. self.output() refers to the target defined in the output function which in this case is /tmp/groups.json.

Now we need to create a task to convert that JSON file into CSV format. The jq command line tool does this job well so we’ll use that. The following task does the job:

from luigi.contrib.external_program import ExternalProgramTask
 
class GroupsToCSV(luigi.contrib.external_program.ExternalProgramTask):
    file_path = "/tmp/groups.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def program_args(self):
        return ["./groups.sh", self.input()[0].path, self.output().path]
 
    def output(self):
        return luigi.LocalTarget(self.file_path)
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

groups.sh

#!/bin/bash
 
in=${1}
out=${2}
 
echo "id,name,urlname,link,rating,created,description,organiserName,organiserMemberId" > ${out}
jq -r '.[] | [.id, .name, .urlname, .link, .rating, .created, .description, .organizer.name, .organizer.member_id] | @csv' ${in} >> ${out}

I wanted to call jq directly from the Python code but I couldn’t figure out how to do it so putting that code in a shell script is my workaround.

The last piece of the puzzle is a wrapper task that launches the others:

import os
 
class Meetup(luigi.WrapperTask):
    def run(self):
        print("Running Meetup")
 
    def requires(self):
        key = os.environ['MEETUP_API_KEY']
        lat = os.getenv('LAT', "51.5072")
        lon = os.getenv('LON', "0.1275")
 
        yield GroupsToCSV(key, lat, lon)

Now we’re ready to run the tasks:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
DEBUG: Checking if GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   PENDING
DEBUG: Checking if GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
INFO: Running command: ./groups.sh /tmp/groups.json /tmp/groups.csv
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   Meetup()
Running Meetup
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      Meetup()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
    - 1 GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
    - 1 Meetup()
 
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

Looks good! Let’s quickly look at our CSV file:

$ head -n10 /tmp/groups.csv 
id,name,urlname,link,rating,created,description,organiserName,organiserMemberId
1114381,"London NoSQL, MySQL, Open Source Community","london-nosql-mysql","https://www.meetup.com/london-nosql-mysql/",4.28,1208505614000,"<p>Meet others in London interested in NoSQL, MySQL, and Open Source Databases.</p>","Sinead Lawless",185675230
1561841,"Enterprise Search London Meetup","es-london","https://www.meetup.com/es-london/",4.66,1259157419000,"<p>Enterprise Search London is a meetup for anyone interested in building search and discovery experiences — from intranet search and site search, to advanced discovery applications and beyond.</p>
<p>Disclaimer: This meetup is NOT about SEO or search engine marketing.</p>
<p><strong>What people are saying:</strong></p>
<ul>
<li><span>""Join this meetup if you have a passion for enterprise search and user experience that you would like to share with other able-minded practitioners."" — Vegard Sandvold</span></li>
<li><span>""Full marks for vision and execution. Looking forward to the next Meetup."" — Martin White</span></li>
<li><span>“Consistently excellent” — Helen Lippell</span></li>
</ul>

Sweet! And what if we run it again?

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=172768377, workers=1, host=Marks-MBP-4, username=markneedham, pid=4531) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 Meetup()
 
Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

As expected nothing happens since our dependencies are already satisfied and we have our first Luigi pipeline up and running.

Written by Mark Needham

March 25th, 2017 at 2:09 pm

Posted in Python

Tagged with , , ,