Mark Needham

Thoughts on Software Development

Archive for the ‘Languages’ Category

Python 3: Create sparklines using matplotlib

without comments

I recently wanted to create sparklines to show how some values were changing over time. In addition, I wanted to generate them as images on the server rather than introducing a JavaScript library.

Chris Seymour’s excellent gist which shows how to create sparklines inside a Pandas dataframe got me most of the way there, but I had to tweak his code a bit to get it to play nicely with Python 3.6.

This is what I ended up with:

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import base64
 
from io import BytesIO
 
def sparkline(data, figsize=(4, 0.25), **kwags):
    """
    Returns a HTML image tag containing a base64 encoded sparkline style plot
    """
    data = list(data)
 
    fig, ax = plt.subplots(1, 1, figsize=figsize, **kwags)
    ax.plot(data)
    for k,v in ax.spines.items():
        v.set_visible(False)
    ax.set_xticks([])
    ax.set_yticks([])
 
    plt.plot(len(data) - 1, data[len(data) - 1], 'r.')
 
    ax.fill_between(range(len(data)), data, len(data)*[min(data)], alpha=0.1)
 
    img = BytesIO()
    plt.savefig(img, transparent=True, bbox_inches='tight')
    img.seek(0)
    plt.close()
 
    return base64.b64encode(img.read()).decode("UTF-8")

I had to change the class used to write the image from StringIO to BytesIO and I found I needed to decode the bytes produced if I wanted it to display in a HTML page.

This is how you would call the above function:

if __name__ == "__main__":
    values = [
        [1,2,3,4,5,6,7,8,9,10],
        [7,10,12,18,2,8,10,6,7,12],
        [10,9,8,7,6,5,4,3,2,1]
    ]
 
    with open("/tmp/foo.html", "w") as file:
        for value in values:
            file.write('<div><img src="data:image/png;base64,{}"/></div>'.format(sparkline(value)))

And the HTML page looks like this:

2017 09 23 07 49 32

Written by Mark Needham

September 23rd, 2017 at 6:51 am

PHP vs Python: Generating a HMAC

without comments

I’ve been writing a bit of code to integrate with a ClassMarker webhook, and you’re required to check that an incoming request actually came from ClassMarker by checking the value of a base64 hash using HMAC SHA256.

The example in the documentation is written in PHP which I haven’t done for about 10 years so I had to figure out how to do the same thing in Python.

This is the PHP version:

$ php -a
php > echo base64_encode(hash_hmac("sha256", "my data", "my_secret", true));
vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE=

The Python equivalent is a bit more code but it’s not too bad.

Import all the libraries

import hmac
import hashlib
import base64

Generate that hash

data = "my data".encode("utf-8")
digest = hmac.new(b"my_secret", data, digestmod=hashlib.sha256).digest()
 
print(base64.b64encode(digest).decode())
'vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE='

We’re getting the same value as the PHP version so it’s good times all round.

Written by Mark Needham

August 2nd, 2017 at 6:09 am

Posted in Python

Tagged with ,

Pandas/scikit-learn: get_dummies test/train sets – ValueError: shapes not aligned

without comments

I’ve been using panda’s get_dummies function to generate dummy columns for categorical variables to use with scikit-learn, but noticed that it sometimes doesn’t work as I expect.

Prerequisites

import pandas as pd
import numpy as np
from sklearn import linear_model

Let’s say we have the following training and test sets:

Training set

train = pd.DataFrame({"letter":["A", "B", "C", "D"], "value": [1, 2, 3, 4]})
X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
y_train = train["value"]

Test set

test = pd.DataFrame({"letter":["D", "D", "B", "E"], "value": [4, 5, 7, 19]})
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
y_test = test["value"]

Now say we want to train a linear model on our training set and then use it to predict the values in our test set:

Train the model

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model

model.score(X_test, y_test)
ValueError: shapes (4,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

Hmmm that didn’t go to plan. If we print X_train and X_test it might help shed some light:

Checking the train/test datasets

print(X_train)
   letter_A  letter_B  letter_C  letter_D
0         1         0         0         0
1         0         1         0         0
2         0         0         1         0
3         0         0         0         1
print(X_test)
   letter_B  letter_D  letter_E
0         0         1         0
1         0         1         0
2         1         0         0
3         0         0         1

They do indeed have different shapes and some different column names because the test set contained some values that weren’t present in the training set.

We can fix this by making the ‘letter’ field categorical before we run the get_dummies method over the dataframe. At the moment the field is of type ‘object’:

Column types

print(train.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
letter    4 non-null object
value     4 non-null int64
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes

We can fix this by converting the ‘letter’ field to the type ‘category’ and setting the list of allowed values to be the unique set of values in the train/test sets.

All allowed values

all_data = pd.concat((train,test))
for column in all_data.select_dtypes(include=[np.object]).columns:
    print(column, all_data[column].unique())
letter ['A' 'B' 'C' 'D' 'E']

Now let’s update the type of our ‘letter’ field in the train and test dataframes.

Type: ‘category’

all_data = pd.concat((train,test))
 
for column in all_data.select_dtypes(include=[np.object]).columns:
    train[column] = train[column].astype('category', categories = all_data[column].unique())
    test[column] = test[column].astype('category', categories = all_data[column].unique())

And now if we call get_dummies on either dataframe we’ll get the same set of columns:

get_dummies: Take 2

X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         1         0         0         0         0
1         0         1         0         0         0
2         0         0         1         0         0
3         0         0         0         1         0
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         0         0         0         1         0
1         0         0         0         1         0
2         0         1         0         0         0
3         0         0         0         0         1

Great! Now we should be able to train our model and use it against the test set:

Train the model: Take 2

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model: Take 2

model.score(X_test, y_test)
-1.0604490500863557

And we’re done!

Written by Mark Needham

July 5th, 2017 at 3:42 pm

Posted in Python

Tagged with ,

Pandas: Find rows where column/field is null

without comments

In my continued playing around with the Kaggle house prices dataset I wanted to find any columns/fields that have null values in.

If we want to get a count of the number of null fields by column we can use the following code, adapted from Poonam Ligade’s kernel:

Prerequisites

import pandas as pd

Count the null columns

train = pd.read_csv("train.csv")
null_columns=train.columns[train.isnull().any()]
train[null_columns].isnull().sum()
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

So there are lots of different columns containing null values. What if we want to find the solitary row which has ‘Electrical’ as null?

Single column is null

print(train[train["Electrical"].isnull()][null_columns])
      LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
1379         73.0   NaN       None         0.0       Gd       TA           No   
 
     BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
1379          Unf          Unf        NaN         NaN    BuiltIn       2007.0   
 
     GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
1379          Fin         TA         TA    NaN   NaN         NaN

And what if we want to return every row that contains at least one null value? That’s not too difficult – it’s just a combination of the code in the previous two sections:

All null columns

print(train[train.isnull().any(axis=1)][null_columns].head())
   LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
0         65.0   NaN    BrkFace       196.0       Gd       TA           No   
1         80.0   NaN       None         0.0       Gd       TA           Gd   
2         68.0   NaN    BrkFace       162.0       Gd       TA           Mn   
3         60.0   NaN       None         0.0       TA       Gd           No   
4         84.0   NaN    BrkFace       350.0       Gd       TA           Av   
 
  BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
0          GLQ          Unf      SBrkr         NaN     Attchd       2003.0   
1          ALQ          Unf      SBrkr          TA     Attchd       1976.0   
2          GLQ          Unf      SBrkr          TA     Attchd       2001.0   
3          ALQ          Unf      SBrkr          Gd     Detchd       1998.0   
4          GLQ          Unf      SBrkr          TA     Attchd       2000.0   
 
  GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
0          RFn         TA         TA    NaN   NaN         NaN  
1          RFn         TA         TA    NaN   NaN         NaN  
2          RFn         TA         TA    NaN   NaN         NaN  
3          Unf         TA         TA    NaN   NaN         NaN  
4          RFn         TA         TA    NaN   NaN         NaN

Hope that helps future Mark!

Written by Mark Needham

July 5th, 2017 at 2:31 pm

Posted in Python

Tagged with , ,

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

Prerequisites

import numpy as np
import pandas as pd
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
 
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
 
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

without comments

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
null_columns=train.columns[train.isnull().any()]
 
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
 
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
 
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
 
data = pd.concat([
        sub_train[cols],
        pd.get_dummies(sub_train[cols].LotShape),
        pd.get_dummies(sub_train[cols].LotConfig)
    ], axis=1).select_dtypes(include=[np.number])
 
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
 
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
 
lr = linear_model.LinearRegression()
 
model = lr.fit(X_train, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 
 -0.84137438493

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
 
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))
0.0

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))
1.0

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

Written by Mark Needham

June 4th, 2017 at 9:22 am

Posted in Data Science,Python

Tagged with ,

Leaflet: Mapping Strava runs/polylines on Open Street Map

without comments

I’m a big Strava user and spent a bit of time last weekend playing around with their API to work out how to map all my runs.

2017 04 29 15 56 06

Strava API and polylines

This is a two step process:

  1. Call the /athlete/activities/ endpoint to get a list of all my activities
  2. For each of those activities call /activities/[activityId] endpoint to get more detailed information for each activity

That second API returns a ‘polyline’ property which the documentation describes as follows:

Activity and segment API requests may include summary polylines of their respective routes. The values are string encodings of the latitude and longitude points using the Google encoded polyline algorithm format.

If we navigate to that page we get the following explanation:

Polyline encoding is a lossy compression algorithm that allows you to store a series of coordinates as a single string.

I tried out a couple of my polylines using the interactive polyline encoder utility which worked well once I realised that I needed to escape backslashes (“\”) in the polyline before pasting it into the tool.

Now that I’d figured out how to map one run it was time to automate the process.

Leaflet and OpenStreetMap

I’ve previously had a good experience using Leaflet so I was keen to use that and luckily came across a Stack Overflow answer showing how to do what I wanted.

I created a HTML file and manually pasted in a couple of my runs (not forgetting to escape those backslashes!) to check that they worked:

blog.html

<html>
  <head>
    <title>Mapping my runs</title>
  </head>
 
  <body>
    <script src="http://cdn.leafletjs.com/leaflet-0.7/leaflet.js"></script>
    <script type="text/javascript" src="https://rawgit.com/jieter/Leaflet.encoded/master/Polyline.encoded.js"></script>
    <link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet-0.7/leaflet.css" />
    <div id="map" style="width: 100%; height: 100%"></div>
 
    <script>
    var map = L.map('map').setView([55.609818, 13.003286], 13);
    L.tileLayer(
        'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
            maxZoom: 18,
        }).addTo(map);
 
    var encodedRoutes = [
      "{zkrIm`inANPD?BDXGPKLATHNRBRFtAR~AFjAHl@D|ALtATj@HHJBL?`@EZ?NQ\\Y^MZURGJKR]RMXYh@QdAWf@[~@aAFGb@?j@YJKBU@m@FKZ[NSPKTCRJD?`@Wf@Wb@g@HCp@Qh@]z@SRMRE^EHJZnDHbBGPHb@NfBTxBN|DVbCBdA^lBFl@Lz@HbBDl@Lr@Bb@ApCAp@Ez@g@bEMl@g@`B_AvAq@l@    QF]Rs@Nq@CmAVKCK?_@Nw@h@UJIHOZa@xA]~@UfASn@U`@_@~@[d@Sn@s@rAs@dAGN?NVhAB\\Ox@@b@S|A?Tl@jBZpAt@vBJhATfGJn@b@fARp@H^Hx@ARGNSTIFWHe@AGBOTAP@^\\zBMpACjEWlEIrCKl@i@nAk@}@}@yBOWSg@kAgBUk@Mu@[mC?QLIEUAuAS_E?uCKyCA{BH{DDgF`AaEr@uAb@oA~@{AE}AKw@    g@qAU[_@w@[gAYm@]qAEa@FOXg@JGJ@j@o@bAy@NW?Qe@oCCc@SaBEOIIEQGaAe@kC_@{De@cE?KD[H[P]NcAJ_@DGd@Gh@UHI@Ua@}Bg@yBa@uDSo@i@UIICQUkCi@sCKe@]aAa@oBG{@G[CMOIKMQe@IIM@KB]Tg@Nw@^QL]NMPMn@@\\Lb@P~@XT",
      "u}krIq_inA_@y@My@Yu@OqAUsA]mAQc@CS@o@FSHSp@e@n@Wl@]ZCFEBK?OC_@Qw@?m@CSK[]]EMBeAA_@m@qEAg@UoCAaAMs@IkBMoACq@SwAGOYa@IYIyA_@kEMkC]{DEaAScC@yEHkGA_ALsCBiA@mCD{CCuAZcANOH@HDZl@Z`@RFh@\\TDT@ZVJBPMVGLM\\Mz@c@NCPMXERO|@a@^Ut@s@p@KJAJ    Bd@EHEXi@f@a@\\g@b@[HUD_B@uADg@DQLCLD~@l@`@J^TF?JANQ\\UbAyABEZIFG`@o@RAJEl@_@ZENDDIA[Ki@BURQZaARODKVs@LSdAiAz@G`BU^A^GT@PRp@zARXRn@`BlDHt@ZlAFh@^`BX|@HHHEf@i@FAHHp@bBd@v@DRAVMl@i@v@SROXm@tBILOTOLs@NON_@t@KX]h@Un@k@\\c@h@Ud@]ZGNKp@Sj@KJo@    b@W`@UPOX]XWd@UF]b@WPOAIBSf@QVi@j@_@V[b@Uj@YtAEFCCELARBn@`@lBjAzD^vB^hB?LENURkAv@[Ze@Xg@Py@p@QHONMA[HGAWE_@Em@Hg@AMCG@QHq@Cm@M[Jy@?UJIA{@Ae@KI@GFKNIX[QGAcAT[JK?OVMFK@IAIUKAYJI?QKUCGFIZCXDtAHl@@p@LjBCZS^ERAn@Fj@Br@Hn@HzAHh@RfD?j@TnCTlA    NjANb@\\z@TtARr@P`AFnAGfBG`@CFE?"
  ]
 
    for (let encoded of encodedRoutes) {
      var coordinates = L.Polyline.fromEncoded(encoded).getLatLngs();
 
      L.polyline(
          coordinates,
          {
              color: 'blue',
              weight: 2,
              opacity: .7,
              lineJoin: 'round'
          }
      ).addTo(map);
    }
    </script>
  </body>
</html>

We can spin up a Python web server over that HTML file to see how it renders:

$ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

And below we can see both runs plotted on the map.

2017 04 29 15 53 28

Automating Strava API to Open Street Map

The final step is to automate the whole thing so that I can see all of my runs.

I wrote the following script to call the Strava API and save the polyline for every run to a CSV file:

import requests
import os
import sys
import csv
 
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
 
with open("runs.csv", "w") as runs_file:
    writer = csv.writer(runs_file, delimiter=",")
    writer.writerow(["id", "polyline"])
 
    page = 1
    while True:
        r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
        response = r.json()
 
        if len(response) == 0:
            break
        else:
            for activity in response:
                r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                polyline = r.json()["map"]["polyline"]
                writer.writerow([activity["id"], polyline])
            page += 1

I then wrote a simple script using Flask to parse the CSV files and send a JSON representation of my runs to a slightly modified version of the HTML page that I described above:

from flask import Flask
from flask import render_template
import csv
import json
 
app = Flask(__name__)
 
@app.route('/')
def my_runs():
    runs = []
    with open("runs.csv", "r") as runs_file:
        reader = csv.DictReader(runs_file)
 
        for row in reader:
            runs.append(row["polyline"])
 
    return render_template("leaflet.html", runs = json.dumps(runs))
 
if __name__ == "__main__":
    app.run(port = 5001)

I changed the following line in the HTML file:

var encodedRoutes = {{ runs|safe }};

Now we can launch our Flask web server:

$ python app.py 
 * Running on http://127.0.0.1:5001/ (Press CTRL+C to quit)

And if we navigate to http://127.0.0.1:5001/ we can see all my runs that went near Westminster:

2017 04 29 16 32 00

The full code for all the files I’ve described in this post are available on github. If you give it a try you’ll need to provide your Strava Token in the ‘TOKEN’ environment variable before running extract_runs.py.

Hope this was helpful and if you have any questions ask me in the comments.

Written by Mark Needham

April 29th, 2017 at 3:36 pm

Posted in Javascript

Tagged with , ,

Python: Flask – Generating a static HTML page

without comments

Whenever I need to quickly spin up a web application Python’s Flask library is my go to tool but I recently found myself wanting to generate a static HTML to upload to S3 and wondered if I could use it for that as well.

It’s actually not too tricky. If we’re in the scope of the app context then we have access to the template rendering that we’d normally use when serving the response to a web request.

The following code will generate a HTML file based on a template file templates/blog.html:

from flask import render_template
import flask
 
app = flask.Flask('my app')
 
if __name__ == "__main__":
    with app.app_context():
        rendered = render_template('blog.html', \
            title = "My Generated Page", \
            people = [{"name": "Mark"}, {"name": "Michael"}])
        print(rendered)

templates/index.html

<!doctype html>
<html>
  <head>
	<title>{{ title }}</title>
  </head>
  <body>
	<h1>{{ title }}</h1>
  <ul>
  {% for person in people %}
    <li>{{ person.name }}</li>
  {% endfor %}
  </ul>
  </body>
</html>

If we execute the Python script it will generate the following HTML:

$ python blog.py 
<!doctype html>
<html>
  <head>
	<title>My Generated Page</title>
  </head>
  <body>
	<h1>My Generated Page</h1>
  <ul>
 
    <li>Mark</li>
 
    <li>Michael</li>
 
  </ul>
 
  </body>
</html>


And we can finish off by redirecting that output into a file:

$ python blog.py  > blog.html

We could also write to the file from Python but this seems just as easy!

Written by Mark Needham

April 27th, 2017 at 8:59 pm

Posted in Python

Tagged with

Luigi: Defining dynamic requirements (on output files)

with one comment

In my last blog post I showed how to convert a JSON document containing meetup groups into a CSV file using Luigi, the Python library for building data pipelines. As well as creating that CSV file I wanted to go back to the meetup.com API and download all the members of those groups.

This was a rough flow of what i wanted to do:

  • Take JSON document containing all groups
  • Parse that document and for each group:
    • Call the /members endpoint
    • Save each one of those files as a JSON file
  • Iterate over all those JSON files and create a members CSV file

In the previous post we created the GroupsToJSON task which calls the /groups endpoint on the meetup API and creates the file /tmp/groups.json.

Our new task has that as its initial requirement:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

But we also want to create a requirement on a task that will make those calls to the /members endpoint and store the result in a JSON file.

One of the patterns that Luigi imposes on us is that each task should only create one file so actually we have a requirement on a collection of tasks rather than just one. It took me a little while to get my head around that!

We don’t know the parameters of those tasks at compile time – we can only calculate them by parsing the JSON file produced by GroupsToJSON.

In Luigi terminology what we want to create is a dynamic requirement. A dynamic requirement is defined inside the run method of a task and can rely on the output of any tasks specified in the requires method, which is exactly what we need.

This code does the delegating part of the job:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
 
    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]
 
 
                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members
 
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

Inside our run method we iterate over the output of GroupsToJSON (which is our input) and we yield to another task as well as collecting its outputs in the array outputs that we’ll use later.
MembersToJSON looks like this:

class MembersToJSON(luigi.Task):
    group_id = luigi.IntParameter()
    key = luigi.Parameter()
 
 
    def run(self):
        results = []
        uri = "https://api.meetup.com/2/members?&group_id={0}&key={1}".format(self.group_id, self.key)
        while True:
            if uri is None:
                break
            r = requests.get(uri)
            response = r.json()
            for result in response["results"]:
                results.append(result)
            uri = response["meta"]["next"] if response["meta"]["next"] else None
 
 
        with self.output().open("w") as output:
            json.dump(results, output)
 
    def output(self):
        return luigi.LocalTarget("/tmp/members/{0}.json".format(self.group_id))

This task generates one file per group containing a list of all the members of that group.

We can now go back to MembersToCSV and convert those JSON files into a single CSV file:

class MembersToCSV(luigi.Task):
    out_path = "/tmp/members.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
 
    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]
 
 
                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members
 
        with self.output().open("w") as output:
            writer = csv.writer(output, delimiter=",")
            writer.writerow(["id", "name", "joined", "topics", "groupId"])
 
            for path in outputs:
                group_id = path.split("/")[-1].replace(".json", "")
                with open(path) as json_data:
                    d = json.load(json_data)
                    for member in d:
                        topic_ids = ";".join([str(topic["id"]) for topic in member["topics"]])
                        if "name" in member:
                            writer.writerow([member["id"], member["name"], member["joined"], topic_ids, group_id])
 
    def output(self):
        return luigi.LocalTarget(self.out_path)
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

We then just need to add our new task as a requirement of the wrapper task:

And we’re ready to roll:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup --workers 3

We’ve defined the number of workers here as we can execute those calls to the /members endpoint in parallel and there are ~ 600 calls to make.

All the code from both blog posts is available as a gist if you want to play around with it.

Any questions/advice let me know in the comments or I’m @markhneedham on twitter.

Written by Mark Needham

March 28th, 2017 at 5:39 am

Posted in Python

Tagged with ,

Luigi: An ExternalProgramTask example – Converting JSON to CSV

with 3 comments

I’ve been playing around with the Python library Luigi which is used to build pipelines of batch jobs and I struggled to find an example of an ExternalProgramTask so this is my attempt at filling that void.

Luigi - the Python data library for building data science pipelines

I’m building a little data pipeline to get data from the meetup.com API and put it into CSV files that can be loaded into Neo4j using the LOAD CSV command.

The first task I created calls the /groups endpoint and saves the result into a JSON file:

import luigi
import requests
import json
from collections import Counter
 
class GroupsToJSON(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def run(self):
        seed_topic = "nosql"
        uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(seed_topic, self.lat, self.lon, self.key)
 
        r = requests.get(uri)
        all_topics = [topic["urlkey"]  for result in r.json()["results"] for topic in result["topics"]]
        c = Counter(all_topics)
 
        topics = [entry[0] for entry in c.most_common(10)]
 
        groups = {}
        for topic in topics:
            uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(topic, self.lat, self.lon, self.key)
            r = requests.get(uri)
            for group in r.json()["results"]:
                groups[group["id"]] = group
 
        with self.output().open('w') as groups_file:
            json.dump(list(groups.values()), groups_file, indent=4, sort_keys=True)
 
    def output(self):
        return luigi.LocalTarget("/tmp/groups.json")

We define a few parameters at the top of the class which will be passed in when this task is executed. The most interesting lines of the run function are the last couple where we write the JSON to a file. self.output() refers to the target defined in the output function which in this case is /tmp/groups.json.

Now we need to create a task to convert that JSON file into CSV format. The jq command line tool does this job well so we’ll use that. The following task does the job:

from luigi.contrib.external_program import ExternalProgramTask
 
class GroupsToCSV(luigi.contrib.external_program.ExternalProgramTask):
    file_path = "/tmp/groups.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()
 
    def program_args(self):
        return ["./groups.sh", self.input()[0].path, self.output().path]
 
    def output(self):
        return luigi.LocalTarget(self.file_path)
 
    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

groups.sh

#!/bin/bash
 
in=${1}
out=${2}
 
echo "id,name,urlname,link,rating,created,description,organiserName,organiserMemberId" > ${out}
jq -r '.[] | [.id, .name, .urlname, .link, .rating, .created, .description, .organizer.name, .organizer.member_id] | @csv' ${in} >> ${out}

I wanted to call jq directly from the Python code but I couldn’t figure out how to do it so putting that code in a shell script is my workaround.

The last piece of the puzzle is a wrapper task that launches the others:

import os
 
class Meetup(luigi.WrapperTask):
    def run(self):
        print("Running Meetup")
 
    def requires(self):
        key = os.environ['MEETUP_API_KEY']
        lat = os.getenv('LAT', "51.5072")
        lon = os.getenv('LON', "0.1275")
 
        yield GroupsToCSV(key, lat, lon)

Now we’re ready to run the tasks:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
DEBUG: Checking if GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   PENDING
DEBUG: Checking if GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
INFO: Running command: ./groups.sh /tmp/groups.json /tmp/groups.csv
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   Meetup()
Running Meetup
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      Meetup()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
    - 1 GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
    - 1 Meetup()
 
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

Looks good! Let’s quickly look at our CSV file:

$ head -n10 /tmp/groups.csv 
id,name,urlname,link,rating,created,description,organiserName,organiserMemberId
1114381,"London NoSQL, MySQL, Open Source Community","london-nosql-mysql","https://www.meetup.com/london-nosql-mysql/",4.28,1208505614000,"<p>Meet others in London interested in NoSQL, MySQL, and Open Source Databases.</p>","Sinead Lawless",185675230
1561841,"Enterprise Search London Meetup","es-london","https://www.meetup.com/es-london/",4.66,1259157419000,"<p>Enterprise Search London is a meetup for anyone interested in building search and discovery experiences — from intranet search and site search, to advanced discovery applications and beyond.</p>
<p>Disclaimer: This meetup is NOT about SEO or search engine marketing.</p>
<p><strong>What people are saying:</strong></p>
<ul>
<li><span>""Join this meetup if you have a passion for enterprise search and user experience that you would like to share with other able-minded practitioners."" — Vegard Sandvold</span></li>
<li><span>""Full marks for vision and execution. Looking forward to the next Meetup."" — Martin White</span></li>
<li><span>“Consistently excellent” — Helen Lippell</span></li>
</ul>

Sweet! And what if we run it again?

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=172768377, workers=1, host=Marks-MBP-4, username=markneedham, pid=4531) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====
 
Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 Meetup()
 
Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies
 
===== Luigi Execution Summary =====

As expected nothing happens since our dependencies are already satisfied and we have our first Luigi pipeline up and running.

Written by Mark Needham

March 25th, 2017 at 2:09 pm

Posted in Python

Tagged with , , ,