Mark Needham

Thoughts on Software Development

Shell: Create a comma separated string

without comments

I recently needed to generate a string with comma separated values, based on iterating a range of numbers.

e.g. we should get the following output where n = 3

foo-0,foo-1,foo-2

I only had the shell available to me so I couldn’t shell out into Python or Ruby for example. That means it’s bash scripting time!

If we want to iterate a range of numbers and print them out on the screen we can write the following code:

n=3
for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  echo "foo-$i"
done
 
foo-0
foo-1
foo-2

Combining them into a string is a bit more tricky, but luckily I found a great blog post by Andreas Haupt which shows what to do. Andreas is solving a more complicated problem than me but these are the bits of code that we need from the post.

n=3
combined=""
 
for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
done
echo $combined
 
foo-0,foo-1,foo-2

This won’t work if you set n<0 but that’s ok for me! I’ll let Andreas explain how it works:

  • ${combined:+,} will return either a comma (if combined exists and is set) or nothing at all.
  • In the first invocation of the loop combined is not yet set and nothing is put out.
  • In the next rounds combined is set and a comma will be put out.

We can see how it in action by printing out the value of $combined after each iteration of the loop:

n=3
combined=""
 
for i in $(seq 0 $(($n > 0 ? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
  echo $combined
done
 
foo-0
foo-0,foo-1
foo-0,foo-1,foo-2

Looks good to me!

Written by Mark Needham

June 23rd, 2017 at 12:26 pm

Posted in Shell Scripting

Tagged with , ,

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

Prerequisites

import numpy as np
import pandas as pd
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
 
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
 
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Kubernetes: Which node is a pod on?

without comments

When running Kubernetes on a cloud provider, rather than locally using minikube, it’s useful to know which node a pod is running on.

The normal command to list pods doesn’t contain this information:

$ kubectl get pod
NAME           READY     STATUS    RESTARTS   AGE       
neo4j-core-0   1/1       Running   0          6m        
neo4j-core-1   1/1       Running   0          6m        
neo4j-core-2   1/1       Running   0          2m

I spent a while searching for a command that I could use before I came across Ta-Ching Chen’s blog post while looking for something else.

Ta-Ching points out that we just need to add the flag -o wide to our original command to get the information we require:

$ kubectl get pod -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
neo4j-core-0   1/1       Running   0          6m        10.32.3.6    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-1   1/1       Running   0          6m        10.32.3.7    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-2   1/1       Running   0          2m        10.32.0.10   gke-neo4j-cluster-default-pool-ded394fa-kp68

Easy!

Written by Mark Needham

June 14th, 2017 at 8:49 am

Posted in Kubernetes

Tagged with

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

without comments

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
null_columns=train.columns[train.isnull().any()]
 
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
 
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
 
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
 
data = pd.concat([
        sub_train[cols],
        pd.get_dummies(sub_train[cols].LotShape),
        pd.get_dummies(sub_train[cols].LotConfig)
    ], axis=1).select_dtypes(include=[np.number])
 
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
 
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
 
lr = linear_model.LinearRegression()
 
model = lr.fit(X_train, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 
 -0.84137438493

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
 
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))
0.0

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))
1.0

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

Written by Mark Needham

June 4th, 2017 at 9:22 am

Posted in Data Science,Python

Tagged with ,

GraphQL-Europe: A trip to Berlin

without comments

Last weekend my colleagues Will, Michael, Oskar, and I went to Berlin to spend Sunday at the GraphQL Europe conference in Berlin.

IMG 20170521 084449

Neo4j sponsored the conference as we’ve been experimenting with building a GraphQL to Neo4j integration and wanted to get some feedback from the community as well as learn what’s going on in GraphQL land.

Will and Michael have written about their experience where they talk more about the hackathon we hosted so I’ll cover it more from a personal perspective.

The first thing that stood out for me was how busy it was – I knew GraphQL was pretty hipster but I wasn’t expecting there to be ~ 300 attendees.

The venue was amazing – the nHow Hotel is located right next to the Spree River so there were great views to be had during the breaks. It also helped that it was really sunny for the whole day!

IMG 20170521 103636

I spent most of the day hanging out at the Neo4j booth which was good fun – several people pointed out that an integration between Neo4j and GraphQL made a lot of sense given that GraphQL talks about the application graph and Neo4j graphs in general.

I managed to attend a few of the talks, including one by Brooks Swinnerton from GitHub who announced that they’d be moving to GraphQL for v4 of their API.

The most interesting part of the talk for me was when Brooks said they’d directed requests for their REST API to the GraphQL one behind the scenes for a while now to check that it could handle the load.

GitHub is moving to GraphQL for v4 of our API because it offers significantly more flexibility for our integrators. The ability to define precisely the data you want—and only the data you want—is a powerful advantage over the REST API v3 endpoints.

I think twitter may be doing something similar based on this tweet by Tom Ashworth:

From what I could tell the early pick up of GraphQL seems to be from the front end of applications – several of the attendees had attended ReactEurope a couple of days earlier – but micro services were mentioned in a few of the talks and it was suggested that GraphQL works well in this world as well.

It was a fun day out so thanks to the folks at Graphcool for organising!

Written by Mark Needham

May 27th, 2017 at 11:31 am

Posted in Conferences

Tagged with ,

PostgreSQL: ERROR: argument of WHERE must not return a set

without comments

In my last post I showed how to load and query data from the Strava API in PostgreSQL and after executing some simple queries my next task was to query more complex part of the JSON structure.

2017 05 01 21 22 55

Strava allows users to create segments, which are edited portions of road or trail where athletes can compete for time.

I wanted to write a query to find all the times that I’d run a particular segment. e.g. the Akerman Road segment covers a road running North to South in Kennington/Stockwell in South London.

This segment has the id ‘6818475’ so we’ll need to look inside segment_efforts and then compare the value segment.id against this id.

I initially wrote this query to try and find the times I’d run this segment:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs
WHERE jsonb_array_elements(data->'segment_efforts')->'segment'->>'id' = '6818475'
 
ERROR:  argument OF WHERE must NOT RETURN a SET
LINE 3: WHERE jsonb_array_elements(data->'segment_efforts')->'segmen...

This doesn’t work since jsonb_array_elements returns a set of boolean values, as Craig Ringer points out on Stack Overflow.

Instead we can use a LATERAL subquery to achieve our goal:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs r,
LATERAL jsonb_array_elements(r.data->'segment_efforts') segment
WHERE segment ->'segment'->>'id' = '6818475'
 
    id     |       startdate        | averagespeed 
-----------+------------------------+--------------
 455461182 | "2015-12-24T11:20:26Z" | 2.841
 440088621 | "2015-11-27T06:10:42Z" | 2.975
 407930503 | "2015-10-07T05:18:34Z" | 2.985
 317170464 | "2015-06-03T04:44:59Z" | 2.842
 312629236 | "2015-05-27T04:46:33Z" | 2.857
 277786711 | "2015-04-02T05:25:59Z" | 2.408
 226351235 | "2014-12-05T07:59:15Z" | 2.803
 225073326 | "2014-12-01T06:15:21Z" | 2.929
 224287690 | "2014-11-29T09:02:46Z" | 3.087
 223964715 | "2014-11-28T06:18:29Z" | 2.844
(10 ROWS)

Perfect!

Written by Mark Needham

May 1st, 2017 at 8:42 pm

Posted in PostgreSQL

Tagged with

Loading and analysing Strava runs using PostgreSQL JSON data type

without comments

In my last post I showed how to map Strava runs using data that I’d extracted from their /activities API, but the API returns a lot of other data that I discarded because I wasn’t sure what I should keep.

The API returns a nested JSON structure so the easiest solution would be to save each run as an individual file but I’ve always wanted to try out PostgreSQL’s JSON data type and this seemed like a good opportunity.

Creating a JSON ready PostgreSQL table

First up we need to create a database in which we’ll store our Strava data. Let’s name it appropriately:

CREATE DATABASE strava;
\CONNECT strava;

Now we can now create a table with one field with the JSON data type:

CREATE TABLE runs (
  id INTEGER NOT NULL,
  DATA jsonb
);
 
ALTER TABLE runs ADD PRIMARY KEY(id);

Easy enough. Now we’re ready to populate the table.

Importing Strava API

We can partially reuse the script from the last post except rather than saving to CSV file we’ll save to PostgreSQL using the psycopg2 library.

2017 05 01 13 45 58

The script relies on a TOKEN environment variable. If you want to try this on your own Strava account you’ll need to create an application, which will give you a key.

extract-runs.py

import requests
import os
import json
import psycopg2
 
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
 
with psycopg2.connect("dbname=strava user=markneedham") as conn:
    with conn.cursor() as cur:
        page = 1
        while True:
            r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
            response = r.json()
 
            if len(response) == 0:
                break
            else:
                for activity in response:
                    r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                    json_response = r.json()
                    cur.execute("INSERT INTO runs (id, data) VALUES(%s, %s)", (activity["id"], json.dumps(json_response)))
                    conn.commit()
                page += 1

Querying Strava

We can now write some queries against our newly imported data.

My quickest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'average_speed')::FLOAT AS speed 
FROM runs 
ORDER BY speed DESC 
LIMIT 5
 
    id     |      start_date      | speed 
-----------+----------------------+-------
 649253963 | 2016-07-22T05:18:37Z | 3.736
 914796614 | 2017-03-26T08:37:56Z | 3.614
 653703601 | 2016-07-26T05:25:07Z | 3.606
 548540883 | 2016-04-17T18:18:05Z | 3.604
 665006485 | 2016-08-05T04:11:21Z | 3.604
(5 ROWS)

My longest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'distance')::FLOAT AS distance
FROM runs
ORDER BY distance DESC
LIMIT 5
 
    id     |      start_date      | distance 
-----------+----------------------+----------
 840246999 | 2017-01-22T10:20:33Z |  10764.1
 461124609 | 2016-01-02T08:42:47Z |  10457.9
 467634177 | 2016-01-10T18:48:47Z |  10434.5
 471467618 | 2016-01-16T12:33:28Z |  10359.3
 540811705 | 2016-04-10T07:26:55Z |   9651.6
(5 ROWS)

Runs this year

SELECT COUNT(*)
FROM runs
WHERE data->>'start_date' >= '2017-01-01 00:00:00'
 
 COUNT 
-------
    62
(1 ROW)

Runs per year

SELECT EXTRACT(YEAR FROM to_date(data->>'start_date', 'YYYY-mm-dd')) AS YEAR, 
       COUNT(*) 
FROM runs 
GROUP BY YEAR 
ORDER BY YEAR
 
 YEAR | COUNT 
------+-------
 2014 |    18
 2015 |   139
 2016 |   166
 2017 |    62
(4 ROWS)

That’s all for now. Next I’m going to learn how to query segments, which are stored inside a nested array inside the JSON document. Stay tuned for that in a future post.

Written by Mark Needham

May 1st, 2017 at 7:11 pm

Posted in PostgreSQL

Tagged with , ,

Leaflet: Mapping Strava runs/polylines on Open Street Map

without comments

I’m a big Strava user and spent a bit of time last weekend playing around with their API to work out how to map all my runs.

2017 04 29 15 56 06

Strava API and polylines

This is a two step process:

  1. Call the /athlete/activities/ endpoint to get a list of all my activities
  2. For each of those activities call /activities/[activityId] endpoint to get more detailed information for each activity

That second API returns a ‘polyline’ property which the documentation describes as follows:

Activity and segment API requests may include summary polylines of their respective routes. The values are string encodings of the latitude and longitude points using the Google encoded polyline algorithm format.

If we navigate to that page we get the following explanation:

Polyline encoding is a lossy compression algorithm that allows you to store a series of coordinates as a single string.

I tried out a couple of my polylines using the interactive polyline encoder utility which worked well once I realised that I needed to escape backslashes (“\”) in the polyline before pasting it into the tool.

Now that I’d figured out how to map one run it was time to automate the process.

Leaflet and OpenStreetMap

I’ve previously had a good experience using Leaflet so I was keen to use that and luckily came across a Stack Overflow answer showing how to do what I wanted.

I created a HTML file and manually pasted in a couple of my runs (not forgetting to escape those backslashes!) to check that they worked:

blog.html

<html>
  <head>
    <title>Mapping my runs</title>
  </head>
 
  <body>
    <script src="http://cdn.leafletjs.com/leaflet-0.7/leaflet.js"></script>
    <script type="text/javascript" src="https://rawgit.com/jieter/Leaflet.encoded/master/Polyline.encoded.js"></script>
    <link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet-0.7/leaflet.css" />
    <div id="map" style="width: 100%; height: 100%"></div>
 
    <script>
    var map = L.map('map').setView([55.609818, 13.003286], 13);
    L.tileLayer(
        'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
            maxZoom: 18,
        }).addTo(map);
 
    var encodedRoutes = [
      "{zkrIm`inANPD?BDXGPKLATHNRBRFtAR~AFjAHl@D|ALtATj@HHJBL?`@EZ?NQ\\Y^MZURGJKR]RMXYh@QdAWf@[~@aAFGb@?j@YJKBU@m@FKZ[NSPKTCRJD?`@Wf@Wb@g@HCp@Qh@]z@SRMRE^EHJZnDHbBGPHb@NfBTxBN|DVbCBdA^lBFl@Lz@HbBDl@Lr@Bb@ApCAp@Ez@g@bEMl@g@`B_AvAq@l@    QF]Rs@Nq@CmAVKCK?_@Nw@h@UJIHOZa@xA]~@UfASn@U`@_@~@[d@Sn@s@rAs@dAGN?NVhAB\\Ox@@b@S|A?Tl@jBZpAt@vBJhATfGJn@b@fARp@H^Hx@ARGNSTIFWHe@AGBOTAP@^\\zBMpACjEWlEIrCKl@i@nAk@}@}@yBOWSg@kAgBUk@Mu@[mC?QLIEUAuAS_E?uCKyCA{BH{DDgF`AaEr@uAb@oA~@{AE}AKw@    g@qAU[_@w@[gAYm@]qAEa@FOXg@JGJ@j@o@bAy@NW?Qe@oCCc@SaBEOIIEQGaAe@kC_@{De@cE?KD[H[P]NcAJ_@DGd@Gh@UHI@Ua@}Bg@yBa@uDSo@i@UIICQUkCi@sCKe@]aAa@oBG{@G[CMOIKMQe@IIM@KB]Tg@Nw@^QL]NMPMn@@\\Lb@P~@XT",
      "u}krIq_inA_@y@My@Yu@OqAUsA]mAQc@CS@o@FSHSp@e@n@Wl@]ZCFEBK?OC_@Qw@?m@CSK[]]EMBeAA_@m@qEAg@UoCAaAMs@IkBMoACq@SwAGOYa@IYIyA_@kEMkC]{DEaAScC@yEHkGA_ALsCBiA@mCD{CCuAZcANOH@HDZl@Z`@RFh@\\TDT@ZVJBPMVGLM\\Mz@c@NCPMXERO|@a@^Ut@s@p@KJAJ    Bd@EHEXi@f@a@\\g@b@[HUD_B@uADg@DQLCLD~@l@`@J^TF?JANQ\\UbAyABEZIFG`@o@RAJEl@_@ZENDDIA[Ki@BURQZaARODKVs@LSdAiAz@G`BU^A^GT@PRp@zARXRn@`BlDHt@ZlAFh@^`BX|@HHHEf@i@FAHHp@bBd@v@DRAVMl@i@v@SROXm@tBILOTOLs@NON_@t@KX]h@Un@k@\\c@h@Ud@]ZGNKp@Sj@KJo@    b@W`@UPOX]XWd@UF]b@WPOAIBSf@QVi@j@_@V[b@Uj@YtAEFCCELARBn@`@lBjAzD^vB^hB?LENURkAv@[Ze@Xg@Py@p@QHONMA[HGAWE_@Em@Hg@AMCG@QHq@Cm@M[Jy@?UJIA{@Ae@KI@GFKNIX[QGAcAT[JK?OVMFK@IAIUKAYJI?QKUCGFIZCXDtAHl@@p@LjBCZS^ERAn@Fj@Br@Hn@HzAHh@RfD?j@TnCTlA    NjANb@\\z@TtARr@P`AFnAGfBG`@CFE?"
  ]
 
    for (let encoded of encodedRoutes) {
      var coordinates = L.Polyline.fromEncoded(encoded).getLatLngs();
 
      L.polyline(
          coordinates,
          {
              color: 'blue',
              weight: 2,
              opacity: .7,
              lineJoin: 'round'
          }
      ).addTo(map);
    }
    </script>
  </body>
</html>

We can spin up a Python web server over that HTML file to see how it renders:

$ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

And below we can see both runs plotted on the map.

2017 04 29 15 53 28

Automating Strava API to Open Street Map

The final step is to automate the whole thing so that I can see all of my runs.

I wrote the following script to call the Strava API and save the polyline for every run to a CSV file:

import requests
import os
import sys
import csv
 
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
 
with open("runs.csv", "w") as runs_file:
    writer = csv.writer(runs_file, delimiter=",")
    writer.writerow(["id", "polyline"])
 
    page = 1
    while True:
        r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
        response = r.json()
 
        if len(response) == 0:
            break
        else:
            for activity in response:
                r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                polyline = r.json()["map"]["polyline"]
                writer.writerow([activity["id"], polyline])
            page += 1

I then wrote a simple script using Flask to parse the CSV files and send a JSON representation of my runs to a slightly modified version of the HTML page that I described above:

from flask import Flask
from flask import render_template
import csv
import json
 
app = Flask(__name__)
 
@app.route('/')
def my_runs():
    runs = []
    with open("runs.csv", "r") as runs_file:
        reader = csv.DictReader(runs_file)
 
        for row in reader:
            runs.append(row["polyline"])
 
    return render_template("leaflet.html", runs = json.dumps(runs))
 
if __name__ == "__main__":
    app.run(port = 5001)

I changed the following line in the HTML file:

var encodedRoutes = {{ runs|safe }};

Now we can launch our Flask web server:

$ python app.py 
 * Running on http://127.0.0.1:5001/ (Press CTRL+C to quit)

And if we navigate to http://127.0.0.1:5001/ we can see all my runs that went near Westminster:

2017 04 29 16 32 00

The full code for all the files I’ve described in this post are available on github. If you give it a try you’ll need to provide your Strava Token in the ‘TOKEN’ environment variable before running extract_runs.py.

Hope this was helpful and if you have any questions ask me in the comments.

Written by Mark Needham

April 29th, 2017 at 3:36 pm

Posted in Javascript

Tagged with , ,

Python: Flask – Generating a static HTML page

without comments

Whenever I need to quickly spin up a web application Python’s Flask library is my go to tool but I recently found myself wanting to generate a static HTML to upload to S3 and wondered if I could use it for that as well.

It’s actually not too tricky. If we’re in the scope of the app context then we have access to the template rendering that we’d normally use when serving the response to a web request.

The following code will generate a HTML file based on a template file templates/blog.html:

from flask import render_template
import flask
 
app = flask.Flask('my app')
 
if __name__ == "__main__":
    with app.app_context():
        rendered = render_template('blog.html', \
            title = "My Generated Page", \
            people = [{"name": "Mark"}, {"name": "Michael"}])
        print(rendered)

templates/index.html

<!doctype html>
<html>
  <head>
	<title>{{ title }}</title>
  </head>
  <body>
	<h1>{{ title }}</h1>
  <ul>
  {% for person in people %}
    <li>{{ person.name }}</li>
  {% endfor %}
  </ul>
  </body>
</html>

If we execute the Python script it will generate the following HTML:

$ python blog.py 
<!doctype html>
<html>
  <head>
	<title>My Generated Page</title>
  </head>
  <body>
	<h1>My Generated Page</h1>
  <ul>
 
    <li>Mark</li>
 
    <li>Michael</li>
 
  </ul>
 
  </body>
</html>


And we can finish off by redirecting that output into a file:

$ python blog.py  > blog.html

We could also write to the file from Python but this seems just as easy!

Written by Mark Needham

April 27th, 2017 at 8:59 pm

Posted in Python

Tagged with

AWS Lambda: Programmatically scheduling a CloudWatchEvent

without comments

I recently wrote a blog post showing how to create a Python ‘Hello World’ AWS lambda function and manually invoke it, but what I really wanted to do was have it run automatically every hour.

To achieve that in AWS Lambda land we need to create a CloudWatch Event. The documentation describes them as follows:

Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.

2017 04 05 23 06 36

This is actually really easy from the Amazon web console as you just need to click the ‘Triggers’ tab and then ‘Add trigger’. It’s not obvious that there are actually three steps are involved as they’re abstracted from you.

So what are the steps?

  1. Create rule
  2. Give permission for that rule to execute
  3. Map the rule to the function

I forgot to do step 2) initially and then you just end up with a rule that never triggers, which isn’t particularly useful.

The following code creates a ‘Hello World’ lambda function and runs it once an hour:

import boto3
 
lambda_client = boto3.client('lambda')
events_client = boto3.client('events')
 
fn_name = "HelloWorld"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'
 
fn_response = lambda_client.create_function(
    FunctionName=fn_name,
    Runtime='python2.7',
    Role=fn_role,
    Handler="{0}.lambda_handler".format(fn_name),
    Code={'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(), },
)
 
fn_arn = fn_response['FunctionArn']
frequency = "rate(1 hour)"
name = "{0}-Trigger".format(fn_name)
 
rule_response = events_client.put_rule(
    Name=name,
    ScheduleExpression=frequency,
    State='ENABLED',
)
 
lambda_client.add_permission(
    FunctionName=fn_name,
    StatementId="{0}-Event".format(name),
    Action='lambda:InvokeFunction',
    Principal='events.amazonaws.com',
    SourceArn=rule_response['RuleArn'],
)
 
events_client.put_targets(
    Rule=name,
    Targets=[
        {
            'Id': "1",
            'Arn': fn_arn,
        },
    ]
)

We can now check if our trigger has been configured correctly:

$ aws events list-rules --query "Rules[?Name=='HelloWorld-Trigger']"
[
    {
        "State": "ENABLED", 
        "ScheduleExpression": "rate(1 hour)", 
        "Name": "HelloWorld-Trigger", 
        "Arn": "arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger"
    }
]
 
$ aws events list-targets-by-rule --rule HelloWorld-Trigger
{
    "Targets": [
        {
            "Id": "1", 
            "Arn": "arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld"
        }
    ]
}
 
$ aws lambda get-policy --function-name HelloWorld
{
    "Policy": "{\"Version\":\"2012-10-17\",\"Id\":\"default\",\"Statement\":[{\"Sid\":\"HelloWorld-Trigger-Event\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"lambda:InvokeFunction\",\"Resource\":\"arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld\",\"Condition\":{\"ArnLike\":{\"AWS:SourceArn\":\"arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger\"}}}]}"
}

All looks good so we’re done!

Written by Mark Needham

April 5th, 2017 at 11:49 pm

Posted in Software Development

Tagged with , ,