Mark Needham

Thoughts on Software Development

scikit-learn: Random forests – Feature Importance

without comments

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:


import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# We'll use this library to make the display pretty
from tabulate import tabulate

Load Data

train = pd.read_csv('train.csv')
# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

Split train/test sets

y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Train model

clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model =, y_train)

Feature Importance

headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

Written by Mark Needham

June 16th, 2017 at 5:55 am

Kubernetes: Which node is a pod on?

without comments

When running Kubernetes on a cloud provider, rather than locally using minikube, it’s useful to know which node a pod is running on.

The normal command to list pods doesn’t contain this information:

$ kubectl get pod
NAME           READY     STATUS    RESTARTS   AGE       
neo4j-core-0   1/1       Running   0          6m        
neo4j-core-1   1/1       Running   0          6m        
neo4j-core-2   1/1       Running   0          2m

I spent a while searching for a command that I could use before I came across Ta-Ching Chen’s blog post while looking for something else.

Ta-Ching points out that we just need to add the flag -o wide to our original command to get the information we require:

$ kubectl get pod -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
neo4j-core-0   1/1       Running   0          6m    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-1   1/1       Running   0          6m    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-2   1/1       Running   0          2m   gke-neo4j-cluster-default-pool-ded394fa-kp68


Written by Mark Needham

June 14th, 2017 at 8:49 am

Posted in Kubernetes

Tagged with

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

without comments

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
data = pd.concat([
    ], axis=1).select_dtypes(include=[np.number])
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
lr = linear_model.LinearRegression()
model =, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

Written by Mark Needham

June 4th, 2017 at 9:22 am

Posted in Data Science,Python

Tagged with ,

GraphQL-Europe: A trip to Berlin

without comments

Last weekend my colleagues Will, Michael, Oskar, and I went to Berlin to spend Sunday at the GraphQL Europe conference in Berlin.

IMG 20170521 084449

Neo4j sponsored the conference as we’ve been experimenting with building a GraphQL to Neo4j integration and wanted to get some feedback from the community as well as learn what’s going on in GraphQL land.

Will and Michael have written about their experience where they talk more about the hackathon we hosted so I’ll cover it more from a personal perspective.

The first thing that stood out for me was how busy it was – I knew GraphQL was pretty hipster but I wasn’t expecting there to be ~ 300 attendees.

The venue was amazing – the nHow Hotel is located right next to the Spree River so there were great views to be had during the breaks. It also helped that it was really sunny for the whole day!

IMG 20170521 103636

I spent most of the day hanging out at the Neo4j booth which was good fun – several people pointed out that an integration between Neo4j and GraphQL made a lot of sense given that GraphQL talks about the application graph and Neo4j graphs in general.

I managed to attend a few of the talks, including one by Brooks Swinnerton from GitHub who announced that they’d be moving to GraphQL for v4 of their API.

The most interesting part of the talk for me was when Brooks said they’d directed requests for their REST API to the GraphQL one behind the scenes for a while now to check that it could handle the load.

GitHub is moving to GraphQL for v4 of our API because it offers significantly more flexibility for our integrators. The ability to define precisely the data you want—and only the data you want—is a powerful advantage over the REST API v3 endpoints.

I think twitter may be doing something similar based on this tweet by Tom Ashworth:

From what I could tell the early pick up of GraphQL seems to be from the front end of applications – several of the attendees had attended ReactEurope a couple of days earlier – but micro services were mentioned in a few of the talks and it was suggested that GraphQL works well in this world as well.

It was a fun day out so thanks to the folks at Graphcool for organising!

Written by Mark Needham

May 27th, 2017 at 11:31 am

Posted in Conferences

Tagged with ,

PostgreSQL: ERROR: argument of WHERE must not return a set

without comments

In my last post I showed how to load and query data from the Strava API in PostgreSQL and after executing some simple queries my next task was to query more complex part of the JSON structure.

2017 05 01 21 22 55

Strava allows users to create segments, which are edited portions of road or trail where athletes can compete for time.

I wanted to write a query to find all the times that I’d run a particular segment. e.g. the Akerman Road segment covers a road running North to South in Kennington/Stockwell in South London.

This segment has the id ‘6818475’ so we’ll need to look inside segment_efforts and then compare the value against this id.

I initially wrote this query to try and find the times I’d run this segment:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs
WHERE jsonb_array_elements(data->'segment_efforts')->'segment'->>'id' = '6818475'
LINE 3: WHERE jsonb_array_elements(data->'segment_efforts')->'segmen...

This doesn’t work since jsonb_array_elements returns a set of boolean values, as Craig Ringer points out on Stack Overflow.

Instead we can use a LATERAL subquery to achieve our goal:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs r,
LATERAL jsonb_array_elements(>'segment_efforts') segment
WHERE segment ->'segment'->>'id' = '6818475'
    id     |       startdate        | averagespeed 
 455461182 | "2015-12-24T11:20:26Z" | 2.841
 440088621 | "2015-11-27T06:10:42Z" | 2.975
 407930503 | "2015-10-07T05:18:34Z" | 2.985
 317170464 | "2015-06-03T04:44:59Z" | 2.842
 312629236 | "2015-05-27T04:46:33Z" | 2.857
 277786711 | "2015-04-02T05:25:59Z" | 2.408
 226351235 | "2014-12-05T07:59:15Z" | 2.803
 225073326 | "2014-12-01T06:15:21Z" | 2.929
 224287690 | "2014-11-29T09:02:46Z" | 3.087
 223964715 | "2014-11-28T06:18:29Z" | 2.844
(10 ROWS)


Written by Mark Needham

May 1st, 2017 at 8:42 pm

Posted in PostgreSQL

Tagged with

Loading and analysing Strava runs using PostgreSQL JSON data type

without comments

In my last post I showed how to map Strava runs using data that I’d extracted from their /activities API, but the API returns a lot of other data that I discarded because I wasn’t sure what I should keep.

The API returns a nested JSON structure so the easiest solution would be to save each run as an individual file but I’ve always wanted to try out PostgreSQL’s JSON data type and this seemed like a good opportunity.

Creating a JSON ready PostgreSQL table

First up we need to create a database in which we’ll store our Strava data. Let’s name it appropriately:

\CONNECT strava;

Now we can now create a table with one field with the JSON data type:

  DATA jsonb

Easy enough. Now we’re ready to populate the table.

Importing Strava API

We can partially reuse the script from the last post except rather than saving to CSV file we’ll save to PostgreSQL using the psycopg2 library.

2017 05 01 13 45 58

The script relies on a TOKEN environment variable. If you want to try this on your own Strava account you’ll need to create an application, which will give you a key.

import requests
import os
import json
import psycopg2
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
with psycopg2.connect("dbname=strava user=markneedham") as conn:
    with conn.cursor() as cur:
        page = 1
        while True:
            r = requests.get("{0}".format(page), headers = headers)
            response = r.json()
            if len(response) == 0:
                for activity in response:
                    r = requests.get("{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                    json_response = r.json()
                    cur.execute("INSERT INTO runs (id, data) VALUES(%s, %s)", (activity["id"], json.dumps(json_response)))
                page += 1

Querying Strava

We can now write some queries against our newly imported data.

My quickest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'average_speed')::FLOAT AS speed 
FROM runs 
    id     |      start_date      | speed 
 649253963 | 2016-07-22T05:18:37Z | 3.736
 914796614 | 2017-03-26T08:37:56Z | 3.614
 653703601 | 2016-07-26T05:25:07Z | 3.606
 548540883 | 2016-04-17T18:18:05Z | 3.604
 665006485 | 2016-08-05T04:11:21Z | 3.604
(5 ROWS)

My longest runs

SELECT id, data->>'start_date' AS start_date, 
       (data->>'distance')::FLOAT AS distance
FROM runs
ORDER BY distance DESC
    id     |      start_date      | distance 
 840246999 | 2017-01-22T10:20:33Z |  10764.1
 461124609 | 2016-01-02T08:42:47Z |  10457.9
 467634177 | 2016-01-10T18:48:47Z |  10434.5
 471467618 | 2016-01-16T12:33:28Z |  10359.3
 540811705 | 2016-04-10T07:26:55Z |   9651.6
(5 ROWS)

Runs this year

FROM runs
WHERE data->>'start_date' >= '2017-01-01 00:00:00'
(1 ROW)

Runs per year

SELECT EXTRACT(YEAR FROM to_date(data->>'start_date', 'YYYY-mm-dd')) AS YEAR, 
FROM runs 
 2014 |    18
 2015 |   139
 2016 |   166
 2017 |    62
(4 ROWS)

That’s all for now. Next I’m going to learn how to query segments, which are stored inside a nested array inside the JSON document. Stay tuned for that in a future post.

Written by Mark Needham

May 1st, 2017 at 7:11 pm

Posted in PostgreSQL

Tagged with , ,

Leaflet: Mapping Strava runs/polylines on Open Street Map

without comments

I’m a big Strava user and spent a bit of time last weekend playing around with their API to work out how to map all my runs.

2017 04 29 15 56 06

Strava API and polylines

This is a two step process:

  1. Call the /athlete/activities/ endpoint to get a list of all my activities
  2. For each of those activities call /activities/[activityId] endpoint to get more detailed information for each activity

That second API returns a ‘polyline’ property which the documentation describes as follows:

Activity and segment API requests may include summary polylines of their respective routes. The values are string encodings of the latitude and longitude points using the Google encoded polyline algorithm format.

If we navigate to that page we get the following explanation:

Polyline encoding is a lossy compression algorithm that allows you to store a series of coordinates as a single string.

I tried out a couple of my polylines using the interactive polyline encoder utility which worked well once I realised that I needed to escape backslashes (“\”) in the polyline before pasting it into the tool.

Now that I’d figured out how to map one run it was time to automate the process.

Leaflet and OpenStreetMap

I’ve previously had a good experience using Leaflet so I was keen to use that and luckily came across a Stack Overflow answer showing how to do what I wanted.

I created a HTML file and manually pasted in a couple of my runs (not forgetting to escape those backslashes!) to check that they worked:


    <title>Mapping my runs</title>
    <script src=""></script>
    <script type="text/javascript" src=""></script>
    <link rel="stylesheet" href="" />
    <div id="map" style="width: 100%; height: 100%"></div>
    var map ='map').setView([55.609818, 13.003286], 13);
        'http://{s}{z}/{x}/{y}.png', {
            maxZoom: 18,
    var encodedRoutes = [
      "{zkrIm`inANPD?BDXGPKLATHNRBRFtAR~AFjAHl@D|ALtATj@HHJBL?`@EZ?NQ\\Y^MZURGJKR]RMXYh@QdAWf@[~@aAFGb@?j@YJKBU@m@FKZ[NSPKTCRJD?`@Wf@Wb@g@HCp@Qh@]z@SRMRE^EHJZnDHbBGPHb@NfBTxBN|DVbCBdA^lBFl@Lz@HbBDl@Lr@Bb@ApCAp@Ez@g@bEMl@g@`B_AvAq@l@    QF]Rs@Nq@CmAVKCK?_@Nw@h@UJIHOZa@xA]~@UfASn@U`@_@~@[d@Sn@s@rAs@dAGN?NVhAB\\Ox@@b@S|A?Tl@jBZpAt@vBJhATfGJn@b@fARp@H^Hx@ARGNSTIFWHe@AGBOTAP@^\\zBMpACjEWlEIrCKl@i@nAk@}@}@yBOWSg@kAgBUk@Mu@[mC?QLIEUAuAS_E?uCKyCA{BH{DDgF`AaEr@uAb@oA~@{AE}AKw@    g@qAU[_@w@[gAYm@]qAEa@FOXg@JGJ@j@o@bAy@NW?Qe@oCCc@SaBEOIIEQGaAe@kC_@{De@cE?KD[H[P]NcAJ_@DGd@Gh@UHI@Ua@}Bg@yBa@uDSo@i@UIICQUkCi@sCKe@]aAa@oBG{@G[CMOIKMQe@IIM@KB]Tg@Nw@^QL]NMPMn@@\\Lb@P~@XT",
    for (let encoded of encodedRoutes) {
      var coordinates = L.Polyline.fromEncoded(encoded).getLatLngs();
              color: 'blue',
              weight: 2,
              opacity: .7,
              lineJoin: 'round'

We can spin up a Python web server over that HTML file to see how it renders:

$ python -m http.server
Serving HTTP on port 8000 ( ...

And below we can see both runs plotted on the map.

2017 04 29 15 53 28

Automating Strava API to Open Street Map

The final step is to automate the whole thing so that I can see all of my runs.

I wrote the following script to call the Strava API and save the polyline for every run to a CSV file:

import requests
import os
import sys
import csv
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
with open("runs.csv", "w") as runs_file:
    writer = csv.writer(runs_file, delimiter=",")
    writer.writerow(["id", "polyline"])
    page = 1
    while True:
        r = requests.get("{0}".format(page), headers = headers)
        response = r.json()
        if len(response) == 0:
            for activity in response:
                r = requests.get("{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                polyline = r.json()["map"]["polyline"]
                writer.writerow([activity["id"], polyline])
            page += 1

I then wrote a simple script using Flask to parse the CSV files and send a JSON representation of my runs to a slightly modified version of the HTML page that I described above:

from flask import Flask
from flask import render_template
import csv
import json
app = Flask(__name__)
def my_runs():
    runs = []
    with open("runs.csv", "r") as runs_file:
        reader = csv.DictReader(runs_file)
        for row in reader:
    return render_template("leaflet.html", runs = json.dumps(runs))
if __name__ == "__main__": = 5001)

I changed the following line in the HTML file:

var encodedRoutes = {{ runs|safe }};

Now we can launch our Flask web server:

$ python 
 * Running on (Press CTRL+C to quit)

And if we navigate to we can see all my runs that went near Westminster:

2017 04 29 16 32 00

The full code for all the files I’ve described in this post are available on github. If you give it a try you’ll need to provide your Strava Token in the ‘TOKEN’ environment variable before running

Hope this was helpful and if you have any questions ask me in the comments.

Written by Mark Needham

April 29th, 2017 at 3:36 pm

Posted in Javascript

Tagged with , ,

Python: Flask – Generating a static HTML page

without comments

Whenever I need to quickly spin up a web application Python’s Flask library is my go to tool but I recently found myself wanting to generate a static HTML to upload to S3 and wondered if I could use it for that as well.

It’s actually not too tricky. If we’re in the scope of the app context then we have access to the template rendering that we’d normally use when serving the response to a web request.

The following code will generate a HTML file based on a template file templates/blog.html:

from flask import render_template
import flask
app = flask.Flask('my app')
if __name__ == "__main__":
    with app.app_context():
        rendered = render_template('blog.html', \
            title = "My Generated Page", \
            people = [{"name": "Mark"}, {"name": "Michael"}])


<!doctype html>
	<title>{{ title }}</title>
	<h1>{{ title }}</h1>
  {% for person in people %}
    <li>{{ }}</li>
  {% endfor %}

If we execute the Python script it will generate the following HTML:

$ python 
<!doctype html>
	<title>My Generated Page</title>
	<h1>My Generated Page</h1>

And we can finish off by redirecting that output into a file:

$ python  > blog.html

We could also write to the file from Python but this seems just as easy!

Written by Mark Needham

April 27th, 2017 at 8:59 pm

Posted in Python

Tagged with

AWS Lambda: Programmatically scheduling a CloudWatchEvent

with one comment

I recently wrote a blog post showing how to create a Python ‘Hello World’ AWS lambda function and manually invoke it, but what I really wanted to do was have it run automatically every hour.

To achieve that in AWS Lambda land we need to create a CloudWatch Event. The documentation describes them as follows:

Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.

2017 04 05 23 06 36

This is actually really easy from the Amazon web console as you just need to click the ‘Triggers’ tab and then ‘Add trigger’. It’s not obvious that there are actually three steps are involved as they’re abstracted from you.

So what are the steps?

  1. Create rule
  2. Give permission for that rule to execute
  3. Map the rule to the function

I forgot to do step 2) initially and then you just end up with a rule that never triggers, which isn’t particularly useful.

The following code creates a ‘Hello World’ lambda function and runs it once an hour:

import boto3
lambda_client = boto3.client('lambda')
events_client = boto3.client('events')
fn_name = "HelloWorld"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'
fn_response = lambda_client.create_function(
    Code={'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(), },
fn_arn = fn_response['FunctionArn']
frequency = "rate(1 hour)"
name = "{0}-Trigger".format(fn_name)
rule_response = events_client.put_rule(
            'Id': "1",
            'Arn': fn_arn,

We can now check if our trigger has been configured correctly:

$ aws events list-rules --query "Rules[?Name=='HelloWorld-Trigger']"
        "State": "ENABLED", 
        "ScheduleExpression": "rate(1 hour)", 
        "Name": "HelloWorld-Trigger", 
        "Arn": "arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger"
$ aws events list-targets-by-rule --rule HelloWorld-Trigger
    "Targets": [
            "Id": "1", 
            "Arn": "arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld"
$ aws lambda get-policy --function-name HelloWorld
    "Policy": "{\"Version\":\"2012-10-17\",\"Id\":\"default\",\"Statement\":[{\"Sid\":\"HelloWorld-Trigger-Event\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"\"},\"Action\":\"lambda:InvokeFunction\",\"Resource\":\"arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld\",\"Condition\":{\"ArnLike\":{\"AWS:SourceArn\":\"arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger\"}}}]}"

All looks good so we’re done!

Written by Mark Needham

April 5th, 2017 at 11:49 pm

Posted in Software Development

Tagged with , ,

AWS Lambda: Encrypted environment variables

with 3 comments

Continuing on from my post showing how to create a ‘Hello World’ AWS lambda function I wanted to pass encrypted environment variables to my function.

The following function takes in both an encrypted and unencrypted variable and prints them out.

Don’t print out encrypted variables in a real function, this is just so we can see the example working!

import boto3
import os
from base64 import b64decode
def lambda_handler(event, context):
    encrypted = os.environ['ENCRYPTED_VALUE']
    decrypted = boto3.client('kms').decrypt(CiphertextBlob=b64decode(encrypted))['Plaintext']
    # Don't print out your decrypted value in a real function! This is just to show how it works.
    print("Decrypted value:", decrypted)
    plain_text = os.environ["PLAIN_TEXT_VALUE"]
    print("Plain text:", plain_text)

Now we’ll zip up our function into, ready to send to AWS.


Now it’s time to upload our function to AWS and create the associated environment variables.

If you’re using a Python editor then you’ll need to install boto3 locally to keep the editor happy but you don’t need to include boto3 in the code you send to AWS Lambda – it comes pre-installed.

Now we write the following code to automate the creation of our Lambda function:

import boto3
from base64 import b64encode
fn_name = "HelloWorldEncrypted"
kms_key = "arn:aws:kms:[aws-zone]:[your-aws-id]:key/[your-kms-key-id]"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'
lambda_client = boto3.client('lambda')
kms_client = boto3.client('kms')
encrypt_me = "abcdefg"
encrypted = b64encode(kms_client.encrypt(Plaintext=encrypt_me, KeyId=kms_key)["CiphertextBlob"])
plain_text = 'hijklmno'
        Code={ 'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(),},
            'Variables': {
                'ENCRYPTED_VALUE': encrypted,
                'PLAIN_TEXT_VALUE': plain_text,

The tricky bit for me here was figuring out that I needed to pass the value that I wanted to base 64 encode the output of the value encrypted by the KMS client. The KMS client relies on a KMS key that we need to setup. We can see a list of all our KMS keys by running the following command:

$ aws kms list-keys

The format of these keys is arn:aws:kms:[zone]:[account-id]:key/[key-id].

Now let’s try executing our Lambda function from the AWS console:

$ python

Let’s check it got created:

$ aws lambda list-functions --query "Functions[*].FunctionName"

And now let’s execute the function:

$ aws lambda invoke --function-name HelloWorldEncrypted --invocation-type RequestResponse --log-type Tail /tmp/out | jq ".LogResult"

That’s a bit hard to read, some decoding is needed:

START RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9 Version: $LATEST
('Decrypted value:', 'abcdefg')
('Plain text:', 'hijklmno')
END RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9
REPORT RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9	Duration: 360.04 ms	Billed Duration: 400 ms 	Memory Size: 128 MB	Max Memory Used: 24 MB

And it worked, hoorah!

Written by Mark Needham

April 3rd, 2017 at 5:49 am

Posted in Software Development

Tagged with ,