Mark Needham

Thoughts on Software Development

Neo4j: Cypher – Rounding of floating point numbers/BigDecimals

without comments

I was doing some data cleaning a few days ago and wanting to multiply a value by 1 million. My Cypher code to do this looked like this:

with "8.37" as rawNumeric 
RETURN toFloat(rawNumeric) * 1000000 AS numeric
 
╒═════════════════╕
│"numeric"        │
╞═════════════════╡
│8369999.999999999│
└─────────────────┘

Unfortunately that suffers from the classic rounding error when working with floating point numbers. I couldn’t figure out a way to solve it using pure Cypher, but there tends to be an APOC function to solve every problem and this was no exception.

I’m using Neo4j 3.2.3 so I downloaded the corresponding APOC jar and put it in a plugins directory:

$ ls -lh plugins/
total 3664
-rw-r--r--@ 1 markneedham  staff   1.8M  9 Aug 09:14 apoc-3.2.0.4-all.jar

I’m using Docker so I needed to tell that where my plugins folder lives:

$ docker run -v $PWD/plugins:/plugins \
    -p 7474:7474 \
    -p 7687:7687 \
    -e NEO4J_AUTH="none" \
    neo4j:3.2.3

Now we’re reading to try out our new function:

with "8.37" as rawNumeric 
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│"8370000.00"    │
└────────────────┘

That almost does what we want, but the result is a string rather than numeric value. It’s not too difficult to fix though:

with "8.37" as rawNumeric 
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion
 
╒════════════════╕
│"apocConversion"│
╞════════════════╡
│8370000         │
└────────────────┘

That’s more like it!

Written by Mark Needham

August 13th, 2017 at 7:23 am

Posted in neo4j

Tagged with , ,

Serverless: AWS HTTP Gateway – 502 Bad Gateway

without comments

In my continued work with Serverless and AWS Lambda I ran into a problem when trying to call a HTTP gateway.

My project looked like this:

serverless.yaml

service: http-gateway
 
frameworkVersion: ">=1.2.0 <2.0.0"
 
provider:
  name: aws
  runtime: python3.6
  timeout: 180
 
functions:
  no-op:
      name: NoOp
      handler: handler.noop
      events:
        - http: POST noOp

handler.py

def noop(event, context):
    return "hello"

Let’s deploy to AWS:

$ serverless  deploy
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (179 B)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
..............
Serverless: Stack update finished...
Service Information
service: http-gateway
stage: dev
region: us-east-1
api keys:
  None
endpoints:
  POST - https://29nb5rmmd0.execute-api.us-east-1.amazonaws.com/dev/noOp
functions:
  no-op: http-gateway-dev-no-op

And now we’ll try and call it using cURL:

$ curl -X POST https://29nb5rmmd0.execute-api.us-east-1.amazonaws.com/dev/noOp
{"message": "Internal server error"}

That didn’t work so well, what do the logs have to say?

$ serverless  logs --function no-op	
START RequestId: 64ab69b0-7d8f-11e7-9db5-13b228cd4cb6 Version: $LATEST
END RequestId: 64ab69b0-7d8f-11e7-9db5-13b228cd4cb6
REPORT RequestId: 64ab69b0-7d8f-11e7-9db5-13b228cd4cb6	Duration: 0.27 ms	Billed Duration: 100 ms 	Memory Size: 1024 MB	Max Memory Used: 21 MB

So the function is completely fine. It turns out I’m not very good at reading the manual and should have been returning a map instead of a string:

API Gateway expects to see a json map with keys “body”, “headers”, and “statusCode”.

Let’s update our handler function and re-deploy.

def noop(event, context):
    return {
        "body": "hello",
        "headers": {},
        "statusCode": 200
        }

Now we’re ready to try the endpoint again:

$ curl -X POST https://29nb5rmmd0.execute-api.us-east-1.amazonaws.com/dev/noOp
hello

Much better!

Written by Mark Needham

August 11th, 2017 at 4:01 pm

Serverless: Python – virtualenv – { “errorMessage”: “Unable to import module ‘handler'” }

without comments

I’ve been using the Serverless library to deploy and run some Python functions on AWS lambda recently and was initially confused about how to handle my dependencies.

I tend to create a new virtualenv for each of my project so let’s get that setup first:

Prerequisites

$ npm install serverless
$ virtualenv -p python3 a
$ . a/bin/activate

Now let’s create our Serverless project. I’m going to install the requests library so that I can use it in my function.

My Serverless project

serverless.yaml

service: python-starter-template

frameworkVersion: ">=1.2.0 <2.0.0"

provider:
  name: aws
  runtime: python3.6
  timeout: 180

functions:
  starter-function:
      name: Starter
      handler: handler.starter

handler.py

import requests
 
def starter(event, context):
    print("event:", event, "context:", context)
    r = requests.get("http://www.google.com")
    print(r.status_code)
$ pip install requests

Ok, we’re now ready to try out the function. A nice feature of Serverless is that it lets us try out functions locally before we deploy them onto one of the Cloud providers:

$ ./node_modules/serverless/bin/serverless invoke local --function starter-function
event: {} context: <__main__.FakeLambdaContext object at 0x10bea9a20>
200
null

So far so good. Next we’ll deploy our function to AWS. I’m assuming you’ve already got your credentials setup but if not you can follow the tutorial on the Serverless page.

$ ./node_modules/serverless/bin/serverless deploy
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (26.48 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.........
Serverless: Stack update finished...
Service Information
service: python-starter-template
stage: dev
region: us-east-1
api keys:
  None
endpoints:
  None
functions:
  starter-function: python-starter-template-dev-starter-function

Now let’s invoke our function:

$ ./node_modules/serverless/bin/serverless invoke --function starter-function
{
    "errorMessage": "Unable to import module 'handler'"
}
 
  Error --------------------------------------------------
 
  Invoked function failed
 
     For debugging logs, run again after setting the "SLS_DEBUG=*" environment variable.
 
  Get Support --------------------------------------------
     Docs:          docs.serverless.com
     Bugs:          github.com/serverless/serverless/issues
     Forums:        forum.serverless.com
     Chat:          gitter.im/serverless/serverless
 
  Your Environment Information -----------------------------
     OS:                     darwin
     Node Version:           6.7.0
     Serverless Version:     1.19.0

Hmmm, that’s odd – I wonder why it can’t import our handler module? We can call the logs function to check. The logs are usually a few seconds behind so we’ll have to be a bit patient if we don’t see them immediately.

$ ./node_modules/serverless/bin/serverless logs  --function starter-function
START RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552 Version: $LATEST
Unable to import module 'handler': No module named 'requests'
 
END RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552
REPORT RequestId: 735efa84-7ad0-11e7-a4ef-d5baf0b46552	Duration: 0.42 ms	Billed Duration: 100 ms 	Memory Size: 1024 MB	Max Memory Used: 22 MB

That explains it – the requests module wasn’t imported.

If we look in .serverless/python-starter-template.zip

we can see that the requests module is hidden inside the a directory and the instance of Python that runs on Lambda doesn’t know where to find it.

I’m sure there are other ways of solving this but the easiest one I found is a Serverless plugin called serverless-python-requirements.

So how does this plugin work?

A Serverless v1.x plugin to automatically bundle dependencies from requirements.txt and make them available in your PYTHONPATH.

Doesn’t sound too tricky – we can use pip freeze to get our list of requirements and write them into a file. Let’s rework serverless.yaml to make use of the plugin:

My Serverless project using serverless-python-requirements

$ npm install --save serverless-python-requirements
$ pip freeze > requirements.txt
$ cat requirements.txt 
certifi==2017.7.27.1
chardet==3.0.4
idna==2.5
requests==2.18.3
urllib3==1.22

serverless.yaml

service: python-starter-template

frameworkVersion: ">=1.2.0 <2.0.0"

provider:
  name: aws
  runtime: python3.6
  timeout: 180

plugins:
  - serverless-python-requirements

functions:
  starter-function:
      name: Starter
      handler: handler.starter

package:
  exclude:
    - a/** # virtualenv

We have two changes from before:

  • We added the serverless-python-requirements plugin
  • We excluded the a directory since we don’t need it

Let’s deploy again and run the function:

$ ./node_modules/serverless/bin/serverless deploy
Serverless: Parsing Python requirements.txt
Serverless: Installing required Python packages for runtime python3.6...
Serverless: Linking required Python packages...
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Unlinking required Python packages...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (14.39 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.........
Serverless: Stack update finished...
Service Information
service: python-starter-template
stage: dev
region: us-east-1
api keys:
  None
endpoints:
  None
functions:
  starter-function: python-starter-template-dev-starter-function
$ ./node_modules/serverless/bin/serverless invoke --function starter-function
null

Looks good. Let’s check the logs:

$ ./node_modules/serverless/bin/serverless logs --function starter-function
START RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24 Version: $LATEST
event: {} context: <__main__.LambdaContext object at 0x7f568b105f28>
200
END RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24
REPORT RequestId: 61e8eda7-7ad4-11e7-8914-03b8a7793a24	Duration: 55.55 ms	Billed Duration: 100 ms 	Memory Size: 1024 MB	Max Memory Used: 29 M

All good here as well so we’re done!

Written by Mark Needham

August 6th, 2017 at 7:03 pm

Posted in Software Development

Tagged with ,

AWS Lambda: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory’

without comments

I’ve been working on an AWS lambda job to convert a HTML page to PDF using a Python wrapper around the wkhtmltopdf library but ended up with the following error when I tried to execute it:

b'/bin/sh: ./binary/wkhtmltopdf: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory\n': Exception
Traceback (most recent call last):
File "/var/task/handler.py", line 33, in generate_certificate
wkhtmltopdf(local_html_file_name, local_pdf_file_name)
File "/var/task/lib/wkhtmltopdf.py", line 64, in wkhtmltopdf
wkhp.render()
File "/var/task/lib/wkhtmltopdf.py", line 56, in render
raise Exception(stderr)
Exception: b'/bin/sh: ./binary/wkhtmltopdf: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory\n'

It turns out this is the error you get if you run a 32 bit binary on a 64 bit operating system which is what AWS uses.

If you are using any native binaries in your code, make sure they are compiled in this environment. Note that only 64-bit binaries are supported on AWS Lambda.

I changed to the 64 bit binary and am now happily converting HTML pages to PDF.

Written by Mark Needham

August 3rd, 2017 at 5:24 pm

Posted in Software Development

Tagged with ,

PHP vs Python: Generating a HMAC

without comments

I’ve been writing a bit of code to integrate with a ClassMarker webhook, and you’re required to check that an incoming request actually came from ClassMarker by checking the value of a base64 hash using HMAC SHA256.

The example in the documentation is written in PHP which I haven’t done for about 10 years so I had to figure out how to do the same thing in Python.

This is the PHP version:

$ php -a
php > echo base64_encode(hash_hmac("sha256", "my data", "my_secret", true));
vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE=

The Python equivalent is a bit more code but it’s not too bad.

Import all the libraries

import hmac
import hashlib
import base64

Generate that hash

data = "my data".encode("utf-8")
digest = hmac.new(b"my_secret", data, digestmod=hashlib.sha256).digest()
 
print(base64.b64encode(digest).decode())
'vyniKpNSlxu4AfTgSJImt+j+pRx7v6m+YBobfKsoGhE='

We’re getting the same value as the PHP version so it’s good times all round.

Written by Mark Needham

August 2nd, 2017 at 6:09 am

Posted in Python

Tagged with ,

Docker: Building custom Neo4j images on Mac OS X

without comments

I sometimes needs to create custom Neo4j Docker images to try things out and wanted to share my work flow, mostly for future Mark but also in case it’s useful to someone else.

There’s already a docker-neo4j repository so we’ll just tweak the files in there to achieve what we want.

$ git clone git@github.com:neo4j/docker-neo4j.git
$ cd docker-neo4j

If we want to build a Docker image for Neo4j Enterprise Edition we can run the following build target:

$ make clean build-enterprise
Makefile:9: *** This Make does not support .RECIPEPREFIX. Please use GNU Make 4.0 or later.  Stop.

Denied at the first hurdle! What version of make have we got on this machine?

$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
 
This program built for i386-apple-darwin11.3.0

We can sort that out by installing a newer version using brew:

$ brew install make
$ gmake --version
GNU Make 4.2.1
Built for x86_64-apple-darwin15.6.0
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

That’s more like it! brew installs make with the ‘g’ prefix and since I’m not sure if anything else on my system relies on the older version of make I won’t bother changing the symlink.

Let’s retry our original command:

$ gmake clean build-enterprise
Makefile:14: *** NEO4J_VERSION is not set.  Stop.

It’s still not happy with us! Let’s set that environment variable to the latest released version as of writing:

$ export NEO4J_VERSION="3.2.2"
$ gmake clean build-enterprise
...
Successfully built c16b6f2738de
Successfully tagged test/18334:latest
Neo4j 3.2.2-enterprise available as: test/18334

We can see that image in Docker land by running the following command:

$ docker images | head -n2
REPOSITORY                                     TAG                          IMAGE ID            CREATED             SIZE
test/18334                                     latest                       c16b6f2738de        4 minutes ago       303MB

If I wanted to deploy that image to my own Docker Hub I could run the following commands:

$ docker login --username=markhneedham
$ docker tag c16b6f2738de markhneedham/neo4j:3.2.2
$ docker push markhneedham/neo4j

Putting Neo4j Enterprise 3.2.2 on my Docker Hub isn’t very interesting though – that version is already on the official Neo4j Docker Hub.

I’ve actually been building versions of Neo4j against the HEAD of the Neo4j 3.2 branch (i.e. 3.2.3-SNAPSHOT), deploying those to S3, and then building a Docker image based on those archives.

To change the destination of the Neo4j artifact we need to tweak this line in the Makefile:

$ git diff Makefile
diff --git a/Makefile b/Makefile
index c77ed1f..98e05ca 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,7 @@ ifndef NEO4J_VERSION
 endif
 
 tarball = neo4j-$(1)-$(2)-unix.tar.gz
-dist_site := http://dist.neo4j.org
+dist_site := https://s3-eu-west-1.amazonaws.com/core-edge.neotechnology.com/20170726
 series := $(shell echo "$(NEO4J_VERSION)" | sed -E 's/^([0-9]+\.[0-9]+)\..*/\1/')
 
 all: out/enterprise/.sentinel out/community/.sentinel

We can then update the Neo4j version environment variable:

$ export NEO4J_VERSION="3.2.3-SNAPSHOT"

And then repeat the Docker commands above. You’ll need to sub in your own Docker Hub user and repository names.

I’m using these custom images as part of Kubernetes deployments but you can use them anywhere that accepts a Docker container.

If anything on the post didn’t make sense or you want more clarification let me know @markhneedham.

Written by Mark Needham

July 26th, 2017 at 10:20 pm

Posted in neo4j

Tagged with ,

Pandas: ValueError: The truth value of a Series is ambiguous.

without comments

I’ve been playing around with Kaggle in my spare time over the last few weeks and came across an unexpected behaviour when trying to add a column to a dataframe.

First let’s get Panda’s into our program scope:

Prerequisites

import pandas as pd

Now we’ll create a data frame to play with for the duration of this post:

>>> df = pd.DataFrame({"a": [1,2,3,4,5], "b": [2,3,4,5,6]})
>>> df
   a  b
0  5  2
1  6  6
2  0  8
3  3  2
4  1  6

Let’s say we want to create a new column which returns True if either of the numbers are odd. If not then it’ll return False.

We’d expect to see a column full of True values so let’s get started.

>>> divmod(df["a"], 2)[1] > 0
0     True
1    False
2     True
3    False
4     True
Name: a, dtype: bool
 
>>> divmod(df["b"], 2)[1] > 0
0    False
1     True
2    False
3     True
4    False
Name: b, dtype: bool

So far so good. Now let’s combine those two calculations together and create a new column in our data frame:

>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) or (divmod(df["b"], 2)[1] > 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/kaggle/house-prices/a/lib/python3.6/site-packages/pandas/core/generic.py", line 953, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Hmmm, that was unexpected! Unfortunately Python’s or and and statements don’t work very well against Panda’s Series’, so instead we need to use the bitwise or (|) and and (&).

Let’s update our example:

>>> df["anyOdd"] = (divmod(df["a"], 2)[1] > 0) | (divmod(df["b"], 2)[1] > 0)
>>> df
   a  b  anyOdd
0  1  2    True
1  2  3    True
2  3  4    True
3  4  5    True
4  5  6    True

Much better. And what about if we wanted to check if both values are odd?

>>> df["bothOdd"] = (divmod(df["a"], 2)[1] > 0) & (divmod(df["b"], 2)[1] > 0)
>>> df
   a  b  anyOdd  bothOdd
0  1  2    True    False
1  2  3    True    False
2  3  4    True    False
3  4  5    True    False
4  5  6    True    False

Works exactly as expected, hoorah!

Written by Mark Needham

July 26th, 2017 at 9:41 pm

Posted in Data Science

Tagged with , ,

Pandas/scikit-learn: get_dummies test/train sets – ValueError: shapes not aligned

without comments

I’ve been using panda’s get_dummies function to generate dummy columns for categorical variables to use with scikit-learn, but noticed that it sometimes doesn’t work as I expect.

Prerequisites

import pandas as pd
import numpy as np
from sklearn import linear_model

Let’s say we have the following training and test sets:

Training set

train = pd.DataFrame({"letter":["A", "B", "C", "D"], "value": [1, 2, 3, 4]})
X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
y_train = train["value"]

Test set

test = pd.DataFrame({"letter":["D", "D", "B", "E"], "value": [4, 5, 7, 19]})
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
y_test = test["value"]

Now say we want to train a linear model on our training set and then use it to predict the values in our test set:

Train the model

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model

model.score(X_test, y_test)
ValueError: shapes (4,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

Hmmm that didn’t go to plan. If we print X_train and X_test it might help shed some light:

Checking the train/test datasets

print(X_train)
   letter_A  letter_B  letter_C  letter_D
0         1         0         0         0
1         0         1         0         0
2         0         0         1         0
3         0         0         0         1
print(X_test)
   letter_B  letter_D  letter_E
0         0         1         0
1         0         1         0
2         1         0         0
3         0         0         1

They do indeed have different shapes and some different column names because the test set contained some values that weren’t present in the training set.

We can fix this by making the ‘letter’ field categorical before we run the get_dummies method over the dataframe. At the moment the field is of type ‘object’:

Column types

print(train.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
letter    4 non-null object
value     4 non-null int64
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes

We can fix this by converting the ‘letter’ field to the type ‘category’ and setting the list of allowed values to be the unique set of values in the train/test sets.

All allowed values

all_data = pd.concat((train,test))
for column in all_data.select_dtypes(include=[np.object]).columns:
    print(column, all_data[column].unique())
letter ['A' 'B' 'C' 'D' 'E']

Now let’s update the type of our ‘letter’ field in the train and test dataframes.

Type: ‘category’

all_data = pd.concat((train,test))
 
for column in all_data.select_dtypes(include=[np.object]).columns:
    train[column] = train[column].astype('category', categories = all_data[column].unique())
    test[column] = test[column].astype('category', categories = all_data[column].unique())

And now if we call get_dummies on either dataframe we’ll get the same set of columns:

get_dummies: Take 2

X_train = train.drop(["value"], axis=1)
X_train = pd.get_dummies(X_train)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         1         0         0         0         0
1         0         1         0         0         0
2         0         0         1         0         0
3         0         0         0         1         0
X_test = test.drop(["value"], axis=1)
X_test = pd.get_dummies(X_test)
print(X_train)
   letter_A  letter_B  letter_C  letter_D  letter_E
0         0         0         0         1         0
1         0         0         0         1         0
2         0         1         0         0         0
3         0         0         0         0         1

Great! Now we should be able to train our model and use it against the test set:

Train the model: Take 2

lr = linear_model.LinearRegression()
model = lr.fit(X_train, y_train)

Test the model: Take 2

model.score(X_test, y_test)
-1.0604490500863557

And we’re done!

Written by Mark Needham

July 5th, 2017 at 3:42 pm

Posted in Python

Tagged with ,

Pandas: Find rows where column/field is null

without comments

In my continued playing around with the Kaggle house prices dataset I wanted to find any columns/fields that have null values in.

If we want to get a count of the number of null fields by column we can use the following code, adapted from Poonam Ligade’s kernel:

Prerequisites

import pandas as pd

Count the null columns

train = pd.read_csv("train.csv")
null_columns=train.columns[train.isnull().any()]
train[null_columns].isnull().sum()
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

So there are lots of different columns containing null values. What if we want to find the solitary row which has ‘Electrical’ as null?

Single column is null

print(train[train["Electrical"].isnull()][null_columns])
      LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
1379         73.0   NaN       None         0.0       Gd       TA           No   
 
     BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
1379          Unf          Unf        NaN         NaN    BuiltIn       2007.0   
 
     GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
1379          Fin         TA         TA    NaN   NaN         NaN

And what if we want to return every row that contains at least one null value? That’s not too difficult – it’s just a combination of the code in the previous two sections:

All null columns

print(train[train.isnull().any(axis=1)][null_columns].head())
   LotFrontage Alley MasVnrType  MasVnrArea BsmtQual BsmtCond BsmtExposure  \
0         65.0   NaN    BrkFace       196.0       Gd       TA           No   
1         80.0   NaN       None         0.0       Gd       TA           Gd   
2         68.0   NaN    BrkFace       162.0       Gd       TA           Mn   
3         60.0   NaN       None         0.0       TA       Gd           No   
4         84.0   NaN    BrkFace       350.0       Gd       TA           Av   
 
  BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType  GarageYrBlt  \
0          GLQ          Unf      SBrkr         NaN     Attchd       2003.0   
1          ALQ          Unf      SBrkr          TA     Attchd       1976.0   
2          GLQ          Unf      SBrkr          TA     Attchd       2001.0   
3          ALQ          Unf      SBrkr          Gd     Detchd       1998.0   
4          GLQ          Unf      SBrkr          TA     Attchd       2000.0   
 
  GarageFinish GarageQual GarageCond PoolQC Fence MiscFeature  
0          RFn         TA         TA    NaN   NaN         NaN  
1          RFn         TA         TA    NaN   NaN         NaN  
2          RFn         TA         TA    NaN   NaN         NaN  
3          Unf         TA         TA    NaN   NaN         NaN  
4          RFn         TA         TA    NaN   NaN         NaN

Hope that helps future Mark!

Written by Mark Needham

July 5th, 2017 at 2:31 pm

Posted in Python

Tagged with , ,

Shell: Create a comma separated string

without comments

I recently needed to generate a string with comma separated values, based on iterating a range of numbers.

e.g. we should get the following output where n = 3

foo-0,foo-1,foo-2

I only had the shell available to me so I couldn’t shell out into Python or Ruby for example. That means it’s bash scripting time!

If we want to iterate a range of numbers and print them out on the screen we can write the following code:

n=3
for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  echo "foo-$i"
done
 
foo-0
foo-1
foo-2

Combining them into a string is a bit more tricky, but luckily I found a great blog post by Andreas Haupt which shows what to do. Andreas is solving a more complicated problem than me but these are the bits of code that we need from the post.

n=3
combined=""
 
for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
done
echo $combined
 
foo-0,foo-1,foo-2

This won’t work if you set n<0 but that’s ok for me! I’ll let Andreas explain how it works:

  • ${combined:+,} will return either a comma (if combined exists and is set) or nothing at all.
  • In the first invocation of the loop combined is not yet set and nothing is put out.
  • In the next rounds combined is set and a comma will be put out.

We can see how it in action by printing out the value of $combined after each iteration of the loop:

n=3
combined=""
 
for i in $(seq 0 $(($n > 0 ? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
  echo $combined
done
 
foo-0
foo-0,foo-1
foo-0,foo-1,foo-2

Looks good to me!

Written by Mark Needham

June 23rd, 2017 at 12:26 pm

Posted in Shell Scripting

Tagged with , ,