Mark Needham

Thoughts on Software Development

Archive for the ‘Data Science’ Category

Exploring (potential) data entry errors in the Land Registry data set

without comments

I’ve previously written a couple of blog posts describing the mechanics of analysing the Land Registry data set and I thought it was about time I described some of the queries I’ve been running the discoveries I’ve made.

To recap, the land registry provides a 3GB, 20 million line CSV file containing all the property sales in the UK since 1995.

We’ll be loading and query the data in R using the data.table package:

> library(data.table)
> dt = fread("pp-complete.csv", header = FALSE)
> dt[1:5]
                                       V1     V2               V3       V4 V5
1: {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T
2: {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D
3: {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F
4: {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D
5: {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D
   V6 V7  V8 V9           V10        V11         V12
1:  N  F 106     READING ROAD   NORTHOLT    NORTHOLT
2:  N  F  58     ADAMS MEADOW  ILMINSTER   ILMINSTER
3:  N  L  58    WHELLOCK ROAD                 LONDON
4:  N  F  17         WESTGATE    MORPETH     MORPETH
5:  N  F   4    MASON GARDENS WEST WINCH KING'S LYNN
                            V13            V14 V15
1:                       EALING GREATER LONDON   A
2:               SOUTH SOMERSET       SOMERSET   A
3:                       EALING GREATER LONDON   A
4:               CASTLE MORPETH NORTHUMBERLAND   A
5: KING'S LYNN AND WEST NORFOLK        NORFOLK   A

For our first query we’re going to find the most expensive query sold for each year from 1995 – 2015.

The first thing we’ll need to do is make column ‘V2’ (price) numeric and convert column ‘V3’ (sale date) to data format so we can do date arithmetic on it:

> dt = dt[, V2:= as.numeric(V2)]
> dt = dt[, V3:= as.Date(V3)]

Now let’s write the query:

> dt[, .SD[which.max(V2)], by=year(V3)][order(year)][, .(year,V9,V8,V10,V12,V14,V4,V2)]
    year             V9               V8                   V10            V12            V14       V4       V2
 1: 1995                  THORNETS HOUSE       BUILDER GARDENS    LEATHERHEAD         SURREY KT22 7DE  5610000
 2: 1996                              24             MAIN ROAD MELTON MOWBRAY LEICESTERSHIRE LE14 3SP 17250000
 3: 1997                              42        HYDE PARK GATE         LONDON GREATER LONDON  SW7 5DU  7500000
 4: 1998                              19     NEW BRIDGE STREET         LONDON GREATER LONDON EC4V 6DB 11250000
 5: 1999                  TERMINAL HOUSE LOWER BELGRAVE STREET         LONDON GREATER LONDON SW1W 0NH 32477000
 6: 2000         UNIT 3     JUNIPER PARK            FENTON WAY       BASILDON          ESSEX SS15 6RZ 12600000
 7: 2001                              19        BABMAES STREET         LONDON GREATER LONDON SW1Y 6HD 24750000
 8: 2002                              72        VINCENT SQUARE         LONDON GREATER LONDON SW1P 2PA  8300000
 9: 2003                              81          ADDISON ROAD         LONDON GREATER LONDON  W14 8ED  9250000
10: 2004                              29   HOLLAND VILLAS ROAD         LONDON GREATER LONDON  W14 8DH  7950000
11: 2005 APARTMENT 1102              199         KNIGHTSBRIDGE         LONDON GREATER LONDON  SW7 1RH 15193950
12: 2006                               1     THORNWOOD GARDENS         LONDON GREATER LONDON   W8 7EA 12400000
13: 2007                              36         CADOGAN PLACE         LONDON GREATER LONDON SW1X 9RX 17000000
14: 2008             50                         CHESTER SQUARE         LONDON GREATER LONDON SW1W 9EA 19750000
15: 2009                       CASA SARA     HEATHERSIDE DRIVE VIRGINIA WATER         SURREY GU25 4JU 13800000
16: 2010                              10   HOLLAND VILLAS ROAD         LONDON GREATER LONDON  W14 8BP 16200000
17: 2011                WHITESTONE HOUSE       WHITESTONE LANE         LONDON GREATER LONDON  NW3 1EA 19250000
18: 2012                              20           THE BOLTONS         LONDON GREATER LONDON SW10 9SU 54959000
19: 2013   APARTMENT 7F              171         KNIGHTSBRIDGE         LONDON GREATER LONDON  SW7 1DW 39000000
20: 2014                  APARTMENT 6, 5          PRINCES GATE         LONDON GREATER LONDON  SW7 1QJ 50000000
21: 2015                              37       BURNSALL STREET         LONDON GREATER LONDON  SW3 3SR 27750000
    year             V9               V8                   V10            V12            V14       V4       V2

The results mostly make sense – the majority of the highest priced properties are around Hyde Park and often somewhere near Knightsbridge which is one of the most expensive places in the country.

There are some odd odds though. e.g. in 1996 the top priced property is in Leicester and sold for just over £17m. I looked it up on the Land Registry site to quickly see what it was subsequently sold for:

2015 10 17 22 06 03

Based on the subsequent prices I think we can safely assume that the initial price is incorrect and should actually have been £17,250.

We can also say the same about our 2000 winner in Juniper Park in Basildon which sold for £12.6 million. If we look at the next sale price after that it’s £172,500 in 2003 so most likely it was sold for £126,000 – only 100 times out!

I wanted to follow this observation and see if I could find other anomalies by comparing adjacent sale prices of properties.

First we’ll create a ‘fullAddress’ field which we’ll use as an identifier for each property. It’s not completely unique but it’s not far away:

> dt = dt[, fullAddress := paste(dt$V8, dt$V9, dt$V10, dt$V11, dt$V12, dt$V13, dt$V4, sep=", ")]
> setkey(dt, fullAddress)
 
> dt[, .(fullAddress, V2)][1:5]
                                                                                  fullAddress     V2
1:                ''NUTSHELL COTTAGE, 72, , KIRKLAND, KENDAL, KENDAL, SOUTH LAKELAND, LA9 5AP  89000
2:                         'FARRIERS', , FARRIERS CLOSE, WOODLEY, READING, WOKINGHAM, RG5 3DD 790000
3: 'HOLMCROFT', 40, , BRIDGNORTH ROAD, WOMBOURNE, WOLVERHAMPTON, SOUTH STAFFORDSHIRE, WV5 0AA 305000
4:                            (AKERS), , CHAPEL STREET, EASINGWOLD, YORK, HAMBLETON, YO61 3AE 118000
5:                                       (ANNINGS), , , FARWAY, COLYTON, EAST DEVON, EX24 6DF 150000

Next we’ll add a column to the data table which contains the previous sale price and another column which calculate the difference between the two prices:

> dt[, lag.V2:=c(NA, V2[-.N]), by = fullAddress]
> dt[, V2.diff := V2 - lag.V2]
 
> dt[!is.na(lag.V2),][1:10][, .(fullAddress, lag.V2, V2, V2.diff)]
                                                                                   fullAddress lag.V2     V2 V2.diff
 1:                                       (ANNINGS), , , FARWAY, COLYTON, EAST DEVON, EX24 6DF 150000 385000  235000
 2:                  (BARBER), , PEACOCK CORNER, MOULTON ST MARY, NORWICH, BROADLAND, NR13 3NF 115500 136000   20500
 3:                      (BELL), , BAWBURGH ROAD, MARLINGFORD, NORWICH, SOUTH NORFOLK, NR9 5AG 128000 300000  172000
 4:                      (BEVERLEY), , DAWNS LANE, ASLOCKTON, NOTTINGHAM, RUSHCLIFFE, NG13 9AD  95000 210000  115000
 5: (BLACKMORE), , GREAT STREET, NORTON SUB HAMDON, STOKE-SUB-HAMDON, SOUTH SOMERSET, TA14 6SJ  53000 118000   65000
 6:                        (BOWDERY), , HIGH STREET, MARKINGTON, HARROGATE, HARROGATE, HG3 3NR 140000 198000   58000
 7:                  (BULLOCK), , MOORLAND ROAD, INDIAN QUEENS, ST. COLUMB, RESTORMEL, TR9 6HN  50000  50000       0
 8:                                   (CAWTHRAY), , CAWOOD ROAD, WISTOW, SELBY, SELBY, YO8 3XB 130000 120000  -10000
 9:                                   (CAWTHRAY), , CAWOOD ROAD, WISTOW, SELBY, SELBY, YO8 3XB 120000 155000   35000
10:                                 (COATES), , , BARDSEA, ULVERSTON, SOUTH LAKELAND, LA12 9QT  26000  36000   10000

Let’s find the properties which have the biggest £ value difference in adjacent sales:

> dt[!is.na(V2.diff)][order(-abs(V2.diff))][, .(fullAddress, lag.V2, V2, V2.diff)][1:20]
                                                                fullAddress   lag.V2       V2   V2.diff
 1:     , 50, CHESTER SQUARE, LONDON, LONDON, CITY OF WESTMINSTER, SW1W 9EA  1135000 19750000  18615000
 2:         44, , LANSDOWNE ROAD, , LONDON, KENSINGTON AND CHELSEA, W11 2LU  3675000 22000000  18325000
 3:      24, , MAIN ROAD, ASFORDBY VALLEY, MELTON MOWBRAY, MELTON, LE14 3SP 17250000    32500 -17217500
 4:           11, , ORMONDE GATE, , LONDON, KENSINGTON AND CHELSEA, SW3 4EU   250000 16000000  15750000
 5:     2, , HOLLAND VILLAS ROAD, , LONDON, KENSINGTON AND CHELSEA, W14 8BP  8675000 24000000  15325000
 6:          1, , PEMBRIDGE PLACE, , LONDON, KENSINGTON AND CHELSEA, W2 4XB  2340250 17000000  14659750
 7:     10, , CHESTER SQUARE, LONDON, LONDON, CITY OF WESTMINSTER, SW1W 9HH   680000 15000000  14320000
 8:        12, , SOUTH EATON PLACE, , LONDON, CITY OF WESTMINSTER, SW1W 9JA  4250000 18550000  14300000
 9:     32, FLAT 1, HOLLAND PARK, , LONDON, KENSINGTON AND CHELSEA, W11 3TA   420000 14100000  13680000
10:       42, , EGERTON CRESCENT, , LONDON, KENSINGTON AND CHELSEA, SW3 2EB  1125000 14650000  13525000
11:   36, , CADOGAN PLACE, LONDON, LONDON, KENSINGTON AND CHELSEA, SW1X 9RX  3670000 17000000  13330000
12:        22, , ILCHESTER PLACE, , LONDON, KENSINGTON AND CHELSEA, W14 8AA  3350000 16250000  12900000
13:                3, , BOLNEY GATE, , LONDON, CITY OF WESTMINSTER, SW7 1QW  5650000 18250000  12600000
14:        JUNIPER PARK, UNIT 3, FENTON WAY, , BASILDON, BASILDON, SS15 6RZ 12600000   172500 -12427500
15:           10, , WALTON PLACE, , LONDON, KENSINGTON AND CHELSEA, SW3 1RJ   356000 12750000  12394000
16: 84, MAISONETTE C, EATON SQUARE, , LONDON, CITY OF WESTMINSTER, SW1W 9AG  1500000 13400000  11900000
17:          3, , CHESTERFIELD HILL, , LONDON, CITY OF WESTMINSTER, W1J 5BJ   955000 12600000  11645000
18:   39, , ENNISMORE GARDENS, LONDON, LONDON, CITY OF WESTMINSTER, SW7 1AG  3650000 15250000  11600000
19:       76, FLAT 2, EATON SQUARE, , LONDON, CITY OF WESTMINSTER, SW1W 9AW  3500000 15000000  11500000
20:                            85, , AVENUE ROAD, , LONDON, CAMDEN, NW8 6JD   519000 12000000  11481000

Most of the entries here are in Westminster or Hyde Park and don’t look particularly dodgy at first glance. We’d have to drill into the sale dates to confirm.

What you might also have noticed is that our Melton Mowbray and Juniper Park properties both show up and although they don’t have the biggest £ value difference they would probably rank top if calculated the multiplier instead. Let’s give that a try:

> dt[, V2.multiplier := ifelse(V2 > lag.V2, V2 / lag.V2, lag.V2 / V2)]
 
> dt[!is.na(V2.multiplier)][order(-V2.multiplier)][, .(fullAddress, lag.V2, V2, V2.multiplier)][1:20]
                                                                            fullAddress   lag.V2       V2 V2.multiplier
 1:                  24, , MAIN ROAD, ASFORDBY VALLEY, MELTON MOWBRAY, MELTON, LE14 3SP 17250000    32500     530.76923
 2:                          LEA HAVEN, FLAT 1, CASTLE LANE, , TORQUAY, TORBAY, TQ1 3BE    38000  7537694     198.36037
 3:   NIGHTINGALE HOUSE, , BURLEIGH ROAD, ASCOT, ASCOT, WINDSOR AND MAIDENHEAD, SL5 7LD     9500  1100000     115.78947
 4:                    JUNIPER PARK, UNIT 3, FENTON WAY, , BASILDON, BASILDON, SS15 6RZ 12600000   172500      73.04348
 5:                           9, , ROTHSAY GARDENS, BEDFORD, BEDFORD, BEDFORD, MK40 3QA    21000  1490000      70.95238
 6:       22, GROUND FLOOR FLAT, SEA VIEW AVENUE, , PLYMOUTH, CITY OF PLYMOUTH, PL4 8RU    27950  1980000      70.84079
 7: 91A, , TINTERN AVENUE, WESTCLIFF-ON-SEA, WESTCLIFF-ON-SEA, SOUTHEND-ON-SEA, SS0 9QQ    17000  1190000      70.00000
 8:     204C, , SUTTON ROAD, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SS2 5ES    18000  1190000      66.11111
 9:            PRIORY COURT, FLAT 3, PRIORY AVENUE, TOTNES, TOTNES, SOUTH HAMS, TQ9 5HS  2226500    34000      65.48529
10:      59, , ST ANNS ROAD, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SOUTHEND-ON-SEA, SS2 5AT    18250  1190000      65.20548
11:                                    15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ    13500   880000      65.18519
12:                       11, , ORMONDE GATE, , LONDON, KENSINGTON AND CHELSEA, SW3 4EU   250000 16000000      64.00000
13:                         WOODEND, , CANNONGATE ROAD, HYTHE, HYTHE, SHEPWAY, CT21 5PX    19261  1200000      62.30206
14:                 DODLESTON OAKS, , CHURCH ROAD, DODLESTON, CHESTER, CHESTER, CH4 9NG    10000   620000      62.00000
15:         CREEKSIDE, , CURLEW DRIVE, WEST CHARLETON, KINGSBRIDGE, SOUTH HAMS, TQ7 2AA    28000  1700000      60.71429
16:                              20, , BRANCH ROAD, BURNLEY, BURNLEY, BURNLEY, BB11 3AT     9000   540000      60.00000
17:             THE BARN, , LEE WICK LANE, ST OSYTH, CLACTON-ON-SEA, TENDRING, CO16 8ES    10000   600000      60.00000
18:                           11, , OAKWOOD GARDENS, KNAPHILL, WOKING, WOKING, GU21 2RX     6000   357000      59.50000
19:                              23, , OLDHAM ROAD, GRASSCROFT, OLDHAM, OLDHAM, OL4 4HY     8000   475000      59.37500
20:                  THE SUNDAY HOUSE, , WATER LANE, GOLANT, FOWEY, RESTORMEL, PL23 1LF     8000   475000      59.37500

This is much better! Our Melton Mowbray property comes in first by miles and Juniper Park is there in 4th. The rest of the price increases look implausible as well but let’s drill into a couple of them:

> dt[fullAddress == "15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ"][, .(fullAddress, V3, V2)]
                                        fullAddress         V3     V2
1: 15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ 1995-06-29  13500
2: 15, , BREWERY LANE, LEIGH, LEIGH, WIGAN, WN7 2RJ 2008-03-28 880000

If we look at some other properties on the same road and look at the property’s features it seems more likely that’s meant to say £88,000.

I noticed a similar trend when looking at some of the others on this list but I also realised that the data needs a bit of cleaning up as the ‘fullAddress’ column isn’t uniquely identifying properties e.g. sometimes a property might have a Town/City of ‘London’ and a District of ‘London’ but on another transaction the District could be blank.

On top of that, my strategy of looking for subsequent prices to spot anomalies falls down when trying to explore properties which only have one sale.

So I have a couple of things to look into for now but once I’ve done those it’d be interesting to write an algorithm/program that could predict which transactions are likely to be anomalies.

I can imagine how that might work if I had a labelled training set but I’m not sure if I could do it with an unsupervised algorithm so if you have any pointers let me know.

Written by Mark Needham

October 18th, 2015 at 10:03 am

Posted in Data Science

Tagged with

Data Science: Mo’ Data Mo’ Problems

without comments

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on.

In the early parts of a project we’re trying to prove out our approach rather than prove we can handle big data – something that Ashok taught me a couple of years ago on a project we worked on together.

In a Neo4j project that means coming up with an effective way to model and query our data and if we lose track of this it’s very easy to get sucked into working on the big data problem.

This could mean optimising our import scripts to deal with huge amounts of data or working out how to handle different aspects of the data (e.g. variability in shape or encoding) that only seem to reveal themselves at scale.

These are certainly problems that we need to solve but in my experience they end up taking much more time than expected and therefore aren’t the best problem to tackle when time is limited. Early on we want to create some momentum and keep the feedback cycle fast.

We probably want to tackle the data size problem as part of the implementation/production stage of the project to use Michael Nygaard’s terminology.

At this stage we’ll have some confidence that our approach makes sense and then we can put aside the time to set things up properly.

I’m sure there are some types of projects where this approach doesn’t make sense so I’d love to hear about them in the comments so I can spot them in future.

Written by Mark Needham

June 28th, 2014 at 11:35 pm

Posted in Data Science

Tagged with

Data Science: Don’t build a crawler (if you can avoid it!)

with one comment

On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available.

When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy.

Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules.

Don’t build a crawler

One of the most tempting things to do is build a crawler which starts on the home page and then follows some/all the links it comes across, downloading those pages as it goes.

This is incredibly time consuming and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.

Ashok wanted to get the same data a few months later and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted and then built up a list of pages to download.

It took him in the order of minutes to build a script that would get the data whereas I spent many hours using the crawler based approach.

If there is no discernible URI structure or if you want to get every single page then the crawler approach might make sense but I try to avoid it as a first port of call.

Download the files

The second thing I learnt is that running Web Driver or nokogiri or enlive against live web pages and then only storing the parts of the page we’re interested in is sub optimal.

We pay the network cost every time we run the script and at the beginning of a data gathering exercise we won’t know exactly what data we need so we’re bound to have to run it multiple times until we get it right.

It’s much quicker to download the files to disk and work on them locally.

Use wget

Having spent a lot of time writing different tools to download the ThoughtWorks data set Ashok asked me why I wasn’t using wget instead.

I couldn’t think of a good reason so now I favour building up a list of URIs and then letting wget take care of downloading them for us. e.g.

$ head -n 5 uris.txt
https://www.some-made-up-place.com/page1.html
https://www.some-made-up-place.com/page2.html
https://www.some-made-up-place.com/page3.html
https://www.some-made-up-place.com/page4.html
https://www.some-made-up-place.com/page5.html
 
$ cat uris.txt | time xargs wget
...
Total wall clock time: 3.7s
Downloaded: 60 files, 625K in 0.7s (870 KB/s)
        3.73 real         0.03 user         0.09 sys

If we need to speed things up we can always use the ‘-P’ flag of xargs to do so:

cat uris.txt | time xargs -n1 -P10 wget
        1.65 real         0.20 user         0.21 sys

It pays to be reasonably sensible when using tools like this and of course read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.

Given that you can get the pages using a web browser anyway it’s generally fine but it makes sense not to bombard their site with requests for every single page and instead just focus on the data you’re interested in.

Written by Mark Needham

September 19th, 2013 at 6:55 am

Posted in Data Science

Tagged with

Micro Services Style Data Work Flow

with 2 comments

Having worked on a few data related applications over the last ten months or so Ashok and I were recently discussing some of the things that we’ve learnt

One of the things he pointed out is that it’s very helpful to separate the different stages of a data work flow into their own applications/scripts.

I decided to try out this idea with some football data that I’m currently trying to model and I ended up with the following stages:

Data workflow

The stages do the following:

  • Find – Finds web pages which have the data we need and writes those URLs of those to a text file.
  • Download – Reads in the URLs and downloads the contents to the file system.
  • Extract – Reads in the web pages from the file system and using CSS selectors extracts appropriate data and saves JSON files to disk.
  • Import – Reads in the JSON files and creates nodes/relationships in neo4j.

It’s reasonably similar to micro services except instead of using HTTP as the protocol between each part we use text files as the interface between different scripts.

In fact it’s more like a variation of Unix pipelining as described in The Art of Unix Programming except we store the results of each stage of the pipeline instead of piping them directly into the next one.

If following the Unix way isn’t enough of a reason to split up the problem like this there are a couple of other reasons why this approach is useful:

  • We end up tweaking some parts more than others therefore it’s good if we don’t have to run all the steps each time we make a change e.g. I find that I spend much more time in the extract & import stages than in the other two stages. Once I’ve got the script for getting all the data written it doesn’t seem to change that substantially.
  • We can choose the appropriate technology to do each of the jobs. In this case I find that any data processing is much easier to do in Ruby but the data import is significantly quicker if you use the Java API.
  • We can easily make changes to the work flow if we find a better way of doing things.

That third advantage became clear to me on Saturday when I realised that waiting 3 minutes for the import stage to run each time was becoming quite frustrating.

All node/relationship creation was happening via the REST interface from a Ruby script since that was the easiest way to get started.

I was planning to plugin some Java code using the batch importer to speed things up until Ashok pointed me to a CSV driven batch importer which seemed like it might be even better.

That batch importer takes CSV files of nodes and edges as its input so I needed to add another stage to the work flow if I wanted to use it:

Data workflow 2

I spent a few hours working on the ‘Extract to CSV’ stage and then replaced the initial ‘Import’ script with a call to the batch importer.

It now takes 1.3 seconds to go through the last two stages instead of 3 minutes for the old import stage.

Since all I added was another script that took a text file as input and created text files as output it was really easy to make this change to the work flow.

I’m not sure how well this scales if you’re dealing with massive amounts of data but you can always split the data up into multiple files if the size becomes unmanageable.

Written by Mark Needham

February 18th, 2013 at 10:16 pm

Posted in Data Science

Tagged with

Data Science: Don’t filter data prematurely

without comments

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

  • Scrape internal application using web driver and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not – we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

  • Scrape internal application using web driver and save pages as HTML documents
  • Parse HTML documents and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

Written by Mark Needham

February 17th, 2013 at 8:02 pm

Posted in Data Science

Tagged with ,

Data Science: Discovery work

with one comment

Aaron Erickson recently wrote a blog post where he talks through some of the problems he’s seen with big data initiatives where organisations end up buying a product and expecting it to magically produce results.

[…] corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking… if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe *we* could become the next Amazon.

He goes on to suggest that a more ‘agile’ approach might be more beneficial whereby we drive our work from a business problem with a small team in a short discovery exercise. We can then build on top of that if we’re seeing good results.

A few months ago Ashok and I were doing this type of work for one of our clients and afterwards we tried to summarise how it differed to a normal project.

Hacker Mentality

Since the code we’re writing is almost certainly going to be throwaway it doesn’t make sense to spend a lot of time making it beautiful. It just needs to work.

We didn’t spend any time setting up a continuous integration server or a centralised source control repository since there were only two of us. These things make sense when you have a bigger team and more time but for this type of work it feels overkill.

Most of the code we wrote was in Ruby because that was the language in which we could hack together something useful in the least amount of time but I’m sure others could go just as fast in other languages. We did, however, end up moving some of the code to Java later on after realising the performance gains we’d get from doing so.

2 or 3 hour ‘iterations’

As I mentioned in a previous post we took the approach of finding questions that we wanted the answers to and then spending a few hours working on those before talking to our client again.

Since we don’t really know what the outcome of our discovery work is going to be we want to be able to quickly change direction and not go down too many rabbit holes.

1 or 2 weeks in total

We don’t have any data to prove this but it seems like you’d need a week or two to iterate through enough ideas that you’d have a reasonable chance of coming up with something useful.

It took us 4 days before we zoomed in on something that was useful to the client and allowed them to learn something that they didn’t previously know.

If we do find something worth pursuing then we’d want to bake that work into the normal project back log and then treat it the same as any other piece of work, driven by priority and so on.

Small team

You could argue that small teams are beneficial all the time but it’s especially the case here if we want to keep the feedback cycle tight and the communication overhead low.

Our thinking was that 2 or 3 people would probably be sufficient where 2 of the people would be developers and 1 might be someone with a UX background to help do any visualisation work.

If the domain was particularly complex then that 3rd person could be someone with experience in that area who could help derive useful questions to answer.

Written by Mark Needham

December 9th, 2012 at 10:36 am

Posted in Data Science

Tagged with

Nygard Big Data Model: The Investigation Stage

without comments

Earlier this year Michael Nygard wrote an extremely detailed post about his experiences in the world of big data projects and included in the post was the following diagram which I’ve found very useful.

Nygard

Nygard’s Big Data Model (shamelessly borrowed by me because it’s awesome)

Ashok and I have been doing some work in this area helping one of our clients make sense of and visualise some of their data and we realised retrospectively that we were very acting very much in the investigation stage of the model.

In particular Nygard makes the following suggestions about the way that we work when we’re in this mode:

We don’t want to invest in fully automated machine learning and feedback. That will follow once we validate a hypothesis and want to integrate it into our routine operation.

Ad-hoc analysis refers to human-based data exploration. This can be as simple as spreadsheets with line graphs…the key aspect is that most of the tools are interactive. Questions are expressed as code, but that code is usually just “one shot” and is not meant for production operations.

In our case we weren’t doing anything as complicated as machine learning – most of our work was working out the relationships between things, how best to model those and what visualisations would best describe what we were seeing.

We didn’t TDD any of the code, we copy/pasted a lot and when we had a longer running query we didn’t try and optimise it there and then, instead we ran it once, saved the results to a file and then used the file to load it onto the UI.

We were able to work in iterations of 2/3 hours during which we tried to answer a question (or more than one if we had time) and then showed our client what we’d managed to do and then decided where we wanted to go next.

To start with we did all this with a subset of the actual data set and then once we were on the right track we loaded in the rest of the data.

We can easily get distracted by the difficulties of loading large amounts of data before checking whether what we’re doing makes sense.

We iterated 4 or 5 different ideas before we got to one that allowed us to explore an area which hadn’t previously been explored.

Now that we’ve done that we’re rewriting the application from scratch, still using the same ideas as from the initial prototype, but this time making sure the queries can run on the fly and making the code a bit closer to production quality!

We’ve moved into the implementation stage of the model for this avenue although if I understand the model correctly, it would be ok to go back into investigation mode if we want to do some discovery work with others parts of the data.

I’m probably quoting this model way too much to people that I talk to about this type of work but I think it’s really good so nice work Mr Nygard!

Written by Mark Needham

October 10th, 2012 at 12:00 am

Posted in Data Science

Tagged with

Strata Conf London: Day 2 Wrap Up

without comments

Yesterday I attended the second day of Strata Conf London and these are the some of the things I learned from the talks I attended:

  • John Graham Cunningham opened the series of keynotes with a talk describing the problems British Rail had in 1955 when trying to calculate the distances between all train stations and comparing them to the problems we have today.

    British Rail were trying to solve a graph problem when people didn’t know about graphs and Dijkstra’s algorithm hadn’t been invented and it was effectively invented on this project but never publicised. John’s suggestion here was that we need to share the stuff that we’re doing so that people don’t re-invent the wheel.

    He then covered the ways they simplified the problem by dumping partial results to punch cards, partitioning the data & writing really tight code – all things we do today when working with data that’s too big to fit in memory. Our one advantage is that we have lots of computers that we can get to do our work – something that wasn’t the case in 1955.

    There is a book titled ‘A Computer called LEO: Lyons Tea Shops and the world’s first office computer‘ which was recommended by one of the attendees and covers some of the things from the talk.

    The talk is online and worth a watch, he’s pretty entertaining as well as informative!

  • Next up was Alasdair Allan who gave a talk showing some of the applications of different data sources that he’s managed to hack together.

    For example he showed an application which keeps track of where his credit card is being used via his bank’s transaction record and it sends those details to his phone and compares it to his current GPS coordinates.

    If they differ then the card is being used by someone else, and he was actually able to detect fraudulent use of his card more quickly than his bank on one occasion!

    He also got access to the data on an RFID chip of his hotel room swipe card and was able to chart the times at which people went into/came out of the room and make inferences about why some times were more popular than others.

    The final topic covered was how we leak our location too easily on social media platforms – he referenced a paper by some guys at the University of Rochester titled ‘Following your friends and following them to where you are‘ in which the authors showed that it’s quite easy to work out your location just by looking at where your friends currently are.

  • Ben Goldacre did the last keynote in which he covered similar ground as in his TED talk about pharmaceuticals not releasing the results of failed trials.

    I didn’t write down anything from the talk because it takes all your concentration to keep up with him but he’s well worth watching if you get the chance!

  • I attended a panel about how journalists use data and an interesting point was made about being sceptical about data and finding out how the data was actually collected rather than just trusting it.

    Another topic discussed was whether the open data movement might be harmed if people come up with misleading data visualisations – something which is very easy to do.

    If the data is released and people cause harm by making cause/effect claims that don’t actually exist then people might be less inclined to make their data open in the first place.

    We were encouraged to think about where the gaps are in what’s being reported. What isn’t being reported but perhaps should be?

  • The coolest thing I saw at Strata was the stuff that Narrative Science are doing – they have developed some software which is able to take in a load of data and convert it into an article describing the data.

    We were showing examples of this being done for football matches, company reports and even giving feedback on your performance in an exam and suggesting areas in which you need to focus your future study.

    Wired had an article a few months ago where they interviewed Kristian Hammond, one of the co founders and the guy who gave this talk.

    I have no idea how they’re doing what they’re doing but it’s very very clever!

  • I’d heard about DataSift before coming to Strata – they are one of the few companies that has access to the twitter fire hose and have previously been written up on the High Scalability blog – but I still wanted to see it in action!

    The talk was focused around five challenges DataSift have had:

    There was a very meta demo where the presenter showed DataSift’s analysis of the strataconf hash tag which suggested that 60% of tweets showed no emotion but 15% were extremely enthusiastic -‘that must be the Americans’.

  • I then went to watch another talk by Alasdair Allan – this time pairing with a colleague of his, Zena Wood, talking about the work they’re doing at the University of Exeter.

    It mostly focused on tracking movement around the campus based on which wifi mast your mobile phone was currently closest to and allowed them to make some fascinating observations.

    e.g. Alasdair often took a different route out of the campus which was apparently because that route was more scenic. However, he would only take it if it was sunny!

    They discussed some of the questions they want to answer with the work they’re doing such as:

    • Do people go to lectures if they’re on the other side of the campus?
    • How does the campus develop? Are new buildings helping build the social network of students?
    • Is there a way to stop freshers’ flu from spreading?
  • The last talk I went to was by Thomas Levine of ScraperWiki talking about different tools he uses to clean up the data he’s working with.

    There were references to ‘head’, ‘tail’, ‘tr’ and a few other Unix tools and a Python library called unidecode which is able to convert Unicode data into ASCII.

    He then moved onto tools for converting PDFs into something more digestible and mentioned pdftohtml, pdftotext and inkscape.

    He suggested saving any data you’re working with into a database rather than working with raw files – CouchDB was his preference and he’s also written a document like interface over SQLite called DumpTruck.

    In the discussion afterwards someone mentioned Apache Tika which is a tool for extracting meta data using parser libraries. It looks neat as well.

  • A general trend at this conference was that some of the talks ended up feeling quite salesy and some presenters would only describe what they were doing up to a certain point at which the rest effectively became ‘magic’.

    I found this quite strange because in software conferences that I’ve attended people are happy to explain everything to you but I think here the ‘magic’ is actually how people are making money so it doesn’t make sense to expose it.

Overall it was an enjoyable couple of days and it was especially fascinating to see the different ways that people have come up with for exploring and visualising data and creating useful applications on top of that.

Written by Mark Needham

October 3rd, 2012 at 6:46 am

Posted in Data Science

Tagged with , ,

Strata Conf London: Day 1 Wrap Up

with one comment

For the past couple of days I attended the first Strata Conf to be held in London – a conference which seems to bring together people from the data science and big data worlds to talk about the stuff they’re doing.

Since I’ve been playing around with a couple of different things in this area over the last 4/5 months I thought it’d be interesting to come along and see what people much more experienced in this area had to say!

  • My favourite talk of the morning was by Jake Porway talking about his company DataKind – “an organisation that matches data from non-profit and government organisations with data scientists”.

    In particular he focused on data dive – weekend events DataKind run where they bring together NGOs who have data they want to explore and data scientists/data hackers/statisticians who can help them find some insight in the data.

    There was an event in London last weekend and there’s an extensive write up on one that was held in Chicago earlier in the year.

    Jake also had some good tips for working with data which he shared:

    • Start with a question not with the data
    • Team up with someone who knows the story of the data
    • Visualisation is a process not an end – need tools that allow you to explore the data
    • You don’t need big data to have big insights

    Most of those tie up with what Ashok and I have been learning in the stuff we’ve been working on but Jake put it much better than I could!

  • Jeni Tennison gave an interesting talk about the Open Data Institute – an organisation that I hadn’t heard about until the talk. Their goal is to help people find value from the Open Government Data that’s now being made available.

    There’s an Open Data Hack Day in London on October 25th/26th being run by these guys which sounds like it could be pretty cool.

    Jeni had another talk on the second day where I believe she went into more detail about how they are going about making government data publicly available, including the data of legislation.gov.uk.

  • Simon Rogers of the Guardian and Kathryn Hurley of Google gave a talk about the Guardian Data blog where Kathyrn had recently spent a week working.

    Simon started out by talking about the importance of knowing what stories matter to you and your audience before Kathryn rattled through a bunch of useful tools for doing this type of work.

    Some of the ones I hadn’t heard of were Google Refine and Data Wrangler for cleaning up data, Google Data Explorer for finding interesting data sets to work with and finally CartoDB, DataWrapper and Tableau for creating data visualisations.

  • In the afternoon I saw a very cool presentation demonstrating Emoto 2012 – a bunch of visualisations done using London 2012 Olympics data.

    It particularly focused around sentiment analysis – working out the positive/negative sentiment of tweets – which the guys used Lexalytics Salience Engine to do.

    One of the more amusing examples showed the emotion of tweets about Ryan Lochte suddenly going very negative when he admitted to peeing in the pool.

  • Noel Welsh gave a talk titled ‘Making Big Data Small’ in which he ran through different streaming/online algorithms which we can use to work out things like the most frequent items or to learn classifiers/recommendation systems.

    It moved pretty quickly so I didn’t follow everything but he did talk about Hash functions, referencing the Murmur Hash 3 algorithm and also talked about the stream-lib library which has some of the other algorithms mentioned.

    Alex Smola’s blog was suggested as a good resource for learning more about this topic as well.

  • Edmund Jackson then gave an interesting talk about using clojure to do everything you’d want to do in the data science arena from quickly hacking something to building a production ready piece of machine learning code.

    He spent a bit of time at the start of the talk explaining the mathematical and platform problems that we face when working in this area and suggested that clojure sits nicely on the intersection.

    If we need to do anything statistics related we can use incanter, weka and Mahout give us machine learning algorithms, we can use JBLAS to do linear algebra and cascalog is available to run queries on top of Hadoop.

    On top of that if we want to try some code out on a bit of data we have an easily accessible REPL and if we later need to make our code run in parallel it should be reasonably easy to do.

  • Jason McFall gave a talk about establishing cause and effect from data which was a good refresher in statistics for me and covered similar ground to some of the Statistics One course on coursera.

    In particular he talked about the danger of going on a fishing expedition where we decide what it is we want to conclude from our data and then go in search of things to support that conclusion.

    We also need to make sure we connect all the data sets – sometimes we can make wrong conclusions about something but when we have all the data that conclusion no longer makes sense.

    Think Stats was suggested as a good book for learning more in this area.

  • The last talk I saw was by Max Gadney talking about the work he’s done for the Government Digital Service (GDS) building a dashboard for departmental data & for uefa providing insight to users about what’s happening in a match.

    I’d seen some of the GDS stuff before but Max has written it up pretty extensively on his blog as well so it was the uefa stuff that intrigued me more!

    In particular he developed an ‘attacking algorithm’ which filtered through the masses of data they had and was able to determine which team had the attacking momentum – it was especially interesting to see how much Real Madrid dominated against Manchester City when they played each other a couple of weeks ago.

Written by Mark Needham

October 2nd, 2012 at 11:42 pm

Posted in Data Science

Tagged with

Data Science: Making sense of the data

with one comment

Over the past month or so Ashok and I have been helping one of our clients explore and visualise some of their data and one of the first things we needed to do was make sense of the data that was available.

Start small

Ashok suggested that we work with a subset of our eventual data set so that we could get a feel for the data and quickly see whether what we were planning to do made sense.

Although our data set isn’t at ‘big data’ levels – we’re working with hundreds of thousands/millions of data points rather than the terabytes of data I often read about – I think this worked out well for us.

The data we’re working with was initially stored in a SQL Server database so the quickest way to get moving was to export it as CSV files and then work with those.

One problem we had with that approach was that we hadn’t realised that some product names could have commas in them and since we were using a comma as our field separator this led to products being imported with some quite strange properties!

Since we only had a few thousand records we were able to see that quickly even when running queries which returned all records.

We’ve been asking questions of the data and with the small data set we were able to very quickly get an answer to these questions for a few records and decide whether it would be interesting to find the answer for the whole data set.

Visual verification of the data

Another useful, and in hindsight obvious, technique is to spend a little bit of time skimming over the data and look for any patterns which stand out when visualising scanning the file.

For example, one import that we did had several columns with NULL values in and we initially tried to parse the file and load it into neo4j.

After going back and skimming the file we realised that we hadn’t understood how one of the domain concepts worked and those NULL values did actually make sense and we’d need to change our import script.

On another occasion skimming the data made it clear that we’d made a mistake with the way we’d exported the data and we had to go back and try again.

I’m sure there are other useful techniques we can use when first playing around with a new data set so feel free to point those out in the comments!

Written by Mark Needham

September 30th, 2012 at 2:58 pm

Posted in Data Science

Tagged with