Mark Needham

Thoughts on Software Development

Archive for the ‘Data Science’ Category

Nygard Big Data Model: The Investigation Stage

without comments

Earlier this year Michael Nygard wrote an extremely detailed post about his experiences in the world of big data projects and included in the post was the following diagram which I’ve found very useful.

Nygard

Nygard’s Big Data Model (shamelessly borrowed by me because it’s awesome)

Ashok and I have been doing some work in this area helping one of our clients make sense of and visualise some of their data and we realised retrospectively that we were very acting very much in the investigation stage of the model.

In particular Nygard makes the following suggestions about the way that we work when we’re in this mode:

We don’t want to invest in fully automated machine learning and feedback. That will follow once we validate a hypothesis and want to integrate it into our routine operation.

Ad-hoc analysis refers to human-based data exploration. This can be as simple as spreadsheets with line graphs…the key aspect is that most of the tools are interactive. Questions are expressed as code, but that code is usually just “one shot” and is not meant for production operations.

In our case we weren’t doing anything as complicated as machine learning – most of our work was working out the relationships between things, how best to model those and what visualisations would best describe what we were seeing.

We didn’t TDD any of the code, we copy/pasted a lot and when we had a longer running query we didn’t try and optimise it there and then, instead we ran it once, saved the results to a file and then used the file to load it onto the UI.

We were able to work in iterations of 2/3 hours during which we tried to answer a question (or more than one if we had time) and then showed our client what we’d managed to do and then decided where we wanted to go next.

To start with we did all this with a subset of the actual data set and then once we were on the right track we loaded in the rest of the data.

We can easily get distracted by the difficulties of loading large amounts of data before checking whether what we’re doing makes sense.

We iterated 4 or 5 different ideas before we got to one that allowed us to explore an area which hadn’t previously been explored.

Now that we’ve done that we’re rewriting the application from scratch, still using the same ideas as from the initial prototype, but this time making sure the queries can run on the fly and making the code a bit closer to production quality!

We’ve moved into the implementation stage of the model for this avenue although if I understand the model correctly, it would be ok to go back into investigation mode if we want to do some discovery work with others parts of the data.

I’m probably quoting this model way too much to people that I talk to about this type of work but I think it’s really good so nice work Mr Nygard!

Written by Mark Needham

October 10th, 2012 at 12:00 am

Posted in Data Science

Tagged with

Strata Conf London: Day 2 Wrap Up

without comments

Yesterday I attended the second day of Strata Conf London and these are the some of the things I learned from the talks I attended:

  • John Graham Cunningham opened the series of keynotes with a talk describing the problems British Rail had in 1955 when trying to calculate the distances between all train stations and comparing them to the problems we have today.

    British Rail were trying to solve a graph problem when people didn’t know about graphs and Dijkstra’s algorithm hadn’t been invented and it was effectively invented on this project but never publicised. John’s suggestion here was that we need to share the stuff that we’re doing so that people don’t re-invent the wheel.

    He then covered the ways they simplified the problem by dumping partial results to punch cards, partitioning the data & writing really tight code – all things we do today when working with data that’s too big to fit in memory. Our one advantage is that we have lots of computers that we can get to do our work – something that wasn’t the case in 1955.

    There is a book titled ‘A Computer called LEO: Lyons Tea Shops and the world’s first office computer‘ which was recommended by one of the attendees and covers some of the things from the talk.

    The talk is online and worth a watch, he’s pretty entertaining as well as informative!

  • Next up was Alasdair Allan who gave a talk showing some of the applications of different data sources that he’s managed to hack together.

    For example he showed an application which keeps track of where his credit card is being used via his bank’s transaction record and it sends those details to his phone and compares it to his current GPS coordinates.

    If they differ then the card is being used by someone else, and he was actually able to detect fraudulent use of his card more quickly than his bank on one occasion!

    He also got access to the data on an RFID chip of his hotel room swipe card and was able to chart the times at which people went into/came out of the room and make inferences about why some times were more popular than others.

    The final topic covered was how we leak our location too easily on social media platforms – he referenced a paper by some guys at the University of Rochester titled ‘Following your friends and following them to where you are‘ in which the authors showed that it’s quite easy to work out your location just by looking at where your friends currently are.

  • Ben Goldacre did the last keynote in which he covered similar ground as in his TED talk about pharmaceuticals not releasing the results of failed trials.

    I didn’t write down anything from the talk because it takes all your concentration to keep up with him but he’s well worth watching if you get the chance!

  • I attended a panel about how journalists use data and an interesting point was made about being sceptical about data and finding out how the data was actually collected rather than just trusting it.

    Another topic discussed was whether the open data movement might be harmed if people come up with misleading data visualisations – something which is very easy to do.

    If the data is released and people cause harm by making cause/effect claims that don’t actually exist then people might be less inclined to make their data open in the first place.

    We were encouraged to think about where the gaps are in what’s being reported. What isn’t being reported but perhaps should be?

  • The coolest thing I saw at Strata was the stuff that Narrative Science are doing – they have developed some software which is able to take in a load of data and convert it into an article describing the data.

    We were showing examples of this being done for football matches, company reports and even giving feedback on your performance in an exam and suggesting areas in which you need to focus your future study.

    Wired had an article a few months ago where they interviewed Kristian Hammond, one of the co founders and the guy who gave this talk.

    I have no idea how they’re doing what they’re doing but it’s very very clever!

  • I’d heard about DataSift before coming to Strata – they are one of the few companies that has access to the twitter fire hose and have previously been written up on the High Scalability blog – but I still wanted to see it in action!

    The talk was focused around five challenges DataSift have had:

    There was a very meta demo where the presenter showed DataSift’s analysis of the strataconf hash tag which suggested that 60% of tweets showed no emotion but 15% were extremely enthusiastic -‘that must be the Americans’.

  • I then went to watch another talk by Alasdair Allan – this time pairing with a colleague of his, Zena Wood, talking about the work they’re doing at the University of Exeter.

    It mostly focused on tracking movement around the campus based on which wifi mast your mobile phone was currently closest to and allowed them to make some fascinating observations.

    e.g. Alasdair often took a different route out of the campus which was apparently because that route was more scenic. However, he would only take it if it was sunny!

    They discussed some of the questions they want to answer with the work they’re doing such as:

    • Do people go to lectures if they’re on the other side of the campus?
    • How does the campus develop? Are new buildings helping build the social network of students?
    • Is there a way to stop freshers’ flu from spreading?
  • The last talk I went to was by Thomas Levine of ScraperWiki talking about different tools he uses to clean up the data he’s working with.

    There were references to ‘head’, ‘tail’, ‘tr’ and a few other Unix tools and a Python library called unidecode which is able to convert Unicode data into ASCII.

    He then moved onto tools for converting PDFs into something more digestible and mentioned pdftohtml, pdftotext and inkscape.

    He suggested saving any data you’re working with into a database rather than working with raw files – CouchDB was his preference and he’s also written a document like interface over SQLite called DumpTruck.

    In the discussion afterwards someone mentioned Apache Tika which is a tool for extracting meta data using parser libraries. It looks neat as well.

  • A general trend at this conference was that some of the talks ended up feeling quite salesy and some presenters would only describe what they were doing up to a certain point at which the rest effectively became ‘magic’.

    I found this quite strange because in software conferences that I’ve attended people are happy to explain everything to you but I think here the ‘magic’ is actually how people are making money so it doesn’t make sense to expose it.

Overall it was an enjoyable couple of days and it was especially fascinating to see the different ways that people have come up with for exploring and visualising data and creating useful applications on top of that.

Written by Mark Needham

October 3rd, 2012 at 6:46 am

Posted in Data Science

Tagged with , ,

Strata Conf London: Day 1 Wrap Up

with one comment

For the past couple of days I attended the first Strata Conf to be held in London – a conference which seems to bring together people from the data science and big data worlds to talk about the stuff they’re doing.

Since I’ve been playing around with a couple of different things in this area over the last 4/5 months I thought it’d be interesting to come along and see what people much more experienced in this area had to say!

  • My favourite talk of the morning was by Jake Porway talking about his company DataKind – “an organisation that matches data from non-profit and government organisations with data scientists”.

    In particular he focused on data dive – weekend events DataKind run where they bring together NGOs who have data they want to explore and data scientists/data hackers/statisticians who can help them find some insight in the data.

    There was an event in London last weekend and there’s an extensive write up on one that was held in Chicago earlier in the year.

    Jake also had some good tips for working with data which he shared:

    • Start with a question not with the data
    • Team up with someone who knows the story of the data
    • Visualisation is a process not an end – need tools that allow you to explore the data
    • You don’t need big data to have big insights

    Most of those tie up with what Ashok and I have been learning in the stuff we’ve been working on but Jake put it much better than I could!

  • Jeni Tennison gave an interesting talk about the Open Data Institute – an organisation that I hadn’t heard about until the talk. Their goal is to help people find value from the Open Government Data that’s now being made available.

    There’s an Open Data Hack Day in London on October 25th/26th being run by these guys which sounds like it could be pretty cool.

    Jeni had another talk on the second day where I believe she went into more detail about how they are going about making government data publicly available, including the data of legislation.gov.uk.

  • Simon Rogers of the Guardian and Kathryn Hurley of Google gave a talk about the Guardian Data blog where Kathyrn had recently spent a week working.

    Simon started out by talking about the importance of knowing what stories matter to you and your audience before Kathryn rattled through a bunch of useful tools for doing this type of work.

    Some of the ones I hadn’t heard of were Google Refine and Data Wrangler for cleaning up data, Google Data Explorer for finding interesting data sets to work with and finally CartoDB, DataWrapper and Tableau for creating data visualisations.

  • In the afternoon I saw a very cool presentation demonstrating Emoto 2012 – a bunch of visualisations done using London 2012 Olympics data.

    It particularly focused around sentiment analysis – working out the positive/negative sentiment of tweets – which the guys used Lexalytics Salience Engine to do.

    One of the more amusing examples showed the emotion of tweets about Ryan Lochte suddenly going very negative when he admitted to peeing in the pool.

  • Noel Welsh gave a talk titled ‘Making Big Data Small’ in which he ran through different streaming/online algorithms which we can use to work out things like the most frequent items or to learn classifiers/recommendation systems.

    It moved pretty quickly so I didn’t follow everything but he did talk about Hash functions, referencing the Murmur Hash 3 algorithm and also talked about the stream-lib library which has some of the other algorithms mentioned.

    Alex Smola’s blog was suggested as a good resource for learning more about this topic as well.

  • Edmund Jackson then gave an interesting talk about using clojure to do everything you’d want to do in the data science arena from quickly hacking something to building a production ready piece of machine learning code.

    He spent a bit of time at the start of the talk explaining the mathematical and platform problems that we face when working in this area and suggested that clojure sits nicely on the intersection.

    If we need to do anything statistics related we can use incanter, weka and Mahout give us machine learning algorithms, we can use JBLAS to do linear algebra and cascalog is available to run queries on top of Hadoop.

    On top of that if we want to try some code out on a bit of data we have an easily accessible REPL and if we later need to make our code run in parallel it should be reasonably easy to do.

  • Jason McFall gave a talk about establishing cause and effect from data which was a good refresher in statistics for me and covered similar ground to some of the Statistics One course on coursera.

    In particular he talked about the danger of going on a fishing expedition where we decide what it is we want to conclude from our data and then go in search of things to support that conclusion.

    We also need to make sure we connect all the data sets – sometimes we can make wrong conclusions about something but when we have all the data that conclusion no longer makes sense.

    Think Stats was suggested as a good book for learning more in this area.

  • The last talk I saw was by Max Gadney talking about the work he’s done for the Government Digital Service (GDS) building a dashboard for departmental data & for uefa providing insight to users about what’s happening in a match.

    I’d seen some of the GDS stuff before but Max has written it up pretty extensively on his blog as well so it was the uefa stuff that intrigued me more!

    In particular he developed an ‘attacking algorithm’ which filtered through the masses of data they had and was able to determine which team had the attacking momentum – it was especially interesting to see how much Real Madrid dominated against Manchester City when they played each other a couple of weeks ago.

Written by Mark Needham

October 2nd, 2012 at 11:42 pm

Posted in Data Science

Tagged with

Data Science: Making sense of the data

with one comment

Over the past month or so Ashok and I have been helping one of our clients explore and visualise some of their data and one of the first things we needed to do was make sense of the data that was available.

Start small

Ashok suggested that we work with a subset of our eventual data set so that we could get a feel for the data and quickly see whether what we were planning to do made sense.

Although our data set isn’t at ‘big data’ levels – we’re working with hundreds of thousands/millions of data points rather than the terabytes of data I often read about – I think this worked out well for us.

The data we’re working with was initially stored in a SQL Server database so the quickest way to get moving was to export it as CSV files and then work with those.

One problem we had with that approach was that we hadn’t realised that some product names could have commas in them and since we were using a comma as our field separator this led to products being imported with some quite strange properties!

Since we only had a few thousand records we were able to see that quickly even when running queries which returned all records.

We’ve been asking questions of the data and with the small data set we were able to very quickly get an answer to these questions for a few records and decide whether it would be interesting to find the answer for the whole data set.

Visual verification of the data

Another useful, and in hindsight obvious, technique is to spend a little bit of time skimming over the data and look for any patterns which stand out when visualising scanning the file.

For example, one import that we did had several columns with NULL values in and we initially tried to parse the file and load it into neo4j.

After going back and skimming the file we realised that we hadn’t understood how one of the domain concepts worked and those NULL values did actually make sense and we’d need to change our import script.

On another occasion skimming the data made it clear that we’d made a mistake with the way we’d exported the data and we had to go back and try again.

I’m sure there are other useful techniques we can use when first playing around with a new data set so feel free to point those out in the comments!

Written by Mark Needham

September 30th, 2012 at 2:58 pm

Posted in Data Science

Tagged with

Data Science: Scrapping the data together

without comments

On Friday Martin, Darren and I were discussing the ThoughtWorks graph that I was working on earlier in the year and Martin pointed out that an interesting aspect of this type of work is that the data you want to work with isn’t easily available.

You therefore need to find a way to scrap the data together to make some headway and then maybe at a later stage once some progress has been made it will become easier to replace that with a cleaner solution.

In this case I became curious about exploring the relationships between people in ThoughtWorks but there aren’t any APIs on our internal systems so I had to find another way to get the data that I wanted.

The obvious way to do that was to get a copy of the database used by our internal staffing system but I didn’t know anybody who worked on that team and trying to get the data that way would therefore be slow and lose my initial enthusiasm.

My only alternative was to go via our staffing application and derive the data that way.

I ended up writing some Selenium scripts to crawl the application for people, projects and clients and save that data to JSON files which I later parsed to build up the graph.

The other bit of data that I was curious about was the sponsor relationships inside the company which is kept in a Google spreadsheet.

I wasn’t allowed access to that spreadsheet until I was able to show what I was going to use the data for so I first needed to put together something using the other data I’d screen scrapped.

Once I did get the spreadsheet I spent around 3 hours cleaning the data so I could integrate it with the other data I had.

This involved fixing misspelt names and updating the spreadsheets where I knew that the data was out of date – it’s certainly not very glamorous work but it helped me to get to a visualisation which I wrote about in an earlier post.

I haven’t done a lot of work in this area but I wouldn’t be surprised if it’s common that we have to use relatively guerilla tactics like the above to get us up and running.

Written by Mark Needham

September 30th, 2012 at 1:44 pm

Posted in Data Science

Tagged with