For the past couple of days I attended the first Strata Conf to be held in London – a conference which seems to bring together people from the data science and big data worlds to talk about the stuff they’re doing.
Since I’ve been playing around with a couple of different things in this area over the last 4/5 months I thought it’d be interesting to come along and see what people much more experienced in this area had to say!
- My favourite talk of the morning was by Jake Porway talking about his company DataKind – “an organisation that matches data from non-profit and government organisations with data scientists”.
In particular he focused on data dive – weekend events DataKind run where they bring together NGOs who have data they want to explore and data scientists/data hackers/statisticians who can help them find some insight in the data.
There was an event in London last weekend and there’s an extensive write up on one that was held in Chicago earlier in the year.
Jake also had some good tips for working with data which he shared:
- Start with a question not with the data
- Team up with someone who knows the story of the data
- Visualisation is a process not an end – need tools that allow you to explore the data
- You don’t need big data to have big insights
Most of those tie up with what Ashok and I have been learning in the stuff we’ve been working on but Jake put it much better than I could!
Jeni Tennison gave an interesting talk about the Open Data Institute – an organisation that I hadn’t heard about until the talk. Their goal is to help people find value from the Open Government Data that’s now being made available.
There’s an Open Data Hack Day in London on October 25th/26th being run by these guys which sounds like it could be pretty cool.
Jeni had another talk on the second day where I believe she went into more detail about how they are going about making government data publicly available, including the data of legislation.gov.uk.
- Simon Rogers of the Guardian and Kathryn Hurley of Google gave a talk about the Guardian Data blog where Kathyrn had recently spent a week working.
Simon started out by talking about the importance of knowing what stories matter to you and your audience before Kathryn rattled through a bunch of useful tools for doing this type of work.
Some of the ones I hadn’t heard of were Google Refine and Data Wrangler for cleaning up data, Google Data Explorer for finding interesting data sets to work with and finally CartoDB, DataWrapper and Tableau for creating data visualisations.
- In the afternoon I saw a very cool presentation demonstrating Emoto 2012 – a bunch of visualisations done using London 2012 Olympics data.
It particularly focused around sentiment analysis – working out the positive/negative sentiment of tweets – which the guys used Lexalytics Salience Engine to do.
One of the more amusing examples showed the emotion of tweets about Ryan Lochte suddenly going very negative when he admitted to peeing in the pool.
- Noel Welsh gave a talk titled ‘Making Big Data Small’ in which he ran through different streaming/online algorithms which we can use to work out things like the most frequent items or to learn classifiers/recommendation systems.
It moved pretty quickly so I didn’t follow everything but he did talk about Hash functions, referencing the Murmur Hash 3 algorithm and also talked about the stream-lib library which has some of the other algorithms mentioned.
Alex Smola’s blog was suggested as a good resource for learning more about this topic as well.
- Edmund Jackson then gave an interesting talk about using clojure to do everything you’d want to do in the data science arena from quickly hacking something to building a production ready piece of machine learning code.
He spent a bit of time at the start of the talk explaining the mathematical and platform problems that we face when working in this area and suggested that clojure sits nicely on the intersection.
If we need to do anything statistics related we can use incanter, weka and Mahout give us machine learning algorithms, we can use JBLAS to do linear algebra and cascalog is available to run queries on top of Hadoop.
On top of that if we want to try some code out on a bit of data we have an easily accessible REPL and if we later need to make our code run in parallel it should be reasonably easy to do.
- Jason McFall gave a talk about establishing cause and effect from data which was a good refresher in statistics for me and covered similar ground to some of the Statistics One course on coursera.
In particular he talked about the danger of going on a fishing expedition where we decide what it is we want to conclude from our data and then go in search of things to support that conclusion.
We also need to make sure we connect all the data sets – sometimes we can make wrong conclusions about something but when we have all the data that conclusion no longer makes sense.
Think Stats was suggested as a good book for learning more in this area.
- The last talk I saw was by Max Gadney talking about the work he’s done for the Government Digital Service (GDS) building a dashboard for departmental data & for uefa providing insight to users about what’s happening in a match.
I’d seen some of the GDS stuff before but Max has written it up pretty extensively on his blog as well so it was the uefa stuff that intrigued me more!
In particular he developed an ‘attacking algorithm’ which filtered through the masses of data they had and was able to determine which team had the attacking momentum – it was especially interesting to see how much Real Madrid dominated against Manchester City when they played each other a couple of weeks ago.