3 Oct 2012

Strata Conf London: Day 2 Wrap Up

Yesterday I attended the second day of Strata Conf London and these are the some of the things I learned from the talks I attended:

John Graham Cunningham opened the series of keynotes with a talk describing the problems British Rail had in 1955 when trying to calculate the distances between all train stations and comparing them to the problems we have today. British Rail were trying to solve a graph problem when people didn’t know about graphs and Dijkstra’s algorithm hadn’t been invented and it was effectively invented on this project but never publicised. John’s suggestion here was that we need to share the stuff that we’re doing so that people don’t re-invent the wheel. He then covered the ways they simplified the problem by dumping partial results to punch cards, partitioning the data & writing really tight code - all things we do today when working with data that’s too big to fit in memory. Our one advantage is that we have lots of computers that we can get to do our work - something that wasn’t the case in 1955. There is a book titled 'A Computer called LEO: Lyons Tea Shops and the world’s first office computer' which was recommended by one of the attendees and covers some of the things from the talk. The talk is online and worth a watch, he’s pretty entertaining as well as informative!
Next up was Alasdair Allan who gave a talk showing some of the applications of different data sources that he’s managed to hack together. For example he showed an application which keeps track of where his credit card is being used via his bank’s transaction record and it sends those details to his phone and compares it to his current GPS coordinates. If they differ then the card is being used by someone else, and he was actually able to detect fraudulent use of his card more quickly than his bank on one occasion! He also got access to the data on an RFID chip of his hotel room swipe card and was able to chart the times at which people went into/came out of the room and make inferences about why some times were more popular than others. The final topic covered was how we leak our location too easily on social media platforms - he referenced a paper by some guys at the University of Rochester titled 'Following your friends and following them to where you are' in which the authors showed that it’s quite easy to work out your location just by looking at where your friends currently are.
Ben Goldacre did the last keynote in which he covered similar ground as in his TED talk about pharmaceuticals not releasing the results of failed trials. I didn’t write down anything from the talk because it takes all your concentration to keep up with him but he’s well worth watching if you get the chance!
I attended a panel about how journalists use data and an interesting point was made about being sceptical about data and finding out how the data was actually collected rather than just trusting it. Another topic discussed was whether the open data movement might be harmed if people come up with misleading data visualisations - something which is very easy to do. If the data is released and people cause harm by making cause/effect claims that don’t actually exist then people might be less inclined to make their data open in the first place. We were encouraged to think about where the gaps are in what’s being reported. What isn’t being reported but perhaps should be?
The coolest thing I saw at Strata was the stuff that Narrative Science are doing - they have developed some software which is able to take in a load of data and convert it into an article describing the data. We were showing examples of this being done for football matches, company reports and even giving feedback on your performance in an exam and suggesting areas in which you need to focus your future study. Wired had an article a few months ago where they interviewed Kristian Hammond, one of the co founders and the guy who gave this talk. I have no idea how they’re doing what they’re doing but it’s very very clever!
I’d heard about DataSift before coming to Strata - they are one of the few companies that has access to the twitter fire hose and have previously been written up on the High Scalability blog - but I still wanted to see it in action! The talk was focused around five challenges DataSift have had:
- Digging through unstructured data volumes - they take tweets and convert them into 94 different files using some NLP wizardry. They use Lexalytics Salience Engine to help them do this.
- Filtering - separating the signal from the noise. Popular hash tags end up getting massively spammed so those tweets need to be excluded.
- Analysing - real time filtering and tagging of data. They use the cloudera Hadoop distribution.
- Variety - integrating data from different sources. e.g. showing the Facebook stock price vs the twitter sentiment analysis of the company.
- Make it work 24/7
There was a very meta demo where the presenter showed DataSift’s analysis of the strataconf hash tag which suggested that 60% of tweets showed no emotion but 15% were extremely enthusiastic -'that must be the Americans'.
I then went to watch another talk by Alasdair Allan - this time pairing with a colleague of his, Zena Wood, talking about the work they’re doing at the University of Exeter. It mostly focused on tracking movement around the campus based on which wifi mast your mobile phone was currently closest to and allowed them to make some fascinating observations. e.g. Alasdair often took a different route out of the campus which was apparently because that route was more scenic. However, he would only take it if it was sunny! They discussed some of the questions they want to answer with the work they’re doing such as:
- Do people go to lectures if they’re on the other side of the campus?
- How does the campus develop? Are new buildings helping build the social network of students?
- Is there a way to stop freshers' flu from spreading?
The last talk I went to was by Thomas Levine of ScraperWiki talking about different tools he uses to clean up the data he’s working with. There were references to 'head', 'tail', 'tr' and a few other Unix tools and a Python library called unidecode which is able to convert Unicode data into ASCII. He then moved onto tools for converting PDFs into something more digestible and mentioned pdftohtml, pdftotext and inkscape. He suggested saving any data you’re working with into a database rather than working with raw files - CouchDB was his preference and he’s also written a document like interface over SQLite called DumpTruck. In the discussion afterwards someone mentioned Apache Tika which is a tool for extracting meta data using parser libraries. It looks neat as well.
A general trend at this conference was that some of the talks ended up feeling quite salesy and some presenters would only describe what they were doing up to a certain point at which the rest effectively became 'magic'. I found this quite strange because in software conferences that I’ve attended people are happy to explain everything to you but I think here the 'magic' is actually how people are making money so it doesn’t make sense to expose it. </ul> Overall it was an enjoyable couple of days and it was especially fascinating to see the different ways that people have come up with for exploring and visualising data and creating useful applications on top of that.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.