On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available.
When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy.
Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules.
Don’t build a crawler
One of the most tempting things to do is build a crawler which starts on the home page and then follows some/all the links it comes across, downloading those pages as it goes.
This is incredibly time consuming and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.
Ashok wanted to get the same data a few months later and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted and then built up a list of pages to download.
It took him in the order of minutes to build a script that would get the data whereas I spent many hours using the crawler based approach.
If there is no discernible URI structure or if you want to get every single page then the crawler approach might make sense but I try to avoid it as a first port of call.
Download the files
We pay the network cost every time we run the script and at the beginning of a data gathering exercise we won’t know exactly what data we need so we’re bound to have to run it multiple times until we get it right.
It’s much quicker to download the files to disk and work on them locally.
Having spent a lot of time writing different tools to download the ThoughtWorks data set Ashok asked me why I wasn’t using wget instead.
I couldn’t think of a good reason so now I favour building up a list of URIs and then letting wget take care of downloading them for us. e.g.
$ head -n 5 uris.txt https://www.some-made-up-place.com/page1.html https://www.some-made-up-place.com/page2.html https://www.some-made-up-place.com/page3.html https://www.some-made-up-place.com/page4.html https://www.some-made-up-place.com/page5.html $ cat uris.txt | time xargs wget ... Total wall clock time: 3.7s Downloaded: 60 files, 625K in 0.7s (870 KB/s) 3.73 real 0.03 user 0.09 sys
If we need to speed things up we can always use the ‘-P’ flag of xargs to do so:
cat uris.txt | time xargs -n1 -P10 wget 1.65 real 0.20 user 0.21 sys
It pays to be reasonably sensible when using tools like this and of course read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.
Given that you can get the pages using a web browser anyway it’s generally fine but it makes sense not to bombard their site with requests for every single page and instead just focus on the data you’re interested in.