26 Aug 2013

Clojure/Enlive: Screen scraping a HTML file from disk

I wanted to play around with some Champions League data and I came across the Rec Sport Soccer Statistics Foundation which has collected results of all matches since the tournament started in 1955.

I wanted to get a list of all the matches for a specific season so I started out by downloading the file:

$ pwd
/tmp/football
$ wget http://www.rsssf.com/ec/ec200203det.html

The next step was to load that page and then run a CSS selector over it to extract the matches. In Ruby land I usually use nokogiri or Web Driver to do this but I’d heard that Clojure’s enlive is good for this type of work so I thought I’d give it a try.

I found a couple of examples showing how to get started but they both seemed to rely on the web page being at a HTTP URI rather than on disk.

I eventually spotted an example which passed in HTML as a string to html-resource and decided to load the contents of my file as a string and then pass that in:

(ns ranking-algorithms.parse
  (:use [net.cgrand.enlive-html]))

(defn fetch-page
  [file-path]
  (html-resource (java.io.StringReader. (slurp file-path))))

The next step was to take that page representation and extract the matches. Since the page isn’t particularly well laid out for that purpose I ended up writing a regular expression to find the matching parts:

(defn matches [file]
  (->> file
       fetch-page
       extract-rows
       (map extract-content)
       (filter recognise-match?)))

(defn extract-rows [page]
  (select page [:div.Section1 :p :span]))

(defn extract-content [row]
  (first (get row :content)))

(defn recognise-match? [row]
  (and (string? row) (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row)))

The interesting part is extract-rows where we apply the CSS selector 'div.Section1 p span', the only difference being that we prefix the selector with ':'.

We then filter everything through recgonise-match? to find the matches since almost every row of the page is returned by our CSS selector. Unfortunately I don’t think there is a more specific selector that I could have used.

When I execute that function I ended up with the following output:

> (matches "/tmp/football/ec200203det.html")
( ... "Lokomotiv\nMoskou-Borussia Dortmund 1 - 2" "Borussia\nDortmund-AC Milan 0 - 1"
"Real\nMadrid-Lokomotiv Moskou 2 - 2" "Real\nMadrid-Borussia Dortmund 2 - 1"
"AC Milan-Lokomotiv\nMoskou 1 - 0" "Borussia Dortmund-Real\nMadrid 1 - 1"
"Lokomotiv\nMoskou-AC Milan 0 - 1" ... )

The next step was to split out the strings into a structure that I can use in a rankings algorithm so I applied another function to each string to pull out the appropriate parts:

(defn matches [file]
  (->> file
       fetch-page
       extract-rows
       (map extract-content)
       (filter recognise-match?)
       (map as-match)))

(defn cleanup [word]
  (clojure.string/replace word "\n" " "))

(defn as-match
  [row]
  (let [match
        (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" row))]
    {:home (cleanup (nth match 1)) :away (cleanup (nth match 2))
     :home_score (nth match 3) :away_score (nth match 4)}))

If we run the function now we get a much nicer output to play with:

> (matches "/tmp/football/ec200203det.html")
( ...  {:home "AC Milan", :away "Internazionale Milaan", :home_score "0", :away_score "0"}
       {:home "Juventus Turijn", :away "Real Madrid", :home_score "3", :away_score "1"}
       {:home "Internazionale Milaan", :away "AC Milan", :home_score "1", :away_score "1"} )

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.