14 Sep 2013

Clojure: All things regex

I’ve been doing some scrapping of web pages recently using Clojure and Enlive and as part of that I’ve had to write regular expressions to extract the data I’m interested in.

On my travels I’ve come across a few different functions and I’m never sure which is the right one to use so I thought I’d document what I’ve tried for future me.

Check if regex matches

The first regex I wrote was while scrapping the Champions League results from the Rec.Sport.Soccer Statistics Foundation and I wanted to determine which spans contained the match result and which didn’t.

A matching line would look like this:

Real Madrid-Juventus Turijn 2 - 1

And a non matching one like this:

53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2

I wrote the following regex to detect match results:

[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]

I then wrote the following function using http://clojuredocs.org/clojure_core/clojure.core/re-matches which would return true or false depending on the input:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))

> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1")
true
> (recognise-match? "53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2")
false

re-matches only returns matches if the whole string matches the pattern which means if we had a line with some spurious text after the score it wouldn’t match:

> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
false

If we don’t mind that and we just want some part of the string to match our pattern then we can use http://clojuredocs.org/clojure_core/clojure.core/re-find instead:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-find #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))

> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
true

Extract capture groups

The next thing I wanted to do was to capture the teams and the score of the match which I initially did using http://clojuredocs.org/clojure_core/clojure.core/re-seq:

> (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

I then extracted the various parts like so:

> (def result (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1")))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]


> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

re-seq returns a list which contains consecutive matches of the regex. The list will either contain strings if we don’t specify capture groups or a vector containing the pattern matched and each of the capture groups.

For example if we now match only sequences of A-Z or spaces and remove the rest of the pattern from above we’d get the following results:

> (re-seq #"([a-zA-Z\s]+)" "FC Valencia-Internazionale Milaan 2 - 1")
(["FC Valencia" "FC Valencia"] ["Internazionale Milaan " "Internazionale Milaan "] [" " " "] [" " " "])

> (re-seq #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
("FC Valencia" "Internazionale Milaan " " " " ")

In our case re-find or re-matches actually makes more sense since we only want to match the pattern once. If there are further matches after this those aren’t included in the results. e.g.

> (re-find #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
"FC Valencia"

> (re-matches #"[a-zA-Z\s]*" "FC Valencia-Internazionale Milaan 2 - 1")
nil

re-matches returns nil here because there are characters in the string which don’t match the pattern i.e. the hyphen between the two scores.

If we tie that in with our capture groups we end up with the following:

> (def result
    (re-find #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

I also came across the http://clojuredocs.org/clojure_core/clojure.core/re-pattern function which provides a more verbose way of creating a pattern and then evaluating it with re-find:

> (re-find (re-pattern "([a-zA-Z\\s]+)-([a-zA-Z\\s]+) ([0-9])[\\s]?.[\\s]?([0-9])") "FC Valencia-Internazionale Milaan 2 - 1")
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

One difference here is that I had to escape the special sequence '\s' otherwise I was getting the following exception:

RuntimeException Unsupported escape character: \s  clojure.lang.Util.runtimeException (Util.java:170)

I wanted to play around with http://clojuredocs.org/clojure_core/clojure.core/re-groups as well but that seemed to throw an exception reasonably frequently when I expected it to work.</cite>

The last function I looked at was http://clojuredocs.org/clojure_core/clojure.core/re-matcher which seemed to be a long-hand for the '#""' syntax used earlier in the post to define matchers: _~lisp > (re-find (re-matcher #"()-([a-zA-Z\s]) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1")) ["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"] _~

In summary

So in summary I think most use cases are covered by re-find and re-matches and maybe re-seq on special occasions. I couldn’t see where I’d use the other functions but I’m happy to be proved wrong.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.