16 Feb 2013

Regular Expressions: Non greedy matching

I was playing around with some football data earlier in the week and I wanted to try and extract just the name 'Rooney' from the following bit of text:

Rooney 8′, 27′

My initial regular expression was the following which annoyingly captures the time of the first goal:

RUBY > "Rooney 8′, 27′".match(/(.*)\s\d(.*)/)[1]
=> "Rooney 8,"

It works fine if the player has only scored one goal…

RUBY > "Rooney 8′".match(/(.*)\s\d(.*)/)[1]
=> "Rooney"

...but since the second part of the regex ("\s\d(.*)") appears twice only the last instance of it is matched and the rest of the text gets captured by the first part of the regex.

One way around this is to make the first part of the regex non greedy/lazy so that it will make as little as possible rather than as much as possible.

We can do that by using the lazy version of '' which is '?':

RUBY > "Rooney 8′, 27′".match(/(.*?)\s\d(.*)/)[1]
=> "Rooney"

Of course you could also argue that I’m being very lazy by using ".*" in the first place and you probably have a point! The following more explicit regular expression achieves the same thing:

RUBY > "Rooney 8′, 27′".match(/([A-Za-z\s-]+)\s\d(.*)/)[1]
=> "Rooney"

As a side note I find that when I’m playing around with regular expressions it really makes sense to have a bunch of test cases that I can run after each change to make sure I haven’t inadvertently broken everything.

regexper.com is also really helpful for visualising what the regular expression is actually doing!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.