Mark Needham

Thoughts on Software Development

Regular Expressions: Non greedy matching

with 2 comments

I was playing around with some football data earlier in the week and I wanted to try and extract just the name ‘Rooney’ from the following bit of text:

Rooney 8′, 27′

My initial regular expression was the following which annoyingly captures the time of the first goal:

> "Rooney 8′, 27′".match(/(.*)\s\d(.*)/)[1]
=> "Rooney 8,"

It works fine if the player has only scored one goal…

> "Rooney 8′".match(/(.*)\s\d(.*)/)[1]
=> "Rooney"

…but since the second part of the regex (“\s\d(.*)”) appears twice only the last instance of it is matched and the rest of the text gets captured by the first part of the regex.

One way around this is to make the first part of the regex non greedy/lazy so that it will make as little as possible rather than as much as possible.

We can do that by using the lazy version of ‘*’ which is ‘*?’:

> "Rooney 8′, 27′".match(/(.*?)\s\d(.*)/)[1]
=> "Rooney"

Of course you could also argue that I’m being very lazy by using “.*” in the first place and you probably have a point! The following more explicit regular expression achieves the same thing:

> "Rooney 8′, 27′".match(/([A-Za-z\s-]+)\s\d(.*)/)[1]
=> "Rooney"

As a side note I find that when I’m playing around with regular expressions it really makes sense to have a bunch of test cases that I can run after each change to make sure I haven’t inadvertently broken everything.

regexper.com is also really helpful for visualising what the regular expression is actually doing!

Be Sociable, Share!

Written by Mark Needham

February 16th, 2013 at 12:17 pm