Mark Needham

Thoughts on Software Development

gawk: Getting story numbers from git commit messages

with 2 comments

As I mentioned in my previous post I’ve been writing a little application to create graphs based on our git repository history and in one of them we wanted to try and create a graph showing which people had been working on which stories.

I needed a way to extract a story number from the git commit message and then store them all in a text file.

A typical commit with a story number in might look like this:

Mark/Uday #689 some awesome scala refactoring

I couldn’t think of an easy way to do this with my current knowledge of sed or the Mac version of awk but the match function of gawk (GNU awk) makes this really easy.

match(string, regexp [, array])

Search string for the longest, leftmost substring matched by the regular expression, regexp and return the character position, or index, at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.

If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp.

The array argument is what I needed and it’s only available as a gawk extension according to the documentation.

I ended up with the following command to strip the story numbers:

git log --no-merges --pretty="format:%s" | 
gawk '{ match($0, /#([0-9]+)/, arr); if(arr[1] != "") print arr[1] }'

I had to install gawk using ports on my Mac but on Fedora the default installation of awk is gawk.

Written by Mark Needham

September 12th, 2011 at 7:05 am

  • Greg

    The mac version of awk, sets the RSTART AND RLENGTH globals when match is called, allowing you to extract the matched string using substr().

    It’s idiomatic in awk to place the condition for processing the line outside the braces rather than using if, e.g.

    | awk ‘match($0, /#[0-9]+/) { print substr( $0, RSTART, RLENGTH )}’

  • http://www.markhneedham.com/blog Mark Needham

    Oh cool I was trying to work out if there was a way to make use of groups without using the third parameter to match!