Mark Needham

Thoughts on Software Development

Ruby: Regex – Matching the Trademark ™ character

without comments

I’ve been playing around with some World Cup data and while cleaning up the data I wanted to strip out the year and host country for a world cup.

I started with a string like this which I was reading from a file:

1930 FIFA World Cup Uruguay ™

And I wanted to be able to extract just the ‘Uruguay’ bit without getting the trademark or the space preceding it. I initially tried the following to match all parts of the line and extract my bit:

p text.match(/\d{4} FIFA World Cup (.*?)/)[1]

Unfortunately that doesn’t actually compile:

tm.rb:4: syntax error, unexpected $end, expecting ')'
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]

I was initially able to work around the problem by matching the unicode code point instead:

p text.match(/\d{4} FIFA World Cup (.*?) \u2122/)[1]

While working on this blog post I also remembered that you can specify the character set of your Ruby file and by default it’s ASCII which would explain why it doesn’t like the ™ character.

If we add the following line at the top of the file then we can happily use the ™ character in our regex:

# encoding: utf-8
# ...
p text.match(/\d{4} FIFA World Cup (.*?)/)[1]
# returns "Uruguay"

This post therefore ends up being more of a reminder for future Mark when he comes across this problem again having forgotten about Ruby character sets!

Be Sociable, Share!

Written by Mark Needham

June 8th, 2014 at 1:34 am

Posted in Ruby

Tagged with