· python

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

I was recently doing some text scrubbing and had difficulty working out how to remove the '†' character from strings.

e.g. I had a string like this:

>>> u'foo †'
u'foo \u2020'

I wanted to get rid of the '†' character and then strip any trailing spaces so I’d end up with the string 'foo'. I tried to do this in one call to 'replace':

>>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

It took me a while to work out that "† " was being treated as ASCII rather than UTF-8. Let’s fix that:

>>> u'foo †'.replace(u' †', "")
u'foo'

I think the following call to unicode, which I’ve written about before, is equivalent:

>>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'

Now back to the scrubbing!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket