Mark Needham

Thoughts on Software Development

A rogue “\357\273\277″ (UTF-8 byte order mark)

with 4 comments

We’ve been loading some data into neo4j from a CSV file – creating one node per row and using the value in the first column as the index lookup for the node.

Unfortunately the index lookup wasn’t working for the first row but was for every other row.

By coincidence we started saving each row into a hash map and were then able to see what was going wrong:

require 'rubygems'
require 'fastercsv'
 
things = FasterCSV.read("things.csv", :col_sep => "|")
 
saved_things = {}
things do |row|
  saved_things[row[0]] = row[1]
end
 
p saved_things

This is what we saw when we ran the script:

{"\357\273\2771"=>"Thing1", "2" => "Thing2"}

A bit of googling suggests that “\357\273\277″ represents a UTF-8 byte order mark which apparently isn’t actually needed anyway:

The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

We’re not converting the CSV file back into any other format so the following awk command can be used to cleanup it up:

awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' things.csv > things.nobom.csv

If we use the hexdump tool we can see that the BOM has been removed:

Before:

$ hexdump things.csv
0000000 ef bb bf 31 7c 50 72 69 6d 61 72 69 65 73 0d 0a
...

After:

hexdump things.nobom.csv
0000000 31 7c 50 72 69 6d 61 72 69 65 73 0d 0a 31 30 7c

I was initially curious why Ruby and the hexdump were printing out different values but it’s just a case of Ruby showing the Octal version of the BOM as compared to the Hexidecimal version. The values translate like so:

Octal | Hexadecimal | Decimal
357   | EF          | 239
273   | BB          | 187
277   | BF          | 191
Be Sociable, Share!

Written by Mark Needham

September 3rd, 2012 at 6:31 am

  • patrick

    Mark, cleaning out the BOM works, but shouldn’t software recognize the BOM for what it is? 

  • http://www.markhneedham.com/blog Mark Needham

    In theory yes it should but in practice it seems not to – in both Ruby and Java libraries – so I figured it’s easier just to get rid of it.

  • Esko Luontola

    Just last week I had to fix an issue that the layout on some of my web sites was completely broken on IE (even the latest versions). The reason was that a couple of template files contained the BOM.

  • goofansu

    Useful, solve my problem!