10 Jun 2012

CSV parsing/UTF-8 encoding

I was recently trying to parse a CSV file which I’d converted from an Excel spreadsheet but was having problems with characters beyond the standard character set.

This is an example of what was going wrong:

RUBY > require 'csv'
> people = CSV.open("sponsors.csv", 'r', ?,, ?\r).to_a
["Erik D\366rnenburg", "N/A"]

> people.each { |sponsee, sponsor| puts "#{sponsee} #{sponsor}" }
Erik D?rnenburg N/A

I came across a Ruby gem called http://snippets.aktagon.com/snippets/159-Detecting-file-data-encoding-with-Ruby-and-the-chardet-RubyGem which allowed me to work out the character set of Erik’s name like so:

RUBY > require 'chardet'
> require 'UniversalDetector'

> UniversalDetector::chardet("Erik D\366rnenburg")
=> {"encoding"=>"ISO-8859-2", "confidence"=>0.879630020576305}

I’d forgotten that you can work out the same thing by making use of file like so:

> file sponsors.csv
sponsors.csv: ISO-8859 text, with CR line terminators

We can then make use of http://docs.moodle.org/22/en/Converting_files_to_UTF-8 to change the file encoding like this:

> iconv -f iso-8859-2 -t utf-8 sponsors.csv > sponsors_conv.csv

> file sponsors_conv.csv
sponsors_conv.csv: UTF-8 Unicode text, with CR line terminators

Now if we parse the UTF-8 encoded file it doesn’t ruin Erik’s name!

RUBY > people = CSV.open("sponsors.csv", 'r', ?,, ?\r).to_a
["Erik D\303\266rnenburg", "N/A"]

> people.each { |sponsee, sponsor| puts "#{sponsee} #{sponsor}" }
Erik Dörnenburg N/A

Hopefully I’ll now remember what to do next time I come across this problem!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.