CSV parsing/UTF-8 encoding
I was recently trying to parse a CSV file which I’d converted from an Excel spreadsheet but was having problems with characters beyond the standard character set.
This is an example of what was going wrong:
> require 'csv'
> people = CSV.open("sponsors.csv", 'r', ?,, ?\r).to_a
["Erik D\366rnenburg", "N/A"]
> people.each { |sponsee, sponsor| puts "#{sponsee} #{sponsor}" }
Erik D?rnenburg N/A
I came across a Ruby gem called http://snippets.aktagon.com/snippets/159-Detecting-file-data-encoding-with-Ruby-and-the-chardet-RubyGem which allowed me to work out the character set of Erik’s name like so:
> require 'chardet'
> require 'UniversalDetector'
> UniversalDetector::chardet("Erik D\366rnenburg")
=> {"encoding"=>"ISO-8859-2", "confidence"=>0.879630020576305}
I’d forgotten that you can work out the same thing by making use of file like so:
> file sponsors.csv
sponsors.csv: ISO-8859 text, with CR line terminators
We can then make use of http://docs.moodle.org/22/en/Converting_files_to_UTF-8 to change the file encoding like this:
> iconv -f iso-8859-2 -t utf-8 sponsors.csv > sponsors_conv.csv
> file sponsors_conv.csv
sponsors_conv.csv: UTF-8 Unicode text, with CR line terminators
Now if we parse the UTF-8 encoded file it doesn’t ruin Erik’s name!
> people = CSV.open("sponsors.csv", 'r', ?,, ?\r).to_a
["Erik D\303\266rnenburg", "N/A"]
> people.each { |sponsee, sponsor| puts "#{sponsee} #{sponsor}" }
Erik Dörnenburg N/A
Hopefully I’ll now remember what to do next time I come across this problem!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.