Mark Needham

Thoughts on Software Development

Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)

without comments

I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.

The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:

$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv
 
real	0m31.945s
user	0m31.370s
sys	0m0.518s

The -c command is described thus;

-c number
             The location is number bytes.

So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.

Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!

Be Sociable, Share!

Written by Mark Needham

August 19th, 2015 at 11:27 pm

Posted in Shell Scripting

Tagged with