I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.
The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:
$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv real 0m31.945s user 0m31.370s sys 0m0.518s
The -c command is described thus;
-c number The location is number bytes.
So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.
Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!