11 Jun 2015

Mac OS X: GNU sed - Hex string replacement / replacing new line characters

Recently I was working with a CSV file which contained both Windows and Unix line endings which was making it difficult to work with.

The actual line endings were HEX '0A0D' i.e. Windows line breaks but there were also HEX 'OA' i.e. Unix line breaks within one of the columns.

I wanted to get rid of the Unix line breaks and discovered that you can do HEX sequence replacement using the GNU version of sed - unfortunately the Mac ships with the BSD version which doesn’t have this functionaltiy.

The first step was therefore to install the GNU version of sed.

BASH brew install coreutils
brew install gnu-sed --with-default-names

I wanted to replace my system sed so that’s why I went with the '--with-default-names' flag - without that flag I believe the sed installation would be accessible as 'gs-sed'.

The following is an example of what the lines in the file look like:

BASH $ echo -e "Hello\x0AMark\x0A\x0D"
Hello
Mark

We want to get rid of the new line in between 'Hello' and 'Mark' but leave the other one be. I adapted one of the commands from this tutorial to look for lines which end in '0A' where that isn’t followed by a '0D':

BASH $ echo -e "Hello\x0AMark\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark

Let’s go through the parts of the sed command:

N - this creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The two lines are separated by a new line.
/\x0A[^\x0D]/ - this matches any lines which contain 'OA' not followed by 'OD'
/s/\n/ / - this substitutes the new line character with a space for those matching lines from the previous command.

Now let’s check it works if we have multiple lines that we want to squash:

BASH $ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D"
Hello
Mark
Hello
Michael

$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark
Hello Michael

Looks good! The actual file is a bit more nuanced so I’ve still got a bit more work to do but this is a good start.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.