# Mark Needham

Thoughts on Software Development

## Shell: Create a comma separated string

I recently needed to generate a string with comma separated values, based on iterating a range of numbers.

e.g. we should get the following output where n = 3

 foo-0,foo-1,foo-2

I only had the shell available to me so I couldn’t shell out into Python or Ruby for example. That means it’s bash scripting time!

If we want to iterate a range of numbers and print them out on the screen we can write the following code:

 n=3 for i in \$(seq 0 \$((\$n > 0? \$n-1: 0))); do echo "foo-\$i" done   foo-0 foo-1 foo-2

Combining them into a string is a bit more tricky, but luckily I found a great blog post by Andreas Haupt which shows what to do. Andreas is solving a more complicated problem than me but these are the bits of code that we need from the post.

 n=3 combined=""   for i in \$(seq 0 \$((\$n > 0? \$n-1: 0))); do token="foo-\$i" combined="\${combined}\${combined:+,}\$token" done echo \$combined   foo-0,foo-1,foo-2

This won’t work if you set n<0 but that’s ok for me! I’ll let Andreas explain how it works:

• \${combined:+,} will return either a comma (if combined exists and is set) or nothing at all.
• In the first invocation of the loop combined is not yet set and nothing is put out.
• In the next rounds combined is set and a comma will be put out.

We can see how it in action by printing out the value of \$combined after each iteration of the loop:

 n=3 combined=""   for i in \$(seq 0 \$((\$n > 0 ? \$n-1: 0))); do token="foo-\$i" combined="\${combined}\${combined:+,}\$token" echo \$combined done   foo-0 foo-0,foo-1 foo-0,foo-1,foo-2

Looks good to me!

Written by Mark Needham

June 23rd, 2017 at 12:26 pm

Posted in Shell Scripting

Tagged with , ,

## Unix: Find files greater than date

For the latter part of the week I’ve been running some tests against Neo4j which generate a bunch of log files and I wanted to filter those files based on the time they were created to do some further analysis.

This is an example of what the directory listing looks like:

 \$ ls -alh foo/database-agent-* -rw-r--r-- 1 markneedham wheel 2.5K 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz -rw-r--r-- 1 markneedham wheel 8.6K 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz -rw-r--r-- 1 markneedham wheel 8.6K 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz -rw-r--r-- 1 markneedham wheel 2.5K 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz -rw-r--r-- 1 markneedham wheel 11K 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz -rw-r--r-- 1 markneedham wheel 4.8K 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz -rw-r--r-- 1 markneedham wheel 3.8K 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz -rw-r--r-- 1 markneedham wheel 4.8K 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz -rw-r--r-- 1 markneedham wheel 12K 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz -rw-r--r-- 1 markneedham wheel 2.5K 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz -rw-r--r-- 1 markneedham wheel 5.8K 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz -rw-r--r-- 1 markneedham wheel 9.5K 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz -rw-r--r-- 1 markneedham wheel 2.7K 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz -rw-r--r-- 1 markneedham wheel 2.8K 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz -rw-r--r-- 1 markneedham wheel 3.1K 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz -rw-r--r-- 1 markneedham wheel 13K 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz -rw-r--r-- 1 markneedham wheel 3.8K 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz -rw-r--r-- 1 markneedham wheel 3.9K 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz -rw-r--r-- 1 markneedham wheel 3.1K 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz -rw-r--r-- 1 markneedham wheel 3.1K 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

I wanted to split the files in half so that I could have the ones created before and after 12pm on the 23rd June.

I discovered that this type of filtering is actually quite easy to do with the ‘find’ command. So if I want to get the files after 12pm I could write the following:

 \$ find foo -name database-agent* -newermt "Jun 23, 2016 12:00" -ls 121939705 8 -rw-r--r-- 1 markneedham wheel 2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz 121939704 8 -rw-r--r-- 1 markneedham wheel 2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz 121934591 24 -rw-r--r-- 1 markneedham wheel 11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz 121939707 16 -rw-r--r-- 1 markneedham wheel 4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz 121934612 8 -rw-r--r-- 1 markneedham wheel 3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz 121939708 16 -rw-r--r-- 1 markneedham wheel 4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz 121934589 24 -rw-r--r-- 1 markneedham wheel 12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz 121939720 8 -rw-r--r-- 1 markneedham wheel 2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz 121939706 16 -rw-r--r-- 1 markneedham wheel 5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz 121934588 8 -rw-r--r-- 1 markneedham wheel 2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz 121934590 32 -rw-r--r-- 1 markneedham wheel 13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz 121934587 8 -rw-r--r-- 1 markneedham wheel 3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz 121934611 8 -rw-r--r-- 1 markneedham wheel 3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

And to get the ones before 12pm:

 \$ find foo -name database-agent* -not -newermt "Jun 23, 2016 12:00" -ls 121879391 24 -rw-r--r-- 1 markneedham wheel 8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz 121879394 24 -rw-r--r-- 1 markneedham wheel 8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz 121879390 24 -rw-r--r-- 1 markneedham wheel 9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz 121879393 8 -rw-r--r-- 1 markneedham wheel 2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz 121879413 8 -rw-r--r-- 1 markneedham wheel 3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz 121879414 8 -rw-r--r-- 1 markneedham wheel 3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz 121879392 8 -rw-r--r-- 1 markneedham wheel 3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

Or we could even find the ones last modified between 12pm and 2pm:

 \$ find foo -name database-agent* -not -newermt "Jun 23, 2016 14:00" -newermt "Jun 23, 2016 12:00" -ls 121934591 24 -rw-r--r-- 1 markneedham wheel 11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz 121934612 8 -rw-r--r-- 1 markneedham wheel 3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz 121934589 24 -rw-r--r-- 1 markneedham wheel 12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz 121934588 8 -rw-r--r-- 1 markneedham wheel 2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz 121934590 32 -rw-r--r-- 1 markneedham wheel 13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz 121934587 8 -rw-r--r-- 1 markneedham wheel 3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz 121934611 8 -rw-r--r-- 1 markneedham wheel 3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or we can filter by relative time e.g. to find the files last modified in the last 1 day, 5 hours:

 \$ find foo -name database-agent* -mtime -1d5h -ls 121939705 8 -rw-r--r-- 1 markneedham wheel 2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz 121939704 8 -rw-r--r-- 1 markneedham wheel 2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz 121934591 24 -rw-r--r-- 1 markneedham wheel 11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz 121939707 16 -rw-r--r-- 1 markneedham wheel 4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz 121934612 8 -rw-r--r-- 1 markneedham wheel 3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz 121939708 16 -rw-r--r-- 1 markneedham wheel 4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz 121934589 24 -rw-r--r-- 1 markneedham wheel 12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz 121939720 8 -rw-r--r-- 1 markneedham wheel 2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz 121939706 16 -rw-r--r-- 1 markneedham wheel 5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz 121934588 8 -rw-r--r-- 1 markneedham wheel 2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz 121934590 32 -rw-r--r-- 1 markneedham wheel 13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz 121934587 8 -rw-r--r-- 1 markneedham wheel 3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz 121934611 8 -rw-r--r-- 1 markneedham wheel 3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or the ones modified more than 1 day, 5 hours ago:

 \$ find foo -name database-agent* -mtime +1d5h -ls 121879391 24 -rw-r--r-- 1 markneedham wheel 8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz 121879394 24 -rw-r--r-- 1 markneedham wheel 8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz 121879390 24 -rw-r--r-- 1 markneedham wheel 9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz 121879393 8 -rw-r--r-- 1 markneedham wheel 2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz 121879413 8 -rw-r--r-- 1 markneedham wheel 3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz 121879414 8 -rw-r--r-- 1 markneedham wheel 3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz 121879392 8 -rw-r--r-- 1 markneedham wheel 3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

There are lots of other flags you can pass to find but these ones did exactly what I wanted!

Written by Mark Needham

June 24th, 2016 at 4:56 pm

Posted in Shell Scripting

Tagged with

## Unix: Find all text below string in a file

I recently wanted to parse some text out of a bunch of files so that I could do some sentiment analysis on it. Luckily the text I want is at the end of the file and doesn’t have anything after it but there is text before it that I want to get rid.

The files look like this:

 # text I don't care about   = Heading of the bit I care about   # text I care about

In other words I want to find the line that contains the Heading and then get all the text after that point.

I figured sed was the tool for the job but my knowledge of the syntax was a bit rusty. Luckily this post served as a refresher.

Effectively what we want to do is delete from the beginning of the file up until the line after the heading. We can do this with the following command:

 \$ cat /tmp/foo.txt # text I don't care about   = Heading of the bit I care about   # text I care about
 \$ cat /tmp/foo.txt | sed '1,/Heading of the bit I care about/d'   # text I care about

That still leaves an extra empty line after the heading which is a bit annoying but easy enough to get rid of by passing another command to sed that strips empty lines:

 \$ cat /tmp/foo.txt | sed -e '1,/Heading of the bit I care about/d' -e '/^\s*\$/d' # text I care about

The only difference here is that we’re now passing the ‘-e’ flag to allow us to specify multiple commands. If we just pass them sequentially then the 2nd one will be interpreted as the name of a file.

Written by Mark Needham

June 19th, 2016 at 8:36 am

Posted in Shell Scripting

Tagged with

## Unix: Split string using separator

I recently found myself needing to iterate over a bunch of ‘/’ separated strings on the command line and extract just the text after the last ‘/’.

e.g. an example of one of the strings

 A/B/C

I wanted to write some code that could split on ‘/’ and then pick the 3rd item in the resulting collection.

One way of doing this is to echo the string and then pipe it through cut:

 \$ string="A/B/C" \$ echo \${string} | cut -d"/" -f3 C

or awk:

 \$ echo \${string} | awk -F"/" '{ print \$3}' C

I don’t like having to echo the string – it feels a bit odd so I wanted to see if there was a way to do the parsing more ‘inline’.

I came across this post which explains how to change the internal field separator (IFS) on the shell and then parse the string into an array using read. I gave it a try:

 \$ IFS="/" read -ra ADDR <<< "\${string}"; echo \${ADDR[2]} C

Works! We can even refer to the last item in the array using -1 instead of it’s absolute position:

 \$ IFS="/" read -ra ADDR <<< "\${string}"; echo \${ADDR[-1]} C

I’d not come across this use of the ‘read’ function before. The key is the ‘-a’ parameter. From the man page:

-a aname
The words are assigned to sequential indices of the array variable aname,
starting at 0. All elements are removed from aname before the assignment.
Other name arguments are ignored.

So we’re resetting the internal field separator and then reading the string into another variable as an array split on the ‘/’ character.

Pretty neat although now it’s longer than the original command and I’m sure I’ll forget the syntax.

Further down the page is another suggestion which seems even harder to remember but is much shorter:

 \$ echo \${string##*/} C

This drops from the beginning of the string up until the last occurrence of ‘/’ which is exactly what we want.

This way is the nicest and doesn’t require any echoing if we just want to assign the result to a variable. The echo is only used here to see the output.

Written by Mark Needham

June 19th, 2016 at 7:22 am

Posted in Shell Scripting

Tagged with

## Unix parallel: Populating all the USB sticks

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!

But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

 for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J\${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

 function duplicate() { i=\${1} echo \${i} time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J\${i}/ }

We can now call this function in parallel like so:

 seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Written by Mark Needham

June 1st, 2016 at 5:53 am

Posted in Shell Scripting

Tagged with

## Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)

I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.

The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:

 \$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv   real 0m31.945s user 0m31.370s sys 0m0.518s

The -c command is described thus;

 -c number The location is number bytes.

So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.

Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!

Written by Mark Needham

August 19th, 2015 at 11:27 pm

Posted in Shell Scripting

Tagged with

## Unix: Redirecting stderr to stdout

I’ve been trying to optimise some Neo4j import queries over the last couple of days and as part of the script I’ve been executed I wanted to redirect the output of a couple of commands into a file to parse afterwards.

I started with the following script which doesn’t do any explicit redirection of the output:

 #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start

Now let’s run that script and redirect the output to a file:

 \$ ./foo.sh > /tmp/output.txt Unable to find any JVMs matching version "1.7".   \$ cat /tmp/output.txt Starting Neo4j Server...WARNING: not changing user process [48230]... waiting for server to be ready.... OK. http://localhost:7474/ is ready.

So the line about not finding a matching JVM is being printed to stderr. That’s reasonably easy to fix:

 #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start 2>&1

Let’s run the script again:

 \$ ./foo.sh > /tmp/output.txt   \$ cat /tmp/output.txt Unable to find any JVMs matching version "1.7". Starting Neo4j Server...WARNING: not changing user process [47989]... waiting for server to be ready.... OK. http://localhost:7474/ is ready.

Great, that worked as expected. Next I extended the script to stop Neo4j, delete all it’s data, start it again and execute a cypher script:

 #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start 2>&1 rm -rf neo4j-community-2.2.3/data/graph.db/ ./neo4j-community-2.2.3/bin/neo4j start 2>&1 time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql 2>&1

Let’s run that script and redirect the output:

 \$ ./foo.sh > /tmp/output.txt Unable to find any JVMs matching version "1.7".   real 0m0.604s user 0m0.334s sys 0m0.054s   \$ cat /tmp/output.txt Unable to find any JVMs matching version "1.7". Another server-process is running with [50614], cannot start a new one. Exiting. Unable to find any JVMs matching version "1.7". Another server-process is running with [50614], cannot start a new one. Exiting. +---------+ | "hello" | +---------+ | "hello" | +---------+ 1 row 4 ms

It looks like our stderr -> stdout redirection on the last line didn’t work. My understanding is that the ‘time’ command swallows all the arguments that follow whereas we want the redirection to be run afterwards.

We can work our way around this problem by putting the actual command in a code block and redirected the output of that:

 #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start 2>&1 rm -rf neo4j-community-2.2.3/data/graph.db/ ./neo4j-community-2.2.3/bin/neo4j start 2>&1 { time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql; } 2>&1
 \$ ./foo.sh > /tmp/output.txt   \$ cat /tmp/output.txt Unable to find any JVMs matching version "1.7". Another server-process is running with [50614], cannot start a new one. Exiting. Unable to find any JVMs matching version "1.7". Another server-process is running with [50614], cannot start a new one. Exiting. Unable to find any JVMs matching version "1.7". +---------+ | "hello" | +---------+ | "hello" | +---------+ 1 row 4 ms   real 0m0.615s user 0m0.316s sys 0m0.050s

Much better!

Written by Mark Needham

August 15th, 2015 at 3:55 pm

Posted in Shell Scripting

Tagged with

## Sed: Using environment variables

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path.

For example the import file contains several lines similar to this:

 LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-bbc/master/data/matches.csv" AS row

And I want that to read:

 LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row

The start of that path also happens to be my working directory:

 \$ echo \$PWD /Users/markneedham/repos/neo4j-bbc

So I wanted to write a script that would look for occurrences of ‘https://raw.githubusercontent.com/mneedham/neo4j-bbc/master’ and replace it with \$PWD. I’m a fan of Sed so I thought I’d try and use it to solve my problem.

The first thing we can do to make life easy is to change the default delimiter. Sed usually uses ‘/’ to separate parts of the command but since we’re using URIs that’s going to be horrible so we’ll use an underscore instead.

For a first cut I tried just removing that first part of the URI but not replacing it with anything in particular:

Cool! That worked as expected. Now let’s try and replace it with \$PWD:

Hmmm that didn’t work as expected. The \$PWD is being treated as a literal instead of being evaluated like we want it to be.

It turns out this is a popular question on Stack Overflow and there are lots of suggestions – I tried a few of them and found that single quotes did the trick:

We could also use double quotes everywhere if we prefer:

Written by Mark Needham

August 13th, 2015 at 7:30 pm

Posted in Shell Scripting

Tagged with

## Neo4j: Using LOAD CSV to help explore CSV files

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

 \$ head -n 5 data/import/references.csv ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText 168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9." 168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl. 58,57,"Barney finally confronts his saboteur (Abby, whom he slept with inÂ Ten Sessions) inÂ Everything Must Go." 58,63,"Barney finally confronts his saboteur (Abby, whom he slept with inÂ Ten Sessions) inÂ Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

 \$ wc -l data/import/references.csv 782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

 LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row return count(*)   ==> +----------+ ==> | count(*) | ==> +----------+ ==> | 636 | ==> +----------+ ==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

 \$ cat data/import/references.csv | cut -d"," -f1 | grep -v '[0-9]\+\$'| head -n 10 ReferencedEpisodeId This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared inÂ Lucky Penny Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

 LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*" RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText   ==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText | ==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ==> | "45" | "37" | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." | ==> | "45" | "184" | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." | ==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

 \$ cat data/import/references.csv | awk '(NR-1)%2{\$1=\$1} {print \$0}' RS=\" ORS=\" | wc -l 637   \$ cat data/import/references.csv | awk '(NR-1)%2{\$1=\$1} {print \$0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

And there we go!

### Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

Written by Mark Needham

June 11th, 2015 at 11:15 pm

Posted in neo4j,Shell Scripting

Tagged with

## Mac OS X: GNU sed – Hex string replacement / replacing new line characters

Recently I was working with a CSV file which contained both Windows and Unix line endings which was making it difficult to work with.

The actual line endings were HEX ‘0A0D’ i.e. Windows line breaks but there were also HEX ‘OA’ i.e. Unix line breaks within one of the columns.

I wanted to get rid of the Unix line breaks and discovered that you can do HEX sequence replacement using the GNU version of sed – unfortunately the Mac ships with the BSD version which doesn’t have this functionaltiy.

The first step was therefore to install the GNU version of sed.

 brew install coreutils brew install gnu-sed --with-default-names

I wanted to replace my system sed so that’s why I went with the ‘–with-default-names’ flag – without that flag I believe the sed installation would be accessible as ‘gs-sed’.

The following is an example of what the lines in the file look like:

 \$ echo -e "Hello\x0AMark\x0A\x0D" Hello Mark

We want to get rid of the new line in between ‘Hello’ and ‘Mark’ but leave the other one be. I adapted one of the commands from this tutorial to look for lines which end in ‘0A’ where that isn’t followed by a ‘0D’:

 \$ echo -e "Hello\x0AMark\x0A\x0D" | \ sed 'N;/\x0A[^\x0D]/s/\n/ /' Hello Mark

Let’s go through the parts of the sed command:

• N – this creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The two lines are separated by a new line.
• /\x0A[^\x0D]/ – this matches any lines which contain ‘OA’ not followed by ‘OD’
• /s/\n/ / – this substitutes the new line character with a space for those matching lines from the previous command.

Now let’s check it works if we have multiple lines that we want to squash:

 \$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" Hello Mark Hello Michael   \$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" | \ sed 'N;/\x0A[^\x0D]/s/\n/ /' Hello Mark Hello Michael

Looks good! The actual file is a bit more nuanced so I’ve still got a bit more work to do but this is a good start.

Written by Mark Needham

June 11th, 2015 at 9:38 pm

Posted in Shell Scripting

Tagged with