Mark Needham

Thoughts on Software Development

Archive for the ‘Shell Scripting’ Category

Unix: Find files greater than date

without comments

For the latter part of the week I’ve been running some tests against Neo4j which generate a bunch of log files and I wanted to filter those files based on the time they were created to do some further analysis.

This is an example of what the directory listing looks like:

$ ls -alh foo/database-agent-*
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    11K 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    12K 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
-rw-r--r--  1 markneedham  wheel   5.8K 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   9.5K 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.7K 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.8K 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel    13K 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.9K 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

I wanted to split the files in half so that I could have the ones created before and after 12pm on the 23rd June.

I discovered that this type of filtering is actually quite easy to do with the ‘find’ command. So if I want to get the files after 12pm I could write the following:

$ find foo -name database-agent* -newermt "Jun 23, 2016 12:00" -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

And to get the ones before 12pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 12:00" -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

Or we could even find the ones last modified between 12pm and 2pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 14:00" -newermt "Jun 23, 2016 12:00" -ls
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or we can filter by relative time e.g. to find the files last modified in the last 1 day, 5 hours:

$ find foo -name database-agent* -mtime -1d5h -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or the ones modified more than 1 day, 5 hours ago:

$ find foo -name database-agent* -mtime +1d5h -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

There are lots of other flags you can pass to find but these ones did exactly what I wanted!

Written by Mark Needham

June 24th, 2016 at 4:56 pm

Posted in Shell Scripting

Tagged with

Unix: Find all text below string in a file

without comments

I recently wanted to parse some text out of a bunch of files so that I could do some sentiment analysis on it. Luckily the text I want is at the end of the file and doesn’t have anything after it but there is text before it that I want to get rid.

The files look like this:

# text I don't care about
 
= Heading of the bit I care about
 
# text I care about

In other words I want to find the line that contains the Heading and then get all the text after that point.

I figured sed was the tool for the job but my knowledge of the syntax was a bit rusty. Luckily this post served as a refresher.

Effectively what we want to do is delete from the beginning of the file up until the line after the heading. We can do this with the following command:

$ cat /tmp/foo.txt 
# text I don't care about
 
= Heading of the bit I care about
 
# text I care about
$ cat /tmp/foo.txt | sed '1,/Heading of the bit I care about/d'
 
# text I care about

That still leaves an extra empty line after the heading which is a bit annoying but easy enough to get rid of by passing another command to sed that strips empty lines:

$ cat /tmp/foo.txt | sed -e '1,/Heading of the bit I care about/d' -e '/^\s*$/d'
# text I care about

The only difference here is that we’re now passing the ‘-e’ flag to allow us to specify multiple commands. If we just pass them sequentially then the 2nd one will be interpreted as the name of a file.

Written by Mark Needham

June 19th, 2016 at 8:36 am

Posted in Shell Scripting

Tagged with

Unix: Split string using separator

without comments

I recently found myself needing to iterate over a bunch of ‘/’ separated strings on the command line and extract just the text after the last ‘/’.

e.g. an example of one of the strings

A/B/C

I wanted to write some code that could split on ‘/’ and then pick the 3rd item in the resulting collection.

One way of doing this is to echo the string and then pipe it through cut:

$ string="A/B/C"
$ echo ${string} | cut -d"/" -f3
C

or awk:

$ echo ${string} | awk -F"/" '{ print $3}'
C

I don’t like having to echo the string – it feels a bit odd so I wanted to see if there was a way to do the parsing more ‘inline’.

I came across this post which explains how to change the internal field separator (IFS) on the shell and then parse the string into an array using read. I gave it a try:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[2]}
C

Works! We can even refer to the last item in the array using -1 instead of it’s absolute position:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[-1]}
C

I’d not come across this use of the ‘read’ function before. The key is the ‘-a’ parameter. From the man page:

-a aname
The words are assigned to sequential indices of the array variable aname,
starting at 0. All elements are removed from aname before the assignment.
Other name arguments are ignored.

So we’re resetting the internal field separator and then reading the string into another variable as an array split on the ‘/’ character.

Pretty neat although now it’s longer than the original command and I’m sure I’ll forget the syntax.

Further down the page is another suggestion which seems even harder to remember but is much shorter:

$ echo ${string##*/} 
C

This drops from the beginning of the string up until the last occurrence of ‘/’ which is exactly what we want.

This way is the nicest and doesn’t require any echoing if we just want to assign the result to a variable. The echo is only used here to see the output.

Written by Mark Needham

June 19th, 2016 at 7:22 am

Posted in Shell Scripting

Tagged with

Unix parallel: Populating all the USB sticks

without comments

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!


But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

function duplicate() {
  i=${1}
  echo ${i}
  time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J${i}/
}

We can now call this function in parallel like so:

seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Written by Mark Needham

June 1st, 2016 at 5:53 am

Posted in Shell Scripting

Tagged with

Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)

without comments

I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.

The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:

$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv
 
real	0m31.945s
user	0m31.370s
sys	0m0.518s

The -c command is described thus;

-c number
             The location is number bytes.

So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.

Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!

Written by Mark Needham

August 19th, 2015 at 11:27 pm

Posted in Shell Scripting

Tagged with

Unix: Redirecting stderr to stdout

without comments

I’ve been trying to optimise some Neo4j import queries over the last couple of days and as part of the script I’ve been executed I wanted to redirect the output of a couple of commands into a file to parse afterwards.

I started with the following script which doesn’t do any explicit redirection of the output:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start

Now let’s run that script and redirect the output to a file:

$ ./foo.sh > /tmp/output.txt
Unable to find any JVMs matching version "1.7".
 
$ cat /tmp/output.txt
Starting Neo4j Server...WARNING: not changing user
process [48230]... waiting for server to be ready.... OK.
http://localhost:7474/ is ready.

So the line about not finding a matching JVM is being printed to stderr. That’s reasonably easy to fix:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1

Let’s run the script again:

$ ./foo.sh > /tmp/output.txt
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Starting Neo4j Server...WARNING: not changing user
process [47989]... waiting for server to be ready.... OK.
http://localhost:7474/ is ready.

Great, that worked as expected. Next I extended the script to stop Neo4j, delete all it’s data, start it again and execute a cypher script:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
./neo4j-community-2.2.3/bin/neo4j start 2>&1
time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql 2>&1

Let’s run that script and redirect the output:

$ ./foo.sh > /tmp/output.txt
Unable to find any JVMs matching version "1.7".
 
real	0m0.604s
user	0m0.334s
sys	0m0.054s
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
+---------+
| "hello" |
+---------+
| "hello" |
+---------+
1 row
4 ms

It looks like our stderr -> stdout redirection on the last line didn’t work. My understanding is that the ‘time’ command swallows all the arguments that follow whereas we want the redirection to be run afterwards.

We can work our way around this problem by putting the actual command in a code block and redirected the output of that:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
./neo4j-community-2.2.3/bin/neo4j start 2>&1
{ time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql; } 2>&1
$ ./foo.sh  > /tmp/output.txt
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
+---------+
| "hello" |
+---------+
| "hello" |
+---------+
1 row
4 ms
 
real	0m0.615s
user	0m0.316s
sys	0m0.050s

Much better!

Written by Mark Needham

August 15th, 2015 at 3:55 pm

Posted in Shell Scripting

Tagged with

Sed: Using environment variables

without comments

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path.

For example the import file contains several lines similar to this:

LOAD CSV WITH HEADERS 
FROM "https://raw.githubusercontent.com/mneedham/neo4j-bbc/master/data/matches.csv" 
AS row

And I want that to read:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" 
AS row

The start of that path also happens to be my working directory:

$ echo $PWD
/Users/markneedham/repos/neo4j-bbc

So I wanted to write a script that would look for occurrences of ‘https://raw.githubusercontent.com/mneedham/neo4j-bbc/master’ and replace it with $PWD. I’m a fan of Sed so I thought I’d try and use it to solve my problem.

The first thing we can do to make life easy is to change the default delimiter. Sed usually uses ‘/’ to separate parts of the command but since we’re using URIs that’s going to be horrible so we’ll use an underscore instead.

For a first cut I tried just removing that first part of the URI but not replacing it with anything in particular:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql
 
$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/subs.csv" AS row

Cool! That worked as expected. Now let’s try and replace it with $PWD:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://$PWD_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file://$PWD/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/subs.csv" AS row

Hmmm that didn’t work as expected. The $PWD is being treated as a literal instead of being evaluated like we want it to be.

It turns out this is a popular question on Stack Overflow and there are lots of suggestions – I tried a few of them and found that single quotes did the trick:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://'$PWD'_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row

We could also use double quotes everywhere if we prefer:

$ sed "s_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://"$PWD"_" import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row

Written by Mark Needham

August 13th, 2015 at 7:30 pm

Posted in Shell Scripting

Tagged with

Neo4j: Using LOAD CSV to help explore CSV files

without comments

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

$ head -n 5 data/import/references.csv
ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText
168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9."
168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl.
58,57,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."
58,63,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

$ wc -l data/import/references.csv
     782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" 
AS row
return count(*)
 
==> +----------+
==> | count(*) |
==> +----------+
==> | 636      |
==> +----------+
==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

$ cat data/import/references.csv | cut -d"," -f1 | grep -v  '[0-9]\+$'| head -n 10
ReferencedEpisodeId
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                     |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "53", ReferenceText -> "The website counting down to the next slap (slapcountdown.com) that Marshall sends Barney reaches zero in Slapsgiving, when the third slap is delivered."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "55", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "57", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "63", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "92", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" | wc -l
637
 
$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/refs.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "56"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "200"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "100"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "86"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "113"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "161"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there we go!

Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ "(?s).*This is the Mother.*"
RETURN row
 
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Written by Mark Needham

June 11th, 2015 at 11:15 pm

Posted in neo4j,Shell Scripting

Tagged with

Mac OS X: GNU sed – Hex string replacement / replacing new line characters

without comments

Recently I was working with a CSV file which contained both Windows and Unix line endings which was making it difficult to work with.

The actual line endings were HEX ‘0A0D’ i.e. Windows line breaks but there were also HEX ‘OA’ i.e. Unix line breaks within one of the columns.

I wanted to get rid of the Unix line breaks and discovered that you can do HEX sequence replacement using the GNU version of sed – unfortunately the Mac ships with the BSD version which doesn’t have this functionaltiy.

The first step was therefore to install the GNU version of sed.

brew install coreutils
brew install gnu-sed --with-default-names

I wanted to replace my system sed so that’s why I went with the ‘–with-default-names’ flag – without that flag I believe the sed installation would be accessible as ‘gs-sed’.

The following is an example of what the lines in the file look like:

$ echo -e "Hello\x0AMark\x0A\x0D"
Hello
Mark

We want to get rid of the new line in between ‘Hello’ and ‘Mark’ but leave the other one be. I adapted one of the commands from this tutorial to look for lines which end in ‘0A’ where that isn’t followed by a ‘0D’:

$ echo -e "Hello\x0AMark\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark

Let’s go through the parts of the sed command:

  • N – this creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The two lines are separated by a new line.
  • /\x0A[^\x0D]/ – this matches any lines which contain ‘OA’ not followed by ‘OD’
  • /s/\n/ / – this substitutes the new line character with a space for those matching lines from the previous command.

Now let’s check it works if we have multiple lines that we want to squash:

$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D"
Hello
Mark
Hello
Michael
 
$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark
Hello Michael

Looks good! The actual file is a bit more nuanced so I’ve still got a bit more work to do but this is a good start.

Written by Mark Needham

June 11th, 2015 at 9:38 pm

Posted in Shell Scripting

Tagged with

Unix: Converting a file of values into a comma separated list

without comments

I recently had a bunch of values in a file that I wanted to paste into a Java program which required a comma separated list of strings.

This is what the file looked like:

$ cat foo2.txt | head -n 5
1.0
1.0
1.0
1.0
1.0

And the idea is that we would end up with something like this:

"1.0","1.0","1.0","1.0","1.0"


The first thing we need to do is quote each of the values. I found a nice way to do this using sed:

$ sed 's/.*/"&"/g' foo2.txt | head -n 5
"1.0"
"1.0"
"1.0"
"1.0"
"1.0"

Now that we’ve got all the values quoted we need to get rid of the new lines and replace them with commas. The way I’d normally do this is using ‘tr’ and then just not copy the final comma…

$ sed 's/.*/"&"/g' foo2.txt | tr '\n' ','
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0",

…but I learnt that we can actually do one better than this using ‘paste’ which allows you to replace new lines excluding the last one.

The only annoying thing about paste is that you can’t pipe to it so we need to use process substitution instead:

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt)
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0"

If we’re only a Mac we could even automate the copy/paste step too by piping to ‘pbcopy’:

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt) | pbcopy

Written by Mark Needham

June 8th, 2015 at 10:23 pm

Posted in Shell Scripting

Tagged with