Mark Needham

Thoughts on Software Development

Unix: Find files greater than date

without comments

For the latter part of the week I’ve been running some tests against Neo4j which generate a bunch of log files and I wanted to filter those files based on the time they were created to do some further analysis.

This is an example of what the directory listing looks like:

$ ls -alh foo/database-agent-*
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    11K 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    12K 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
-rw-r--r--  1 markneedham  wheel   5.8K 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   9.5K 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.7K 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.8K 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel    13K 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.9K 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

I wanted to split the files in half so that I could have the ones created before and after 12pm on the 23rd June.

I discovered that this type of filtering is actually quite easy to do with the ‘find’ command. So if I want to get the files after 12pm I could write the following:

$ find foo -name database-agent* -newermt "Jun 23, 2016 12:00" -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

And to get the ones before 12pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 12:00" -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

Or we could even find the ones last modified between 12pm and 2pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 14:00" -newermt "Jun 23, 2016 12:00" -ls
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or we can filter by relative time e.g. to find the files last modified in the last 1 day, 5 hours:

$ find foo -name database-agent* -mtime -1d5h -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or the ones modified more than 1 day, 5 hours ago:

$ find foo -name database-agent* -mtime +1d5h -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

There are lots of other flags you can pass to find but these ones did exactly what I wanted!

Written by Mark Needham

June 24th, 2016 at 4:56 pm

Posted in Shell Scripting

Tagged with

Unix: Find all text below string in a file

without comments

I recently wanted to parse some text out of a bunch of files so that I could do some sentiment analysis on it. Luckily the text I want is at the end of the file and doesn’t have anything after it but there is text before it that I want to get rid.

The files look like this:

# text I don't care about
 
= Heading of the bit I care about
 
# text I care about

In other words I want to find the line that contains the Heading and then get all the text after that point.

I figured sed was the tool for the job but my knowledge of the syntax was a bit rusty. Luckily this post served as a refresher.

Effectively what we want to do is delete from the beginning of the file up until the line after the heading. We can do this with the following command:

$ cat /tmp/foo.txt 
# text I don't care about
 
= Heading of the bit I care about
 
# text I care about
$ cat /tmp/foo.txt | sed '1,/Heading of the bit I care about/d'
 
# text I care about

That still leaves an extra empty line after the heading which is a bit annoying but easy enough to get rid of by passing another command to sed that strips empty lines:

$ cat /tmp/foo.txt | sed -e '1,/Heading of the bit I care about/d' -e '/^\s*$/d'
# text I care about

The only difference here is that we’re now passing the ‘-e’ flag to allow us to specify multiple commands. If we just pass them sequentially then the 2nd one will be interpreted as the name of a file.

Written by Mark Needham

June 19th, 2016 at 8:36 am

Posted in Shell Scripting

Tagged with

Unix: Split string using separator

without comments

I recently found myself needing to iterate over a bunch of ‘/’ separated strings on the command line and extract just the text after the last ‘/’.

e.g. an example of one of the strings

A/B/C

I wanted to write some code that could split on ‘/’ and then pick the 3rd item in the resulting collection.

One way of doing this is to echo the string and then pipe it through cut:

$ string="A/B/C"
$ echo ${string} | cut -d"/" -f3
C

or awk:

$ echo ${string} | awk -F"/" '{ print $3}'
C

I don’t like having to echo the string – it feels a bit odd so I wanted to see if there was a way to do the parsing more ‘inline’.

I came across this post which explains how to change the internal field separator (IFS) on the shell and then parse the string into an array using read. I gave it a try:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[2]}
C

Works! We can even refer to the last item in the array using -1 instead of it’s absolute position:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[-1]}
C

I’d not come across this use of the ‘read’ function before. The key is the ‘-a’ parameter. From the man page:

-a aname
The words are assigned to sequential indices of the array variable aname,
starting at 0. All elements are removed from aname before the assignment.
Other name arguments are ignored.

So we’re resetting the internal field separator and then reading the string into another variable as an array split on the ‘/’ character.

Pretty neat although now it’s longer than the original command and I’m sure I’ll forget the syntax.

Further down the page is another suggestion which seems even harder to remember but is much shorter:

$ echo ${string##*/} 
C

This drops from the beginning of the string up until the last occurrence of ‘/’ which is exactly what we want.

This way is the nicest and doesn’t require any echoing if we just want to assign the result to a variable. The echo is only used here to see the output.

Written by Mark Needham

June 19th, 2016 at 7:22 am

Posted in Shell Scripting

Tagged with

Python: Regex – matching foreign characters/unicode letters

without comments

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors.

Here are some examples of the format of the data:

Peter Dinklage as Tyrion Lannister
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
Filip Lozić as Young Nobleman
Morgan C. Jones as a Braavosi captain
Adewale Akinnuoye-Agbaje as Malko

So the pattern is:

<actor> as <character>

optionally followed by some other text that we’re not interested in.

The output I want to get is:

Peter Dinklage, Tyrion Lannister
Daniel Naprous, Oznak zo Pahl
Filip Lozić, Young Nobleman
Morgan C. Jones, a Braavosi captain
Adewale Akinnuoye-Agbaje, Malko

I started using the ‘split’ command on the word ‘as’ but that broke down when I realised some of the characters had the letters ‘as’ in the middle of their name. So regex it is!

This was my first attempt:

import re
 
strings = [
    "Peter Dinklage as Tyrion Lannister",
    "Filip Lozić as Young Nobleman",
    "Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)",
    "Morgan C. Jones as a Braavosi captain",
    "Adewale Akinnuoye-Agbaje as Malko"
]
 
regex = "([A-Za-z\-'\. ]*) as ([A-Za-z\-'\. ]*)"
 
for string in strings:
    print string
    match = re.match( regex, string)
    if match is not None:
        print match.groups()
    else:
        print "FAIL"
	print ""
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Filip Lozić as Young Nobleman
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

It works for 4 of the 5 scenarios but now for Filip Lozić. The ‘ć’ character causes the issue so we need to be able to match foreign characters which the current charset I defined in the regex doesn’t capture.

I came across this Stack Overflow post which said that in some regex libraries you can use ‘\p{L}’ to match all letters. I gave that a try:

regex = "([\p{L}\-'\. ]*) as ([\p{L}\-'\. ]*)"

And then re-ran the script:

Peter Dinklage as Tyrion Lannister
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
FAIL
 
Filip Lozić as Young Nobleman
FAIL
 
Morgan C. Jones as a Braavosi captain
FAIL
 
Adewale Akinnuoye-Agbaje as Malko
FAIL

Hmmm, not sure if I did it wrong or if that isn’t available in Python. I’ll assume the latter but feel free to correct me in the comments and I’ll update the post.

I went search again and found this post which suggested another approach:

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”.

Let’s try plugging that in:

regex = "([A-Za-z\-'\.^\W\d_ ]*) as ([A-Za-z\-'\.^\W\d_ ]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous as Oznak zo Pahl(credited', 'Stunt Performer)')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

So that’s fixed Filip but now Daniel Naprous is being incorrectly parsed.

For Attempt #4 I decided to try excluding what I don’t want instead:

regex = "([^0-9\(]*) as ([^0-9\(]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

That does the job but has exposed my lack of regex skillz. If you know a better way let me know in the comments.

Written by Mark Needham

June 18th, 2016 at 7:38 am

Posted in Python

Tagged with

Unix parallel: Populating all the USB sticks

without comments

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!


But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

function duplicate() {
  i=${1}
  echo ${i}
  time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J${i}/
}

We can now call this function in parallel like so:

seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Written by Mark Needham

June 1st, 2016 at 5:53 am

Posted in Shell Scripting

Tagged with

Neo4j vs Relational: Refactoring – Extracting node/table

without comments

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with.

After introducing this ‘nationality’ property I realised that I now had some duplication in the model:

2016 05 22 10 15 15

players.nationality and clubs.country are referring to the same countries but they’ve both got them stored as strings so we can’t ensure the integrity of our countries and ensure that we’re referring to the same country.

We have the same issue in the graph model as well:

2016 05 22 10 40 40

This time Player.nationality and Club.country refer to the same countries.

We can solve our problem by introducing a countries table in the relational model and a set of nodes with a ‘Country’ label in the graph model. Let’s start with relational.

This is the model we’re driving towards:

2016 05 22 10 50 43

The first thing we need to do is create a countries table and populate it:

CREATE TABLE countries (
    "code" CHARACTER VARYING(3) NOT NULL PRIMARY KEY,
    "name" CHARACTER VARYING(50) NOT NULL
);
INSERT INTO countries VALUES('MNE', 'Montenegro');
INSERT INTO countries VALUES('SWZ', 'Swaziland');
...

Next let’s update the clubs table to reference the countries table:

ALTER TABLE clubs
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);

And let’s run a query to populate that column:

UPDATE clubs AS cl
SET country_id = c.code
FROM clubs
INNER JOIN countries AS c 
ON c.name = clubs.country
WHERE cl.id = clubs.id;

This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the ‘country_id’ field. Finally we can remove the ‘country’ field:

ALTER TABLE clubs
DROP COLUMN country;

Now we do the same drill for the players table:

ALTER TABLE players
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);
UPDATE players AS p
SET country_id = c.code
FROM players
INNER JOIN countries AS c 
ON c.name = players.nationality
WHERE p.id = players.id;
ALTER TABLE players
DROP COLUMN nationality;

Now it’s time for the graph. This is the model we want to get to:

2016 05 22 10 51 49

First we’ll create the countries:

CREATE CONSTRAINT ON (c:Country)
ASSERT c.id IS UNIQUE
LOAD CSV WITH HEADERS FROM "file:///countries.csv"
AS row
MERGE (country:Country {id: row.countryCode})
ON CREATE SET country.name = row.country

And now let’s get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:

MATCH (club:Club)
MATCH (country:Country {name: club.country})
MERGE (club)-[:PART_OF]->(country)
REMOVE club.country
MATCH (player:Player)
MATCH (country:Country {name: player.nationality})
MERGE (player)-[:PLAYS_FOR]->(country)
REMOVE player.nationality

And that’s it, we can now write queries against our new model.

Written by Mark Needham

May 22nd, 2016 at 9:58 am

Posted in neo4j

Tagged with

Neo4j vs Relational: Refactoring – Add a new field/property

without comments

A couple of months ago I presented a webinar comparing how you’d model and evolve a data model using a Postgres SQL database and Neo4j.

This is what the two data models looked like after the initial data import and before any refactoring/migration had been done:

Relational

2016 05 22 09 49 23

Graph

2016 05 22 09 52 16

I wanted to add a ‘nationality’ property to the players table in the SQL schema and to the nodes with the ‘Player’ label in the graph.

This refactoring is quite easy in both models. In the relational database we first run a query to add the ‘nationality’ field to the table:

ALTER TABLE players 
ADD COLUMN nationality VARYING(30);

And then we need to generate UPDATE statements from our data dump to update all the existing records:

UPDATE players 
SET nationality = 'Brazil'
WHERE players.id = '/aldair/profil/spieler/4151';
 
...

In the graph we can do this in a single step by processing our data dump using the LOAD CSV command and then setting a property on each player:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///transfers.csv" AS row
MATCH (player:Player {id: row.playerUri})
SET player.nationality = row.playerNationality

If we wanted to make the nationality field non nullable we could go back and run the following queries:

ALTER TABLE players 
ALTER COLUMN nationality SET NOT NULL
CREATE CONSTRAINT ON (player:Player) 
ASSERT exists(player.nationality)

And we’re done!

Written by Mark Needham

May 22nd, 2016 at 9:09 am

Posted in neo4j

Tagged with

R: substr – Getting a vector of positions

with 2 comments

I recently found myself writing an R script to extract parts of a string based on a beginning and end index which is reasonably easy using the substr function:

> substr("mark loves graphs", 0, 4)
[1] "mark"

But what if we have a vector of start and end positions?

> substr("mark loves graphs", c(0, 6), c(4, 10))
[1] "mark"

Hmmm that didn’t work as I expected! It turns out we actually need to use the substring function instead which wasn’t initially obvious to me on reading the documentation:

> substring("mark loves graphs", c(0, 6, 12), c(4, 10, 17))
[1] "mark"   "loves"  "graphs"

Easy when you know how!

Written by Mark Needham

April 18th, 2016 at 7:49 pm

Posted in R

Tagged with

R: tm – Unique words/terms per document

without comments

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document which isn’t how it works out the box.

For example let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix which shows how frequently each term shows up across the corpus. I had to tweak the default word length of 3 to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Written by Mark Needham

April 11th, 2016 at 5:40 am

Posted in R

Tagged with

Neo4j: A procedure for the SLM clustering algorithm

without comments

In the middle of last year I blogged about the Smart Local Moving algorithm which is used for community detection in networks and with the upcoming introduction of procedures in Neo4j I thought it’d be fun to make that code accessible as one.

If you want to grab the code and follow along it’s sitting on the SLM repository on my github.

At the moment the procedure is hard coded to work with a KNOWS relationship between two nodes but that could easily be changed.

To check it’s working correctly I thought it’d make most sense to use the Karate Club data set described on the SLM home page. I think this data set is originally from Networks, Crowds and Markets.

I wrote the following LOAD CSV script to create the graph in Neo4j:

LOAD CSV FROM "file:///Users/markneedham/projects/slm/karate_club_network.txt" as row
FIELDTERMINATOR "\t"
MERGE (person1:Person {id: row[0]})
MERGE (person2:Person {id: row[1]})
MERGE (person1)-[:KNOWS]->(person2)
Graph

Next we need to call the procedure which will add an appropriate label to each node depending which community it belongs to. This is what the procedure code looks like:

public class ClusterAllTheThings
{
    @Context
    public org.neo4j.graphdb.GraphDatabaseService db;
 
    @Procedure
    @PerformsWrites
    public Stream<Cluster> knows() throws IOException
    {
        String query = "MATCH (person1:Person)-[r:KNOWS]->(person2:Person) \n" +
                       "RETURN person1.id AS p1, person2.id AS p2, toFloat(1) AS weight";
 
        Result rows = db.execute( query );
 
        ModularityOptimizer.ModularityFunction modularityFunction = ModularityOptimizer.ModularityFunction.Standard;
        Network network = Network.create( modularityFunction, rows );
 
        double resolution = 1.0;
        int nRandomStarts = 1;
        int nIterations = 10;
        long randomSeed = 0;
 
        double modularity;
 
        Random random = new Random( randomSeed );
 
        double resolution2 = modularityFunction.resolution( resolution, network );
 
        Map<Integer, Node> cluster = new HashMap<>();
        double maxModularity = Double.NEGATIVE_INFINITY;
 
        for ( int randomStart = 0; randomStart < nRandomStarts; randomStart++ )
        {
            network.initSingletonClusters();
 
            int iteration = 0;
            do
            {
                network.runSmartLocalMovingAlgorithm( resolution2, random );
                iteration++;
 
                modularity = network.calcQualityFunction( resolution2 );
            } while ( (iteration < nIterations) );
 
            if ( modularity > maxModularity )
            {
                network.orderClustersByNNodes();
                cluster = network.getNodes();
                maxModularity = modularity;
            }
        }
 
        for ( Map.Entry<Integer, Node> entry : cluster.entrySet() )
        {
            Map<String, Object> params = new HashMap<>();
            params.put("userId", String.valueOf(entry.getKey()));
 
            db.execute("MATCH (person:Person {id: {userId}})\n" +
                       "SET person:`" + (format( "Community-%d`", entry.getValue().getCluster() )),
                    params);
        }
 
        return cluster
                .entrySet()
                .stream()
                .map( ( entry ) -> new Cluster( entry.getKey(), entry.getValue().getCluster() ) );
    }
 
    public static class Cluster
    {
        public long id;
        public long clusterId;
 
        public Cluster( int id, int clusterId )
        {
            this.id = id;
            this.clusterId = clusterId;
        }
    }
}

I’ve hardcoded some parameters to use defaults which could be exposed through the procedure to allow more control if necessary. The Network#create function assumes it is going to receive a stream of rows containing columns ‘p1’, ‘p2’ and ‘weight’ to represent the ‘source’, ‘destination’ and ‘weight’ of the relationship between them.

We call the procedure like this:

CALL org.neo4j.slm.knows()

It will return each of the nodes and the cluster it’s been assigned to and if we then visualise the network in the neo4j browser we’ll see this:

Graph  1

which is similar to the visualisation from the SLM home page:

Network

If you want to play around with the code feel free. You’ll need to run the following commands to create the JAR for the plugin and deploy it.

$ mvn clean package 
$ cp target/slm-1.0.jar /path/to/neo4j/plugins/ 
$ ./path/to/neo4j/bin/neo4j restart

And you’ll need the latest milestone of Neo4j which has procedures enabled.

Written by Mark Needham

February 28th, 2016 at 8:40 pm

Posted in neo4j

Tagged with