Mark Needham

Thoughts on Software Development

Archive for July, 2012

London Bus Stops API: Mapping northing/easting values to lat/long

with one comment

I started playing around with the TFL Bus stop location and routes API and one of the annoying things about the data is that it uses easting/northing values to describe the location of bus stops rather than lat/longs.

The first few lines of the CSV file look like this:

1000,91532,490000266G,WESTMINSTER STN <> / PARLIAMENT SQUARE,530171,179738,177,0K08,0
10001,72689,490013793E,TREVOR CLOSE,515781,174783,78,NB16,0
10002,48461,490000108F,HIGHBURY CORNER,531614,184603,5,C902,0

For each of the stops I wanted to convert from the easting/northing value to the equivalent lat/long value but I couldn’t find a simple way of doing it in code although I did come across an API that would do it for me.

I wrote the following script to save a new CSV file with all the London bus stops and their lat/long location:

require 'rubygems'
require 'csv'
require 'open-uri'
require 'json'
 
data_dir = File.expand_path('data') + '/'
file_name = data_dir + "stops.csv"
 
stops = CSV.read(file_name).drop(1)
 
out = CSV.open(data_dir + "stops_with_lat_longs.csv","w")
 
stops.each do |stop|
  code = stop[1]
  easting = stop[4]
  northing = stop[5]
  url  = "http://www.uk-postcodes.com/eastingnorthing.php?easting=#{easting}&northing=#{northing}"
  location = JSON.parse(open(url).read)
 
  puts "Processing #{stop[3]}: #{location['lat']}, #{location['lng']}"
 
  out << [code, location['lat'],location['lng']]
end
 
out.close

I’ve uploaded the file with mapping from bus stop code to lat/long to github as well.

Peter Hicks has a blog post showing another way of doing this using just Ruby code but I couldn’t get the ‘proj4′ gem to install and I didn’t fancy shaving that yak when I had another solution which worked.

Written by Mark Needham

July 30th, 2012 at 10:28 pm

Puppet: Keeping the discipline

without comments

For the last 5 weeks or so I’ve been working with puppet every day to automate the configuration of various nodes in our stack and my most interesting observation so far is that you really need to keep your discipline when doing this type of work.

We can keep that discipline in three main ways when developing modules.

Running from scratch

Configuring various bits of software seems to follow the 80/20 rule and we get very close to having each thing working quite quickly but then end up spending a disproportionate amount of time tweaking the last little bits.

When it’s ‘just little bits’ there is a massive temptation to make changes manually and then retrospectively put it into a puppet module and assume that it will work if we run from scratch.

From doing that a few times and seeing a module fail when run at a later stage we’ve realised it isn’t a very good approach and we’re started to get into the discipline of starting from scratch each time.

Ideally that would mean creating snapshots of a Virtual Machine instance and rolling back after each run but that ends up taking quite a bit of time so more often we’re removing everything that a module installs and then running that module again.

Running as part of environment

We tend to run the modules individually when we start writing them but it’s also important to run them as part of an environment definition as well to make sure modules play nicely with each other.

This is particularly useful if there are dependencies between our module and another one – perhaps our module will be notified if there’s a configuration file change in another module for example.

Running on different machines

Even if we do the above two things we’ve still found that we catch problems with our modules when we try them out on other machines.

Machines which have been in use for a longer period of time will have earlier versions of a module which aren’t easy to upgrade and we’ll need to remove the old version and install the new one.

We’ve also come across occasions where other machines have left over files from modules that have since been removed and although the files seem harmless they can interfere with new modules that we’re writing.

Written by Mark Needham

July 29th, 2012 at 9:53 pm

Unix: tee

without comments

I’ve read about the Unix ‘tee‘ command before but never found a reason to use it until the last few weeks.

One of the things I repeatedly do by mistake is open /etc/hosts without sudo and then try to make changes to it:

$ vi /etc/hosts
# Editing it leads to the dreaded 'W10: Changing a readonly file'

I always used to close the file and then re-open it with sudo but I recently came across an approach which allows us to use ‘tee’ to get around the problem.

Once we’ve made our change we can run the following command:

:w !sudo tee %

The ‘:w’ is usually used to write the buffer to the current file but here we’re substituting a shell command to receive the buffer instead.

In this case we’re sending the buffer to the ‘sudo tee %’ command where ‘%’ refers to the current file name.

tee is defined like so:

The tee utility copies standard input to standard output, making a copy in zero or more files. The output is unbuffered.

So we’re sending the buffer for the current file to tee which then saves it to the current file. Since we’ve run tee with sudo it will give us the appropriate permissions to write to the file that we didn’t initially have.

This Stack Overflow post explains what’s going on in more detail.

Another thing we’ve been working on is hooking up the logs of our various services to logstash and to test it out we needed to write to files in the /var/log directory.

We started out trying to echo a message to the end of the apache log file like this:

$ echo "random message" >> /var/log/apache2/access_log 
zsh: permission denied: /var/log/apache2/access_log

As you can see that doesn’t work since we don’t have permissions to write to files in /var/log so we need to find another way to do that and Nick pointed out that tee would do the job here as well.

$ echo "random message" | sudo tee -a /var/log/apache2/access_log
random message

Passing the ‘-a’ flag to tee means that it appends standard input to the file instead of replacing the whole file like before. If we don’t want standard in to be directed to standard out we can redirect that to /dev/null like so:

$ echo "random message" | sudo tee > /dev/null -a /var/log/apache2/access_log

Written by Mark Needham

July 29th, 2012 at 7:11 pm

Posted in Shell Scripting

Tagged with ,

neo4j: Multiple starting nodes by index lookup

without comments

I spent a bit of time this evening extracting some data from the ThoughtWorks graph for our marketing team who were interested in anything related to our three European offices in London, Manchester and Hamburg.

The most interesting things we can explore relate to the relationship between people and the offices.

The model around people and offices looks like this:

Current home office

I added a ‘current_home_office’ relationship to make it easier to quickly get to the nodes of people who are currently working in a specific office.

I previously had to use a bit of a convoluted query to achieve the same thing and while the ‘current_home_office’ relationship would need to be maintained as people move offices since I’m just playing around with this data it makes my life much easier.

I wanted to find the number of different offices that people in Europe had worked in which I’ve previously calculated by running three separate queries and manually compiling the results:

START office=node:offices(name = "London - UK South") 
MATCH person-[:current_home_office]->office, person-[:member_of]->otherOffice 
RETURN distinct person.name, count(distinct(otherOffice)) AS offices, ID(person)
ORDER BY offices DESC
START office=node:offices(name = "Manchester - UK North") 
MATCH person-[:current_home_office]->office, person-[:member_of]->otherOffice 
RETURN distinct person.name, count(distinct(otherOffice)) AS offices, ID(person)
ORDER BY offices DESC
START office=node:offices(name = "Hamburg - Germany") 
MATCH person-[:current_home_office]->office, person-[:member_of]->otherOffice 
RETURN distinct person.name, count(distinct(otherOffice)) AS offices, ID(person)
ORDER BY offices DESC

Here we start from a specific office and then match all the people who have this as their current home office. We also match all the offices that person has been a member of and then return a count of the unique offices the person has been to.

That approach works fine but it always felt like I was missing out on a simple way of doing all three of those queries in one since they are identical except for the office name.

What I wanted to do is define 3 starting nodes based on combining those Lucene index lookups into one which captured all 3 offices.

I started just trying to get the offices to return while following the Lucene query parser syntax documentation.

For some reason my initial attempt returned neither an error nor any results:

neo4j-sh (0)$ START office=node:offices("name:'London - UK South' OR name:'Hamburg - Germany' 
              OR name:'Manchester - UK North'")  RETURN office
neo4j-sh (0)$

When I switched the quotes around it was much happier:

neo4j-sh (0)$ START office=node:offices('name:"London - UK South" OR name:"Hamburg - Germany" 
              OR name:"Manchester - UK North"')  RETURN office
==> +-------------------------------------------------------------------------+
==> | office                                                                  |
==> +-------------------------------------------------------------------------+
==> | Node[4171]{type->"office",country->"UK",name->"Manchester - UK North"}  |
==> | Node[4168]{type->"office",country->"UK",name->"London - UK South"}      |
==> | Node[4177]{type->"office",country->"Germany",name->"Hamburg - Germany"} |
==> +-------------------------------------------------------------------------+
==> 3 rows, 0 ms
==>

We can then reduce the queries which find how many offices people from ThoughtWorks Europe have worked in down to the following query:

START office=node:offices('name:"London - UK South" OR name:"Hamburg - Germany" OR name:"Manchester - UK North"') 
MATCH person-[:current_home_office]->office, person-[:member_of]->otherOffice 
RETURN distinct person.name, count(distinct(otherOffice)) AS offices
ORDER BY offices DESC

Written by Mark Needham

July 28th, 2012 at 11:32 pm

Posted in neo4j

Tagged with ,

R: Mapping a function over a collection of values

with 2 comments

I spent a bit of Sunday playing around with R and one thing I wanted to do was map a function over a collection of values and transform each value slightly.

I loaded my data set using the ‘Import Dataset’ option in R Studio (suggested to me by Rob) which gets converted to the following function call:

> data <-  read.csv("~/data.csv", header=T, encoding="ISO-8859")
> data
  Column1 InterestingColumn
1    Mark             12.50
2    Dave            100.00
3    John          1,231.00

data.csv

Column1, InterestingColumn
Mark, 12.50
Dave, 100.00 
John, 1,231.00

data is a table with the type ’3 obs. of 2 variables’ in R Studio.

I was only interested in the values in the 2nd column so I selected those like this:

> data$InterestingColumn
[1]  12.50     100.00    1,231.00
Levels:  1,231.00  100.00  12.50

I wanted to apply a function over each of the numbers and return a new list containing those transformations.

I initially had a look at doing this with a for loop but it didn’t turn out to be as easy as I’d expected and I eventually came across the lapply function which allows you to apply a function over a list or vector.

> values <- data$InterestingColumn
> lapply(values, function(x) 5000 - as.numeric(gsub("\\s|,","", x)))
[[1]]
[1] 4987.5
 
[[2]]
[1] 4900
 
[[3]]
[1] 3769

We define a function which subtracts the value in the column from 5000 since the CSV file contained derived values and I was interested in the original value.

In order to do that subtraction I needed to cast the value from the CSV file to be numeric which first involved stripping out any spaces or commas using gsub and then casting the string using as.numeric.

If we want to have a table structure then we can use the ‘by’ function to do a similar thing:

> as.table(by(data$InterestingColumn, data$Column1, function(x) 5000 - as.numeric(gsub("\\s|,","", x))))
data$Column1
  Dave   John   Mark 
4900.0 3769.0 4987.5

I don’t know enough R to know how to keep the data in exactly the same structure as we got it so if anyone can point me in the right direction that’d be cool.

Written by Mark Needham

July 23rd, 2012 at 11:25 pm

Posted in R

Tagged with

neo4j: Graph Global vs Graph Local queries

with one comment

A few weeks ago I did a presentation at the ThoughtWorks EU away day on the graph I’ve been developing using neo4j and I wanted to show who the most connected people in each of our European offices were.

I started with the following cypher query:

START n = node(*)
 
MATCH n-[r:colleagues*1..2]->c, n-[r2:member_of]->office 
 
WHERE n.type? = 'person'
AND (NOT(HAS(r2.end_date))) 
AND office.name = 'London - UK South' 
AND (NOT(HAS(c.thoughtquitter)))
 
RETURN n.name, count(distinct(c)) AS connections, office.name 
 
ORDER BY connections DESC

The intention is to find all the people who currently work in the London office and find their 1st and 2nd level connections i.e. people they have worked with or people who have worked with people they have worked with.

That gives a rough idea of their reach in the organisation.

I also wanted to exclude anyone who had left, hence the ‘thoughtquitter’ check.

When I ran this query it took about 15 minutes to complete which immediately made me realise I was doing something wrong although I wasn’t sure exactly what.

I showed the query to Michael Hunger and he pointed out that it’s a graph global query i.e. it probably touches most of the nodes on the graph even though it doesn’t necessarily need to.

Michael suggested restructuring the query to start from the London office and work out from there:

START office=node:offices(name='London - UK South')
 
MATCH office<-[r2:member_of]-n-[r:colleagues*1..2]->c 
 
WHERE n.type? = 'person' 
AND (NOT(HAS(r2.end_date))) 
AND (NOT(HAS(c.thoughtquitter)))
 
RETURN n.name, count(distinct(c)) AS connections, office.name 
 
ORDER BY connections DESC

Now we are starting at the London office and finding all the people who have been members of it at some stage. We go and find their level 2 colleagues and then do the same filters that we had in the first query.

The query time went down to around 45 seconds, presumably because the number of nodes on the graph that we touch has gone down substantially.

The lesson for me from this is that when doing a graph traversal it makes much more sense to pick specific starting and/or finishing nodes rather than selecting all of them.

Written by Mark Needham

July 23rd, 2012 at 10:23 pm

Posted in neo4j

Tagged with ,

neo4j: Embracing the sub graph

with 3 comments

In May I wrote a blog post explaining how I’d been designing a neo4j graph by thinking about what questions I wanted to answer about the data.

In the comments Josh Adell gave me the following advice:

The neat things about graphs is that multiple subgraphs can live in the same data-space.

Keep your data model rich! Don’t be afraid to have as many relationships as you need. The power of graph databases comes from finding surprising results when you have strongly interconnected data.

At the time I didn’t really understand the advice but I’ve since updated my graph so that it includes ‘colleagues’ relationships which can be derived by looking at the projects that people had worked together on.

V2 a

When I was showing the graph to Marc he thought it would be quite interesting to see all the people that he hadn’t worked with before, a query that’s very easy to write when we have these two sub graphs.

We could write the following cypher query to find who Marc hadn’t worked with on a specific project:

START project=node:projects(name="Project X"), me=node:people(name="Marc Hofer")
MATCH project<-[:worked_on]-neverColleague-[c?:colleagues]->me
WHERE c is NULL
RETURN neverColleague

We do an index lookup to find the appropriate project node and to find the node that represent’s Marc.

THe MATCH clause starts from the project and then work backwards to the people who have worked on it. We then follow an optional colleagues relationship back to Marc.

The WHERE clause makes sure that we only return people that Marc doesn’t have a ‘colleagues’ relationship with i.e. people Marc hasn’t worked with.

Along the same lines we could also find out all the people that Marc has worked with from a specific office:

START office=node:offices(name="London - UK South"), me=node:people(name="Marc Hofer")
MATCH office<-[r:member_of]-colleague-[c:colleagues]->me
WHERE (NOT(HAS(r.end_date)))
RETURN colleague, c

This is reasonably similar to the previous query except the colleagues relationship is no longer optional since we only want to find people that Marc has worked with.

I’m sure there are other queries we could run but these were a couple that I hadn’t thought of before I had multiple sub graphs together on the same overall graph.

Written by Mark Needham

July 21st, 2012 at 10:46 pm

Posted in neo4j

Tagged with ,

neo4j: Shortest Path with and without cypher

without comments

I was looking back at some code I wrote a few months ago to query a neo4j database to find the shortest path between two people via the colleagues relationships that exist.

The initial code, written using neography, looked like this:

neo = Neography::Rest.new
 
start_node = neo.get_node(start_node_id)
destination_node = neo.get_node(destination_node_id)
 
neo.get_paths(start_node,
              destination_node,
              { "type" => "colleagues" },
              depth = 3,
              algorithm = "shortestPath")

The neography code eventually makes a POST request to /node/{start_id}/paths and provides a JSON payload containing the other information about the query.

It works fine and is pretty readable but since I wrote it I’ve learnt how to write queries using the cypher query language so I thought it would be interesting to contrast the two bits of code against each other.

This is the equivalent written using cypher:

query =  "START source=node(#{start_node_id}), destination=node(#{destination_node_id})"
query << "MATCH p = allShortestPaths(source-[r:colleagues*..3]->destination)"
query << "RETURN NODES(p)"
 
neo.execute_query(query)

The amount of lines of code is pretty similar but the thing I like about cypher is that it feels much more declarative and therefore, I think at least, easier to understand.

Having said that, the learning curve for the non-cypher API is a bit easier and it’s probably best to start with that to get a feel of what sorts of things you can do with a graph.

When I first started learning cypher I was always forgetting which order to put the different key words and what the syntax was for selecting your starting node.

After a while you do get the hang of it though and it starts to feel like any other programming language.

If we didn’t have the node ids of our source and destination nodes but we had two people’s names which had been stored in a ‘people’ Lucene index then we could just as easily find the path between them like this:

START source=node:people(name="Mark Needham"), destination=node:people(name="Cameron Swords")
MATCH p = allShortestPaths(source-[r:colleagues*..3]->destination)
RETURN NODES(p)

I use cypher for pretty much everything I do when reading from the graph and since version 1.8 M01 it’s been possible to construct queries which mutate the graph as well.

Written by Mark Needham

July 19th, 2012 at 7:57 pm

Posted in neo4j

Tagged with ,

neo4j: java.security.NoSuchAlgorithmException: Algorithm [JKS] of type [KeyStore] from provider [org.bouncycastle.jce.provider.BouncyCastleProvider: name=BC version=1.4]

without comments

I’ve spent the last couple of hours moving my neo4j graph from my own machine onto a vanilla CentOS VM and initially tried to run neo using a non Sun version of Java which I installed like so:

yum install java

This is the version of Java that was installed:

$ java -version
java version "1.5.0"
gij (GNU libgcj) version 4.4.6 20120305 (Red Hat 4.4.6-4)

When I tried to start neo4j:

neo4j/bin/neo4j start

I ended up with the following exception in the neo4j log @ neo4j/data/log/neo4j.0.0.log:

Caused by: java.security.NoSuchAlgorithmException: Algorithm [JKS] of type [KeyStore] 
from provider [org.bouncycastle.jce.provider.BouncyCastleProvider: name=BC version=1.4] is not found

I assumed this algorithm was only being packaged with the Sun JDK (although the website suggests otherwise) so I’d need to replace my version of Java to get it working.

I got the 64 bit ‘.bin’ file from the Oracle download page and then installed it using rpm:

rpm -ivh jdk-6u33-linux-x64-rpm.bin

I now have this version of Java installed and neo4j is much happier:

java -version
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)

After googling around for a bit I’m still not entirely sure why it wasn’t able to work with the original version of Java I installed so if anyone could point me in the direction that’d be useful.

Written by Mark Needham

July 17th, 2012 at 12:02 am

Posted in neo4j

Tagged with

tcpdump: Learning how to read UDP packets

with 2 comments

Phil and I spent some of Friday afternoon configuring statsd:

A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services

We configured it to listen on its default port 8125 and then used netcat to send UDP packets to see if it was working like so:

echo -n "blah:36|c" | nc -w 1 -u -4 localhost 8125

We used tcpdump to capture any UDP packets on port 8125 like so:

tcpdump -i lo udp port 8125 -vv -X

To briefly explain the options we passed to it:

  • -i lo only captures packets on the local loopback i.e. packets sent to localhost
  • udp means that only UDP packets will be captured. Other types of packets we might capture could be tcp or icmp for example.
  • -vv just gives us more verbose output
  • -X prints out the data in the UDP packets in ASCII as well as hex. If we just wanted the latter we could use the -x option

This is what one of the messages received by tcpdump looks like:

13:16:40.317636 IP (tos 0x0, ttl 64, id 58103, offset 0, flags [DF], proto UDP (17), length 37)
    localhost.48358 > localhost.8125: [bad udp cksum 7c8f!] UDP, length 9
	0x0000:  4500 0025 e2f7 4000 4011 59ce 7f00 0001  E..%..@.@.Y.....
	0x0010:  7f00 0001 bce6 1fbd 0011 fe24 626c 6168  ...........$blah
	0x0020:  3a33 367c 63                             :36|c

The last three lines of this output detail the IP header, UDP header and the data in the packet. The following diagram is quite useful for understanding what each part of the IP header is defining:

IP Header

Source: WTCS.org

This diagram defines things in terms of bits whereas the tcpdump output is in hexidecimal. Each block of 4 hexidecimal digits is equivalent to 16 bits.

There are a couple of parts of the IP header that might be interesting to us in this case.

The first 4 bits/1 digit define the IP version which is 4 in this case since we’re using IPv4.

The next 4 bits define the Internet Header length – the number of 32 bit words in the header. In this case the value is 5 so we know the total length of the IP header will be 160 bits (5 * 32 = 160).

The next few bits aren’t that interesting but we can see the source IP address at an offset of 96 bits and covers the next 32 bits:

0×0000: 4500 0025 e2f7 4000 4011 59ce 7f00 0001

We know this is going to represent an IPv4 address so it will be represented as ‘x.x.x.x’ where the maximum value of x is 255.

In this case since we’re just sending packets locally it translates to 127.0.0.1:

  • 7f => 127
  • 00 => 0
  • 00 => 0
  • 01 => 1

The next 32 bits are the destination IP address which has now gone onto the next line but is exactly the same:

0×0010: 7f00 0001 bce6 1fbd 0011 fe24 626c 6168

We’ve now covered 160 bits which means that the IP header is complete and we can move onto the IP payload which starts with the UDP header:

Udp header

Source: ciscoskills.net

We start with the source port which is ‘bce6′ or 48358 in decimal. We can see that value referenced in the 2nd line of the tcpdump output as well.

The next 16 bits/4 digits are the destination port which is ’1fbd’ or 8125 in decimal – exactly what we’d expect.

The next 32 bits/2 blocks of 4 digits define the length and checksum but after that we reach the data part of the packet which should contain ‘blah:36|c’.

The word ‘blah’ is defined like so:

626c 6168

00×62 is 98 in decimal and we can use a UTF-8 encoding table to see that 98 maps to the letter ‘b’.

00x6c is 108 or the letter ‘l’, 00×61 is 97 or the letter ‘a’ and 00×68 is 104 or the letter ‘h’

We wrap onto the last line for the rest of the data we wanted to send to statsd:

0x0020:  3a33 367c 63

It follows the same pattern though where 00x3a is 58 or the ‘:’ character and so on.

And now I have a slightly better idea of how to read tcpdump’s output than I did when I started writing this! As usual any tips or hints are welcome.

—–

I found this article useful for initially understanding how to read the output but I think the diagrams above work best! TechRepublic’s ‘anatomy of a data packet‘ also provides a good explanation.

Written by Mark Needham

July 15th, 2012 at 1:29 pm

Posted in Networking

Tagged with