neo4j: When the web console returns nothing…use the data browser!
In my time playing around with neo4j I’ve run into a problem a few times where I executed a query using the web console (usually accessible @ http://localhost:7474/webadmin/#/console/) and have got absolutely no response.
I noticed a similar thing today when Rickard and I were having a look at why a Lucene index query wasn’t behaving as we expected.
I setup some data in a neo4j database using neography with the following code:
require 'neography' @neo = Neography::Rest.new @neo.create_node_index("Id_Index", "exact", "lucene") node1 = @neo.create_node("Hour" => 1, "name" => "Max") node2 = @neo.create_node("Hour" => 2, "name" => "Mark") node3 = @neo.create_node("Hour" => 3, "name" => "Rickard") @neo.add_node_to_index("Id_Index", "Hour", 1, node1) @neo.add_node_to_index("Id_Index", "Hour", 2, node2) @neo.add_node_to_index("Id_Index", "Hour", 3, node3)
I then ran the following query which I was expecting to return all the nodes:
start hour=node:Id_Index("Hour:[00 TO 02] or Hour:[03 TO 05]") RETURN hourInstead it returned nothing and I couldn’t see anything being logged either.
Rickard pointed out was because the exception is only returned to the API caller and that it would be better to run the query from the Data Browser which is typically accessible from http://localhost:7474/webadmin/#/data/search/
If we run the query from there then we can see what’s going wrong:
BadInputException StackTrace: org.neo4j.server.rest.repr.RepresentationExceptionHandlingIterable.exceptionOnHasNext(RepresentationExceptionHandlingIterable.java:50) org.neo4j.helpers.collection.ExceptionHandlingIterable$1.hasNext(ExceptionHandlingIterable.java:60) org.neo4j.helpers.collection.IteratorWrapper.hasNext(IteratorWrapper.java:42) org.neo4j.server.rest.repr.ListRepresentation.serialize(ListRepresentation.java:58) org.neo4j.server.rest.repr.Serializer.serialize(Serializer.java:75) org.neo4j.server.rest.repr.MappingSerializer.putList(MappingSerializer.java:61) org.neo4j.server.rest.repr.CypherResultRepresentation.serialize(CypherResultRepresentation.java:57) org.neo4j.server.rest.repr.MappingRepresentation.serialize(MappingRepresentation.java:42) org.neo4j.server.rest.repr.OutputFormat.assemble(OutputFormat.java:179) org.neo4j.server.rest.repr.OutputFormat.formatRepresentation(OutputFormat.java:131) org.neo4j.server.rest.repr.OutputFormat.response(OutputFormat.java:117) org.neo4j.server.rest.repr.OutputFormat.ok(OutputFormat.java:55) org.neo4j.server.rest.web.CypherService.cypher(CypherService.java:94) java.lang.reflect.Method.invoke(Method.java:597)
There seemed to be some strangeness going on with how Lucene handles the query when a default search field isn’t provided but we noticed that it behaved as expected if we didn’t use an OR since Lucene has an implicit OR between statements anyway.
start hour=node:Id_Index("Hour:[00 TO 02] Hour:[03 TO 05]") RETURN hourEither way, the lesson for me was if the console isn’t giving a result run the query in the data browser to work out what’s going wrong!
Book Review: The Signal and the Noise – Nate Silver
Nate Silver is famous for having correctly predicted the winner of all 50 states in the 2012 United States elections and Sid recommended his book so I could learn more about statistics for the A/B tests that we were running.
I thought the book was a really good introduction to applied statistics and by using real life examples which most people would be able to relate to it makes a potentially dull subject interesting.
Reasonably early on the author points out that there’s a difference between making a prediction and making a forecast:
- Prediction – a definitive and specific statement about when and where something will happen e.g. a major earthquake will hit Kyoto, Japan, on June 28.
- Forecast – a probabilistic statement over a longer time scale e.g. there is a 60% chance of an earthquake in Southern California over the next 30 years.
The book mainly focuses on the latter.
We then move onto quite an interesting section about over fitting which is where we mistake noise for signal in our data.
I first came across this term when Jen and I were working through one of the Kaggle problems and were using a random forest of deliberately over fitted Decision Trees to do digit recognition.
It’s not a problem when we combine lots of decision trees together and use a majority wins algorithm to make our prediction but if we use just one of them its predictions on any new data will be completely wrong.
Later on in the book he points out that a lot of conspiracy theories come when we look at data retrospectively and can easily detect signal from noise in data when at the time it was much more difficult.
He also points out that sometimes there isn’t actually any signal, it’s all noise, and we can fall into the trap of looking for something that isn’t there. I think this ‘noise’ is what we’d refer to as random variation in the context of an A/B test.
Silver also encourages us to make sure that we understand the theory behind any inference we make:
Statistical inferences are much stronger when backed up by theory or at least some deeper thinking about their root causes.
When we were running A/B tests Sid encouraged people to think whether a theory about why conversion had changed made logical sense before assuming it was true which I think covers similar ground.
A big chunk of the book covers Bayes’ theorem and how often when we’re making forecasts we have prior beliefs which it forces us to make explicit.
For example there is a section which talks about the probability a lady is being cheated on given that she’s found some underwear that she doesn’t recognise in her house.
In order to work out the probability she’s being cheated on we need to know the probability that she was being cheated on before she found the underwear. Silver suggests that since 4% of married partners cheat on their spouses that would be a good number to use.
He then goes on to show multiple other problems throughout the book that we can apply Bayes’ theorem to.
Some other interesting things I picked up are that if we’re good at forecasting then being given more information should make our forecast better and that when we don’t have any special information we’re better off following the opinion of the crowd.
Silver also showed a clever trick for inferring data points on a data set which follows a power law i.e. the long tail distribution where there are very few massive events but lots of really small ones.
We have a power law distribution when modelling the number of terrorists attacks vs number of fatalities but if we change both scales to be logarithmic we can come up with a probability of how likely more deadly attacks are.
There is then some discussion of how we can make changes in the way that we treat terrorism to try and impact the shape of the chart e.g. in Israel Silver suggests that they really want to avoid a very deadly attack but at the expense of there being more smaller attacks.
A lot of the book is spent discussing weather/earthquake forecasting which is very interesting to read about but I couldn’t quite see a link back to the software context.
Overall though I found it an interesting read although there are probably a few places that you can skim over the detail and still get the gist of what he’s saying.
Sublime: Overriding default file type/Assigning specific files to a file type
I’ve been using Sublime a bit recently and one thing I wanted to do was put neo4j cypher queries into files with arbitrary extensions and have them recognised as cypher files every time I open them.
I’m using the cypher Sublime plugin to get the syntax highlighting but since I’ve got my cypher in a .haml file it only remembers that it should have cypher highlighting as long as the file is open.
As soon as I close and then re-open the file it goes back to being highlighted as HAML.
I initially thought that the way around this would be to write a plugin which kept track of files that you’d manually assigned a syntax to but then I came across the ApplySyntax plugin which seems even better.
ApplySyntax allows you to assign syntaxes to files based on regular expression matching on the file name or on the first line of the file.
At the moment, the easiest way to detect that a file is a cypher query is that the first line will begin with ‘START’ so I wrote the following in my user settings file:
~/Library/Application Support/Sublime Text 2/Packages/User/ApplySyntax.sublime-settings
{
"reraise_exceptions": false,
"new_file_syntax": false,
"syntaxes": [
{
"name": "Cypher",
"rules": [
{"first_line": "^START"}
]
}
]
}ApplySyntax is a pretty neat plugin, worth having a look if you have this problem to solve!
Ruby 1.9.3 p0: Investigating weirdness with HTTP POST request in net/http
Thibaut and I spent the best part of the last couple of days trying to diagnose a problem we were having trying to make a POST request using rest-client to one of our services.
We have nginx fronting the application server so the request passes through there first:

The problem we were having was that the request was timing out on the client side before it had been processed and the request wasn’t reaching the application server.
We initially thought there might be a problem with our nginx configuration because we don’t have many POST requests with largish (40kb) payloads so we initially tried tweaking the proxy buffer size.
It was a bit of a long shot because changing that setting only reduces the likelihood that nginx writes the request body to disc and then loads it later which shouldn’t impact performance that much.
The next thing we tried was replicating the request using cURL with a smaller payload which worked fine. cURL had no problem with the bigger payload either.
We therefore thought there must be a difference in the request headers being sent by rest-client and our initial investigation suggested that it might be to do with the ‘Content-Length‘ header.
There was a 1 byte difference in the value being sent by cURL and the one being sent by rest-client which was to do with the last character of the payload being a 0A (linefeed) character.
We changed the ‘Content-Length’ header on our cURL request to match that of the rest-client request (i.e. 1 byte too large) and were able to replicate the timeout problem.
At this stage we thought that calling ‘strip’ on the body of our rest-client request would solve the problem as the ‘Content-Length’ header would now be set to the correct value. It did set the ‘Content-Length’ header properly but unfortunately didn’t get rid of the timeout.
Our next step was to check whether or not we could get any request to work from rest-client so we tried using a smaller payload which worked fine.
At this stage Jason heard us discussing what to do next and said that he’d come across it earlier and that upgrading our Ruby Version from ’1.9.3p0′ would solve all our woes.
That Ruby version is a couple of years old and most of our servers are running ’1.9.3p392′ but somehow this one had slipped through the net.
We spun up a new server with that version of Ruby installed and it did indeed fix the problem.
However, we were curious what the fix was and had a look at the change log of the first patch release after ’1.9.3p0′. We noticed the following which seemed relevant:
Tue May 31 17:03:24 2011 Hiroshi Nakamura
* lib/net/http.rb, lib/net/protocol.rb: Allow to configure to wait
server returning ’100 continue’ response before sending HTTP request
body. See NEWS for more detail. See #3622.
Original patch is made by Eric Hodel. * test/net/http/test_http.rb: test it.
* NEWS: Add new feature.
One thing we noticed from looking at the requests with ngrep was that cURL was setting the 100 Continue Expect request header and rest-client wasn’t.
When the payload size was small nginx didn’t seem to send a ’100 Continue’ response which was presumably why we weren’t seeing a problem with the small payloads.
I wasn’t sure how to go about finding out exactly what was going wrong but given how long it took us to get to this point I thought I’d summarise what we tried and see if anyone could explain it to me.
So if you’ve come across this problem (probably 2 years ago!) it’d be cool to know exactly what the problem was.
Mac OS X: A couple of neat tools
When I first started working at uSwitch Sid installed a couple of ‘productivity applications’ on my Mac which I’ve found pretty useful but from talking to others I realised they aren’t known/being used by everyone.
Alfred
Alfred is a Quick Silver replacement which allows you to quickly open applications, find files, search Google and more. Even though we’re not using half of its features it’s still proved to be useful.
I quite like the calculator feature which we’ve been using for adhoc calculation like working out how much free memory there was on a server or the conversion rate on part of an A/B test.
Moom
The other application is Moom which allows you to move/resize windows.
I didn’t see the point when I first saw it but it’s actually really useful when you’re working on a big monitor and want to put say the terminal alongside the browser.
We have the following shortcuts set up:
That allows us to type ‘Ctrl + Space’ to make the window fill the left hand side of the screen, ‘Alt + Space’ to make it fill the right hand side of the screen and ‘Alt + Ctrl + Space’ to fill the whole screen.
You can also set up shortcuts to allow you to move a window between displays or to rearrange the windows based on certain events.
Highly recommended!
If anyone knows any other cool tools like this I’d love to hear about them.
neo4j/cypher: Returning a row with zero count when no relationship exists
I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received.
1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels.
To refresh this is the sub graph that we’ll need to look at to work it out:
I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for:
START game = node:matches('match_id:*')
MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards
LIMIT 5When we run this we get the following results:
+------------------------------+ | team.name | redCards | +------------------------------+ | "Sunderland" | 1 | | "West Ham United" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | | "Liverpool" | 2 | +------------------------------+ 5 rows
The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship.
I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on.
The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional.
However, that on its own that isn’t enough because it now returns a count of all the player performances for each team:
START game = node:matches('match_id:*')
MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards ASC
LIMIT 5+-----------------------------+ | team.name | redCards | +-----------------------------+ | "Chelsea" | 448 | | "Wigan Athletic" | 459 | | "Fulham" | 460 | | "Liverpool" | 466 | | "Everton" | 467 | +-----------------------------+ 5 rows
Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up.
We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need:
START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, COLLECT(r) AS redCards
LIMIT 5+-----------------------------------------------------------------------------------------------------+
| team.name | redCards |
+-----------------------------------------------------------------------------------------------------+
| "Wigan Athletic" | [:sent_off_in[26443] {},:sent_off_in[37785] {}] |
| "Everton" | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] |
| "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}] |
| "Southampton" | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}] |
| "West Ham United" | [:sent_off_in[21734] {minute:67}] |
+-----------------------------------------------------------------------------------------------------+
5 rowsWe then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done:
START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game,
likeThis-[:for]->team
RETURN team.name, LENGTH(COLLECT(r)) AS redCards
ORDER BY redCards
LIMIT 5+--------------------------------+ | team.name | redCards | +--------------------------------+ | "Manchester United" | 0 | | "West Ham United" | 1 | | "Sunderland" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | +--------------------------------+ 5 rows
A/B Testing: Reporting
A few months ago I wrote about my initial experiences with A/B testing and since then we’ve been working on another one and learnt some things around reporting on these types of tests that I thought was interesting.
Reporting as a first class concern
One thing we changed from our previous test after a suggestion by Mike was to start treating the reporting of data related to the test as a first class citizen.
To do this we created an end point which the main application could send POST requests to in order to record page views and various other information about users.
On our previous test we’d derived the various conversion rates from our main transactional data store but it was really slow and painful because the way we structure data in there is optimised for a completely different use case.
Having just the data we want to report on in a separate data store has massively reduced the time spent generating reports.
However, one thing that we learnt about this approach is that you need to spend some time thinking about what data is going to be needed up front.
If you don’t then it will have to be added later on and the reporting on that metric won’t cover the whole test duration.
Drilling down to get insight
In the first test we ran we only really looked at conversion at quite a high level which is good for getting an overview but doesn’t give much insight into what’s going on.
For this test we started off with higher level metrics but a few days in became curious about what was going on between two of the pages and so created a report that segmented users based on an action they’d taken on the first page.
This allowed us to rule out a theory about a change in conversion which we had initially thought was down to a change we’d made but actually proved to be because of a change in an external factor.
The frustrating part of drilling down into the data is that you don’t really know what is it you’re going to want to zoom in on so you have to write code for the specific scenario each time!
Detecting bugs
We generate browser specific metrics on each test that we run and while the conversion rate is generally similar between them there have been some times when there’s a big drop in one browser.
More often than not when we’ve drilled into this we’ve found that there was actually a Javascript bug that we hadn’t detected and we can then go back and sort that out.
An alternative approach would be to have an automated Javascript/Web Driver test suite which ran against each browser. We’ve effectively traded off the maintenance cost of that for what is usually a small period of inconvenience for some users.
Treat servers as cattle: Spin them up, tear them down
A few agos I wrote a post about treating servers as cattle, not as pets in which I described an approach to managing virtual machines at uSwitch whereby we frequently spin up new ones and delete the existing ones.
I’ve worked on teams previously where we’ve also talked about this mentality but ended up not doing it because it was difficult, usually for one of two reasons:
- Slow spin up – this might be due to the cloud providers infrastructure, doing too much on spin up or I’m sure a variety of other reasons.
- Manual steps involved in spin up – the process isn’t 100% automated so we have to do some manual tweaks. Once the machine is finally working we don’t want to have to go through that again.
Martin Fowler wrote a post a couple of years ago where he said the following:
One of my favorite soundbites is: if it hurts, do it more often. It has the happy property of seeming nonsensical on the surface, but yielding some valuable meaning when you dig deeper
I think it applies in this context too and I have noticed that the more frequently we tear down and spin up new nodes the easier it becomes to do so.
Part of this is because there’s been less time for changes to have happened in package repositories but we are also more inclined to optimise things that we have to do frequently so the whole process is faster as well.
For example in one of our sets of machines we need to give one machine a specific tag so that when the application is deployed it sets up a bunch of cron jobs to run each evening.
Initially this was done manually and we were quite reluctant to ever tear down that machine but we’ve now got it all automated and it’s not a big deal anymore – it can be cattle just like the rest of them!
One neat rule of thumb Phil taught me is that if we make major changes to our infrastructure we should spin up some new machines to check that it still actually works.
If we don’t do this then when we actually need to spin up a new node because of a traffic spike or machine corruption problem it’s not going to work and we’re going to have to fix things in a much more stressful context.
For example we recently moved some repositories around in github and although it’s a fairly simple change spinning up new nodes helped us see all the places where we’d failed to make the appropriate change.
While I appreciate taking this approach is more time consuming in the short term I’d argue that if we automate as much of the pain as possible in the long run it will probably be beneficial.
Puppet: Package Versions – To pin or not to pin
Over the last year or so I’ve spent quite a bit of time working with puppet and one of the things that we had to decide when installing packages was whether or not to specify a particular version.
On the first project I worked on we didn’t bother and just let the package manager chose the most recent version.
Therefore if we were installing nginx the puppet code would read like this:
package { 'nginx':
ensure => 'present',
}We can see which version that would install by checking the version table for the package:
$ apt-cache policy nginx nginx: Installed: (none) Candidate: 1:1.2.6-1~43~precise1 Version table: 1:1.2.6-1~43~precise1 0 500 http://ppa.launchpad.net/brightbox/ruby-ng/ubuntu/ precise/main amd64 Packages 1.4.0-1~precise 0 500 http://nginx.org/packages/ubuntu/ precise/nginx amd64 Packages 1.1.19-1ubuntu0.1 0 500 http://us.archive.ubuntu.com/ubuntu/ precise-updates/universe amd64 Packages 1.1.19-1 0 500 http://us.archive.ubuntu.com/ubuntu/ precise/universe amd64 Packages
In this case if we don’t specify a version the Brightbox ’1:1.2.6-1~43~precise1′ version will be installed.
Running dpkg with the ‘compare-versions’ flag shows us that this version is considered higher than the nginx.org one:
$ dpkg --compare-versions '1:1.2.6-1~43~precise1' gt '1.4.0-1~precise' ; echo $? 0
From what I understand you can pin versions higher up the list by associating a higher number with them but given that all these versions are set to ’500′ I’m not sure how it decides on the order!
The problem with not specifying a version is that when a new version becomes available the next time puppet runs it will automatically upgrade the version for us.
Most of the time this isn’t a problem but there were a couple of occasions when a version got bumped and something elsewhere stopped working and it took us quite a while to work out what had changed.
The alternative approach is to pin the package installation to a specific version. So if we want the recent 1.4.0 version installed we’d have the following code:
package { 'nginx':
ensure => '1.4.0-1~precise',
}The nice thing about this approach is that we always know which version is going to be installed.
The problem we now introduce is that when an updated version is added to the repository the old one is typically removed which means a puppet run on a new machine will fail because it can’t find the version.
After working with puppet for a few months it becomes quite easy to see when this is the reason for the failure but it creates the perception that ‘puppet is always failing’ for newer people which isn’t so good.
I think on balance I prefer to have the versions explicitly defined because I find it easier to work out what’s going on that way but I’m sure there’s an equally strong argument for just picking the latest version.
Unix: Checking for open sockets on nginx
Tim and I were investigating a weird problem we were having with nginx where it was getting in a state where it had exceeded the number of open files allowed on the system and started rejecting requests.
We can find out the maximum number of open files that we’re allowed on a system with the following command:
$ ulimit -n 1024
Our hypothesis was that some socket connections were never being closed and therefore the number of open files was climbing slowly upwards until it exceeded the limit.
We wanted to check how many sockets nginx had open so to start with we needed to know the process IDs it was running under:
$ ps aux | grep nginx | grep -v grep root 1089 0.0 0.7 105152 2736 ? Ss 17:34 0:00 nginx: master process /usr/sbin/nginx www-data 17474 0.0 0.6 105300 2296 ? S 21:49 0:04 nginx: worker process www-data 17475 0.0 0.7 105300 2856 ? S 21:49 0:04 nginx: worker process www-data 17476 0.0 0.7 105300 2792 ? S 21:49 0:03 nginx: worker process www-data 17477 0.0 0.7 105300 2668 ? S 21:49 0:04 nginx: worker process
So the process IDs we’re interested in are 1089, 17474, 17475, 17476 and 17477.
We can check which file descriptors they have open with the following command:
$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd /proc/17476/fd: total 0 dr-x------ 2 www-data www-data 0 Apr 23 23:40 . ... l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910] /proc/17477/fd: total 0 ... lrwx------ 1 www-data www-data 64 Apr 23 23:40 56 -> socket:[52213] lrwx------ 1 www-data www-data 64 Apr 23 23:40 57 -> anon_inode:[eventpoll] l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]
We can narrow that down to just show us how many sockets are open:
$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd | grep socket | wc -l 189
We could also use lsof although for some reason that returns a slightly different number:
$ sudo lsof -p 1089,17474,17475,17476,17477 | grep socket | wc -l 184
If we want to use brace expansion to do that it becomes a bit more tricky:
$ sudo lsof -p `echo {1089,174{74,75,76,77}} | sed 's/ /,/g'` | grep socket | wc -l 184
Annoyingly we couldn’t actually replicate the error but think that it’s been solved in nginx 1.2.0 (we were using 1.1.19) by this change:
Bugfix: a segmentation fault might occur in a worker process if the
"try_files" directive was used; the bug had appeared in 1.1.19.