Mark Needham

Thoughts on Software Development

Ruby/Python: Constructing a taxonomy from an array using zip

with one comment

As I mentioned in my previous blog post I’ve been hacking on a product taxonomy and I wanted to create a ‘CHILD’ relationship between a collection of categories.

For example, I had the following array and I wanted to transform it into an array of ‘SubCategory, Category’ pairs:

taxonomy = ["Cat", "SubCat", "SubSubCat"]
# I wanted this to become [("Cat", "SubCat"), ("SubCat", "SubSubCat")

In order to do this we need to zip the first 2 items with the last which I found reasonably easy to do using Python:

>>> zip(taxonomy[:-1], taxonomy[1:])
[('Cat', 'SubCat'), ('SubCat', 'SubSubCat')]

Here we using the python array slicing notation to get all but the last item of ‘taxonomy’ and then all but the first item of ‘taxonomy’ and zip them together.

I wanted to achieve that effect in Ruby though because my import job was written in that!

We can’t achieve the open ended slicing as far as I can tell so the following gives us an error:

> taxonomy[..-1]
SyntaxError: (irb):10: syntax error, unexpected tDOT2, expecting ']'
taxonomy[..-1]
           ^
	from /Users/markhneedham/.rbenv/versions/1.9.3-p327/bin/irb:12:in `<main>'

The way negative indexing works is a bit different so to remove the last item of the array we use ‘-2′ rather than ‘-1′:

> taxonomy[0..-2].zip(taxonomy[1..-1])
=> [["Cat", "SubCat"], ["SubCat", "SubSubCat"]]

Written by Mark Needham

May 19th, 2013 at 10:44 pm

Posted in Python,Ruby

Tagged with ,

neo4j/cypher: Keep longest path when finding taxonomy

without comments

I’ve been playing around with modelling a product taxonomy and one thing that I wanted to do was find out the full path where a product sits under the tree.

I created a simple data set to show the problem:

CREATE (cat { name: "Cat" })
CREATE (subcat1 { name: "SubCat1" })
CREATE (subcat2 { name: "SubCat2" })
CREATE (subsubcat1 { name: "SubSubCat1" })
CREATE (product1 { name: "Product1" })
CREATE (cat)-[:CHILD]-subcat1-[:CHILD]-subsubcat1
CREATE (product1)-[:HAS_CATEGORY]-(subsubcat1)

I wanted to write a query which would return ‘product1′ and the tree ‘Cat -> SubCat1 -> SubSubCat1′ and initially wrote the following query:

START product=node:node_auto_index(name="Product1") 
MATCH product-[:HAS_CATEGORY]-category, taxonomy=category<-[:CHILD*1..]-parent 
RETURN product, EXTRACT(n IN NODES(taxonomy): n.name)

which returns:

==> +--------------------------------------------------------------------+
==> | product                    | EXTRACT(n IN NODES(taxonomy): n.name) |
==> +--------------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1"]              |
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1","Cat"]        |
==> +--------------------------------------------------------------------+
==> 2 rows

I didn’t want to return the first row since that isn’t the full tree and Andres suggested that looking for nodes which didn’t have any incoming children would help me do that:

START product=node:node_auto_index(name="Product1") 
MATCH product-[:HAS_CATEGORY]-category, 
      taxonomy=category<-[:CHILD*1..]-parent 
WHERE NOT parent<-[:CHILD]-() 
RETURN product, EXTRACT(n IN NODES(taxonomy): n.name)
==> +--------------------------------------------------------------------+
==> | product                    | EXTRACT(n IN NODES(taxonomy): n.name) |
==> +--------------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["SubSubCat1","SubCat1","Cat"]        |
==> +--------------------------------------------------------------------+
==> 1 row

If we want to reverse the taxonomy so it’s in the right order we can follow Wes Freeman’s advice from the following Stack Overflow thread:

START product=node:node_auto_index(name="Product1") 
MATCH product-[:HAS_CATEGORY]-category, taxonomy=category<-[:CHILD*1..]-parent 
WHERE NOT parent<-[:CHILD]-() 
RETURN product, 
       REDUCE(acc=[], cat IN EXTRACT(n IN NODES(taxonomy): n.name): cat + acc) AS taxonomy
==> +-------------------------------------------------------------+
==> | product                    | taxonomy                       |
==> +-------------------------------------------------------------+
==> | Node[888]{name:"Product1"} | ["Cat","SubCat1","SubSubCat1"] |
==> +-------------------------------------------------------------+
==> 1 row

Written by Mark Needham

May 19th, 2013 at 10:15 pm

Posted in neo4j

Tagged with ,

Unix: Working with parts of large files

without comments

Chris and I were looking at the neo4j log files of a client earlier in the week and wanted to do some processing of the file so we could ask the client to send us some further information.

The log file was over 10,000 lines long but the bit of the file we were interesting in was only a few hundred lines.

I usually use Vim and the ‘:set number’ when I want to refer to line numbers in a file but Chris showed me that we can achieve the same thing with e.g. ‘less -N data/log/neo4j.0.0.log’.

We can then operate on say lines 10-100 by passing the ‘-n’ flag to sed:

-n By default, each line of input is echoed to the standard output after all of the commands have been applied to it. The -n option suppresses this behavior.

$ sed -n '10,15p' data/log/neo4j.0.0.log
INFO: Enabling HTTPS on port [7473]
May 19, 2013 11:11:52 AM org.neo4j.server.logging.Logger log
INFO: No SSL certificate found, generating a self-signed certificate..
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log
INFO: Mounted discovery module at [/]
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log

We then used a combination of grep, awk and sort to work out which log files we needed.

Written by Mark Needham

May 19th, 2013 at 9:44 pm

Posted in Shell Scripting

Tagged with

A/B Testing: User Experience vs Conversion

without comments

I’ve written a couple of posts over the last few months about my experiences with A/B testing and one conversation we often used to have was around user experience vs conversion rate.

Once you start running an A/B test it encourages you to focus more on the conversion rate of users in different parts of the flow and your inclination is to make changes that increase that conversion rate.

Another one of our drivers is to provide the best user experience that we can to our customers and since sometimes this means that the best thing for them is not to switch it seems that these two must be in conflict.

I found it particularly interesting seeing how the conversion rate could be impacted by the way that information was displayed to a user.

This was an idea that I first came across when reading about how the Obama campaign used A/B testing where they noticed big changes in conversion rates by making small tweaks to sentences and imagery.

Our goal from a user experience perspective was to put all the information in front of the user so that they could make an informed choice about what to do.

Initially we made the negative features of the plans very prominent and had them in a large font which led to a drop in conversion.

We assumed that people were now giving more importance to the negative features than was warranted e.g. some plans had a cancellation fee but it typically only accounted for 5% of the saving they’d make by switching to the plan.

When the product is a bit more complicated we could argue that we improve the user experience by helping the user to make an appropriate choice.

On a website the way that we do this is by how we display information by changing the font size, font weight, positioning and a variety of other things.

It’s an interesting balance to find between the two drivers but if we veer towards conversion at all costs then although we’ll get a higher conversion rate in the long term we’ll have some frustrated customers who won’t use our website again.

If we look at it that way then the two drivers don’t seem so opposed to each other.

Written by Mark Needham

May 18th, 2013 at 8:18 pm

Posted in Software Development

Tagged with

neo4j: When the web console returns nothing…use the data browser!

without comments

In my time playing around with neo4j I’ve run into a problem a few times where I executed a query using the web console (usually accessible @ http://localhost:7474/webadmin/#/console/) and have got absolutely no response.

I noticed a similar thing today when Rickard and I were having a look at why a Lucene index query wasn’t behaving as we expected.

I setup some data in a neo4j database using neography with the following code:

require 'neography'
 
@neo = Neography::Rest.new
 
@neo.create_node_index("Id_Index", "exact", "lucene")
 
node1 = @neo.create_node("Hour" => 1, "name" => "Max")
node2 = @neo.create_node("Hour" => 2, "name" => "Mark")
node3 = @neo.create_node("Hour" => 3, "name" => "Rickard")
 
@neo.add_node_to_index("Id_Index", "Hour", 1, node1)
@neo.add_node_to_index("Id_Index", "Hour", 2, node2) 
@neo.add_node_to_index("Id_Index", "Hour", 3, node3)

I then ran the following query which I was expecting to return all the nodes:

start hour=node:Id_Index("Hour:[00 TO 02] or Hour:[03 TO 05]") RETURN hour

Instead it returned nothing and I couldn’t see anything being logged either.

Rickard pointed out was because the exception is only returned to the API caller and that it would be better to run the query from the Data Browser which is typically accessible from http://localhost:7474/webadmin/#/data/search/

If we run the query from there then we can see what’s going wrong:

BadInputException
 
StackTrace:
org.neo4j.server.rest.repr.RepresentationExceptionHandlingIterable.exceptionOnHasNext(RepresentationExceptionHandlingIterable.java:50)
org.neo4j.helpers.collection.ExceptionHandlingIterable$1.hasNext(ExceptionHandlingIterable.java:60)
org.neo4j.helpers.collection.IteratorWrapper.hasNext(IteratorWrapper.java:42)
org.neo4j.server.rest.repr.ListRepresentation.serialize(ListRepresentation.java:58)
org.neo4j.server.rest.repr.Serializer.serialize(Serializer.java:75)
org.neo4j.server.rest.repr.MappingSerializer.putList(MappingSerializer.java:61)
org.neo4j.server.rest.repr.CypherResultRepresentation.serialize(CypherResultRepresentation.java:57)
org.neo4j.server.rest.repr.MappingRepresentation.serialize(MappingRepresentation.java:42)
org.neo4j.server.rest.repr.OutputFormat.assemble(OutputFormat.java:179)
org.neo4j.server.rest.repr.OutputFormat.formatRepresentation(OutputFormat.java:131)
org.neo4j.server.rest.repr.OutputFormat.response(OutputFormat.java:117)
org.neo4j.server.rest.repr.OutputFormat.ok(OutputFormat.java:55)
org.neo4j.server.rest.web.CypherService.cypher(CypherService.java:94)
java.lang.reflect.Method.invoke(Method.java:597)

There seemed to be some strangeness going on with how Lucene handles the query when a default search field isn’t provided but we noticed that it behaved as expected if we didn’t use an OR since Lucene has an implicit OR between statements anyway.

start hour=node:Id_Index("Hour:[00 TO 02] Hour:[03 TO 05]") RETURN hour

Either way, the lesson for me was if the console isn’t giving a result run the query in the data browser to work out what’s going wrong!

Written by Mark Needham

May 17th, 2013 at 12:00 am

Posted in neo4j

Tagged with

Book Review: The Signal and the Noise – Nate Silver

without comments

Nate Silver is famous for having correctly predicted the winner of all 50 states in the 2012 United States elections and Sid recommended his book so I could learn more about statistics for the A/B tests that we were running.

I thought the book was a really good introduction to applied statistics and by using real life examples which most people would be able to relate to it makes a potentially dull subject interesting.

Reasonably early on the author points out that there’s a difference between making a prediction and making a forecast:

  • Prediction – a definitive and specific statement about when and where something will happen e.g. a major earthquake will hit Kyoto, Japan, on June 28.
  • Forecast – a probabilistic statement over a longer time scale e.g. there is a 60% chance of an earthquake in Southern California over the next 30 years.

The book mainly focuses on the latter.

We then move onto quite an interesting section about over fitting which is where we mistake noise for signal in our data.

I first came across this term when Jen and I were working through one of the Kaggle problems and were using a random forest of deliberately over fitted Decision Trees to do digit recognition.

It’s not a problem when we combine lots of decision trees together and use a majority wins algorithm to make our prediction but if we use just one of them its predictions on any new data will be completely wrong.

Later on in the book he points out that a lot of conspiracy theories come when we look at data retrospectively and can easily detect signal from noise in data when at the time it was much more difficult.

He also points out that sometimes there isn’t actually any signal, it’s all noise, and we can fall into the trap of looking for something that isn’t there. I think this ‘noise’ is what we’d refer to as random variation in the context of an A/B test.

Silver also encourages us to make sure that we understand the theory behind any inference we make:

Statistical inferences are much stronger when backed up by theory or at least some deeper thinking about their root causes.

When we were running A/B tests Sid encouraged people to think whether a theory about why conversion had changed made logical sense before assuming it was true which I think covers similar ground.

A big chunk of the book covers Bayes’ theorem and how often when we’re making forecasts we have prior beliefs which it forces us to make explicit.

For example there is a section which talks about the probability a lady is being cheated on given that she’s found some underwear that she doesn’t recognise in her house.

In order to work out the probability she’s being cheated on we need to know the probability that she was being cheated on before she found the underwear. Silver suggests that since 4% of married partners cheat on their spouses that would be a good number to use.

He then goes on to show multiple other problems throughout the book that we can apply Bayes’ theorem to.

Some other interesting things I picked up are that if we’re good at forecasting then being given more information should make our forecast better and that when we don’t have any special information we’re better off following the opinion of the crowd.

IMG 20130514 011256

Silver also showed a clever trick for inferring data points on a data set which follows a power law i.e. the long tail distribution where there are very few massive events but lots of really small ones.

We have a power law distribution when modelling the number of terrorists attacks vs number of fatalities but if we change both scales to be logarithmic we can come up with a probability of how likely more deadly attacks are.

There is then some discussion of how we can make changes in the way that we treat terrorism to try and impact the shape of the chart e.g. in Israel Silver suggests that they really want to avoid a very deadly attack but at the expense of there being more smaller attacks.

A lot of the book is spent discussing weather/earthquake forecasting which is very interesting to read about but I couldn’t quite see a link back to the software context.

Overall though I found it an interesting read although there are probably a few places that you can skim over the detail and still get the gist of what he’s saying.

Written by Mark Needham

May 14th, 2013 at 12:16 am

Posted in Books

Tagged with

Sublime: Overriding default file type/Assigning specific files to a file type

without comments

I’ve been using Sublime a bit recently and one thing I wanted to do was put neo4j cypher queries into files with arbitrary extensions and have them recognised as cypher files every time I open them.

I’m using the cypher Sublime plugin to get the syntax highlighting but since I’ve got my cypher in a .haml file it only remembers that it should have cypher highlighting as long as the file is open.

As soon as I close and then re-open the file it goes back to being highlighted as HAML.

I initially thought that the way around this would be to write a plugin which kept track of files that you’d manually assigned a syntax to but then I came across the ApplySyntax plugin which seems even better.

ApplySyntax allows you to assign syntaxes to files based on regular expression matching on the file name or on the first line of the file.

At the moment, the easiest way to detect that a file is a cypher query is that the first line will begin with ‘START’ so I wrote the following in my user settings file:

~/Library/Application Support/Sublime Text 2/Packages/User/ApplySyntax.sublime-settings

{
	"reraise_exceptions": false,
	"new_file_syntax": false,
	"syntaxes": [
		{			
			"name": "Cypher",
			"rules": [
				{"first_line": "^START"}
			]
		}	
	]
}

ApplySyntax is a pretty neat plugin, worth having a look if you have this problem to solve!

Written by Mark Needham

May 5th, 2013 at 12:03 am

Posted in Software Development

Tagged with

Ruby 1.9.3 p0: Investigating weirdness with HTTP POST request in net/http

with one comment

Thibaut and I spent the best part of the last couple of days trying to diagnose a problem we were having trying to make a POST request using rest-client to one of our services.

We have nginx fronting the application server so the request passes through there first:

Post

The problem we were having was that the request was timing out on the client side before it had been processed and the request wasn’t reaching the application server.

We initially thought there might be a problem with our nginx configuration because we don’t have many POST requests with largish (40kb) payloads so we initially tried tweaking the proxy buffer size.

It was a bit of a long shot because changing that setting only reduces the likelihood that nginx writes the request body to disc and then loads it later which shouldn’t impact performance that much.

The next thing we tried was replicating the request using cURL with a smaller payload which worked fine. cURL had no problem with the bigger payload either.

We therefore thought there must be a difference in the request headers being sent by rest-client and our initial investigation suggested that it might be to do with the ‘Content-Length‘ header.

There was a 1 byte difference in the value being sent by cURL and the one being sent by rest-client which was to do with the last character of the payload being a 0A (linefeed) character.

We changed the ‘Content-Length’ header on our cURL request to match that of the rest-client request (i.e. 1 byte too large) and were able to replicate the timeout problem.

At this stage we thought that calling ‘strip’ on the body of our rest-client request would solve the problem as the ‘Content-Length’ header would now be set to the correct value. It did set the ‘Content-Length’ header properly but unfortunately didn’t get rid of the timeout.

Our next step was to check whether or not we could get any request to work from rest-client so we tried using a smaller payload which worked fine.

At this stage Jason heard us discussing what to do next and said that he’d come across it earlier and that upgrading our Ruby Version from ’1.9.3p0′ would solve all our woes.

That Ruby version is a couple of years old and most of our servers are running ’1.9.3p392′ but somehow this one had slipped through the net.

We spun up a new server with that version of Ruby installed and it did indeed fix the problem.

However, we were curious what the fix was and had a look at the change log of the first patch release after ’1.9.3p0′. We noticed the following which seemed relevant:

Tue May 31 17:03:24 2011 Hiroshi Nakamura

* lib/net/http.rb, lib/net/protocol.rb: Allow to configure to wait
server returning ’100 continue’ response before sending HTTP request
body. See NEWS for more detail. See #3622.
Original patch is made by Eric Hodel .

* test/net/http/test_http.rb: test it.

* NEWS: Add new feature.

One thing we noticed from looking at the requests with ngrep was that cURL was setting the 100 Continue Expect request header and rest-client wasn’t.

When the payload size was small nginx didn’t seem to send a ’100 Continue’ response which was presumably why we weren’t seeing a problem with the small payloads.

I wasn’t sure how to go about finding out exactly what was going wrong but given how long it took us to get to this point I thought I’d summarise what we tried and see if anyone could explain it to me.

So if you’ve come across this problem (probably 2 years ago!) it’d be cool to know exactly what the problem was.

Written by Mark Needham

April 30th, 2013 at 9:37 pm

Posted in Ruby

Tagged with

Mac OS X: A couple of neat tools

with 7 comments

When I first started working at uSwitch Sid installed a couple of ‘productivity applications’ on my Mac which I’ve found pretty useful but from talking to others I realised they aren’t known/being used by everyone.

Alfred

Alfred is a Quick Silver replacement which allows you to quickly open applications, find files, search Google and more. Even though we’re not using half of its features it’s still proved to be useful.

I quite like the calculator feature which we’ve been using for adhoc calculation like working out how much free memory there was on a server or the conversion rate on part of an A/B test.

Calculator

Moom

The other application is Moom which allows you to move/resize windows.

I didn’t see the point when I first saw it but it’s actually really useful when you’re working on a big monitor and want to put say the terminal alongside the browser.

We have the following shortcuts set up:

Moom1

That allows us to type ‘Ctrl + Space’ to make the window fill the left hand side of the screen, ‘Alt + Space’ to make it fill the right hand side of the screen and ‘Alt + Ctrl + Space’ to fill the whole screen.

You can also set up shortcuts to allow you to move a window between displays or to rearrange the windows based on certain events.

Highly recommended!

If anyone knows any other cool tools like this I’d love to hear about them.

Written by Mark Needham

April 30th, 2013 at 8:07 pm

neo4j/cypher: Returning a row with zero count when no relationship exists

with 3 comments

I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received.

1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels.

To refresh this is the sub graph that we’ll need to look at to work it out:

Sent off

I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for:

START game = node:matches('match_id:*')
MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game, 
      likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards
LIMIT 5

When we run this we get the following results:

+------------------------------+
| team.name         | redCards |
+------------------------------+
| "Sunderland"      | 1        |
| "West Ham United" | 1        |
| "Norwich City"    | 1        |
| "Reading"         | 1        |
| "Liverpool"       | 2        |
+------------------------------+
5 rows

The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship.

I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on.

The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional.

However, that on its own that isn’t enough because it now returns a count of all the player performances for each team:

START game = node:matches('match_id:*')
MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game, 
      likeThis-[:for]->team
RETURN team.name, COUNT(game) AS redCards
ORDER BY redCards ASC
LIMIT 5
+-----------------------------+
| team.name        | redCards |
+-----------------------------+
| "Chelsea"        | 448      |
| "Wigan Athletic" | 459      |
| "Fulham"         | 460      |
| "Liverpool"      | 466      |
| "Everton"        | 467      |
+-----------------------------+
5 rows

Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up.

We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need:

START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, 
      likeThis-[:for]->team
RETURN team.name, COLLECT(r) AS redCards
LIMIT 5
+-----------------------------------------------------------------------------------------------------+
| team.name          | redCards                                                                       |
+-----------------------------------------------------------------------------------------------------+
| "Wigan Athletic"   | [:sent_off_in[26443] {},:sent_off_in[37785] {}]                                |
| "Everton"          | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] |
| "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}]  |
| "Southampton"      | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}]              |
| "West Ham United"  | [:sent_off_in[21734] {minute:67}]                                              |
+-----------------------------------------------------------------------------------------------------+
5 rows

We then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done:

START game = node:matches('match_id:*')
MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, 
      likeThis-[:for]->team
RETURN team.name, LENGTH(COLLECT(r)) AS redCards
ORDER BY redCards
LIMIT 5
+--------------------------------+
| team.name           | redCards |
+--------------------------------+
| "Manchester United" | 0        |
| "West Ham United"   | 1        |
| "Sunderland"        | 1        |
| "Norwich City"      | 1        |
| "Reading"           | 1        |
+--------------------------------+
5 rows

Written by Mark Needham

April 30th, 2013 at 7:02 am

Posted in neo4j

Tagged with ,