Archive for the ‘Software Development’ Category
On Wednesday evening I attended an interesting spin on the monthly Neo4j meetup, where instead of the usual ‘talk then go to the pub afterwards’ format my colleagues Rik and Arturas organised Graph Café in the Doggetts Coat and Badge pub in Blackfriars.
The format was changed as well – the evening consisted of ~10 lightening talks which were spread out over about 3 hours, an approach Rik has used at similar events in Belgium and Holland earlier in the year.
In the gaps in between the talks people mingled with each other and shared tips/talked through the problems they were trying to solve using graphs.
There was a strong turn out and it was much more interactive than a normal meet-up where the main interaction comes from people asking the speaker questions. While there is a pub afterwards there’s always a noticeable drop out from the talk so it was good to have everyone together chatting this time.
Frank Gibson described what I thought was the coolest use of graphs of the evening. He’s modelling different drugs, which medical conditions they treat and which other drugs they aren’t compatible with. The next step is to bring that together with patients’ medical records to help doctors make treatment recommendations.
As well as the talks, table clothes were laid out on the tables where people could sketch out the problems they were working on and get input from others. Tobias went a bit meta and drew a graph about graph databases:
While it’s often said that graphs are whiteboard friendly I was still surprised at how effective this was. When someone was explaining what they were working on I found myself sketching out what I’d interpreted and then they’d join in and point out bits I’d misunderstood and bits they were thinking about changing.
Overall it was a fun meet up and now we need to try and work out how to keep the interactive aspect when it isn’t 25 degrees outside and we’re not in a pub overlooking St Paul’s.
I often go off on massive tangents reading all about a new topic but don’t record what I’ve read so if I go back to the topic again in the future I have to start from scratch which is quite frustrating.
I started off by reading a paper written by James Keener about the Perron-Frobenius Theorem and the ranking of American football teams.
The Perron-Frobenius Theorem asserts the following:
a real square matrix with positive entries has a unique largest real eigenvalue and that the corresponding eigenvector has strictly positive components
This is applicable for network based ranking systems as we can build up a matrix of teams, store a value representing their performance against each other, and then calculate an ordered ranking based on eigenvector centrality.
I also came across the following articles describing different network-based approaches to ranking teams/players in tennis and basketball respectively:
- A network-based dynamical ranking system for competitive sports
- Using Graph Theory to Predict NCAA March Madness Basketball
Unfortunately I haven’t come across any corresponding code showing how to implement those algorithms so I need to do a bit more reading and figure out how to do it.
In the world of non network based ranking systems I came across 3 algorithms:
- Elo – this is a method originally developed to calculate the relative skill of chess players.
Players start out with an average rating which then increases/decreases based on games they take part in. If they beat someone much more highly ranked then they’d gain a lot of points whereas losing to someone similarly ranked wouldn’t affect their ranking too much.
- Glicko – this method was developed as the author, Mark Glickman, detected some flaws in the Elo rating system around the reliability of players’ ratings.
This algorithm therefore introduces the concept of a ratings deviation (RD) to measure uncertainty in a rating. If a player player plays regularly they’d have a low RD and if they don’t it’d be higher. This is then taken into account when assigning points based on games between different players.
- TrueSkill – this one was developed by Microsoft Research to rank players using XBox Live. This seems similar to Glicko in that it has a rating and uncertainty for each player. TrueSkill’s FAQs suggest the following difference between the two:
Glicko was developed as an extension of ELO and was thus naturally limited to two player matches which end in either win or loss. Glicko cannot update skill levels of players if they compete in multi-player events or even in teams. The logistic model would make it computationally expensive to deal with team and multi-player games. Moreover, chess is usually played in pre-set tournaments and thus matching the right opponents was not considered a relevant problem in Glicko. In contrast, the TrueSkill ranking system offers a way to measure the quality of a match between any set of players.
Scott Hamilton has an implementation of all these algorithms in Python which I need to play around with. He based his algorithms on a blog post written by Jeff Moser in which he explains probabilities, the Gaussian distribution, Bayesian probability and factor graphs in deciphering the TrueSkill algorithm. Moser’s created a project implementing TrueSkill in C# on github.
I follow tennis and football reasonably closely so I thought I’d do a bit of reading about the main two rankings I know about there as well:
- UEFA club coefficients – used to rank football clubs that have taken part in a European competition over the last 5 seasons. It takes into account the importance of the match but not the strength of the opposition
- ATP Tennis Rankings – used to rank tennis players on a rolling basis over the last 12 months. They take into account the importance of a tournament and the round a player reached to assign ranking points.
Now that I’ve recorded all that it’s time to go and play with some of them!
One of the common feature requests on the ThoughtWorks projects that I worked on was that the application we were working on should be almost infinitely configurable to cover potential future use cases.
My experience of attempting to do this was that you ended up with an extremely complicated code base and those future use cases often didn’t come to fruition.
It therefore made more sense to solve the problem at hand and then make the code more configurable if/when the need arose.
Now that I’m working on a product and associated tools I’m trying to understand whether those rules of application development apply.
One thing which I think makes sense is the idea of convention over configuration, an approach that I became familiar with after working with Ruby/Rails in 2010/2011.
The phrase essentially means a developer only needs to specify unconventional aspects of the application.
Even if we do this I wonder if it goes far enough. The more things we make configurable the more complexity we add and the more opportunity for people to create themselves problems through misconfiguration.
Perhaps we should only make a few things configurable and have our application work out appropriate values for everything else.
There are a reasonable number of people using a product who don’t have much interest in learning how to configure it. They just want to use it to solve a problem they have without having to think too much.
Although I haven’t used it I’m told that Azul’s Zing JVM takes the minimal configuration approach by only requiring you to specify one parameter – the heap size – and it handles everything else for you.
Of course I’m still new to this so perhaps it still does make sense to default most things but allow power users full control in case their use case differs from the average one that the defaults were created for.
I’d be interested in hearing the opinions of people more experienced in this arena of which there are undoubtably many.
On the recommendation of Ian Robinson I’ve been reading the 2nd edition of William’s Kent’s ‘Data and Reality‘ and the author makes an interesting observation at the end of the first chapter which resonated with me:
Once more: we are not modelling reality, but the way information about reality is processed, by people.
It reminds me of similar advice in Eric Evans’ Domain Driven Design and it’s advice which I believe is helpful when designing a model in a graph database.
Last year I wrote a post explaining how I’d be using an approach of defining questions that I wanted to ask before modelling my data and in neo4j land we can do this by writing cypher queries up front.
We can then play around with increasing the size of our data set in different ways to check that our queries are still performant and tweak our model if necessary.
For example one simple optimisation would be to run an offline query to make implicit relationships explicit.
Although graphs are very whiteboard friendly and it can be tempting to design our whole model before writing any queries this often causes problems later on.
When we eventually get to asking questions of our data we may find that we’ve modelled some things unnecessarily or have designed the model in a way that leads to inefficient queries.
I’ve found an effective approach is to keep the feedback loop tight by minimising the amount of time between drawing parts of our model on a whiteboard and writing queries against it.
If you’re interested in learning more, Ian has a slide deck from a talk he did at JAX 2013 which covers this idea and others when building out graph database applications.
We have a test in our code which checks for unresolvable hosts and it started failing for me because instead of throwing an UnknownHostException from the following call:
InetAddress.getByName( "host.that.is.invalid" )
I was getting back a valid although unreachable host. When I called ping it was easier to see what was going on:
$ ping host.that.is.invalid PING host.that.is.invalid (18.104.22.168): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2
As you can see, that hostname is resolving to ’22.214.171.124′ which I thought was a bit weird but dig confirmed that this was happening:
$ dig host.that.is.invalid ; <<>> DiG 9.8.3-P1 <<>> host.that.is.invalid ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30043 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;host.that.is.invalid. IN A ;; ANSWER SECTION: host.that.is.invalid. 300 IN A 126.96.36.199
It turns out that BT have plugged into DNS searches and if one fails it redirects you to one of their pages instead – something I hadn’t noticed before.
The site they direct you to is www.webaddresshelp.bt.com which contains a list of sponsored results for the search term ‘host.that.is.invalid’ in this case.
$ ping www.webaddresshelp.bt.com PING www.webaddresshelp.bt.com (188.8.131.52): 56 data bytes
If we then wait a little bit for our DNS cache to clear the ping works as expected:
I recently wanted to attach an EBS volume to an existing EC2 instance that I had running and since it was for a one off tasks (famous last words) I decided to configure it manually.
I created the EBS volume through the AWS console and one thing that initially caught me out is that the EC2 instance and EBS volume need to be in the same region and zone.
Therefore if I create my EC2 instance in ‘eu-west-1b’ then I need to create my EBS volume in ‘eu-west-1b’ as well otherwise I won’t be able to attach it to that instance.
I attached the device as /dev/sdf although the UI gives the following warning:
Linux Devices: /dev/sdf through /dev/sdp
Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp.
After attaching the EBS volume to the EC2 instance my next step was to SSH onto my EC2 instance and make the EBS volume available.
The first step is to create a file system on the volume:
$ sudo mkfs -t ext3 /dev/sdf mke2fs 1.42 (29-Nov-2011) Could not stat /dev/sdf --- No such file or directory The device apparently does not exist; did you specify it correctly?
It turns out that warning was handy and the device has in fact been renamed. We can confirm this by calling fdisk:
$ sudo fdisk -l Disk /dev/xvda1: 8589 MB, 8589934592 bytes 255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvda1 doesn't contain a valid partition table Disk /dev/xvdf: 53.7 GB, 53687091200 bytes 255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdf doesn't contain a valid partition table
/dev/xvdf is the one we’re interested in so I re-ran the previous command:
$ sudo mkfs -t ext3 /dev/xvdf mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 3276800 inodes, 13107200 blocks 655360 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 400 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done
Once I’d done that I needed to create a mount point for the volume and I thought the best place was probably a directory under /mnt:
$ sudo mkdir /mnt/ebs
The final step is to mount the volume:
$ sudo mount /dev/xvdf /mnt/ebs
And if we run df we can see that it’s ready to go:
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 883M 6.7G 12% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 164K 118M 1% /run none 5.0M 0 5.0M 0% /run/lock none 296M 0 296M 0% /run/shm /dev/xvdf 50G 180M 47G 1% /mnt/ebs
Last week I had a ~10GB file I wanted to download to my machine but Chrome’s initial estimate was that it would take 10+ hours to do so which meant I’d have probably shutdown my machine before it had completed.
It seemed to make more sense to spin up an EC2 instance and download it onto there instead but I didn’t want to have to keep an SSH session open to that machine either.
screen was a bit more familiar so I decided to use that and I thought I should make a quick note of some of its basic flags for future me.
Starting a new session
Starting a new screen session is as simple as typing the following command:
which leads to the following output:
Screen version 4.00.03jw4 (FAU) 2-May-06 Copyright (c) 1993-2002 Juergen Weigert, Michael Schroeder Copyright (c) 1987 Oliver Laumann This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. ...
We can now start downloading our file using cURL, wget or a download accelerator like axel which is my personal favourite.
Detaching/Exiting from a session without it dying
Once I’d got the download running I wanted to close my SSH session to the AWS instance but first I wanted to detach from my screen session without killing it.
My first attempt was to use Ctrl + D but that actually results in the session being terminated and our download is therefore stopped as well which isn’t quite what we wanted.
Reattaching to a session
After about an hour I wanted to checkup on my download and I assumed just typing screen would take me back to my session but instead it created a new one.
Alistair pointed out that I could get a listing of all the open screen sessions by typing the following command:
$ screen -ls There are screens on: 23397.pts-0.ip-10-243-5-102 (07/31/2013 05:25:30 AM) (Detached) 3981.pts-0.ip-10-243-5-102 (07/26/2013 07:59:28 AM) (Detached) 3910.pts-0.ip-10-243-5-102 (07/26/2013 07:58:42 AM) (Detached) 1094.pts-0.ip-10-243-5-102 (07/26/2013 07:49:31 AM) (Detached) 4 Sockets in /var/run/screen/S-ubuntu.
As you can see, I’d created a bunch of extra sessions by mistake.
The one I had the download running on was ’1094.pts-0.ip-10-243-5-102′ and we can reattach to that one like this:
$ screen -x 1094.pts-0.ip-10-243-5-102
We can also attach using the ‘-r’ flag:
$ screen -r 1094.pts-0.ip-10-243-5-102
I’m not quite sure what the difference is between ‘-r’ and ‘-x’, they both seem to behave in the same way in this scenario.
The manual suggests that ‘-x’ is for attaching to a ‘not detached screen session’ which suggests to me that it shouldn’t have worked since I wanted to connect to a detached session.
Hopefully someone with more knowledge of how these things work can explain what’s going on!
Attach to an existing session or start a new one if none exists
I later learnt that had I not accidentally created all those extra sessions the following command would have been quite useful for finding the first screen session available and connecting to it:
$ screen -x -R
If there aren’t any existing screen sessions available then it will create a new one which means that in my particular situation this would have been a more appropriate command to start with.
I recently wanted to copy some large files from an AWS instance into an S3 bucket using s3cmd but ended up with the following error when trying to use the ‘put’ command:
$ s3cmd put /mnt/ebs/myfile.tar s3://mybucket.somewhere.com /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 1077248 of 12185313280 0% in 1s 937.09 kB/s failed WARNING: Upload failed: /myfile.tar ([Errno 104] Connection reset by peer) WARNING: Retrying on lower speed (throttle=0.00) WARNING: Waiting 3 sec... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 1183744 of 12185313280 0% in 1s 1062.18 kB/s failed WARNING: Upload failed: /myfile.tar ([Errno 104] Connection reset by peer) WARNING: Retrying on lower speed (throttle=0.01) WARNING: Waiting 6 sec... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 417792 of 12185313280 0% in 1s 378.75 kB/s failed WARNING: Upload failed: /myfile.tar ([Errno 104] Connection reset by peer) WARNING: Retrying on lower speed (throttle=0.05) WARNING: Waiting 9 sec... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 94208 of 12185313280 0% in 1s 81.04 kB/s failed WARNING: Upload failed: /myfile.tar ([Errno 32] Broken pipe) WARNING: Retrying on lower speed (throttle=0.25) WARNING: Waiting 12 sec... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 28672 of 12185313280 0% in 1s 18.40 kB/s failed WARNING: Upload failed: /myfile.tar ([Errno 32] Broken pipe) WARNING: Retrying on lower speed (throttle=1.25) WARNING: Waiting 15 sec... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [1 of 1] 12288 of 12185313280 0% in 2s 4.41 kB/s failed ERROR: Upload of '/mnt/ebs/myfile.tar' failed too many times. Skipping that file.
I tried with a smaller file just to make sure I wasn’t doing anything stupid syntax wise and that transferred without a problem which lead me to believe the problem might be when uploading larger files – the one I was uploading was around ~10GB in size.
The Ubuntu repository comes with version 1.0.0 so I needed to find a way of getting a newer version onto the machine.
I eventually ended up downloading version 1.5.0 from sourceforge but I couldn’t get a direct URI to download it so I ended up downloading it to my machine, uploading to the S3 bucket through the web UI and then pulling it back down again using a ‘s3cmd get’. #epic
In retrospect the s3cmd PPA might have been a better option.
Anyway, when I used this s3cmd it uploaded using multi part fine:
... /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [part 761 of 775, 15MB] 15728640 of 15728640 100% in 3s 4.12 MB/s done /mnt/ebs/myfile.tar -> s3://mybucket.somewhere.com/myfile.tar [part 762 of 775, 15MB] ...
My former colleague Anne Simmons recently wrote an interesting post in which she describes some of the reasons that she finds herself not wanting to write about technical topics..
I wrote a post at the end of 2012 in which I explained some of the reasons why I think writing about what you learn is a good idea but Anne brought up some things I hadn’t thought of which I think are worth addressing.
She’s already described her own mantras to overcome these but I thought it’d still be interesting to share my experience as well:
What do I know that the internet doesn’t already?!
I’ve found that then posts I write in which I aggregate a bunch of information that I found from different places tend to be my most read posts.
A lot of the people that I’ve worked with (including me) when encountered with a stack trace will paste it straight into google to try and solve their problem.
I’ve had the same experience as Anne in spending ages trying to solve a problem and thinking it would be cool to save someone else (usually future me) from having to rediscover the solution in the future.
Will people judge me about what knowledge I do have?/What if I’m wrong?!
In 5 years of writing I’ve only had a couple of times when people commented in what I thought was an unnecessary manner but the majority of feedback has been very positive.
Frequently people actually teach me something that I didn’t know rather than criticising what I do know so for me writing has been a net gain.
A strange side effect is that people think I know much more than I do based on writing about things I’ve been working on.
It can only be a good thing for the internet if more people write about the things that they’re working on so I hope Anne keeps to her one post a month target!
Creative brains are a valuable, limited resource. They shouldn’t be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there.
To behave like a hacker, you have to believe that the thinking time of other hackers is precious — so much so that it’s almost a moral duty for you to share information, solve problems and then give the solutions away just so other hackers can solve new problems instead of having to perpetually re-address old ones.
Until I started working on the uSwitch energy website around 8 months ago I had not really done any support of a production system so I learnt some interesting lessons in my time there.
Look at the new code first
We had our application wired up to Airbrake so whenever a user did anything which resulted in an exception being thrown we received a report with the stack trace, environment variables and which page they were on.
When trying to work out what had happened I initially started from scratch and tried to work backwards from the source and create a scenario in my head of what they might have done to get that error.
After a few times of doing this it became clear that there was a reasonable chance that if a user was experiencing a problem it was probably because of some new code that we’d just introduced.
We therefore tweaked our bug hunting algorithm to initially check code that had been changed recently and only after ruling that out did we work back from first principles.
It may never have worked
Sometimes it became clear that new code wasn’t to blame but it seemed implausible that the error could have actually happened.
There was a tendency to assume that the user must be deliberately doing something to make the application break but it soon became clear that they had just managed to hit a code path that had not been hit before.
Even if you’ve done extensive testing on a system users still seem to find paths through the code that haven’t failed previously so it seems best to just assume that is going to happen at some stage.
Log all the things
As I mentioned earlier we were using a 3rd party service to collect errors and other helpful information which was really useful for helping us find the root cause of problems.
The type of logging that you need varies so for a product like neo4j as well as logging exceptions we also log system information and memory settings.
Obviously I’m quite new to this type of work so I’m sure others will have useful bits of advice to share as well.