Mark Needham

Thoughts on Software Development

Archive for the ‘devops’ tag

Treat servers as cattle: Spin them up, tear them down

with one comment

A few agos I wrote a post about treating servers as cattle, not as pets in which I described an approach to managing virtual machines at uSwitch whereby we frequently spin up new ones and delete the existing ones.

I’ve worked on teams previously where we’ve also talked about this mentality but ended up not doing it because it was difficult, usually for one of two reasons:

  • Slow spin up – this might be due to the cloud providers infrastructure, doing too much on spin up or I’m sure a variety of other reasons.
  • Manual steps involved in spin up – the process isn’t 100% automated so we have to do some manual tweaks. Once the machine is finally working we don’t want to have to go through that again.

Martin Fowler wrote a post a couple of years ago where he said the following:

One of my favorite soundbites is: if it hurts, do it more often. It has the happy property of seeming nonsensical on the surface, but yielding some valuable meaning when you dig deeper

I think it applies in this context too and I have noticed that the more frequently we tear down and spin up new nodes the easier it becomes to do so.

Part of this is because there’s been less time for changes to have happened in package repositories but we are also more inclined to optimise things that we have to do frequently so the whole process is faster as well.

For example in one of our sets of machines we need to give one machine a specific tag so that when the application is deployed it sets up a bunch of cron jobs to run each evening.

Initially this was done manually and we were quite reluctant to ever tear down that machine but we’ve now got it all automated and it’s not a big deal anymore – it can be cattle just like the rest of them!

One neat rule of thumb Phil taught me is that if we make major changes to our infrastructure we should spin up some new machines to check that it still actually works.

If we don’t do this then when we actually need to spin up a new node because of a traffic spike or machine corruption problem it’s not going to work and we’re going to have to fix things in a much more stressful context.

For example we recently moved some repositories around in github and although it’s a fairly simple change spinning up new nodes helped us see all the places where we’d failed to make the appropriate change.

While I appreciate taking this approach is more time consuming in the short term I’d argue that if we automate as much of the pain as possible in the long run it will probably be beneficial.

Written by Mark Needham

April 27th, 2013 at 2:22 pm

Posted in DevOps

Tagged with

Puppet: Installing Oracle Java – oracle-license-v1-1 license could not be presented

with 5 comments

In order to run the neo4j server on my Ubuntu 12.04 Vagrant VM I needed to install the Oracle/Sun JDK which proved to be more difficult than I’d expected.

I initially tried to install it via the OAB-Java script but was running into some dependency problems and eventually came across a post which specified a PPA that had an installer I could use.

I wrote a little puppet Java module to wrap the commands in:

class java($version) {
  package { "python-software-properties": }
  exec { "add-apt-repository-oracle":
    command => "/usr/bin/add-apt-repository -y ppa:webupd8team/java",
    notify => Exec["apt_update"]
  package { 'oracle-java7-installer':
    ensure => "${version}",
    require => [Exec['add-apt-repository-oracle']],

I then included this in my default node definition:

node default {
  class { 'java': version => '7u21-0~webupd8~0', }

(as Dave Yeung points out in the comments, you may need to tweak the version. Running aptitude versions oracle-java7-installer should indicate the latest version.)

Unfortunately when I ran that I ended up with the following error:

err: /Stage[main]/Java/Package[oracle-java7-installer]/ensure: change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install oracle-java7-installer' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
Suggested packages:
Unpacking oracle-java7-installer (from .../oracle-java7-installer_7u21-0~webupd8~0_all.deb) ...
oracle-license-v1-1 license could not be presented
try 'dpkg-reconfigure debconf' to select a frontend other than noninteractive
dpkg: error processing /var/cache/apt/archives/oracle-java7-installer_7u21-0~webupd8~0_all.deb (--unpack):
 subprocess new pre-installation script returned error exit status 2
Processing triggers for man-db ...
Errors were encountered while processing:
E: Sub-process /usr/bin/dpkg returned an error code (1)

I came across this post on Ask Ubuntu which explained a neat trick for getting around it by making it look like we’ve agreed to the licence. This is done by passing options to debconf-set-selections.

For a real server I guess you’d want some step where a person accepts the licence but since this is just for my hacking it seems to make sense.

My new Java manifest looks like this:

class java($version) {
  package { "python-software-properties": }
  exec { "add-apt-repository-oracle":
    command => "/usr/bin/add-apt-repository -y ppa:webupd8team/java",
    notify => Exec["apt_update"]
  exec {
      command => '/bin/echo debconf shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections';
      command => '/bin/echo debconf shared/accepted-oracle-license-v1-1 seen true | /usr/bin/debconf-set-selections';
  package { 'oracle-java7-installer':
    ensure => "${version}",
    require => [Exec['add-apt-repository-oracle'], Exec['set-licence-selected'], Exec['set-licence-seen']],

Written by Mark Needham

April 18th, 2013 at 11:36 pm

Posted in DevOps

Tagged with ,

dpkg/apt-cache: Useful commands

without comments

As I’ve mentioned in a couple of previous posts I’ve been playing around with creating a Vagrant VM that I can use for my neo4j hacking which has involved a lot of messing around with installing apt packages.

There are loads of different ways of working out what’s going on when packages aren’t installing as you’d expect so I thought it’d be good to document the ones I’ve been using so I can find them more easily next time.

Finding reverse dependencies

A couple of times I found myself wondering how a certain package had ended up on the VM because I hadn’t specified that it should be installed so I wanted to know who had!

I wanted to find out the reverse dependency for the package. e.g. to find out who depended on make which we can find out with the following command:

$ apt-cache rdepends make
Reverse Depends:

The nice thing about ‘rdepends’ is that it will tell us reverse dependencies even for a package that we haven’t installed. This was helpful here as I had forgotten to install ‘build-essential’ and this made it obvious.

Finding which version of a package is installed

I added one of the Brightbox repositories to get a more recent Ruby version and noticed that something weird was going on with the version of ‘nginx-common’ that puppet was trying to install.

It seemed like one one my dependencies was trying to pull in the ‘latest’ version of ‘nginx-common’ which I’d expected to be ‘1.1.19-1ubuntu0.1’.

By passing the ‘policy’ flag to apt-cache I was able to see that there was a recent version available via Brightbox:

$ apt-cache policy nginx-common
  Installed: 1.1.19-1ubuntu0.1
  Candidate: 1:1.2.6-1~43~precise1
  Version table:
     1:1.2.6-1~43~precise1 0
        500 precise/main amd64 Packages
 *** 1.1.19-1ubuntu0.1 0
        500 precise-updates/universe amd64 Packages
        100 /var/lib/dpkg/status
     1.1.19-1 0
        500 precise/universe amd64 Packages

Finding which versions of a package are available

Another flag that we can pass to apt-cache is ‘madison’ which shows us the available versions for a package but doesn’t indicate which version is installed:

$ apt-cache madison nginx-common
nginx-common | 1:1.2.6-1~43~precise1 | precise/main amd64 Packages
nginx-common | 1.1.19-1ubuntu0.1 | precise-updates/universe amd64 Packages
nginx-common |   1.1.19-1 | precise/universe amd64 Packages
     nginx |   1.1.19-1 | precise/universe Sources
     nginx | 1.1.19-1ubuntu0.1 | precise-updates/universe Sources
     nginx | 1:1.2.6-1~43~precise1 | precise/main Sources

Finding which package a file belongs to

At some stage I wanted to check which exact package was installing nginx which I was able to do with the following command:

$ dpkg -S `which nginx`
nginx-extras: /usr/sbin/nginx

I had installed ‘nginx-common’ which I learn depends on ‘nginx-extras’ by using our ‘rdepends’ command:

$ apt-cache rdepends nginx-extras
Reverse Depends:

Finding the dependencies of a package

I wanted to check the dependencies of the ‘ruby1.9.1’ package to see whether or not I needed to explicitly install ‘libruby1.9.1’ or if that would be taken care of.

Passing the ‘-s’ flag to dpkg let me check this:

$ dpkg -s ruby1.9.1
Package: ruby1.9.1
Status: install ok installed
Architecture: amd64
Version: 1:
Replaces: irb1.9.1, rdoc1.9.1, rubygems1.9.1
Provides: irb1.9.1, rdoc1.9.1, ruby-interpreter, rubygems1.9.1
Depends: libruby1.9.1 (= 1:, libc6 (>= 2.2.5)
Suggests: ruby1.9.1-examples, ri1.9.1, graphviz, ruby1.9.1-dev, ruby-switch
Conflicts: irb1.9.1 (<<, rdoc1.9.1 (<<, ri (<= 4.5), ri1.9.1 (<<, ruby (<= 4.5), rubygems1.9.1

These are the ones that I’ve found useful so far. I’d love to here other people’s favourites though as I’m undoubtably missing some.

Written by Mark Needham

April 18th, 2013 at 9:54 pm

Posted in DevOps

Tagged with , ,

Puppet Debt

without comments

I’ve been playing around with a puppet configuration to run a neo4j server on an Ubuntu VM and one thing that has been quite tricky is getting the Sun/Oracle Java JDK to install repeatably.

I adapted Julian’s Java module which uses OAB-Java and although it was certainly working cleanly at one stage I somehow ended up with it not working because of failed dependencies:

[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[install OAB repo]/returns:  [x] Installing Java build requirements Ofailed
[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[install OAB repo]/returns: ^[[m^O [i] Showing the last 5 lines from the logfile (/root/
[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[install OAB repo]/returns:  nginx-common
[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[install OAB repo]/returns:  nginx-extras
[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[install OAB repo]/returns: E: Sub-process /usr/bin/dpkg returned an error code (1)
[2013-04-12 07:03:10] Warning: /Stage[main]/Java/Package[sun-java6-jdk]: Skipping because of failed dependencies
[2013-04-12 07:03:10] Notice: /Stage[main]/Java/Exec[default JVM]: Dependency Exec[install OAB repo] has failures: true
[2013-04-12 07:03:10] Warning: /Stage[main]/Java/Exec[default JVM]: Skipping because of failed dependencies

I spent a few hours looking at this problem but couldn’t quite figure out how to sort out the dependency problem and ended up running part one command manually after which applying puppet again worked.

Obviously this is a bit of a cop out because ideally I’d like it to be possible to spin up the VM in one puppet run without manual intervention.

A couple of days ago I was discussing the problem with Ashok and he suggested that it was probably good to know when I could defer fixing the problem to a later stage since having a completely automated spin up isn’t my highest priority.

i.e. when I could take on what he referred to as ‘Puppet debt

I think this is a reasonable way of looking at things and I have worked on projects where we’ve been baffled by puppet’s dependency graph and have setup scripts which run puppet twice until we have time to sort it out.

If we’re spinning up new instances frequently then we have less ability to take on this type of debt because it’s going to hurt us much more but if not then I think it is reasonable to defer the problem.

This feels like another type of technical debt to me but I’d be interested in others’ thoughts and whether I’m just a complete cop out!

Written by Mark Needham

April 16th, 2013 at 8:57 pm

Posted in DevOps

Tagged with ,

Capistrano: Deploying to a Vagrant VM

with 2 comments

I’ve been working on a tutorial around thinking through problems in graphs using my football graph and I wanted to deploy it on a local vagrant VM as a stepping stone to deploying it in a live environment.

My Vagrant file for the VM looks like this:

# -*- mode: ruby -*-
# vi: set ft=ruby : do |config| = "precise64"
  config.vm.define :neo01 do |neo| :hostonly, ""
    neo.vm.host_name = 'neo01.local'
    neo.vm.forward_port 7474, 57474
    neo.vm.forward_port 80, 50080
  config.vm.box_url = ""
  config.vm.provision :puppet do |puppet|
    puppet.manifests_path = "puppet/manifests"
    puppet.manifest_file  = "site.pp"
    puppet.module_path = "puppet/modules"

I’m port forwarding ports 80 and 7474 to 50080 and 57474 respectively so that I can access the web app and neo4j console from my browser.

There is a bunch of puppet code to configure the machine in the location specified.

Since the web app is written in Ruby/Sinatra the easiest deployment tool to use is probably capistrano and I found the tutorial on the beanstalk website really helpful for getting me setup.

My config/deploy.rb file which I’ve got Capistrano setup to read looks like this:

require 'capistrano/ext/multistage'
set :application, "thinkingingraphs"
set :scm, :git
set :repository,  ""
set :scm_passphrase, ""
set :ssh_options, {:forward_agent => true}
set :default_run_options, {:pty => true}
set :stages, ["vagrant"]
set :default_stage, "vagrant"

In my config/deploy/vagrant.rb file I have the following:

set :user, "vagrant"
server "", :app, :web, :db, :primary => true
set :deploy_to, "/var/www/thinkingingraphs"

So that IP there is the same one that I assigned in Vagrantfile. If you didn’t do that then you’d need to use ‘vagrant ssh’ to go onto the VM and then ‘ifconfig’ to grab the IP instead.

I figured there was probably another step required to tell Capistrano where it should get the vagrant public key from but I thought I’d try and deploy anyway just to see what would happen.

$ bundle exec cap deploy

It asked me to enter the vagrant user’s password which is ‘vagrant’ by default and I eventually found a post on StackOverflow which suggested changing the ‘ssh_options’ to the following:

set :ssh_options, {:forward_agent => true, keys: ['~/.vagrant.d/insecure_private_key']}

And with that the deployment worked flawlessly! Happy days.

Written by Mark Needham

April 13th, 2013 at 11:17 am

Posted in DevOps

Tagged with

Treating servers as cattle, not as pets

with 2 comments

Although I didn’t go to Dev Ops Days London earlier in the year I was following the hash tag on twitter and one of my favourites things that I read was the following:

“Treating servers as cattle, not as pets” #DevOpsDays

I think this is particularly applicable now that a lot of the time we’re using virtualised production environments via AWS, Rackspace or <insert-cloud-provider-here>.

At uSwitch we use AWS and over the last week Sid and I spent some time investigating a memory leak by running our applications against two different versions of Ruby.

One of them was from the Brightbox repository and the other was custom built but they had annoyingly different puppet configurations so we decided to treat them as separate machine types.

We spun up one of the custom built Ruby nodes and put it in the load balancer alongside 11 of the other node types and left it for the day serving traffic.

The next day we had look at the New Relic memory consumption for both node types and it was clear that the custom built one’s memory usage was climbing much more slowly than the other one.

Instead of trying to work out how to change the Ruby version of the 11 existing nodes we realised it would probably be quicker to just spin up 11 new ones with the custom built Ruby and swap them with the existing ones.

This was pretty much as easy as removing the existing nodes from the load balancer and putting the new ones in although we do have one ‘special’ machine which runs some background jobs.

We needed to make sure there weren’t any jobs on its queue that hadn’t been processed and then make sure that we tagged one of the new machines so that they could take over that role.

One thing that made it particularly easy for us to do this is that spin up of new VMs is extremely quick and completely automated including the installation and start up of applications.

The only manual step we have is to put the new nodes into the load balancer which I think works ok as a manual step because it gives us a chance to quickly scan the box and check everything spun up correctly.

We install all packages/configuration on nodes using puppet headless which makes spin up easier than if you use server/client mode where you have to coordinate node registration with the master on spin up.

I do like this philosophy to machines and although I’m sure it doesn’t apply to all situations we’re almost at the point where if something breaks on a node we might as well spin up a new one while we’re investigating and see which finishes first!

Written by Mark Needham

April 7th, 2013 at 11:41 am

Posted in DevOps

Tagged with

Incrementally rolling out machines with a new puppet role

with one comment

Last week Jason and I with (a lot of) help from Tim have been working on moving several of our applications from Passenger to Unicorn and decided that the easiest way to do this was to create a new set of nodes with this setup.

The architecture we’re working with looks like this at a VM level:


The ‘nginx LB’ nodes are responsible for routing all the requests to their appropriate application servers and the ‘web’ nodes serve the different applications initially using Passenger.

We started off by creating a new ‘nginx LB’ node which we pointed to a new ‘web ELB’ and just put one ‘unicorn web’ node behind it so that we could test everything was working.

We then pointed ‘’ at the IP of our new ‘nginx LB’ node in our /etc/hosts file and checked that the main flows through the various applications were working correctly.

Once we were happy this was working correctly we increased the number of ‘unicorn web’ nodes to three and then repeated our previous checks while tailing the log files across the three machines to make sure everything was ok.

The next step was to send some of the real traffic to the new nodes and check whether they were able to handle it.

Initially we thought that we could put our ‘unicorn web’ nodes alongside the ‘web’ nodes but we realised that we’d made some changes on our new ‘nginx LB’ nodes which meant that the ‘unicorn web’ nodes needed to receive requests proxies through there rather than from the old style node.

A combination of Jason and Sid came up with the idea of just plugging our new ‘nginx LB’ into the ‘nginx ELB’ and having the processing of the whole request treated separately.

Our intermediate architecture therefore looked like this:

Arhictecture rollover

We initially served 1/4 of the requests from the Unicorn and watched the performance of the nodes via New Relic to check that everything was working expected.

One thing we did notice was that the CPU usage on the Unicorn nodes was really high because we’d set up each Unicorn process with 5 workers which meant that we had 25 workers on the VM in total. In comparison our Passenger instances used 5 workers in total.

Once we’d sorted that out we removed one of the ‘nginx LB’ nodes from the ‘nginx ELB’ and served 1/3 of the traffic from our new stack.

We didn’t see any problems so we removed all the ‘nginx LB’ nodes and served all the traffic from our new stack for half an hour.

Again we didn’t notice any problems so our next step before we can decommission the old nodes is to run the new stack for a day and iron out any problems before using it for real.

Written by Mark Needham

March 24th, 2013 at 10:52 pm

Posted in DevOps

Tagged with

Understanding what lsof socket/port aliases refer to

with one comment

Earlier in the week we wanted to check which ports were being listened on and by what processes which we can do with the following command on Mac OS X:

$ lsof -ni | grep LISTEN
idea       2398 markhneedham   58u  IPv6 0xac8f13f77b903331      0t0  TCP *:49410 (LISTEN)
idea       2398 markhneedham   65u  IPv6 0xac8f13f7799a4af1      0t0  TCP *:58741 (LISTEN)
idea       2398 markhneedham  122u  IPv6 0xac8f13f7799a4711      0t0  TCP (LISTEN)
idea       2398 markhneedham  249u  IPv6 0xac8f13f777586711      0t0  TCP *:63342 (LISTEN)
idea       2398 markhneedham  253u  IPv6 0xac8f13f777586331      0t0  TCP (LISTEN)
java      16973 markhneedham  152u  IPv6 0xac8f13f777586af1      0t0  TCP *:56471 (LISTEN)
java      16973 markhneedham  154u  IPv6 0xac8f13f779e6b711      0t0  TCP *:menandmice-dns (LISTEN)
java      16973 markhneedham  168u  IPv6 0xac8f13f77b902f51      0t0  TCP (LISTEN)
java      16973 markhneedham  171u  IPv6 0xac8f13f77b013711      0t0  TCP (LISTEN)

One of the interesting things about this output is that for the most part it shows the port number and which IPs it will accept a connection from but sometimes it uses a socket/port alias.

In this case we can see that the 3rd last line refers to ‘menandmice-dns’ but others could be ‘http-alt’ or ‘mysql’.

We can find out what port those names refer to by looking in /etc/services:

$ cat /etc/services | grep menandmice-dns
menandmice-dns  1337/udp    # menandmice DNS
menandmice-dns  1337/tcp    # menandmice DNS
$ cat /etc/services | grep http-alt
http-alt	591/udp     # FileMaker, Inc. - HTTP Alternate (see Port 80)
http-alt	591/tcp     # FileMaker, Inc. - HTTP Alternate (see Port 80)
http-alt	8008/udp     # HTTP Alternate
http-alt	8008/tcp     # HTTP Alternate
http-alt	8080/udp     # HTTP Alternate (see port 80)
http-alt	8080/tcp     # HTTP Alternate (see port 80)

There’s a massive XML document on the IANA website with a full list of the port assignments which is presumably where /etc/services is derived from.

Written by Mark Needham

March 17th, 2013 at 2:00 pm

Posted in DevOps

Tagged with ,

telnet/netcat: Waiting for a port to be open

with one comment

On Friday Nathan and I were setting up a new virtual machine and we needed a firewall rule to be created to allow us to connect to another machine which had some JAR files we wanted to download.

We wanted to know when it had been done by one of our operations team and I initially thought we might be able to do that using telnet:

$ telnet 8081
telnet: connect to address Operation timed out
telnet: Unable to connect to remote host

We wanted to put a watch on the command so that it would be repeated every few seconds and indicate when we’d could connect to the port. However, as as far as I can tell there’s no way to reduce the length of the telnet timeout so Nathan suggested using netcat instead.

We ended up with the following command…

$ nc -v -w 1 8081
nc: connect to port 8081 (tcp) failed: Connection refused

…which we can then wire up with watch like so:

$ watch "nc -v -w 1 8081"
Every 2.0s: nc -v -w 1 8081                         Sun Jan 20 15:48:05 2013
nc: connect to port 8081 (tcp) timed out: Operation now in progress

And then when it works:

Every 2.0s: nc -v -w 1 8081                         Sun Jan 20 15:49:53 2013
Connection to 8081 port [tcp] succeeded!

Written by Mark Needham

January 20th, 2013 at 3:53 pm

Posted in DevOps,Software Development

Tagged with

Fabric/Boto: boto.exception.NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. [‘QuerySignatureV2AuthHandler’] Check your credentials

with one comment

In our Fabric code we make use of Boto to connect to the EC2 API and pull back various bits of information and the first time anyone tries to use it they end up with the following stack trace:

  File "/Library/Python/2.7/site-packages/fabric/", line 717, in main
    *args, **kwargs
  File "/Library/Python/2.7/site-packages/fabric/", line 332, in execute
    results['<local-only>'] =*args, **new_kwargs)
  File "/Library/Python/2.7/site-packages/fabric/", line 112, in run
    return self.wrapped(*args, **kwargs)
  File "/Users/mark/projects/forward-puppet/", line 131, in running
    instances = instances_by_zones(running_instances(region, role_name))
  File "/Users/mark/projects/forward-puppet/", line 19, in running_instances
    ec2conn = ec2.connect_to_region(region)
  File "/Library/Python/2.7/site-packages/boto/ec2/", line 57, in connect_to_region
    for region in regions(**kw_params):
  File "/Library/Python/2.7/site-packages/boto/ec2/", line 39, in regions
    c = EC2Connection(**kw_params)
  File "/Library/Python/2.7/site-packages/boto/ec2/", line 94, in __init__
  File "/Library/Python/2.7/site-packages/boto/", line 936, in __init__
  File "/Library/Python/2.7/site-packages/boto/", line 548, in __init__
    host, config, self.provider, self._required_auth_capability())
  File "/Library/Python/2.7/site-packages/boto/", line 633, in get_auth_handler
    'Check your credentials' % (len(names), str(names)))
boto.exception.NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. ['QuerySignatureV2AuthHandler'] Check your credentials

We haven’t told Boto about our AWS credentials and I’ve come across two ways of providing them:

As environment variables

export AWS_ACCESS_KEY_ID="aws_access_key_id"
export AWS_SECRET_ACCESS_KEY="aws_secret_access_key"

In the file ~/.boto

aws_access_key_id = aws_access_key_id
aws_secret_access_key = aws_secret_access_key

And that should do the trick!

Written by Mark Needham

January 15th, 2013 at 12:37 am

Posted in DevOps

Tagged with , ,