Archive for the ‘DevOps’ Category
On Friday Nathan and I were setting up a new virtual machine and we needed a firewall rule to be created to allow us to connect to another machine which had some JAR files we wanted to download.
We wanted to know when it had been done by one of our operations team and I initially thought we might be able to do that using telnet:
$ telnet 10.0.0.1 8081 Trying 10.0.0.1... telnet: connect to address 10.0.0.1: Operation timed out telnet: Unable to connect to remote host
We wanted to put a watch on the command so that it would be repeated every few seconds and indicate when we’d could connect to the port. However, as as far as I can tell there’s no way to reduce the length of the telnet timeout so Nathan suggested using netcat instead.
We ended up with the following command…
$ nc -v -w 1 10.0.0.1 8081 nc: connect to 10.0.0.1 port 8081 (tcp) failed: Connection refused
…which we can then wire up with watch like so:
$ watch "nc -v -w 1 10.0.0.1 8081" Every 2.0s: nc -v -w 1 10.0.0.1 8081 Sun Jan 20 15:48:05 2013 nc: connect to 10.0.0.1 port 8081 (tcp) timed out: Operation now in progress
And then when it works:
Every 2.0s: nc -v -w 1 10.0.0.1 8081 Sun Jan 20 15:49:53 2013 Connection to 10.0.0.1 8081 port [tcp] succeeded!
Fabric/Boto: boto.exception.NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. [‘QuerySignatureV2AuthHandler’] Check your credentials
In our Fabric code we make use of Boto to connect to the EC2 API and pull back various bits of information and the first time anyone tries to use it they end up with the following stack trace:
File "/Library/Python/2.7/site-packages/fabric/main.py", line 717, in main *args, **kwargs File "/Library/Python/2.7/site-packages/fabric/tasks.py", line 332, in execute results['<local-only>'] = task.run(*args, **new_kwargs) File "/Library/Python/2.7/site-packages/fabric/tasks.py", line 112, in run return self.wrapped(*args, **kwargs) File "/Users/mark/projects/forward-puppet/ec2.py", line 131, in running instances = instances_by_zones(running_instances(region, role_name)) File "/Users/mark/projects/forward-puppet/ec2.py", line 19, in running_instances ec2conn = ec2.connect_to_region(region) File "/Library/Python/2.7/site-packages/boto/ec2/__init__.py", line 57, in connect_to_region for region in regions(**kw_params): File "/Library/Python/2.7/site-packages/boto/ec2/__init__.py", line 39, in regions c = EC2Connection(**kw_params) File "/Library/Python/2.7/site-packages/boto/ec2/connection.py", line 94, in __init__ validate_certs=validate_certs) File "/Library/Python/2.7/site-packages/boto/connection.py", line 936, in __init__ validate_certs=validate_certs) File "/Library/Python/2.7/site-packages/boto/connection.py", line 548, in __init__ host, config, self.provider, self._required_auth_capability()) File "/Library/Python/2.7/site-packages/boto/auth.py", line 633, in get_auth_handler 'Check your credentials' % (len(names), str(names))) boto.exception.NoAuthHandlerFound: No handler was ready to authenticate. 1 handlers were checked. ['QuerySignatureV2AuthHandler'] Check your credentials
We haven’t told Boto about our AWS credentials and I’ve come across two ways of providing them:
As environment variables
export AWS_ACCESS_KEY_ID="aws_access_key_id" export AWS_SECRET_ACCESS_KEY="aws_secret_access_key"
In the file ~/.boto
[Credentials] aws_access_key_id = aws_access_key_id aws_secret_access_key = aws_secret_access_key
And that should do the trick!
We wanted to tail one of the log files simultaneously on 12 servers this afternoon to try and see if a particular event was being logged and rather than opening 12 SSH sessions decided to get Fabric to help us out.
My initial attempt to do this was the following:
fab -H host1,host2,host3 -- tail -f /var/www/awesome/current/log/production.log
It works but the problem is that by default Fabric runs the specified command one machine after the other so we’ve actually managed to block Fabric with the tail command on ‘host1’.
The output of host1’s log file will be printed to the terminal but nothing from the other two hosts.
Nathan showed me how to get around this problem by making use of Fabric’s parallel execution which we can enable with the ‘-P’ option:
fab -P --linewise -H host1,host2,host3 -- tail -f /var/www/awesome/current/log/production.log
We also used the ‘likewise’ flag to ensure that data between the different tail processes didn’t get mixed up although this wasn’t necessary because Fabric defaults to likewise if you’re using parallel execution mode anyway.
On a side-note, Paul Ingles wrote up the approach taken to make data from log files more accessible using a Kafka driven event pipeline but in this case we haven’t got round to wiring this data up yet so Fabric it is for now.
The idea is to get the simplest implementation of a pipeline in place, prioritizing a fully working skeleton that stretches across the full path to production over a fully featured, final-design functionality for each stage of the pipeline.
Kief goes on to explain in detail how we can go about executing this and it reminded of a project I worked on almost 3 years ago where we took a similar approach.
We were building an internal application for an insurance company and didn’t have any idea how difficult it was going to be to put something into production so we decided to find out on the first day of the project.
We started small – our initial goal was to work out what the process would be to get a ‘Hello world’ text file onto production hardware.
Although we were only putting a text file into production we wanted to try and make the pipeline as similar as possible to how it would actually be so we set up a script to package the text file into a ZIP file. We then wired up a continuous integration server to generate this artifact on each run of the build.
What we learnt from this initial process was how far we’d be able to automate things. We were working closely with one of the guys in the operations team and he showed us where we should deploy the artifact so that he could pick it up and put it into production.
Our next step after this was to do the same thing but this time with a web application just serving a ‘Hello world’ response from one of the end points.
This was relatively painless but we learnt some other intricacies of the process when we wanted to deploy a script that would make changes to a database.
Since these changes had to be verified by a different person they preferred it if we put the SQL scripts in a different artifact which they could pick up.
We found all these things out within the first couple of weeks which made our life much easier when we put the application live a couple of months down the line.
Although there were a few manual steps in the process I’ve described we still found the idea of driving out the path to production early a useful exercise.
Read Kief’s post for ideas about how to handle some of the problems you’ll come across when it’s all a bit more automated!
A few weeks ago Jez Humble wrote a blog post titled “There’s no such thing as a ‘DevOps team’” where he explains what DevOps is actually supposed to be about and describes a model of how developers and operations folk can work together.
Jez’s suggestion is for developers to take responsibility for the systems they create but he notes that:
[…] they need support from operations to understand how to build reliable software that can be continuous deployed to an unreliable platform that scales horizontally. They need to be able to self-service environments and deployments. They need to understand how to write testable, maintainable code. They need to know how to do packaging, deployment, and post-deployment support.
His suggestions sound reasonably similar to the way Spotify have their teams setup whereby product teams own their product from idea to production but can get help from an operations team to make this happen.
At Spotify there is a separate operations team, but their job is not to make releases for the squads – their job is to give the squads the support they need to release code themselves; support in the form of infrastructure, scripts, and routines. They are, in a sense, “building the road to production”.
It’s an informal but effective collaboration, based on face-to-face communication rather than detailed process documentation.
On a few of projects that I’ve worked on in the last 18 months or so we’ve tried to roughly replicate this model but there are a few challenges in doing so.
In a number of the organisations that I’ve worked at there is a mentality that people should only take responsibility for ‘their bit’ which in this case means developers code the application and operations deploy it.
This manifests itself when you hear comments such as “it must be an application problem” when something isn’t working rather than working together to solve the problem.
There’s also a more subtle version of this when we get into the belief that developers are only responsible for putting points on the board therefore they shouldn’t spend time doing operations-y work.
Even if we’ve got beyond the idea that people should only be responsible for their silo and have operations and developers working closely together it can still end up reverting back to type when people are under pressure.
When a big release is coming up there’ll often be a push to ensure that the expected features have been completed and this leads us back towards the silo mentality, at least temporarily.
Presumably with a more frequent release schedule this becomes less of an issue but I haven’t worked for long enough in that way to say for sure.
In some environments there is often quite tight security around who is allowed to push into production and this would typically be folks in the operations team.
Obviously this means that the product team can’t actually push their own changes unless they arrange to work together with one of the operations folks to do so.
We still don’t have the ‘throw it over the wall’ mentality in this setup but it does create more of a bottle neck in the system than we’d have otherwise.
These are just some of the obstacles that I’ve seen that can get in the way of our optimal setup.
I’m sure there are others that I haven’t come across yet but the nice thing is that two of these are more a mindset thing than anything else so that can be fixed over time.
On most of the projects I’ve worked on over the last couple of years we’ve made use of feature toggles that we used to turn pending features on and off while they were still being built but while reading Web Operations I came across another usage.
In the chapter titled ‘Dev and Ops Collaboration and Cooperation’ Paul Hammond suggests the following:
Eventually some of your infrastructure will fail in an unexpected way. When that happens, you’ll want the ability to disable just the features that rely on it, and keep the rest of the site running. Feature flags make this possible.
We’d mainly use this approach to disable peripheral functionality such as the ability to comment on a site whose main purpose is to deliver news.
From what I understand this means we’d permanently have if statements (or some equivalent) in the appropriate places in our code base which could be dynamically toggled if we start experiencing problems.
This differs slightly from the feature toggle approach we’ve used because those toggles would eventually be removed when the feature was running successfully in production.
Hammond goes on to suggest using feature flags for any external services that we rely on e.g. Flickr relies on the Yahoo address book, del.icio.us and last.fm but can gracefully disable that functionality if needs be.
He also points out that it’s useful to think hard about what features are absolutely core to serving your site e.g. Flickr can disable photo uploads but still allow people to continue viewing photos.
Overall this sounds like a pretty neat idea and apart from the slight complexity of having the conditionals in the code I can’t really think of any reasons why you wouldn’t want to do it. Happy to hear opposing views though!
In the latest version of the ThoughtWorks Technology Radar one of the areas covered is ‘configuration in DNS’, a term which I first came across earlier in the year from a mailing list post by my former colleague Daniel Worthington-Bodart.
The radar describes it like so:
Application deployments often suffer from an excess of environment-specific configuration settings, including the hostnames of dependent services. Configuration in DNS is a valuable technique to reduce this complexity by using standard hostnames like ‘mail’ or ‘db’ and have DNS resolve to the correct host for that environment. This can be achieved in multiple ways, including split-horizon DNS or configuring search subdomains. Collaboration between development teams and IT operations is essential to achieve this, but that is unfortunately still difficult in some organizations.
As I alluded to in my post about creating environment agnostic machines one of the techniques that we’ve used to achieve this is configuration in DNS, whereby we use fully qualified domain names (FQDN) for services in our configuration files and have DNS resolve them.
For example for our frontend application we use the service name frontend.production.domain-name.com which resolves to a load balancer that routes requests to the appropriate machine.
Shodhan pointed out that the ‘production’ in the name is a bit misleading because it suggests the name is environment specific which isn’t the case.
We use the same name in staging as well and since the two environments are on different virtual networks the name will resolve to a different machine.
Now that we’re using this approach I was trying to remember what other ways I’ve seen environment-specific configuration handled and I can think of two other ways:
One way is to hard code IP addresses in our configuration files and then either vary those based on the environment or have the environments use separate private networks in which case we could use the same ones.
The disadvantage of this approach is that IPs can be quite difficult to remember and are easy to mistype.
An approach which is slightly better than using IPs is to make use of a specific machine’s FQDN which would then be resolved by DNS.
If the machine’s IP changes then that would be taken care of and we wouldn’t need to change our configuration file but it isn’t particularly flexible to change.
For example if we wanted to change one of our services to make it load balanced that would be much more difficult to do if we’ve hard coded machine names into configuration files than if we have something like db.backend.production which we can instead route to a load balancer’s IP.
Given that we wanted to use service oriented host names we than had to decide how those names would be resolved:
One way to do this, and the way that we’re currently using, is to store all the FQDNs in the /etc/hosts file on every machine and point those names at the appropriate machines.
In our case we’re often pointing them at a hardware load balancer but it could also be an individual machine where that makes more sense e.g. when pointing to the master in a MySQL setup.
This could become quite difficult to manage as the number of FQDNs increases but we’re managing it through puppet so the most annoying thing is that it can take up to half an hour for a newly added FQDN to be resolvable across the whole stack.
Internal DNS server
An alternative approach is to setup an internal DNS server which would mainly be used to service DNS requests for our own network.
The disadvantage of this approach is that it would be a single point of failure which is problematic in terms of making it a good thing to try and compromise if you can break into the network as well as being a potential performance bottleneck.
Of course the latter can be solved by making use of caching or making the DNS server redundant but I do like the simplicity of the /etc/hosts approach.
I’d be interested in hearing about other/better approaches to solving the configuration problem if people know of any!
On my current project we’ve been setting up production and staging environments and Shodhan came up with the idea of making staging and production identical to the point that a machine wouldn’t even know what environment it was in.
Identical in this sense means:
- Puppet doesn’t know which environment the machine is in. Our factor variables suggest the environment is production.
- We set the RACK_ENV variable to production so applications don’t know what environment they’re in.
- The IPs and host names of equivalent machines in production/staging are identical
The only thing that differs is that the external IPs to access machines differ and therefore the NATed address that they display to the world when making any outgoing requests is also different.
The only place where we do something different based on an environment is when deploying applications from Jenkins.
At that stage if there needs to be a different configuration file or initializer depending on the environment we’ll have it placed in the correct location.
Why are we doing this?
The benefit of doing this is that we can have a much higher degree of confidence that something which works in staging is also going to work in production.
Previously we’d noticed that we were inclined to write switch or if/else statements in puppet code based on the environment which meant that the machines were being configured differently depending on the environment.
There have been a few problems with this approach, most of which we’ve come up with a solution for.
Applications that rely on knowing their environment
One problem we had while doing this was that some applications did internally rely on knowing which environment they were deployed on.
For example, we have an email processing job which relies on the RACK_ENV variable and we would end up processing production emails on staging if we deployed the job there with RACK_ENV set to ‘production’.
Our temporary fix for this was to change the application’s deployment mechanism so that this job wouldn’t run in staging but the long term fix is to make it environment agnostic.
Addressing machines from outside the network
We sometimes need to SSH into machines via a jumpbox and this becomes more difficult now that machines in production and staging have identical IPs and host names.
We got around this with some cleverness to rewrite the hostname if it ended in staging. The ~/.ssh/config file reads like this:
Host *.staging IdentityFile ~/.ssh/id_rsa ProxyCommand sh -c "/usr/bin/ssh_staging %h %p"
And in /usr/bin/ssh_staging we have the following:
HOSTNAME=`echo $1 | sed s/staging/production/` ssh staging-jumpbox1 nc $HOSTNAME $2
If we were to run the following command:
That would SSH us onto the staging jumpbox and then proxy an SSH connection onto some-box.production using netcat.
Since web facing applications in both staging and production are referred to by the same fully qualified domain name (FQDN) we need to update our /etc/hosts file to access the staging version.
When SSHing you can’t tell which environment a machine is
One problem with having the machines not know their environment is that we couldn’t tell whether or not we’d SSH’d into a staging or production machine by looking at our command prompt since they have identical host names.
We ended up defining PS1 based on the public IP address of the machine which we found out by calling icanhazip.com
Overall despite the fact that it was initially painful – and Shodhan game-fully took most of that pain – to do this I think it makes sense as an idea and I’d probably want to have it baked in from the beginning on anything I work on in future.
I’ve been playing around with Chef Solo on Fedora and executing the following:
sudo chef-solo -c config/solo.rb -j config/node.json
leads to the following error:
... ERROR: Running exception handlers ERROR: Exception handlers complete FATAL: Stacktrace dumped to /home/mark/chef-solo/chef-stacktrace.out FATAL: ArgumentError: Attribute domain is not defined!
A bit of googling led me to believe that this error is happening because the machine doesn’t have a fully qualified domain name (fqdn) defined which can be seen by calling the following command:
$ hostname -f > hostname: Name of service not known
One way to fix it is to add the following entry to /etc/hosts
Which results in the script running fine with no errors.
... INFO: Running report handlers INFO: Report handlers complete
A suggestion I read while googling about fqdn was to add the hostname of the machine into a file called /etc/HOSTNAME but that didn’t seem to have any impact for me.
On the Mac hostname -f works fine even without an entry like the above in /etc/hosts so I’m not entirely sure how it all works!
If anyone could explain it to me that’d be awesome.
I’ve been flicking through Continuous Deployment and one section early on about changing configuration information in our applications particularly caught my eye:
In our experience, it is an enduring myth that configuration information is somehow less risky to change than source code. Our bet is that, given access to both, we can stop your system at least as easily by changing the configuration as by changing the source code.
In many organisations where I’ve worked this is generally adhered to except when it comes to configuration which is controlled from the database!
If there was a change to be made to source code or configuration on the file system then the application would go through a series of regression tests (often manual) to ensure that the application would still work after the change.
If it was a database change then that just be made and there would be no such process.
Making a change to database configuration is pretty much the same as making any other change and if we don’t treat it the same way then we can run into all kinds of trouble as the authors point out:
If we change the source code, there are a variety of ways in which we are protected from ourselves; the compiler will rule out nonsense, and automated tests should catch most other errors. On the other hand, most configuration information is free-form and untested.
Alex recently wrote about his use of deployment smoke tests – another suggestion of the authors – to ensure that we don’t break our application by making configuration changes.
Organisations often have painful processes for releasing software but I think it makes more sense to try and fix those rather than circumnavigating them and potentially making our application behave unexpectedly in production.