Operations is a development concern

Over the years I've alternated between development and operations several times, often with severely blurred lines (read: startup where hiring ops people was far down the list of priorities; as well as Yahoo, where applications ops work was largely done by developers). Over the last two years the operational side has been what I've focused on, largely because it's been treated as an unloved step-child and really has needed my attention most at the places I've worked.

This post is exploring a lot of things that many people will find obvious (at least I hope so), but that I wanted to write about because I keep seeing people doing this badly over and over again, even though the underlying principles I'm basing this post on are not new. It's also sort of a set of delayed new years resolutions in the sense that a few of these are things I'm not being as disciplined about myself that I want to be but that I intend to increasingly enforce at my day job.

Overview

Operations is often kept separate from application development for good reason: Fundamentally different mindsets are needed for many types of tasks.
For example, developers have a tendency to want wide ranging access, and often complain loudly when they don't get root access to a box or get some other restriction placed on them (I've had more than one developer insist he couldn't do his job without root access everywhere), but people with operations experience know how each extra account mean one more person will cause trouble (whether intentional or not); developers are used to lax discipline with changes when testing things, while a well maintained production system requires traceability and documentation of changes not least so you know you can reproduce the system state.

I've sinned as much as most people who start out in a development capacity. Every point on this list is a reaction to things I've had first hand experience of doing badly at some point or other.

My new personal mantra is that for web services the entire system is a joint development and operations concern, but that instead of letting developers loose on the production system, operations people should be let loose on development. Not writing application code, but helping put in place an infrastructure that treats the whole system the way we typically build applications, and remove or reduce the reasons for development staff to need (or want) to touch the live system.

This should be obvious, but for some reason I keep seeing projects where operational concerns are first addressed when development is ready to hand things over.

As part of the development process I want to:

Develop a full set of operational tools including whatever probes/monitoring tools are required
Write tests for the operational tools and the server images
Create a test suite of the tests that are suitable to run on the live system, particularly integrity tests.
Ensure there is an appropriate way of configuring the running app without manual changes;
... and expand on integrity tests by writing or installing tools that are particularly geared at auditing the system for manual changes.
Write scripts to create the server images
Wipe and recreate the whole QA system and QA the whole thing as one unit, not just "the application".

Many shops do get most of this right, but most development teams I've both worked with and have observed from the outside lack experience with actually properly maintaining a large running system, and so lack knowledge about a lot of operational issues.

As a result, most places it's still very much hit and miss what parts the development team successfully address. Operations is not "sexy" and often considered as an after thought that operations people can fiddle with, and often to be dealt with after launch.

There exists a wide variety of tools to support these concerns, but to bring it all together requires planning, and far too often it is a process that gets left until late in the development process (or after development is "complete"). Instead of considering the deliverable from engineering as the service as a whole, engineering tends to deliver a set of components that excludes the OS and standard system services as well as some or all of the tools required for efficient operational management.

It's like having a car manufacturer hand you the parts of a car they've custom manufactured in pieces and giving you a likely incomplete list of standard parts for you to source from elsewhere.

Let's look at these concerns one by one.

Develop a full set of operational tools

It's my belief that if keeping an application running requires additional custom tools that don't exist, then the development team have delivered an incomplete solution. Yet development teams keep delivering solutions that needs expensive babysitting (time is money - operations staff isn't free, and neither is ongoing support from the developers) to keep working.

I've been guilty of this more than once, and see it happen all the time - it's usually the first thing that happens when time gets tight. Specifically, if operations staff need to write tools to monitor the application or probes to plug in to the monitoring system, then something is wrong.

While it is important to get operations input and involvement in what is being monitoring, leaving this to operations alone means that people who does not know the internals of the application are trying to ascertain how the application may fail and what is "unusual" behavior. An important symptom of this is "alert fatigue", where monitoring ends up generating so many false alerts that people learn to ignore them completely (do you filter your Nagios alerts to a separate folder you only check occasionally? Not good), but it can also lead to monitoring that is too limited, because whole failure scenarios are ignored because it's easier to do that than figure out what actually is an error.

Both are equally bad. Both lead to unnecessary downtime or worse (ongoing data corruption for example)

It is easy to skimp on this, because users (including management) often won't notice that things are missing or broken until the actual service runs into a major problem and you don't catch it or don't have the tools to fix it. It's also easy to pretend that you can avoid thinking about this until after launch, when "things calm down". But it's at crunch times like a new major version where you need proper operational tools the most because you're likely to run into unforeseen problems and would benefit from having the monitoring and maintenance tools in place.

At the same time, if planned in advanced and done properly, the effort required can be significantly reduced by taking the operational need into account when designing the overall system, which brings me to my next two points:

Write tests for the operational tools and the server images

A probe for a monitoring tool like Nagios is essentially a test that is executed regularly in a production environment. You are going to need probes. Lots of them. If you plan ahead and actually ensure the people who do development and operations talk to each other, a significant number of probes can have the dual function of driving a test case.

Take a probe that checks that you can log in to the application. If it fails, login doesn't work. Which means you can slot it in as an integration test / functional test.

But you also want to make sure your probes will actually detect failure. So you run the test "the other way": Set up the system so login should fail or should succeed, and you also have a test for your login probe.

In addition to ensuring you have properly tested operational tools, you are also ensuring they are suitably tested. And if you are writing these tests anyway, you get part of it for "free". You are writing tests, are you not?

But you should also test properties of the server images as a whole: What starts failing when you run out of memory? What starts failing when the disk is full? Are logs correctly rotated so you're not facing the eternal grind of reclaiming space?

Those latter cases are ignored so often I get a headache just thinking about it. They are simple to fix, but fixing them in images before launch and writing tests to find them is infinitely preferable to running into them in production and having to update lots of systems without accidentally disrupting anything - writing test cases that exercise your complete images can save you endless pain after deployment. Too often this is ignored because the "environment" rarely becomes an issue in development and test/QA setup unless it is exercised on purpose, and developers are very good at ignoring it and focusing on testing app functionality instead.

Create a test suite of the tests that are suitable to run on the live system

If you are first writing a ton of tests, how much pain is identifying which of them can safely run on the live system?

Not very much.

Ideally you'd turn all of them into probes for your monitoring system, and make that your "test suite", but that may not always be appropriate depending on your monitoring system (most of them have strict timeouts for probes to ensure they don't start lagging behind). What you can always do is wrap any tests that don't "fit well" with your monitoring system in a test suite that is scheduled to run regularly, and use a probe to check the status of the last run so you can alert on it.

What matters is that whether you write test cases only for the application or for the whole system, you will have a bunch of tests that verify, or attempt to verify that the correct functioning of the system or parts of the system. For any such test that does not degrade the system (you obviously won't include load tests) or cause integrity problems (you won't include tests that mess with the database), or expose lots of test data, it's a shame to not leverage it to ensure that the system continues to work as expected after deployment.

Keeping this goal in mind also helps you structuring your tests so that it is easy to run them on the live system (making it easy to strip out mocks and turn off any preloading of data into the database, for example). Doing so has the potential of significantly reduce the cost and effort involved in writing probes by largely making them a different packaged version of part of your test code.

Knowing what your operations people care about helps you write appropriate tests; knowing what the developers considers failures and what they consider normal helps the operations people address and/or escalate the right things.

Ensure there is an appropriate way of configuring the running app without manual changes

How many servers do you have to manage before manually editing config files becomes a threat to the stability and scalability of your service and your ability to recover from problems? One.

If you want the ability to test a full system, you need to be able to reproduce it. To be able to reproduce it, you either need to take a copy or be able to recreate it from scratch. Being able to recreate it from scratch buys you the important property of being able to cleanly recreate a new image with significant changes. Manual changes massively complicate this process, not least because it increases the odds that something will be forgotten. And as the number of servers increases, manual changes lessens your ability to ensure the servers stay in sync or that further site wide changes won't break some of them.

I've personally spent well over a hundred hours "reverse engineering" and documenting the installation process for a system I was part of building a couple of years back. It can get immensely painful as the system grows and accrete additional dependencies without regularly verifying that you are still able to rebuild the system from scratch. It was a painful experience - one I never want to repeat.

I would like to think that people keep all their source versioned, and avoid manual changes on their live systems or, absolute worst case, version things "after the fact" if pressured into making manual changes during an outage / emergency (rarely an excuse - how long does it take to commit a change?). I would also like to think this was the case for config files and other parts of the OS install and third party applications, but I know this is not the case a lot of places.

There is really no excuse for not being able to reproduce configuration changes for a system from scratch. Even more so, there is really no reason to manually change config files when you have a proper system in place, and there are a wide variety of tools such as Puppet that allows you to automate this process. Another option is to package all changes up in testable packages, using your preferred OS/distributions package management support. Worst case, using a version control mechanism and/or a script to do scp/ssh in turn to a number of hosts and keeping them in sync is trivial to create.

After messing up this myself, and I know how tempting it can be to think "I'll just make that quick little change and check it in later", I've resorted to writing scripts to compare my virtual server images. I'll have it compare my template image with a backup of a running system, and report any changes, such as changes to config files or added/removed packages.

There are also tools to do this, such as Tripwire that are much more rarely used than they deserve. For my part my main reason for writing a script was to get prettified, context sensitive output (rather than telling me the package database has changed checksums and that lots of files have been added, for example, I'd rather be told that package XYZ was added).

I'll consider open sourcing some of those scripts (they're pretty trivial) if there's any interest.

Another reason to do this is so that I can create a development image, make modifications to it and compare it to a "clean" image to get a full list of what I changed - it's saved me days of taking detailed notes while experimenting with config changes. Having tools that does integrity checking in place is invaluable: Sooner or later you will forget and make a manual change, or someone will hack your system and help themselves - the worst hackers are not the ones that deface your homepage for the cred but the ones that quietly corrupt your backups by changing your backup script or silently steal data over time; the ones you don't notice at a glance. Proper tools will significantly increase your chances of picking this up, and can also help you roll any genuine changes back into your standard images.

Write scripts to create the server images

I've seen a number of places that are great about creating servers from images, but where the images are manually created. I hinted at why I don't like this in the section above, but lets make it explicit:

If you build the server images with a script, you are set for continuous integration not just of the app, but of the whole environment including a script that documents a "from scratch" installation of the OS.

When you decide to make major changes (say, you upgrade to the next major version of Apache), you upgrade your list of required packages, and the next build will fail or succeed, and you can additionally trivially guarantee there are no unversioned, unpackaged changes.

You also effectively have documentation of what is required: A list of packages, a set of patches or replacements for config files; a number of actions that needs to be carried out. If times comes to change OS or distribution, the script serves as documentation of what you have actually been using. Rather than dredging through an old filesystem to determine how you customized the base version of the OS, it's all sitting there in a neat little script.

At my day job at Aardvark Media we've been transitioning to this approach for our customer sites, and it's a world of difference from manually managing servers or managing images. When we need a customized image for a specific client it's a matter of updating a script that's anything from 5 to a couple of 100 lines (plus some generic boilerplate functions) depending on server functions required over and above what is part of our base image (which is also created with a script), rather than customizing a base image and starting over or cleaning up as best we can if we mess up something.

Upgrades, say of Postgres, happen the same way: I copy a build script, make a change, build a new image and test it using OpenVz. Increasingly we also deploy using OpenVz.

Wipe and recreate the whole QA system and QA the whole thing, not just "the application".

I've previously advocated rebuilding the build server on each build. Likewise, rebuilding the QA system regularly using scripts as mentioned or alternatively using tools as described above to compare the QA system with production, ensures that the QA system truly is in sync with the production system. If you are QA'ing on a system that is manually maintained, or not on purpose validated against production, you are testing things blind. It's like replacing components in your app with different versions than you'll deploy with and still expect the bugs to be the same.

The reason I prefer the idea of recreating QA system from scratch rather than syncing it manually is that I know how easy for it to get out of sync, and how easy it can be to set things up to rebuild it from scratch... I'll admit to not being as disciplined as I'd like about this one. Working on it.

... in closing

Making the whole end-to-end operational support a development concern has important effects in creating a system that is designed not just to be easy to use, but to be easy to operate. Reducing operational complexity translates directly into spending less operational resources on maintaining the system, helping you either cut cost or free up resources for more important tasks. While some extra resources might be spent upfront, operations costs you time and money the entire time the system is operational, whether or not you make further feature upgrades.

Automating more operational tasks up front, and including operations staff from the start of design and development, also aids in the creation of a system where errors and problems are caught early, and reduces the amount of time that development staff will need to spend handling escalated operational issues - for a complex enough system, it is perfectly possible to recoup the initial added development overhead just in reduced escalations within months.

Operations is a development concern 2009-01-19