Continuous Deployment and the Unforeseeable

Can you afford not to do continuous deployment?

Continuous deployment is the practice of regularly (more than daily) deploying updated software to production.

Arguments in favour continuous deployment often focus on how it enables us to continually, regularly, and rapidly deliver value to the business, allowing us to move fast. It’s also often discussed how it reduces release-risk by making deployments an everyday event – with smaller, less risky changes, which are fully automated.

I want to consider another reason

Unforeseen Production Issues

It can be tempting to reduce the frequency of deployments in response to risk. If a deployment with a bug can result in losing a significant amount of money or catastrophic reputation damage, it’s tempting to shy away from the risk and do it less often. Why not plan to release once a month instead of every day? Then we’re only taking the risk monthly.

Let’s leave aside the many reasons why doing things less often is unlikely to reduce the risk.

There will always be things that can break your system in production that are outside your control, and are unlikely to be caught by your pre-deployment testing. If you hit one of these issues then you will need to perform a deployment to fix the issue.

Time-sensitive Bugs

How often do we run into bugs in software due to failing to anticipate some detail of time and dates. We hear stories of systems breaking on 29/02 nearly every leap year. Have you anticipated backwards time corrections from NTP? Are you sure you handle leap-seconds? What about every framework and third party service you are using?

Yes, most of these can be tested for in advance with sufficiently rigorous testing, but we always miss test cases.

Time means that we cannot be sure that software that worked when it was deployed will continue to work.

Capacity Problems

If you production system is overloaded due to unforeseen bottlenecks you may need to fix them and re-deploy. Even if you do capacity planning you may find the software needs to scale in ways that you did not anticipate when you wrote your capacity tests.

When you find an unexpected bottleneck in a production system, you’re going to need to make a change and deploy. How quickly depends on how much early warning your monitoring software gave you.

Third Party dependencies

That service that you depend on goes down for an extended period of time, suddenly you need to change your system to not rely on it.

Infrastructure

Do you handle all the infrastructure related issues that can go wrong? What happens if DNS is intermittently returning incorrect results? What if there’s a routing problem between two services in your system?

Third Party Influence

“Noisy Neighbours” in virtualised or shared hosting. Third party JavaScript in your page overwriting your variables or event listeners. In most environments, you can be affected by others over whom you have no control.

Response

Monitor Monitor Monitor

We cannot foresee all production issues and guard against them in our deployment pipeline. Even for those we could, in many cases the cost would not be warranted by the benefit.

Any of the above factors (and more) can cause your system to stop meeting its acceptance criteria at some point after it has been deployed.

The first response to this fact is that we need to monitor everything important in production. This could mean running your automated acceptance tests against your production environment. You can do this both for “feature” and “non-functional” requirement acceptance tests such as capacity.

If you have end-to-end tests feature tests for your system why not run them against your production app? If you’re worried about cluttering production with fake test data you can build in a cleanup mechanism, or a means of excluding your test data from affecting real users.

If you have capacity tests for your system why not run them against your production app? If you’re worried about the risk that it won’t cope at least you can control when the capacity tests run. Wouldn’t you rather your system failed during working hours due to an capacity test which you could just turn off, rather than failing in the middle of the night due to a real spike in traffic?

Be ready to deploy at any point

Tying this back to continuous delivery, I think our second response should be being ready to deploy at any point.

Some of these potential production issues you can prevent with sufficient testing, but will you have thought of every possible case? Some of these issues can be mitigated in production by modifying infrastructure configuration and or application configuration – but

Should you be modifying those independently of the application? They’re just as likely to break your production system as code changes
Not everything will (or should be) configurable

At the point you find yourself needing to do an emergency bug fix on production you have two scenarios when you have not been practising continuous delivery.

You pull up the branch/tag of code last deployed to production a few weeks ago, make your changes, and deploy.
You risk deploying what you have now. Weeks of changes which have not been deployed. Any of which may introduce new, subtle issues.

You may not currently even be at an integrable state. If you were not expecting to deploy why should you be? If you’re not then you need to

Identify quickly what code is actually in production.
Context-switch back to the code-base as it was then. Your mental model of the code will be as it is now, not as it was a few weeks ago.
Create yourself re-integration pain when you have to merge your bugfix with the (potentially many) changes that have been made since. This may be fun if you have performed significant refactorings since.
Perform a risky release. You haven’t deployed for a few weeks. Lots could have broken since then. Are your acceptance tests still working? Or do they have some of the same problems as your production environment? If you have fixed a time-sensitive bug in your tests since your last deploy you will now have to back-port that fix to the branch you are now making your changes on.

Summary

We’ll never be able to foresee everything that will happen in production. It’s eve more important to be able to notice and react quickly to production problems than to stop broken software getting into production.

Steve Smith

Great stuff Benji!

I’ve written about increasing release cadence to reduce risk before at http://www.alwaysagileconsulting.com/more-releases-with-less-risk/. I agree that all possible production issues can’t be foreseen, and that the key is to monitor business metrics as well as operational metrics. Monitoring is testing in production…

Cheers

Steve

July 21st, 2014