People screw up.
Even simple tasks eventually are done wrong. Your star engineer will have a bad day, he'll forget to deploy to one of your servers and it'll be a big problem.
An error in certain industries is ok. Your burger didn't have enough ketchup, or the bun was crushed. Most people won't care.
An error in engineering is costly, people and dogs can die. Software is no exception. In software this mostly manifests into revenue loss, customer trust loss, and not to mention the time and inconvenience to engineers. Nobody wants to be the person that has to fix it really quick at 2am.
Something went wrong
Whenever something goes wrong, it's a good idea to find the root cause. The "Five Whys" is a great way to do just that. Often, the issue was that someone screwed up. They forgot to test a piece of code, or they forgot to configure it correctly for production.
Then someone says, "well, let's make a checklist so then our engineers don't mess up!" This is not the correct solution.
Let's say that there is an engineer that is in charge of upgrading the live site. To him this might just be a matter of running a few commands with what build to use once every few weeks.
In this situation he can commit all types of errors:
- He can type in the wrong build and realize it after the fact he did it
- He can type in the wrong build and not realize it
- He can type in the right build but it has the wrong configuration
- He can type in the right build with the right config, but it's released at the wrong time
In a process centric way of solving this, these are possible solutions that would be suggested, all of these introduce new problems.
- The engineer emails the build and commands for others to verify
- Employee should always look at the documentation for any changes
- The engineer should always double check the configuration
But these can easily go wrong:
- Assuming that the other employees also don't make a mistake or simply ignore it after a while and say "OK" without looking. It also adds extra time if other people are not available.
- The engineer doesn't read it or documentation is hard to follow or it is out of date or, even worse, it's wrong.
- Production went down at 2am on New Year's Day, the engineer was tired and forgot to check the documentation and missed a step. The outage lasted six hours longer than neccessary.
See the pattern? Every time you solve a problem with process, you can introduce another one. And the fundamental problem is still present: human error.
The ideal way of solving these problems is:
- Builds are marked good or bad depending on if the tests pass
- Deployment happens with the latest good build every so often
- Ideally, this would be automatic via continuous delivery
- Monitoring is top notch so that production outages can be detected quickly
This is why automation is great: it reduces error and removes risk.
Guide to postmortems
After every outage, you should hold a postmortem meeting to discuss what went wrong and how to fix it.
First, identify the root cause of the problem via a method such as the Five Whys. An easy mistake to make is to place blame on people in the Five Why's, don't do this, you're a team, you're all accountable for each other.
After you've identified the problem, it's time for solutions. Sometimes root causes are aberrations and aren't worth addressing, but almost all the time, the root cause is a human.
Try to figure out how to not only not make this mistake, but how to completely avoid the possibility of it occuring at all.
If a human typed the wrong deployment command in, ask yourself, why does a human need to do that? Could that not be automated?
If a human configured the service wrong, ask, why didn't monitoring or deployment tests catch this and revert the deployment?
If monitoring didn't catch the outage, it better catch it next time. A good monitoring system should also, as much as is reasonable, use your system like a user does. If you do file uploads and downloads, have it create a new user account, upload and download a file every five minutes. The worst thing your team can say is "Oh, it's this issue again". Learn from your mistakes.
If the process you follow wasn't followed, maybe you should wonder if it's too difficult and should be simplified or removed.
Key: no human involvement required
Process always seems cheaper. You know what isn't cheap? Losing customers and trust and getting them back. Process is a quick bandage.
Automation is, at first, more expensive than process, but given good developers and tests. Your automation will fail less frequently than a human.
Automate your process. Don't have bad process make more process.