I don’t particularly enjoy outages, but I do like reading about their root cause analysis afterwards. It’s a valuable place to learn about mistakes made and often shares a lot of insights into (the technology behind) an organization that you normally wouldn’t get to know.
And last November’s Azure outage is no different. A very detailed write-up with enough internals to keep things interesting. The outage occurred as a result of a planned maintenance, to deploy an improvement to the storage infrastructure that would result in faster Storage Tables.
During this deployment, there were two operational errors:
The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.
Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.
As with most problems, they’re human-induced. Technology doesn’t often fail, except when engineers make mistakes or implement the technology in a bad way. In this case, a combination of several human errors were the cause.
In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol. With the tooling updates the policy is now enforced by the deployment platform itself.
Not everything can be solved with procedures. Even with every step clearly outlined, it still relies on engineers following every step to the letter, and not making mistakes. But we make mistakes. We all do.
It’s just hoping those mistakes don’t occur during critical times.