Azure Cloud Outage Root Cause Analysis

Want to help support this blog? Try out Oh Dear, the best all-in-one monitoring tool for your entire website, co-founded by me (the guy that wrote this blogpost). Start with a 10-day trial, no strings attached.

We offer uptime monitoring, SSL checks, broken links checking, performance & cronjob monitoring, branded status pages & so much more. Try us out today!

Profile image of Mattias Geniar

Mattias Geniar, December 17, 2014

Follow me on Twitter as @mattiasgeniar

I don’t particularly enjoy outages, but I do like reading about their root cause analysis afterwards. It’s a valuable place to learn about mistakes made and often shares a lot of insights into (the technology behind) an organization that you normally wouldn’t get to know.

And last November’s Azure outage is no different. A very detailed write-up with enough internals to keep things interesting. The outage occurred as a result of a planned maintenance, to deploy an improvement to the storage infrastructure that would result in faster Storage Tables.

During this deployment, there were two operational errors:

  1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.

  2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.

As with most problems, they’re human-induced. Technology doesn’t often fail, except when engineers make mistakes or implement the technology in a bad way. In this case, a combination of several human errors were the cause.

In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol. With the tooling updates the policy is now enforced by the deployment platform itself.

Not everything can be solved with procedures. Even with every step clearly outlined, it still relies on engineers following every step to the letter, and not making mistakes. But we make mistakes. We all do.

It’s just hoping those mistakes don’t occur during critical times.



Want to subscribe to the cron.weekly newsletter?

I write a weekly-ish newsletter on Linux, open source & webdevelopment called cron.weekly.

It features the latest news, guides & tutorials and new open source projects. You can sign up via email below.

No spam. Just some good, practical Linux & open source content.