Microsoft has published a full, frank, and ugly account of just what went wrong when Azure Storage entered Total Inability To Support Usual Performance – TITSUP - mode in November.
The nub of the problem was that Azure's update procedures and code had “... a gap in the deployment tooling that relied on human decisions and protocol.”
At the time of the incident, Microsoft said it was caused by “... an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”
Microsoft says its flighting process works like this:
“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”
The new analysis of the outage fingers faulty flighting as the cause of the mess, saying it started with “a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.”
During the upgrade, “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.” The engineer doing the upgrade “believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.”
But it wasn't, because “the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.”
“Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.”
Microsoft has since changed its processes and “released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.”
Those updates mean “policy is now enforced by the deployment platform itself.”
That's an unusual amount of explanatory material for any incident and a far more detailed dump than cloudy rivals have offered after their own outages.
With cloud services now hard to differentiate on price, and often not highly-differentiated in terms of features, might this kind of openness sway customers? Or is it safer to assume that if Microsoft has a big SNAFU like this in it, there are others waiting to happen and Azure is best avoided?
You be the judge. ®