Channel

This article is more than 1 year old

TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Buggy software and bad deployment borked Redmond's cloud

Thu 18 Dec 2014 // 08:29 UTC

Microsoft has published a full, frank, and ugly account of just what went wrong when Azure Storage entered Total Inability To Support Usual Performance – TITSUP - mode in November.

The nub of the problem was that Azure's update procedures and code had “... a gap in the deployment tooling that relied on human decisions and protocol.”

At the time of the incident, Microsoft said it was caused by “... an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”

Microsoft says its flighting process works like this:

“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”

The new analysis of the outage fingers faulty flighting as the cause of the mess, saying it started with “a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.”

During the upgrade, “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.” The engineer doing the upgrade “believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.”

But it wasn't, because “the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.”

“Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.”

Microsoft has since changed its processes and “released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.”

Those updates mean “policy is now enforced by the deployment platform itself.”

Microsoft's being very open about this issue. Not only is there a detailed article about the mess, there's also a video interview with more detail.

That's an unusual amount of explanatory material for any incident and a far more detailed dump than cloudy rivals have offered after their own outages.

With cloud services now hard to differentiate on price, and often not highly-differentiated in terms of features, might this kind of openness sway customers? Or is it safer to assume that if Microsoft has a big SNAFU like this in it, there are others waiting to happen and Azure is best avoided?

Topics

Special Features

Vendor Voice

Resources

Channel

TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Buggy software and bad deployment borked Redmond's cloud

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

French lawmakers take a swing at cloud monopolies

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

SharePoint logs are easily circumvented and Microsoft is dragging its heels

Reducing the cloud security overhead

Microsoft rolls out safety tools for Azure AI. Hint: More models

US government excoriates Microsoft for 'avoidable errors' but keeps paying for its products

Cloud Software Group and Microsoft pledge another eight years of co-opetition

Microsoft decides it's done with Azure egress ransoms

Microsoft breach allowed Russian spies to steal emails from US government

Microsoft unbundling Teams is to appease regulators, not give customers a better deal

Want to keep Windows 10 secure? This is how much Microsoft will charge you

Microsoft Teams decouples from Office 365 suite globally

About Us

Our Websites

Your Privacy