TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Buggy software and bad deployment borked Redmond's cloud


Microsoft has published a full, frank, and ugly account of just what went wrong when Azure Storage entered Total Inability To Support Usual Performance – TITSUP - mode in November.

The nub of the problem was that Azure's update procedures and code had “... a gap in the deployment tooling that relied on human decisions and protocol.”

At the time of the incident, Microsoft said it was caused by “... an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”

Microsoft says its flighting process works like this:

“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”

The new analysis of the outage fingers faulty flighting as the cause of the mess, saying it started with “a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.”

During the upgrade, “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.” The engineer doing the upgrade “believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.”

But it wasn't, because “the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.”

“Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.”

Microsoft has since changed its processes and “released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.”

Those updates mean “policy is now enforced by the deployment platform itself.”

Microsoft's being very open about this issue. Not only is there a detailed article about the mess, there's also a video interview with more detail.

That's an unusual amount of explanatory material for any incident and a far more detailed dump than cloudy rivals have offered after their own outages.

With cloud services now hard to differentiate on price, and often not highly-differentiated in terms of features, might this kind of openness sway customers? Or is it safer to assume that if Microsoft has a big SNAFU like this in it, there are others waiting to happen and Azure is best avoided?

You be the judge. ®

Broader topics


Other stories you might like

  • Tesla driver charged with vehicular manslaughter after deadly Autopilot crash

    Prosecution seems to be first of its kind in America

    A Tesla driver has seemingly become the first person in the US to be charged with vehicular manslaughter for a deadly crash in which the vehicle's Autopilot mode was engaged.

    According to the cops, the driver exited a highway in his Tesla Model S, ran a red light, and smashed into a Honda Civic at an intersection in Gardena, Los Angeles County, in late 2019. A man and woman in the second car were killed. The Tesla driver and a passenger survived and were taken to hospital.

    Prosecutors in California charged Kevin George Aziz Riad, 27, in October last year though details of the case are only just emerging, according to AP on Tuesday. Riad, a limousine service driver, is facing two counts of vehicular manslaughter, and is free on bail after pleading not guilty.

    Continue reading
  • AMD returns to smartphone graphics with new Samsung chip for your pocket computer

    We're back in black

    AMD's GPU technology is returning to mobile handsets with Samsung's Exynos 2200 system-on-chip, which was announced on Tuesday.

    The Exynos 2200 processor, fabricated using a 4nm process, has Armv9 CPU cores and the oddly named Xclipse GPU, which is an adaptation of AMD's RDNA 2 mainstream GPU architecture.

    AMD was in the handheld GPU market until 2009, when it sold the Imageon GPU and handheld business for $65m to Qualcomm, which turned the tech into the Adreno GPU for its Snapdragon family. AMD's Imageon processors were used in devices from Motorola, Panasonic, Palm and others making Windows Mobile handsets.

    Continue reading
  • Big shock: Guy who fled political violence and became rich in tech now struggles to care about political violence

    'I recognize that I come across as lacking empathy,' billionaire VC admits

    Billionaire tech investor and ex-Facebook senior executive Chamath Palihapitiya was publicly blasted after he said nobody really cares about the reported human rights abuse of Uyghur Muslims in China.

    The blunt comments were made during the latest episode of All-In, a podcast in which Palihapitiya chats to investors and entrepreneurs Jason Calacanis, David Sacks, and David Friedberg about technology.

    The group were debating the Biden administration’s response to what's said to be China's crackdown of Uyghur Muslims when Palihapitiya interrupted and said: “Nobody cares about what’s happening to the Uyghurs, okay? ... I’m telling you a very hard ugly truth, okay? Of all the things that I care about … yes, it is below my line.”

    Continue reading

Biting the hand that feeds IT © 1998–2022