Devops

This article is more than 1 year old

Asleep at the wheel: Why did it take 5 HOURS for Microsoft to acknowledge an Azure DevOps TITSUP*?

We'll have to wait until the US wakes up before we can answer that one

Wed 8 Apr 2020 // 15:30 UTC

In an impressively frank postmortem, Microsoft has admitted that at least part of its organisation was asleep at the wheel in a very real sense while its European DevOps tooling tottered.

The travails of Azure during the current surge in usage are well-documented but, as well as showing the limits of cloudy tech, the pandemic-induced capacity constraints have also shown up some all too human failings at Microsoft.

Microsoft's Azure DevOps hosted pools were no exception to those constraints, and from 24 to 26 March customers in Europe and the UK experienced substantial delays in their pipelines.

It was bad. By Microsoft's own reckoning, during normal working hours over the three days, customers experienced an average delay of 21 minutes. The worst delay was nine hours.

The problem was that for each Azure Pipelines job, a fresh virtual machine is needed and, well, there was no room at the inn as Azure reacted to the COVID-19 surge.

Oh, and the Primary Incident Manager (PIM) was asleep, but more on that later.

As the failures mounted, Azure Pipelines kept trying, and the queues kept getting longer. The gang had noted the potential for problems prior to the incident and had already been working on an update to deal with it (ephemeral OS disks for Linux agents and chunkier Azure VMs with nested virtualization for Windows.) However, the change was a big one and took a while to roll out. After all, making things worse would be less than ideal. Fair enough.

However, what was not fair enough was the pisspoor communication from the Windows giant as users clamoured for answers. With refreshing frankness, Chad Kimes, director of engineering, said: "On the first day, when the impact was most severe, we didn't acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes."

Yes, Chad, it is.

Kimes proceeded to give an insight into how Microsoft handles this type of problem. Automated tooling detects customer request failures and performance wobbles. It then loops in both a Designated Responsible Individual (DRI) and the PIM. The PIM is the one who does the external communications.

However, a pipeline delay is picked up by a different process and the PIM wasn't informed.

So, in this instance, while the DRI was frantically trying to work out why builds were borked in the UK and Europe, the PIM enjoyed the rest of the righteous, doubtless dreaming of gambolling through fields of purest Git.

It wasn't until the PIM awoke and signed into the incident bridge at the start of business hours in the Eastern US that – oops – the borkage was finally acknowledged, five hours after things had gone TITSUP*.

Anyone who has had to ask the Windows giant the simplest of questions will sympathise with having to wait for someone in the US to wake up before an answer can be dispensed. We humble hacks at El Reg's London office would therefore like to welcome those developers using the company's Azure DevOps hosted pools to our own little world of pain.

Still, kudos to Microsoft for laying out what happened so bluntly. As well as offering profuse apologies to developers and an array of technical mitigations, the company also said: "We are improving our live-site processes to ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types." ®

Topics

Special Features

Vendor Voice

Resources

Devops

Asleep at the wheel: Why did it take 5 HOURS for Microsoft to acknowledge an Azure DevOps TITSUP*?

We'll have to wait until the US wakes up before we can answer that one

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

French lawmakers take a swing at cloud monopolies

US government excoriates Microsoft for 'avoidable errors' but keeps paying for its products

A different view from the edge

Cloud Software Group and Microsoft pledge another eight years of co-opetition

Microsoft breach allowed Russian spies to steal emails from US government

Microsoft rolls out safety tools for Azure AI. Hint: More models

Open source versus Microsoft: The new rebellion begins

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

Microsoft squashes SmartScreen security bypass bug exploited in the wild

Microsoft to tackle spam by restricting Exchange Online bulk email

Microsoft to use Windows 11 Start menu as a billboard with app ads for Insiders

About Us

Our Websites

Your Privacy