Asleep at the wheel: Why did it take 5 HOURS for Microsoft to acknowledge an Azure DevOps TITSUP*?

We'll have to wait until the US wakes up before we can answer that one


In an impressively frank postmortem, Microsoft has admitted that at least part of its organisation was asleep at the wheel in a very real sense while its European DevOps tooling tottered.

The travails of Azure during the current surge in usage are well-documented but, as well as showing the limits of cloudy tech, the pandemic-induced capacity constraints have also shown up some all too human failings at Microsoft.

Microsoft's Azure DevOps hosted pools were no exception to those constraints, and from 24 to 26 March customers in Europe and the UK experienced substantial delays in their pipelines.

It was bad. By Microsoft's own reckoning, during normal working hours over the three days, customers experienced an average delay of 21 minutes. The worst delay was nine hours.

The problem was that for each Azure Pipelines job, a fresh virtual machine is needed and, well, there was no room at the inn as Azure reacted to the COVID-19 surge.

Oh, and the Primary Incident Manager (PIM) was asleep, but more on that later.

As the failures mounted, Azure Pipelines kept trying, and the queues kept getting longer. The gang had noted the potential for problems prior to the incident and had already been working on an update to deal with it (ephemeral OS disks for Linux agents and chunkier Azure VMs with nested virtualization for Windows.) However, the change was a big one and took a while to roll out. After all, making things worse would be less than ideal. Fair enough.

However, what was not fair enough was the pisspoor communication from the Windows giant as users clamoured for answers. With refreshing frankness, Chad Kimes, director of engineering, said: "On the first day, when the impact was most severe, we didn't acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes."

Yes, Chad, it is.

Kimes proceeded to give an insight into how Microsoft handles this type of problem. Automated tooling detects customer request failures and performance wobbles. It then loops in both a Designated Responsible Individual (DRI) and the PIM. The PIM is the one who does the external communications.

However, a pipeline delay is picked up by a different process and the PIM wasn't informed.

So, in this instance, while the DRI was frantically trying to work out why builds were borked in the UK and Europe, the PIM enjoyed the rest of the righteous, doubtless dreaming of gambolling through fields of purest Git.

It wasn't until the PIM awoke and signed into the incident bridge at the start of business hours in the Eastern US that – oops – the borkage was finally acknowledged, five hours after things had gone TITSUP*.

Anyone who has had to ask the Windows giant the simplest of questions will sympathise with having to wait for someone in the US to wake up before an answer can be dispensed. We humble hacks at El Reg's London office would therefore like to welcome those developers using the company's Azure DevOps hosted pools to our own little world of pain.

Still, kudos to Microsoft for laying out what happened so bluntly. As well as offering profuse apologies to developers and an array of technical mitigations, the company also said: "We are improving our live-site processes to ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types." ®

* Total Inability To Service User Pipelines


Other stories you might like

  • Microsoft Azure to spin up AMD MI200 GPU clusters for 'large scale' AI training
    Windows giant carries a PyTorch for chip designer and its rival Nvidia

    Microsoft Build Microsoft Azure on Thursday revealed it will use AMD's top-tier MI200 Instinct GPUs to perform “large-scale” AI training in the cloud.

    “Azure will be the first public cloud to deploy clusters of AMD's flagship MI200 GPUs for large-scale AI training,” Microsoft CTO Kevin Scott said during the company’s Build conference this week. “We've already started testing these clusters using some of our own AI workloads with great performance.”

    AMD launched its MI200-series GPUs at its Accelerated Datacenter event last fall. The GPUs are based on AMD’s CDNA2 architecture and pack 58 billion transistors and up to 128GB of high-bandwidth memory into a dual-die package.

    Continue reading
  • New York City rips out last city-owned public payphones
    Y'know, those large cellphones fixed in place that you share with everyone and have to put coins in. Y'know, those metal disks representing...

    New York City this week ripped out its last municipally-owned payphones from Times Square to make room for Wi-Fi kiosks from city infrastructure project LinkNYC.

    "NYC's last free-standing payphones were removed today; they'll be replaced with a Link, boosting accessibility and connectivity across the city," LinkNYC said via Twitter.

    Manhattan Borough President Mark Levine said, "Truly the end of an era but also, hopefully, the start of a new one with more equity in technology access!"

    Continue reading
  • Cheers ransomware hits VMware ESXi systems
    Now we can say extortionware has jumped the shark

    Another ransomware strain is targeting VMware ESXi servers, which have been the focus of extortionists and other miscreants in recent months.

    ESXi, a bare-metal hypervisor used by a broad range of organizations throughout the world, has become the target of such ransomware families as LockBit, Hive, and RansomEXX. The ubiquitous use of the technology, and the size of some companies that use it has made it an efficient way for crooks to infect large numbers of virtualized systems and connected devices and equipment, according to researchers with Trend Micro.

    "ESXi is widely used in enterprise settings for server virtualization," Trend Micro noted in a write-up this week. "It is therefore a popular target for ransomware attacks … Compromising ESXi servers has been a scheme used by some notorious cybercriminal groups because it is a means to swiftly spread the ransomware to many devices."

    Continue reading
  • Twitter founder Dorsey beats hasty retweet from the board
    As shareholders sue the social network amid Elon Musk's takeover scramble

    Twitter has officially entered the post-Dorsey age: its founder and two-time CEO's board term expired Wednesday, marking the first time the social media company hasn't had him around in some capacity.

    Jack Dorsey announced his resignation as Twitter chief exec in November 2021, and passed the baton to Parag Agrawal while remaining on the board. Now that board term has ended, and Dorsey has stepped down as expected. Agrawal has taken Dorsey's board seat; Salesforce co-CEO Bret Taylor has assumed the role of Twitter's board chair. 

    In his resignation announcement, Dorsey – who co-founded and is CEO of Block (formerly Square) – said having founders leading the companies they created can be severely limiting for an organization and can serve as a single point of failure. "I believe it's critical a company can stand on its own, free of its founder's influence or direction," Dorsey said. He didn't respond to a request for further comment today. 

    Continue reading
  • Snowflake stock drops as some top customers cut usage
    You might say its valuation is melting away

    IPO darling Snowflake's share price took a beating in an already bearish market for tech stocks after filing weaker than expected financial guidance amid a slowdown in orders from some of its largest customers.

    For its first quarter of fiscal 2023, ended April 30, Snowflake's revenue grew 85 percent year-on-year to $422.4 million. The company made an operating loss of $188.8 million, albeit down from $205.6 million a year ago.

    Although surpassing revenue expectations, the cloud-based data warehousing business saw its valuation tumble 16 percent in extended trading on Wednesday. Its stock price dived from $133 apiece to $117 in after-hours trading, and today is cruising back at $127. That stumble arrived amid a general tech stock sell-off some observers said was overdue.

    Continue reading

Biting the hand that feeds IT © 1998–2022