PaaS + IaaS

This article is more than 1 year old

Microsoft admits Azure Resource Manager failed after code change

Just lucky Western Europe was asleep when it happened

Thu 23 Mar 2023 // 12:03 UTC

Any insomniacs, workaholics or those pulling an all-nighter related to a past deadline project may have noted a four-and-a-half hour failure of Azure Resource Manager in Europe this morning following a recent code change.

Should this have happened later in the day, admins would have pulled their hair in frustration over the impact to efficiency, but Microsoft appears to have dodged a bullet on this occasion.

Azure Status History website confirms:

"Between 02.41 UTC and 07.10 UTC on 23 Mar (sic) 2023 you may have experienced issues using Azure Resource Manager in West Europe when performing resource management operations. This would have impacted users of Azure Portal, Azure CLI, Azure PowerShell, as well as Azure services which depend upon ARM for their internal resource management operations."

What caused the multi-hour performance wobble? Readers will be pleased to know that quality control measures are still top of mind within Microsoft.

"This issue was the result of an interaction between a recent code change in West Europe which introduced a subtle performance regression, and a specific internal Azure workload isn West Europe which exercised this performance regression in a manner which resulted in significant resource saturation."

ARM, as Azure customers will know, is a deployment and management services (inc access control, locks and tags) that provides a management layer so admins can create, update and delete resources in their Azure account.

The result of the failure was "increased latency and timeouts for both internal and external ARM calls," Microsoft says.

Engineers for the cloud and software biz spotted and temporarily blocked the internal workload to "mitigate the impact to customers." Just after 0700 UTC, customer traffic in the region returned to "nominal success rate levels."

Microsoft was rolling back the code change in question in a "globally safe manner" and will "continue to monitor the situation after the rollback has been completed."

"We have identified the specific code responsible for this performance regression and are working on an alternative implementation which avoids the potential for this type of failure mode in the future," Microsoft adds.

A preliminary Post Incident Report on the root cause of the failure and repair items used will need to wait three days, and a final report "where we will share a deep-dive into the incident" is scheduled for a fortnight from today.

Today's incident won't have caused nearly as much gnashing of teeth as the one in the last week of January, when a WAN router IP address change was blamed for a worldwide Microsoft 365 outage.

Topics

Special Features

Vendor Voice

Resources

PaaS + IaaS

Microsoft admits Azure Resource Manager failed after code change

Just lucky Western Europe was asleep when it happened

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

French lawmakers take a swing at cloud monopolies

Industrial systems integrating digitalisation

Researchers claim Windows Defender can be fooled into deleting databases

October 2025 will be a support massacre for a bunch of Microsoft products

Microsoft is a national security threat, says ex-White House cyber policy director

Microsoft breach allowed Russian spies to steal emails from US government

Open source versus Microsoft: The new rebellion begins

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

Cloud Software Group and Microsoft pledge another eight years of co-opetition

Microsoft aims to triple datacenter capacity to fuel AI boom

About Us

Our Websites

Your Privacy