This article is more than 1 year old

Microsoft admits Azure Resource Manager failed after code change

Just lucky Western Europe was asleep when it happened

Any insomniacs, workaholics or those pulling an all-nighter related to a past deadline project may have noted a four-and-a-half hour failure of Azure Resource Manager in Europe this morning following a recent code change.

Should this have happened later in the day, admins would have pulled their hair in frustration over the impact to efficiency, but Microsoft appears to have dodged a bullet on this occasion.

Azure Status History website confirms:

"Between 02.41 UTC and 07.10 UTC on 23 Mar (sic) 2023 you may have experienced issues using Azure Resource Manager in West Europe when performing resource management operations. This would have impacted users of Azure Portal, Azure CLI, Azure PowerShell, as well as Azure services which depend upon ARM for their internal resource management operations."

What caused the multi-hour performance wobble? Readers will be pleased to know that quality control measures are still top of mind within Microsoft.

"This issue was the result of an interaction between a recent code change in West Europe which introduced a subtle performance regression, and a specific internal Azure workload isn West Europe which exercised this performance regression in a manner which resulted in significant resource saturation."

ARM, as Azure customers will know, is a deployment and management services (inc access control, locks and tags) that provides a management layer so admins can create, update and delete resources in their Azure account.

The result of the failure was "increased latency and timeouts for both internal and external ARM calls," Microsoft says.

Engineers for the cloud and software biz spotted and temporarily blocked the internal workload to "mitigate the impact to customers." Just after 0700 UTC, customer traffic in the region returned to "nominal success rate levels."

Microsoft was rolling back the code change in question in a "globally safe manner" and will "continue to monitor the situation after the rollback has been completed."

"We have identified the specific code responsible for this performance regression and are working on an alternative implementation which avoids the potential for this type of failure mode in the future," Microsoft adds.

A preliminary Post Incident Report on the root cause of the failure and repair items used will need to wait three days, and a final report "where we will share a deep-dive into the incident" is scheduled for a fortnight from today.

Today's incident won't have caused nearly as much gnashing of teeth as the one in the last week of January, when a WAN router IP address change was blamed for a worldwide Microsoft 365 outage.

Gotta love the cloud. ®

More about

TIP US OFF

Send us news


Other stories you might like