Microsoft says bug, sorry, 'a latent defect' in Safe Deployment Process system downed Azure Active Directory
We're taking steps to prevent this from happening again, says Redmond
Microsoft has blamed a software bug for the service disruption on Monday and Tuesday that affected customers using Azure Active Directory-dependent applications.
From about three hours, from 2125 UTC (1425 PDT) on September 28, 2020 to 0023 UTC (1723 PDT) on September 29, 2020, Azure Active Directory (AD), Microsoft's cloud-based authentication system, and Azure AD B2C, a white-label authentication service for businesses, misfired.
The authentication errors, which lingered until 0225 UTC (1925 PDT) for some, prevented customers from logging into multiple Microsoft 365 services and some Azure services.
It was suggested at the time that coincidental problems with the US 911 emergency dispatch system during Azure disruption were the result of dependency on Azure AD authentication. Microsoft insists that's not the case. "We’ve seen no indication that the multi-state 911 outage was a result of Monday’s service interruption," a spokesperson for the IT goliath said in an email to The Register.
Nonetheless, the outage had worldwide impact, though it was more severe in America. According to Microsoft's incident report, SM79-F88, only 17 per cent of authentication attempts succeeded in the US during the service disruption, rising to 37 per cent just prior to resolution.
Australia had it slightly better, with a 37 per cent success rate. In Asia, the authentication success rate hovered around 72 per cent for the first two hours of the incident but then dropped to 32 per cent as the business day got underway and login attempts rose. Europe meanwhile enjoyed an 81 per cent success rate during the service troubles.
Where are we now? Microsoft 363? 362? We've lost count because Exchange Online isn't playing nicely this morningREAD MORE
The Window biz reports that defenses implemented in its Managed Identity service for Virtual Machines, Virtual Machine Scale Sets, and Azure Kubernetes Services helped keep those tools up with availability of 99.8 per cent.
The disruption occurred, Microsoft says, because a bug in its Azure AD's Safe Deployment Process rendered it unsafe: the safeguard pushed through a crash-inducing update into production, bypassing the usual verification process, and ultimately broke AD.
The update was supposed to be be rolled out gradually across five rings over a period of several days; the rings range from a testing and validation environment through to the public Azure cloud. You'd expect the busted update to be caught at the validation stage. However, it turned out the Safe Deployment Process had "a latent defect" – a bug, in non-Redmond speak – that impaired the system's ability to read deployment metadata. As a result, all the deployment rings got the unstable update at once, and the service started to degrade.
"Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries," the incident report noted. "Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment."
A request to provide further details about the nature of the code defect was declined.
We are continuously taking steps to improve the Microsoft Azure Platform
Microsoft says it reverted the change using an automated rollback within minutes of the incident and this should have mitigated the issue. However, the bug in its Safe Deployment Process system "corrupted the deployment metadata," so the rollback had to be done manually.
The Windows biz says it has already fixed its defective code, returned functioning metadata to its rollback system, and expanded its rollback operation drills. It also says it intends to make its Safe Deployment System safer still with additional defenses against the issues that arose and to hasten the deployment of an Azure AD backup authentication system.
What's more, it has a plan to prime its automated communications pipeline to get outage information to customers within 15 minutes, so they spend less time in the dark.
"We sincerely apologize for the impact to affected customers," the company said in its incident report. "We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future."