This article is more than 1 year old
With so many cloud services dependent on it, Azure Active Directory has become a single point of failure for Microsoft
Does Redmond have a reliability problem?
Comment Microsoft has fixed an issue with its OneDrive and SharePoint services where users were unable to sign in, caused by a faulty remediation for the earlier Azure Active Directory outage.
"We're investigating an issue affecting access to multiple Microsoft 365 services. We're working to identify the full impact," said a Microsoft 365 status tweet at around 10:45pm last night GMT. It was a reference to a major outage across the company's cloud services, beginning perhaps 20 minutes earlier, including both Microsoft 365 and some Azure services. The incident continued for hours until around 3:20am today when Microsoft reported that "the majority of services are now recovered for most users".
The core service affected was Azure Active Directory, which controls login to everything from Outlook email to Teams to the Azure portal, used for managing other cloud services. The five-hour impact was also felt in productivity-stopping annoyances like some installations of Microsoft Office and Visual Studio, even on the desktop, declaring that they could not check their licensing and therefore would not run.
There are claims that the US emergency 911 service was affected, which is not implausible given that the RapidDeploy Nimbus Dispatch system describes itself as "a Microsoft Azure–based Computer Aided Dispatch platform". If the problem is authentication, even resilient services with failover to other Azure regions may become inaccessible and therefore useless.
The company has yet to provide full details, but a status report today said that "a recent configuration change impacted a backend storage layer, which caused latency to authentication requests".
How to have a more positive 'outage experience' according to Microsoft: Please don't rely on the Azure Status pageREAD MORE
Status tweets allow us to track some of the developments. 11:36pm: "We've rolled back the change that is likely the source of impact." 11:49pm: "We're not observing an increase in successful connections after rolling back a recent change." 12:48am: "We're rerouting traffic to alternate infrastructure to improve the user experience." 1:40am: "We're seeing improvement for multiple services after applying mitigation steps."
It was not completely over even after the main outage was fixed. Microsoft reported today via the Admin Center that "some users were unable to access SharePoint Online or OneDrive for Business" between 7:20am and 11:52am UK time. The problem was that "a change put in place to mitigate impact during the recent AAD outage caused this issue". Microsoft added: "We're reviewing our deployment and provisioning procedures to help prevent similar problems in the future."
Every IT administrator will feel sympathy for the engineers working under stress to fix issues that have such wide consequences. "We acknowledge the unfortunate reality that – given the scale of our operations and the pace of change – we will never be able to avoid outages entirely," said CTO Mark Russinovich on 17 August. Subsequent events proved the truth of those words, especially in the UK, where a major Azure data centre suffered an outage only two weeks ago.
Outages may be inevitable, but nevertheless Microsoft has some hard questions to answer. Measuring cloud reliability is non-trivial since what matters is not the number of outages but their extent and impact.
So, does everyone get why the mono-directory is not a good idea?
Microsoft seems to have more than its fair share of problems. Gartner noted recently that it "continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused efforts and improved service availability metrics during the past year". The analyst's reservations were based in part on the low ratio of availability zones to regions, and that "a limited set of services support the availability zone model".
Gartner's concerns are valid, but this was not the cause of the recent disruption. Bill Witten, identity architect at Okta, was to the point, commenting: "So, does everyone get why the mono-directory is not a good idea?"
Microsoft has built so much on Azure Active Directory that it is a single point of failure. The company either needs to make it so resilient that failure is near-impossible (which is likely to be its intention), or consider gradually reducing the dependence of so many services.
The recent outages are an embarrassment for the company, coming so soon after the Ignite online conference. Microsoft does not talk about it much, but it is perhaps the single biggest issue facing its cloud ambitions and ability to continue its catch-up effort with AWS. ®