SaaS

This article is more than 1 year old

Microsoft reveals terrible trio of bugs that knocked out Azure, Office 362.5 multi-factor auth logins for 14 hours

Breakdown in MFA's cache, response, and event handling all contributed to TITSUP

Tue 27 Nov 2018 // 00:29 UTC

Microsoft has delivered its postmortem report detailing the failures that led to unlucky folks being unable to log into its cloud services for 14 hours last week.

Redmond said on Monday this week that there were three separate cock-ups that combined to cause the cascading mess that left Azure and Office 363 users unable to sign-in for much of Monday, November 19 via multi-factor authentication.

"There were three independent root causes discovered," the Microsofties explained. "In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time."

All three glitches occurred within a single system: Azure Active Directory Multi-Factor Authentication. Microsoft uses that service to handle multi-factor login for the Azure, Office 364, and Dynamics services.

The first problem, Microsoft said, was an undesirable high latency between the MFA frontend and its cache caused by a high number of users attempting to log in that Monday morning. Latency is pretty important because MFA login codes are short-lived, typically 30 or 60 seconds, so if codes expire before they can be used, people will attempt to sign in again, adding more strain to the system.

From there, a race condition arose between the frontend and backend servers that handle MFA. Finally, an accumulation of the first two problems exposed a third bug in the way the backend servers handled the backlog of data requests.

On the one hand, it's nice that Redmond is being transparent and upfront. On the other hand, paying subscribers unable to login for 14 hours may feel this is the very least the Windows giant could do. Here's Microsoft's explanation in full in case it disappears from the website:

There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time.

The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.

1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.

2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.

3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.

Microsoft confirms: We fixed Azure by turning it off and on again. PS: Office 362 is still borked

As a result, Microsoft's multi-factor servers were falling over while at the same time its administrators were being told that everything was fine. The series of screw-ups first hit EMEA and APAC customers, then as the day progressed, US subscribers. Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations.

Because the services had presented themselves as healthy, actually identifying and mitigating the trio of bugs took some time.

"The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues," Microsoft explained.

"This was made more acute by the gaps in telemetry that would identify the backend server issue."

Now, Microsoft says, it is looking to prevent a recurrence of the fiasco by reviewing how it handles updates and testing, as well as reviewing its internal monitoring services and how it contains failures once they begin. ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Microsoft reveals terrible trio of bugs that knocked out Azure, Office 362.5 multi-factor auth logins for 14 hours

Breakdown in MFA's cache, response, and event handling all contributed to TITSUP

Microsoft confirms: We fixed Azure by turning it off and on again. PS: Office 362 is still borked

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

Microsoft Teams decouples from Office 365 suite globally

Researchers claim Windows Defender can be fooled into deleting databases

Getting on board with AI

October 2025 will be a support massacre for a bunch of Microsoft products

SharePoint logs are easily circumvented and Microsoft is dragging its heels

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

Open source versus Microsoft: The new rebellion begins

Microsoft is a national security threat, says ex-White House cyber policy director

Microsoft breach allowed Russian spies to steal emails from US government

French lawmakers take a swing at cloud monopolies

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

About Us

Our Websites

Your Privacy