WAN router IP address change blamed for global Microsoft 365 outage
Command line not vetted using full qualification process, says Redmond. We think it involved chewing gum somewhere
The global outage of Microsoft 365 services that last week prevented some users from accessing resources for more than half a working day was down to a packet bottleneck caused by a router IP address change.
Microsoft's wide area network toppled a bunch of services from 07:05 UTC on January 25 and although some regions and services had come back online by 09:00, intermittent packet loss woes weren't fully mitigated until 12:42. The wobble also affected Azure Government cloud services.
In a postmortem, Microsoft said that changes made to its WAN had hit connectivity between clients and Azure, across regions and cross-premises via ExpressRoute.
"As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them.
"The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed."
This meant users were unable to access resources hosted in Azure or other Microsoft 365 and Power Platform services.
Microsoft said monitoring systems detected DNS and WAN-related troubles at 07:12, some seven minutes after they began.
By 08:20, resident techies at Microsoft had spotted the "problematic command that triggered the issues" and some 40 minutes later networking telemetry indicated many of the services were running again.
- UK government in talks with datacenter operators over blackouts
- Datacenter outages are costing more, $1m+ failures now common
- Exchange Online and Microsoft Teams went down in APAC because Microsoft broke itself
- Microsoft Teams outage widens to take out M365 services, admin center
However, Microsoft said the initial problem with the WAN meant automated systems for maintaining its health were paused. This included systems for identifying and expelling unhealthy devices, as well as the traffic engineering system for optimizing the flow of data across the network.
"Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC," the postmortem added.
Efforts Microsoft is taking to make similar incidents less likely or severe include blocking "highly impactful command from getting executed on the devices" and requiring all command execution on devices to follow safe guidelines.
The final post-incident report is scheduled to be published a fortnight after the outage. ®