Microsoft reveals train of mistakes that killed Azure in the South Central US 'incident'

Thunderbolt and lightning, Azure outage frightening

Microsoft has published the preliminary findings for what it calls “the South Central US incident”, but what many will call “the day the Azure cloud fell from the sky” and it doesn’t make for happy reading.

Thunder and lightning, very very frightening

As is well known now, high energy storms hit Southern Texas early in the morning on 4 September, with multiple Azure data centers in the region seeing what Microsoft described as “voltage sags and swells.” A lightning strike at 0842 UTC caused one data center to switch to generator power and also overloaded suppressors on the mechanical cooling system, shutting it down.

The data center struggled on for a bit, but as its thermal buffers were depleted, temperatures rose, and a shutdown started. Alas, this was not before temperatures had risen to the point where actual hardware, including storage units and network devices, were damaged.

It’s at this point that a fateful decision was taken by engineers. The team could have failed over to another data center but instead put a higher priority on the integrity of customer data (since the asynchronous nature of geo-replication could have led to data loss.) Thus the engineers began working through the damaged hardware, replacing where necessary and migrating customer data to healthy servers as needed, while customers kept hitting Refresh and staring at their screens in bafflement.

In the preliminary report, Microsoft admitted that “this particular set of issues also caused a cascading impact to services outside of the region”

And goodness, it certainly did cause an impact.

For customers directly using the South Central US region, problems began at 0929 UTC on 4 September with pretty much everything going down. Microsoft states that the majority of Azure services were up again just over a day later, by 11:00 UTC on 5 September, but it was not until 08:40 UTC on 7 September that “full mitigation” was complete.

Azure Service Manager does not support automatic failover

So far so bad. However, as Azure users know all too well, the problem was not isolated to the South Central US region. Microsoft has revealed that the legacy Azure Service Manager (ASM) which manages ‘classic’ resource types uses South Central US as its primary site to store resource metadata. While it also uses other locations to store metadata, ASM does not support automatic failover.

Uh oh. It wasn’t until 0110 UTC on 5 September that service was fully resumed.

Microsoft is keen to point out that its shiny new Azure Resource Manager (ARM) features global resiliency and stores data in every region. Unfortunately, it appeared that ARM also struggled with customers experiencing time-outs and, of course, problems with resources that had underlying dependencies.

The incident also served as a pointer to weaknesses in Azure Active Directory (AAD). The affected data center was, unfortunately, one of the AAD sites for North America.

The good news is that as the data center fell over, authentication traffic was routed to the other sites automatically. The bad news is that automatic throttling kicked in, leading to timeouts for customers. It took until 1440 UTC on 4 September for Microsoft to deal with routing and bump up capacity elsewhere.

Finally Visual Studio Team Services (VSTS) customers discovered that the affected data center provided capabilities used by services in other regions. Again, the decision by engineers not to fail over in order to protect data led to a long wait for affected customers. The VSTS impact was not fully mitigated until 0005 UTC on 6 September. Nearly two days after the initial failure.

We have drawn a discreet veil over the fact that it was over 12 hours before Azure was even able to reliably show its status page.

We are so so sorry

Microsoft obviously said it is very sorry, although you’ll have to check your service level agreement and October billing statement to see just how sorry. It also promised that it will deal with the hardware problems – be it the design of the data center itself or dealing with the lack of resilience in its storage units to “environmental factors”.

More importantly, Microsoft has seemingly recongised that the whole ASM thing isn’t good and plans to migrate dependencies away from it to ARM as rapidly as possible. Customers would be well advised to take a good long look at their own designs as well. ®

Similar topics

Other stories you might like

  • North Korea pulled in $400m in cryptocurrency heists last year – report

    Plus: FIFA 22 players lose their identity and Texas gets phony QR codes

    In brief Thieves operating for the North Korean government made off with almost $400m in digicash last year in a concerted attack to steal and launder as much currency as they could.

    A report from blockchain biz Chainalysis found that attackers were going after investment houses and currency exchanges in a bid to purloin funds and send them back to the Glorious Leader's coffers. They then use mixing software to make masses of micropayments to new wallets, before consolidating them all again into a new account and moving the funds.

    Bitcoin used to be a top target but Ether is now the most stolen currency, say the researchers, accounting for 58 per cent of the funds filched. Bitcoin accounted for just 20 per cent, a fall of more than 50 per cent since 2019 - although part of the reason might be that they are now so valuable people are taking more care with them.

    Continue reading
  • Tesla Full Self-Driving videos prompt California's DMV to rethink policy on accidents

    Plus: AI systems can identify different chess players by their moves and more

    In brief California’s Department of Motor Vehicles said it’s “revisiting” its opinion of whether Tesla’s so-called Full Self-Driving feature needs more oversight after a series of videos demonstrate how the technology can be dangerous.

    “Recent software updates, videos showing dangerous use of that technology, open investigations by the National Highway Traffic Safety Administration, and the opinions of other experts in this space,” have made the DMV think twice about Tesla, according to a letter sent to California’s Senator Lena Gonzalez (D-Long Beach), chair of the Senate’s transportation committee, and first reported by the LA Times.

    Tesla isn’t required to report the number of crashes to California’s DMV unlike other self-driving car companies like Waymo or Cruise because it operates at lower levels of autonomy and requires human supervision. But that may change after videos like drivers having to take over to avoid accidentally swerving into pedestrians crossing the road or failing to detect a truck in the middle of the road continue circulating.

    Continue reading
  • Alien life on Super-Earth can survive longer than us due to long-lasting protection from cosmic rays

    Laser experiments show their magnetic fields shielding their surfaces from radiation last longer

    Life on Super-Earths may have more time to develop and evolve, thanks to their long-lasting magnetic fields protecting them against harmful cosmic rays, according to new research published in Science.

    Space is a hazardous environment. Streams of charged particles traveling at very close to the speed of light, ejected from stars and distant galaxies, bombard planets. The intense radiation can strip atmospheres and cause oceans on planetary surfaces to dry up over time, leaving them arid and incapable of supporting habitable life. Cosmic rays, however, are deflected away from Earth, however, since it’s shielded by its magnetic field.

    Now, a team of researchers led by the Lawrence Livermore National Laboratory (LLNL) believe that Super-Earths - planets that are more massive than Earth but less than Neptune - may have magnetic fields too. Their defensive bubbles, in fact, are estimated to stay intact for longer than the one around Earth, meaning life on their surfaces will have more time to develop and survive.

    Continue reading

Biting the hand that feeds IT © 1998–2022