Couldn't connect to West Europe SQL Databases last week? Blame operator error

Clouds form over PICNIC (Problem In Chair Not In Computer)

Microsoft has blamed "operator error" for the multi-hour outage of its cloud SQL Server in Europe last week.

"Between 03:47 UTC and 13:30 UTC on 21 July 2022, customers using SQL Database and SQL Data Warehouse in West Europe may have experienced issues accessing services," said Microsoft.

The issues were severe for affected customers. Attempting to make new connections to databases in the West Europe region resulted in errors and time-outs. While existing connections were OK, if they were closed, attempts to re-establish them faced the same issues.

And, of course, when Microsoft's SQL Database gets ill, so too do many of the services that depend upon it, including App Services, Automation, Backup and so on.

It took almost three hours to achieve partial recovery (at 06:12 UTC) and the company said the problem was solved at 13:30 UTC, although it didn't declare full mitigation until 18:45 UTC ("No failures that occurred after 13:30 UTC were directly as a result of this incident," Microsoft said.)

So what happened? A PICNIC (Problem In Chair Not In Computer) by the sounds of things – "an operator error led to an incorrect action being performed in close sequence on all four persisted metadata caches," explained Microsoft.

Connections to the Azure SQL Database service are dealt with by regional gateway clusters (West Europe has two) and there are multiple persisted metadata caches used for connection routing (again, West Europe has two per gateway.)

That "operator error" meant the caches were unavailable to the gateways. Gateway processes in the West Europe region couldn't access connection routing metadata, and the incident kicked off.

Once the error was identified, engineers were faced with a decision to either revive the caches or rebuild entirely new ones. With the latter choice likely to take a lot longer than the former, engineers got cracking with fixing what was already in place.

By 06:18 UTC, success rates hit approximately 60 per cent, but issues persisted. "Firstly," said Microsoft, "a timing issue in applying mitigation caused gateways in one of the two clusters to cache incorrect cache connection strings. Secondly, the metadata caches were not receiving updates for changes that happened while the caches were unavailable."

Cue a careful restart of all the gateway nodes in the cluster and a script to deal with stale cache entries (where updates had been missed.)

Missing from Microsoft's detailed explanation for the outage was the fate of the unfortunate operator whose Who, Me? moment caused such chaos for customers in the West Europe region. It was not explained how one person could wreak such havoc. Perhaps that PICNIC had a side of iffy processes.

Instead, the company closed the stable door long after the horse bolted by "programmatically blocking any further executions of the action that led to the metadata caches becoming unavailable."

It has also thrown up stronger guardrails "to prevent human errors like the one that triggered the start of this incident."

In-memory caching of connection routing metadata is also to be implemented and the company is to take a long, hard look at service resiliency. ®

Similar topics


Other stories you might like

Biting the hand that feeds IT © 1998–2022