Users of Microsoft's Azure storage service “may have experienced difficulties provisioning new resources or accessing their existing resources “ for over eight hours on Wednesday and Thursday. Azure storage was also tough to provision for a short time on Wednesday night.
The first brownout hit the service's East US region, and Microsoft's status page says it impacted “Virtual Machines Azure Media Services, Application Insights, Azure Logic Apps, Azure Data Factory, Azure Site Recovery, Azure Cache, Azure Search, Azure Service Bus, Azure Event Hubs, Azure SQL Database, API Management and Azure Stream Analytics.”
The second incident was shorter but had wider impact: Microsoft says users as far away as India would have experienced trouble provisioning storage “due to an underlying storage incident.”
The first incident was uncannily similar to Amazon Web Services' “S3-izure” , which also hit a single region in the Eastern USA. But unlike the Amazon S3 outage, the world seems to have kept turning during this incident: The Register has been unable to find the kind of “OMG, everything is down” panic that followed Amazon's incident.
There are two obvious reasons the incident hasn't let to an outcry about disruptions, the first of which is that this was not a total outage like the S3-izure. Azure Storage users may therefore have delivered a less-than-stellar experience, but there's no reason they would have gone down.
A second is that the problem struck overnight: the problem surfaced at 21:50 UTC on Wednesday evening, about 6PM Eastern Time and therefore not at a time of heavy demand.
Those factors means it is probably rather unkind to suggest that Microsoft's outage going largely unremarked-upon is an indication its cloud is unloved and powers no sites that would be missed or attract criticism for slow performance.
But the preliminary root cause of the incident - “one Storage cluster that lost power and became unavailable” - does suggest that Azure may not have superb resilience. A power loss is, however, something a cloud operator cannot always control. And it is a far more more acceptable reason for an outage than the typo that took down Amazon' S3 service. ®