Microsoft Azure CTO Mark Russinovich, together with principal program manager overseeing "outage communications" Sami Kubba, has posted about advancing the "outage experience" – not the phrase that usually comes to mind when a cloud failure ruins your day.
Outages are "an unfortunate inevitability of the technology industry", Russinovich said.
Azure's reliability record is good – Russinovich reported in July 2019 that "Azure has operated core compute services at 99.995 percent average uptime across our global cloud infrastructure" but these global figures do not tell the whole story.
Problems with Azure MFA (Multi-Factor Authentication), for example, which suffered major outages in November 2018 and again in October 2019, have a disproportionate impact, preventing many users from using Office 365 services such as email, SharePoint and Teams.
Another example; if an individual service fails, such as Azure DevOps Pipelines, it can block application deployment, causing significant disruption to some customers while hardly registering on the global reliability numbers.
If outages cannot be altogether avoided, the least Microsoft can do is to keep its customers informed. Russinovich said the aim is to notify customers in less than 15 minutes via an automated process – using "AIOps" to detect anomalies and inform both engineers and customers. In theory, this alerting process should mean that admins get alerted to problems without needing to take matters into their own hands, for example, by searching both on Azure and on social media to see if a problem is with Azure or on the customer's side.
Microsoft says Service Health in the Azure portal is the place to look for outage information – presuming that the portal itself is not borked
It is here that Kubba's post revealed a key tip. The place to look, Kubba said, is in the Azure Portal under Service Health. The public Azure status page is "only used to communicate widespread outages" so likely to be ineffective in discovering what may be wrong. "Despite this, we constantly find that customers visit the Azure Status page to determine the health of services on Azure," complained Kubba – though we presume that if an outage blocks access to the portal, the Status page would then be the right place to look. In other cases, it is not much use since "more than 95 per cent of our incidents" do not appear there, according to Kubba.
A further complication is that Azure DevOps (which is where Pipelines live) is not integrated into the Azure portal, and many Azure DevOps users only hang around the DevOps portal. Therefore, there is a separate Azure DevOps status page, making this a third site that needs to be bookmarked for tracking outages.
There is another issue, which is the flow of information from admins to end users, who most likely do not even know that a service or part of a service is running on Azure. They are the ones whose work is affected so this is a matter for admins to work out, configuring service health alerts so that the right people get the message, and then translating that into effective communication with end users.
Kubba also said that "reliability is a shared responsibility", observing that the customer can reduce or remove the impact of outages by architecting reliable applications. It is a nuanced issue. If a customer were to run a business-critical application on a single virtual machine (VM), and it goes down, then the customer is to blame for not using an appropriate architecture; but that would not make it OK for Microsoft to host an unreliable VM service. ®