Software

This article is more than 1 year old

How to have a more positive 'outage experience' according to Microsoft: Please don't rely on the Azure Status page

'We will never be able to avoid outages entirely' CTO confesses

Tue 18 Aug 2020 // 11:42 UTC

Microsoft Azure CTO Mark Russinovich, together with principal program manager overseeing "outage communications" Sami Kubba, has posted about advancing the "outage experience" – not the phrase that usually comes to mind when a cloud failure ruins your day.

Outages are "an unfortunate inevitability of the technology industry", Russinovich said.

Azure's reliability record is good – Russinovich reported in July 2019 that "Azure has operated core compute services at 99.995 percent average uptime across our global cloud infrastructure" but these global figures do not tell the whole story.

Problems with Azure MFA (Multi-Factor Authentication), for example, which suffered major outages in November 2018 and again in October 2019, have a disproportionate impact, preventing many users from using Office 365 services such as email, SharePoint and Teams.

Another example; if an individual service fails, such as Azure DevOps Pipelines, it can block application deployment, causing significant disruption to some customers while hardly registering on the global reliability numbers.

If outages cannot be altogether avoided, the least Microsoft can do is to keep its customers informed. Russinovich said the aim is to notify customers in less than 15 minutes via an automated process – using "AIOps" to detect anomalies and inform both engineers and customers. In theory, this alerting process should mean that admins get alerted to problems without needing to take matters into their own hands, for example, by searching both on Azure and on social media to see if a problem is with Azure or on the customer's side.

Microsoft says Service Health in the Azure portal is the place to look for outage information – presuming that the portal itself is not borked

It is here that Kubba's post revealed a key tip. The place to look, Kubba said, is in the Azure Portal under Service Health. The public Azure status page is "only used to communicate widespread outages" so likely to be ineffective in discovering what may be wrong. "Despite this, we constantly find that customers visit the Azure Status page to determine the health of services on Azure," complained Kubba – though we presume that if an outage blocks access to the portal, the Status page would then be the right place to look. In other cases, it is not much use since "more than 95 per cent of our incidents" do not appear there, according to Kubba.

A further complication is that Azure DevOps (which is where Pipelines live) is not integrated into the Azure portal, and many Azure DevOps users only hang around the DevOps portal. Therefore, there is a separate Azure DevOps status page, making this a third site that needs to be bookmarked for tracking outages.

There is another issue, which is the flow of information from admins to end users, who most likely do not even know that a service or part of a service is running on Azure. They are the ones whose work is affected so this is a matter for admins to work out, configuring service health alerts so that the right people get the message, and then translating that into effective communication with end users.

Kubba also said that "reliability is a shared responsibility", observing that the customer can reduce or remove the impact of outages by architecting reliable applications. It is a nuanced issue. If a customer were to run a business-critical application on a single virtual machine (VM), and it goes down, then the customer is to blame for not using an appropriate architecture; but that would not make it OK for Microsoft to host an unreliable VM service. ®

Topics

Special Features

Vendor Voice

Resources

Software

How to have a more positive 'outage experience' according to Microsoft: Please don't rely on the Azure Status page

'We will never be able to avoid outages entirely' CTO confesses

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

French lawmakers take a swing at cloud monopolies

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

Alleged cryptojacker accused of stealing $3.5M from cloud to mine under $1M in crypto

Reducing the cloud security overhead

Cloud Software Group and Microsoft pledge another eight years of co-opetition

Misconfigured cloud server leaked clues of North Korean animation scam

Oracle scores big win with Fujitsu Japan for its Alloy partner cloud

Tencent Cloud to revisit design after circular dependencies slowed emergency API fix

Alibaba Cloud reveals network telemetry tool that helped cut number of engineers needed by 86%

Backblaze cloud storage buzzes with added Event Notifications

AWS must pay $525M to cloud storage patent holder, says jury

SharePoint logs are easily circumvented and Microsoft is dragging its heels

About Us

Our Websites

Your Privacy