IT downtime not itself going down, power failures most common cause

2022 in a nutshell: Missing SLAs, failing to meet customer expectations

Infrastructure operators are struggling to reduce the rate of IT outages despite improving technology and strong investment in this area.

The Uptime Institute's 2022 Outage Analysis Report says that progress toward reducing downtime has been mixed. Investment in cloud technologies and distributed resiliency has helped to reduce the impact of site-level failures, for example, but has also added complexity. A growing number of incidents are being attributed to network, software or systems issues because of this intricacy.

The authors make it clear that critical IT systems are far more reliable than they once were, thanks to many decades of improvement. However, data covering 2021 and 2022 indicates that unscheduled downtime is continuing at a rate that is not significantly reduced from previous years.

Most organizations – 80 percent – have experienced an outage in the past three years, with about one in five of those surveyed saying they had a serious or severe outage during the same timeframe.

"Serious" and "severe" are the top two ratings in the Uptime Institute's five-level category ranking for outages. "Serious" covers disruption of services with possible financial losses or compliance breaches, while "severe" covers major and damaging disruption of services with potentially large financial losses.

Based on the data it has collected, the Uptime Institute report suggests that each year there will likely be at least 20 serious IT outages across the world that cause major financial loss, business and customer disruption, and reputational loss.

When it comes the cause of outages, the report notes that, as well as a primary cause, most have other factors that also contribute to an incident. Power failures are listed as the most common outage cause, with 43 percent of them listing this as the primary factor, followed by software, network, and cooling all accounting for about 14 percent of incidents.

In the Uptime Institute's annual resiliency survey – one of the data sources for the Outage Analysis Report – network issues were listed as the most common cause of all end-to-end IT service outages generally, with power-related issues coming second.

The Uptime Institute also found that third-party commercial operators such as cloud, hosting and colocation providers accounted for almost 63 percent of all public outages over a five-year period, and this percentage has crept up year by year to 71 percent during 2021.

However, the key words here are "public outage," and the report authors note that the reliability of public cloud services has come under greater scrutiny in recent years as a result of some high-profile outages, as well as the growing interest in running critical services in the public cloud.

Nevertheless, the survey found that enterprise IT managers are "somewhat concerned" about the resiliency of public cloud services, with only 13 percent of respondents saying public cloud services are reliable enough to run all their workloads, and the number of "don't know" responses has increased since last year.

Drilling deeper into the causes, the Uptime Institute found that UPS failures are the most common reason for power-related outages followed by generators, transfer switches, and power distribution units.

The most common reasons behind a network-related outage are a tie between configuration/change management errors and a third-party network provider failure. These are not surprising in modern network environments, the report states, where networks are constantly being updated to optimize performance or meet new requirements.

Another trend reported by the Uptime Institute is that the duration of outages also appears to be increasing, at least for publicly reported outages. This is worrying because an outage is likely to be more costly and disruptive the longer it lasts.

In 2021, the number of publicly reported outages lasting longer than 48 hours was 16 percent, compared with 4 percent in 2017, while those lasting between 24 and 48 hours stood at 12 percent, compared with 4 percent in 2017.

The cost of outages has also risen. In 2019, 60 percent of major failures are estimated to have cost less than $100,000, while 28 percent cost between $100,000 and $1 million. In 2021, only 39 percent cost less than $100,000, while 47 percent were between $100,000 and $1 million. The proportion of outages costing over $1 million grew from 11 percent to 15 percent.

The data feeding into the Outage Analysis Report comes from four main data sources, according to the Uptime Institute. One of these is a public outages database it maintains, another is a confidential system for members to report abnormal incidents, and the other two are its Global Survey of IT and Data Center Managers and Data Center Resiliency Survey. ®

Other stories you might like

  • Cloudflare explains how it managed to break the internet
    'Network engineers walked over each other's changes'

    A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage.

    The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

    "The outage," explained Cloudflare, "was caused by a change that was part of a long-running project to increase resilience in our busiest locations."

    Continue reading
  • This startup says it can glue all your networks together in the cloud
    Or some approximation of that

    Multi-cloud networking startup Alkira has decided it wants to be a network-as-a-service (NaaS) provider with the launch of its cloud area networking platform this week.

    The upstart, founded in 2018, claims this platform lets customers automatically stitch together multiple on-prem datacenters, branches, and cloud workloads at the press of a button.

    The subscription is the latest evolution of Alkira’s multi-cloud platform introduced back in 2020. The service integrates with all major public cloud providers – Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud – and automates the provisioning and management of their network services.

    Continue reading
  • Cisco execs pledge simpler, more integrated networks
    Is this the end of Switchzilla's dashboard creep?

    Cisco Live In his first in-person Cisco Live keynote in two years, CEO Chuck Robbins didn't make any lofty claims about how AI is taking over the network or how the company's latest products would turn networking on its head. Instead, the presentation was all about working with customers to make their lives easier.

    "We need to simplify the things that we do with you. If I think back to eight or ten years ago, I think we've made progress, but we still have more to do," he said, promising to address customers' biggest complaints with the networking giant's various platforms.

    "Everything we find that is inhibiting your experience from being the best that it can be, we're going to tackle," he declared, appealing to customers to share their pain points at the show.

    Continue reading

Biting the hand that feeds IT © 1998–2022