IT downtime not itself going down, power failures most common cause
2022 in a nutshell: Missing SLAs, failing to meet customer expectations
Infrastructure operators are struggling to reduce the rate of IT outages despite improving technology and strong investment in this area.
The Uptime Institute's 2022 Outage Analysis Report says that progress toward reducing downtime has been mixed. Investment in cloud technologies and distributed resiliency has helped to reduce the impact of site-level failures, for example, but has also added complexity. A growing number of incidents are being attributed to network, software or systems issues because of this intricacy.
The authors make it clear that critical IT systems are far more reliable than they once were, thanks to many decades of improvement. However, data covering 2021 and 2022 indicates that unscheduled downtime is continuing at a rate that is not significantly reduced from previous years.
Most organizations – 80 percent – have experienced an outage in the past three years, with about one in five of those surveyed saying they had a serious or severe outage during the same timeframe.
"Serious" and "severe" are the top two ratings in the Uptime Institute's five-level category ranking for outages. "Serious" covers disruption of services with possible financial losses or compliance breaches, while "severe" covers major and damaging disruption of services with potentially large financial losses.
Based on the data it has collected, the Uptime Institute report suggests that each year there will likely be at least 20 serious IT outages across the world that cause major financial loss, business and customer disruption, and reputational loss.
When it comes the cause of outages, the report notes that, as well as a primary cause, most have other factors that also contribute to an incident. Power failures are listed as the most common outage cause, with 43 percent of them listing this as the primary factor, followed by software, network, and cooling all accounting for about 14 percent of incidents.
In the Uptime Institute's annual resiliency survey – one of the data sources for the Outage Analysis Report – network issues were listed as the most common cause of all end-to-end IT service outages generally, with power-related issues coming second.
The Uptime Institute also found that third-party commercial operators such as cloud, hosting and colocation providers accounted for almost 63 percent of all public outages over a five-year period, and this percentage has crept up year by year to 71 percent during 2021.
However, the key words here are "public outage," and the report authors note that the reliability of public cloud services has come under greater scrutiny in recent years as a result of some high-profile outages, as well as the growing interest in running critical services in the public cloud.
Nevertheless, the survey found that enterprise IT managers are "somewhat concerned" about the resiliency of public cloud services, with only 13 percent of respondents saying public cloud services are reliable enough to run all their workloads, and the number of "don't know" responses has increased since last year.
- Cable cut blamed for global four-hour internet disruption
- Internet went offline in Pakistan as protestors marched for ousted prime minister
- Dropbox unplugged its own datacenter – and things went better than expected
- Atlassian comes clean on what data-deleting script behind outage actually did
Drilling deeper into the causes, the Uptime Institute found that UPS failures are the most common reason for power-related outages followed by generators, transfer switches, and power distribution units.
The most common reasons behind a network-related outage are a tie between configuration/change management errors and a third-party network provider failure. These are not surprising in modern network environments, the report states, where networks are constantly being updated to optimize performance or meet new requirements.
Another trend reported by the Uptime Institute is that the duration of outages also appears to be increasing, at least for publicly reported outages. This is worrying because an outage is likely to be more costly and disruptive the longer it lasts.
In 2021, the number of publicly reported outages lasting longer than 48 hours was 16 percent, compared with 4 percent in 2017, while those lasting between 24 and 48 hours stood at 12 percent, compared with 4 percent in 2017.
The cost of outages has also risen. In 2019, 60 percent of major failures are estimated to have cost less than $100,000, while 28 percent cost between $100,000 and $1 million. In 2021, only 39 percent cost less than $100,000, while 47 percent were between $100,000 and $1 million. The proportion of outages costing over $1 million grew from 11 percent to 15 percent.
The data feeding into the Outage Analysis Report comes from four main data sources, according to the Uptime Institute. One of these is a public outages database it maintains, another is a confidential system for members to report abnormal incidents, and the other two are its Global Survey of IT and Data Center Managers and Data Center Resiliency Survey. ®