Outage rates fall, but major ones will cost more. Oh and don't bank on SLAs
Not what cost-cutting companies want to hear right now. Maybe they should hang onto those engineers?
The rate at which IT infrastructure outages happen seems to have fallen in recent years, but the flip side is that those that do occur are becoming more pricey for organizations suffering them.
IT outages are bad, and you could be forgiven for thinking they are on the rise. However, according to a fresh report from the Uptime Institute, the incidence of outages has been outpaced by the growth in datacenter infrastructure capacity itself. This means that while the total number of outages is still increasing year-on-year globally, the rate at which they occur is actually falling.
SLAs? Lower your expectations
Another notable finding from the report is that the frequency and duration of outages strongly suggests the performance of many service providers falls short of their service level agreements (SLAs), according to the authors. Customers should not regard SLAs or availability figures as reliable predictors of future availability, the report warns.
According to the research, major IT failures may seem more common because of today's greater reliance on IT and online services, and the increased visibility of outages being reported via the news and social media. The reality is that "decades of innovation, investment and better management mean that, overall, critical IT systems, networks and datacenters are far more reliable than they were," the report states.
However, it also finds that more than two-thirds of all blackouts are now costing organizations more than $100,000, and says the case for investing more in resiliency is becoming stronger.
Uptime's Annual Outages Analysis 2023 draws on data from three main sources: the Uptime Institute Annual Global Data Center Survey 2022, the Uptime Institute Data Center Resiliency Survey 2023, and publicly reported outages tracked by Uptime during 2022.
In four separate surveys from 2020 to 2022, the proportion of managers and datacenter operators who reported a significant or worse outage at their organization during the past three years fluctuated between 60 and 80 percent, according to the report.
Uptime said it has tracked a steady decline in the outage rate per site, with 60 percent of respondents to the 2022 Uptime annual survey reporting an outage in the past three years, a figure that is down from 69 percent in 2021 and 78 percent in 2020.
There are also signs from the data that the impact of some outages is actually declining. Uptime classifies outages on a scale of 1 to 5, and the top two categories (serious and severe) have previously accounted for about 20 percent of all outages, but by 2022, these had fallen to 14 percent.
According to Uptime, its survey findings regarding the causes leading to outages have been “remarkably consistent” over time, with on-site power problems remaining the biggest cause of significant site outages, accounting for 44 percent of these in last year’s data.
The next largest starting point is network issues at 14 percent, with hardware/software failures and cooling issues both at 13 percent. However, when it comes to all outages, not just those that had a major impact, it appears that network issues is the greatest cause, at 31 percent, ahead of power problems coming second.
Cyberattacks on the rise
Meanwhile, for publicly recorded or reported outages, there is a different mix of causes, with cyberattacks and ransomware accounting for about 11 percent of these. This places them behind network issues and hardware/software failures, but it is a cause that is on the rise from the 8 percent reported in 2021.
Nearly a fifth indicated that public clouds are not resilient enough to run any of their workloads at all
Such attacks often lead to a lengthy shutdown of large parts of an organization's digital infrastructure, the report notes, with data loss common and a frequent need to rebuild systems and databases.
In a blow to the cloud operators, Uptime finds that many enterprise IT managers are concerned about the resiliency of public cloud services, such that only one in 10 survey respondents said that public cloud services are resilient enough for all their workloads.
Nearly a fifth (18 percent) indicated that public clouds are not resilient enough to run any of their workloads at all, representing a growing proportion, according to the report.
“These numbers are unlikely to change dramatically until [cloud providers] can offer greater reassurances on transparency — and perhaps new SLAs that give mission-critical customers more control and compensation,” says the report.
When it comes to publicly reported outages, the figures show that the majority (about 70 percent) are sorted out within 12 hours, and most are fixed much more quickly than that. Once again, however, there is a sting in the tail, with a rise in the number of outages that have not been recovered even after 48 hours.
Since 2017, this type of outage has risen from about 4 percent to 16 percent of reported incidents. There may be several reasons for this, according to the report, such as major ransomware attacks requiring the shutdown of all potentially affected systems becoming more common.
- Outage-hit Twitter muddies violent speech policy
- ChatGPT, write a report about database glitches that crashed you today
- Southwest promotes internal IT executive to CIO in wake of that Christmas meltdown
- AWS expands footprint at site of infamously flaky US-EAST-1 region
As to costs, Uptime reports that in its 2022 global survey, a quarter of respondents said their most recent outage had cost more than $1 million in direct and indirect costs, while a further 45 percent said it had cost between $100,000 and $1 million. This reflects a clear trend of increasing costs, with the figures from 2019 showing that 60 percent of respondents had indicated that major outage costs were below $100,000.
Finally, Uptime says that high availability and resiliency should be a priority for all involved in the digital infrastructure supply chain, but warns that progress on this doesn't always move forwards.
The report welcomes the shift towards distributed architectures, which could reduce the impact of some localized failures. However, it warns that other trends may undermine progress; the transition to renewable energy and more distributed energy generation may reduce the reliability of the grid, for example. A skills shortage may also limit the availability of experienced staff with the know-how to achieve greater resiliency.
You hear that, Elon? ®