Datacenter outages are on the decline, but when they hit, they hit hard
Power snafus take limelight in latest downtime diary from Uptime Institute
The frequency and severity of datacenter outages is on the decline, yet when incidents do occur they can be very costly to the organization involved, with power issues leading to the most serious blackouts.
While the datacenter footprint is expanding to meet the demand sparked by generative AI mania and more, the overall number of datacenter related outages is likely to increase.
However, according to a new report from Uptime Institute, there has been a consistent downward trend in the frequency and severity of outages relative to the growth in IT capacity over the past several years.
This means that while there are more incidents than before, their rate of increase is lower than the pace at which IT capacity itself is expanding.
There are a number of reasons for this, including that many organizations are investing more in physical infrastructure redundancy. Other reasons include the move to the public cloud and the adoption of new technology to help comply with regulations around the reporting and improvement of resiliency and energy performance.
Uptime cautions, however, that data relating to outages should be treated carefully as it is often commercially sensitive and subject to uncertainty. The Uptime report is based on data from its members, surveys of datacenter managers, and publicly available data.
The Annual Outage Analysis 2024 states than 55 percent of operator respondents indicated they had experienced an outage in the past three years, but this was down from 60 percent for the previous year, and 69 percent the year before that.
At the same time, only one in ten outages during the past year was categorized as either serious or severe – the top two rankings in a scale of five. This is an improvement of four percentage points from the previous year, and ten percentage points compared to the year before that.
Yet just over half of respondents (54 percent) indicated that their most recent significant, serious or severe outage cost the organization more than $100,000, with 16 percent saying that it cost them upwards of $1 million.
As far as the most severe outages go, disruption to on-site power distribution has consistently been the greatest single factor for several years, listed in 52 percent of incidents in the latest report.
Uptime claims there is some evidence that a shift toward more dynamic power grids using renewable energy is reducing grid reliability, and that datacenters may experience an increase in outages as this trend progresses. Many outages occur when an uninterruptible power supply (UPS) or generator fails to respond to a grid disruption, it notes.
Microsoft suffered such an outage to its Azure services in West Europe last year when a disturbance in the power supplied by the utility company caused it to move to generator power at one datacenter, but a subset of the generators failed to kick in as expected.
The second greatest cause is a failure or underperformance of cooling equipment. This could also be seen last year when 2.5 million payment transactions could not be completed when the cooling system failed at an Equinix datacenter used by two banks in Singapore – DBS and Citibank.
- Now OpenAI CEO Sam Altman wants billions for AI chip fabs
- Datacenters feeling the heat to turn hot air into cool solutions
- Datacenter architect creates bonkers designs to illustrate the craft, and quirks, of building bit barns
- UK throws millions at scheme to heat homes with waste energy from datacenters
Uptime notes that third-party provider issues have seen a small but consistent uptick since 2020, rising by five percentage points to account for nearly one in ten outages in 2023. This likely includes failures at cloud operators, and may be growing because of the growth of workloads in the public cloud.
That said, the report also finds that human error is a contributing factor in many outages, ranging from two-thirds to four-fifths of all downtime incidents. These might be caused by staff failing to follow procedures or by the procedures themselves being inadequate.
The New York Stock Exchange (NYSE), for example, suffered an incident last year after an employee failed to shut down a disaster recovery system at the exchange's secondary datacenter. As this system was left running overnight, the software that operates the NYSE acted as if trading had already begun and prevented opening auction prices from being set correctly.
Uptime asserts there is an opportunity here for organizations to further reduce outages through better staff training and a careful review of processes to iron out any potential failure points.
According to the company, there are typically about 10-20 high-profile IT outages or datacenter events globally each year that lead to serious financial loss or business and customer disruption. In many cases these also lead to reputational loss.
The full Uptime Institute report can be downloaded here. ®