Special Features

Datacenter Networking Nexus

Human error and power glitches to blame for most outages

Blackouts less frequent in 2024, still a PITA when the datacenter downtime demons visit


Datacenter outages are less frequent and severe, but human error remains one of the most persistent challenges, with between two-thirds and four-fifths of major wobbles involving some element of meatbag-related cause.

According to the latest Annual Outage Analysis report from Uptime Institute, the overall picture is one of improving reliability, but with the sting in the tail that when failures do occur they can be significant and costly.

"Outages overall have slowed down," said Andy Lawrence, Uptime executive director of research. "Datacenter operators are facing a growing number of external risks beyond their control, including power grid constraints, extreme weather, network provider failures and third-party software issues. And despite a more volatile risk landscape, improvements are occurring."

Some 53 percent of operators reported an outage in the past three years, but this compares with 60 percent in 2022, 69 percent in 2021, and 78 percent in 2020. Just 9 percent of reported incidents during 2024 were classified as serious or severe, which is the lowest level yet recorded.

But preventing human error remains one of the major stumbling blocks in datacenter operations. Uptime says it views human error as a contributing factor rather than a root cause in outages, though it directly or indirectly plays a part in most of them.

Code changes, for example, played a part in several recent Microsoft incidents, such as problems with Azure cloud services in January and  a Microsoft 365 outage in March.

Nearly 40 percent of organizations have suffered a major outage caused by human error over the past three years, the report says. Staff failing to follow procedures was a feature in 58 percent of those cases, with faulty processes or procedures to blame in 45 percent.

It is also on the increase too, with the proportion of human error-related outages caused by failing to follow procedures up by 10 percentage points from last year. The reason for this may be the rapid growth seen by the datacenter industry recently and the resulting staff shortages in many regions, Uptime suggests.

To combat this, a greater focus on staff training and real-time operational support may reduce risks more effectively than improving documentation and processes, although these are still important.

Backing this up, 80 percent of operators told Uptime they believe that better management and processes might have prevented their organization's most recent downtime disaster.

Power-related issues remain the leading cause of major outages. These account for more than half of all cases, while more than one in four respondents to the 2025 Uptime resiliency survey reported that a serious or severe IT outage was caused by a power glitch within the past three years.

The most frequent factor in these is UPS failure – something that recently led to a six-hour blackout at Google Cloud services in the US east zone in America.

Other elements in the power chain can also cause issues such as intermittent faults in the supply and by mismanaged or misconfigured failover to generators.

Grid instability is also listed as a growing concern by Uptime. Rising demand, aging infrastructure, extreme weather, and the variability of renewable energy sources may increase the frequency of power disruptions - making robust on-site systems even more essential. Datacenters near London's Heathrow airport managed to remain operational despite a power outage that closed the site and caused disruption to a large number of flights in March.

Overall, investments in resiliency and the diligence of operators tell how a success story have led to a reduction in the overall severity and frequency of outages relative to the overall growth in online services.

However, Uptime warns the rising complexity of these environments, driven by AI, automation, and integration between IT and OT systems, is increasing exposure to operational errors and cybersecurity threats. ®

Send us news
8 Comments

Google Cloud goes down, takes Cloudflare and its customers with it

Big G said it was fixed, but acknowledged ongoing customer pain

Intel reportedly investigates return to memory biz with Japan’s SoftBank

PLUS: Equinix Singapore outage; Japan and India explore geocoding; APAC datacenter shortage predicted

Datacenters have a public image problem, industry confesses to The Reg

'Most people are f**king scared of AI, like we're feeding a monster'

'Close to impossible' for Europe to escape clutches of US hyperscalers

Barriers stack up: Datacenter capacity, egress fees, platform skills, variety of cloud services. It won't happen, say analysts

Tinfoil hat wearers can thank AI for declassification of JFK docs

Plus: AWS launches second Secret-level cloud region

Toshiba realises it can build, power, and maintain datacenters – so builds a team to do it all

Show us another company that builds power plants, semiconductors, and hard disks

CoreWeave signs megalease at Applied Digital's not-so-little house on the prairie

A big win for North Dakota

Schneider Electric says US grid will be less stable by 2030 as datacenter demand rises

Safety margin set to narrow – yes that buffer that helps prevent cascading failure events

Amazon has changed its nuclear deal in Pennsylvania to bypass grumpy regulators

New front-of-the-meter agreement avoids direct delivery snag that drew regulator pushback

UK bets big (and small) on nuclear as datacenter demand expected to climb

£14.2B Sizewell C among the investments along side Small Modular Reactors

Alphawave Semi swallowed in Qualcomm's $2.4B connectivity conquest

Another tech biz to be Yanked from London Stock Exchange

Musk's smog-belching Colossus datacenter slammed by civil rights group

NAACP claims that 'temporary' gas turbines were an attempt to get around environmental laws