Off-Prem

This article is more than 1 year old

Google: We had to shut down a datacenter to save it during London’s heatwave

Can't say what caused cooling failure, admits to re-routing traffic away from working resources

Mon 1 Aug 2022 // 06:30 UTC

Google has revealed the root cause of the outage that disrupted services at its europe-west2-a zone, based in London, during a recent heatwave.

"One of the datacenters that hosts zone europe-west2-a could not maintain a safe operating temperature due to a simultaneous failure of multiple, redundant cooling systems combined with the extraordinarily high outside temperatures," states Google's incident report.

The report doesn't explain why the cooling systems failed, but does say Google first became aware of "an issue affecting two cooling systems in one of the datacenters that hosts europe-west2-a on Tuesday, 19 July 2022 at 06:33 US/Pacific and began an investigation."

We inadvertently modified traffic routing for internal services to avoid all three zones in the europe-west2

The Register has checked weather records for the day in question: just before Google noticed cooling problems – 2:20PM in London – the temperature was 102°F/39°C.

That's a level of heat that's manageable in places where datacenter designers know that sort of temperature can be expected. But as July 19 was the hottest day on record in London, the UK capital is not such a place.

Engineers worked on mitigations to the failed cooling systems from 07:02 Pacific, but their efforts failed.

London temperatures remained above 95°F/35°C deep into the evening and at around 6PM in London Google engineers "powered down this part of the zone to prevent an even longer outage or damage to machines."

In other words, they shut the zone down to save it from a worse outage.

Chaos kicked in after the shutdown decision. Closing the datacenter meant "Compute Engine terminated all VMs in the impacted datacenter, representing approximately 35 percent of the VMs in the europe-west2-a zone."

Google also made a mess of trying to provide redundancy.

"At the start of the incident, we inadvertently modified traffic routing for internal services to avoid all three zones in the europe-west2 region, rather than just the impacted europe-west2-a zone."

So while only part of europe-west2-a zone was down, Google told itself to ignore working resources.

Google and other cloud vendors advise users to employ multiple zones to improve resilience. Google's error, therefore, went against its own advice.

The cooling system came back online at 14:13 Pacific – past 10PM in London when temperatures were still sizzling.

"Google engineers are actively conducting a detailed analysis of the cooling system failure that triggered this incident," the report states.

The search giant and cloud aspirant has also pledged to:

Investigate and develop more advanced methods to progressively decrease the thermal load within a single datacenter space, reducing the probability that a full shutdown is required;
Examine procedures, tooling, and automated recovery systems for gaps to substantially improve recovery times in the future;
Audit cooling system equipment and standards across the datacenters that house Google Cloud globally.

The incident report also offers a detailed account of the incident's impact on Google cloud services, and offers the figure of 18 hours, 23 minutes as the duration of the outage – plus a "long tail duration" of 35 hours, 15 minutes before things were back to normal. ®

Topics

Special Features

Vendor Voice

Resources

Off-Prem

Google: We had to shut down a datacenter to save it during London’s heatwave

Can't say what caused cooling failure, admits to re-routing traffic away from working resources

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google fires 28 staff after sit-in protest against Israeli cloud deal ends in arrests

Protest group says Google has fired more staff over sit-ins opposing work for Israel

Google joins the custom server CPU crowd with Arm-based Axion chips

Industrial systems integrating digitalisation

Google One VPN axed for everyone but Pixel loyalists ... for now

Tokyo wags finger at Google for blocking Yahoo Japan! from using ad tech

Google location tracking deal could be derailed by politics

Google squashes AI teams together in push for fresh models

Google will pump more than $100B into AI, says DeepMind boss

UK data watchdog questions how private Google's Privacy Sandbox is

Google laying off staff again and moving some roles to 'hubs,' freeing up cash for AI investments

Japan turns up heat on Apple, Google with threat of hefty fines

About Us

Our Websites

Your Privacy