Google Cloud’s so-called uninterruptible power supplies caused a six-hour interruption
When the power went out, they didn’t switch on
Google has revealed that a recent six-hour outage at one of its cloudy regions was caused by uninterruptible power supplies not doing their job.
The outage commenced on March 29th and caused “degraded service or unavailability” for over 20 Google Cloud services in the us-east5-c zone. Google’s US east zone is centered on Columbus, Ohio.
Google’s incident report states that the outage started with “loss of utility power in the affected zone.”
Hyperscalers build to survive that sort of thing with uninterruptible power supplies (UPSes) that are supposed to immediately provide power if the grid goes dead, and keep doing so for a few hours before diesel-powered generators kick in.
Google’s UPSes, however, suffered a “critical battery failure” and didn’t provide any juice. They also appear to have prevented power from generators reaching Google’s racks, because the incident report states the advertising giant’s engineers had to bypass the UPSes before power became available.
Engineers were alerted to the incident at 12:54 Pacific Time and their efforts saw generators come online at 14:49.
“The majority of Google Cloud services recovered shortly thereafter,” the incident report states, although “A few services experienced longer restoration times as manual actions were required in some cases to complete full recovery.”
- Datacenters near Heathrow seemingly stay up as substation fire closes airport
- 'Once in a lifetime' IT outage at city council hit datacenter, but no files lost
- Oracle outage hits US Federal health records systems
- Microsoft blames Outlook's wobbly weekend on 'problematic code change'
Google is terribly sorry this happened and “committed to preventing a repeat of this issue in the future.” To avoid similar messes in future, the web giant has promised to do the following:
- Harden cluster power failure and recovery path to achieve a predictable and faster time-to-serving after power is restored.
- Audit systems that did not automatically failover and close any gaps that prevented this function.
- Work with our uninterruptible power supply (UPS) vendor to understand and remediate issues in the battery backup system.
Oh, to be a fly on the wall when Google meets with that UPS vendor.
Hyperscalers promise resilience and mostly succeed, but even their plans can sometimes go awry. The lesson for the rest of us is that regular testing of all disaster recovery infrastructure and procedures - including what to do when public clouds have outages – is not optional or something that can be put off. ®