Google Cloud partially evaporates for hours amid power supply failure: Two US East Coast zones rattled
Networking, Kubernetes, storage, virtual machine systems hit by outage
Google Cloud is having a wobbly Monday. Its Kubernetes platform and networking services were partially unavailable for hours today, and its virtual-machine hosting and in-memory storage systems had a limited outage.
The web giant's Cloud Networking service fell over around 0800 PT (1500 UTC) today due to a power supply failure. Connections to virtual machines in Google's us-east1-c and us-east1-d zones started failing, and the breakdown spread to other Google Cloud services, such as its Persistent Disk product.
Here's the official word from Google at time of writing, 1300 PT (2000 UTC), some five hours on from when trouble started:
We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.
Services in us-east1-d have been fully restored. Services in us-east1-c are fully restored except for Persistent Disk which is partially restored. No ETA for full recovery of Persistent Disk yet. Impact is due to power failure. A more detailed analysis will be available at a later time.
Our engineering team is working on recovery of impacted services. We will provide an update by Monday, 2020-06-29 13:00 US/Pacific with current details. We apologize to all who are affected by the disruption.
Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.
Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.
And just as we were about to publish this article, Google said it expects to full resolve the Networking and storage issues within this hour:
The issue with Cloud Networking and Persistent Disk has been resolved for the majority of affected projects as of Monday, 2020-06-29 10:20 US/Pacific, and we expect full mitigation to occur for remaining projects within the hour.
Next up, Google's Kubernetes Engine in the us-east1-c zone took a hit at around 0700 PT (1400 UTC), and at time of writing nearly six hours on, is still down:
We are experiencing an issue with Google Kubernetes Engine in us-east1-c where clusters may be unavailable or unreachable beginning at Monday, 2020-06-29 07:15 US/Pacific US/Pacific. Mitigation work is currently underway by our engineering team.
Related problems with Compute Engine virtual machines, lasting about an hour, have, we're told, been cleared up:
The issue with Google Compute Engine in zones us-east1-c and us-east1-d where existing VMs may be unavailable or unreachable, and new VM creation may fail beginning on Monday, 2020-06-29 08:20 US/Pacific has been resolved for all affected users as of Monday, 2020-06-29 09:45 US/Pacific.
The impact to us-east1-d has been mitigated by Monday, 2020-06-29 08:45 US/Pacific. The impact to us-east1-c has been mitigated by Monday, 2020-06-29 09:45 US/Pacific.
Finally, Cloud Memorystore in us-east1-c and us-east1-d fell over around 0800 PT and was restored at 1130 PT. ®
Updated to add
As this article was published, Google updated its status boards to say its Kubernetes Engine outage has been resolved.