Code crash? Russian hackers? Nope. Good ol' broken fiber cables borked Google Cloud's networking today

Connectivity to us-east1 knackered for hours, still no fix


Fiber-optic cables linking Google Cloud servers in its us-east1 region physically broke today, slowing down or effectively cutting off connectivity with the outside world.

For at least the past nine hours, and counting, netizens and applications have struggled to connect to systems and services hosted in the region, located on America's East Coast. Developers and system admins have been forced to migrate workloads to other regions, or redirect traffic, in order to keep apps and websites ticking over amid mitigations deployed by the Silicon Valley giant.

Starting at 0755 PDT (1455 UTC) today, according to Google, the search giant "experiencing external connectivity loss for all us-east1 zones and traffic between us-east1, and other regions has approximately 10% loss."

fire

I got 502 problems, and Cloudflare sure is one: Outage interrupts your El Reg-reading pleasure for almost half an hour

READ MORE

By 0900 PDT, Google revealed the extent of the blunder: its cloud platform had "lost multiple independent fiber links within us-east1 zone." The fiber provider, we're told, "has been notified and are currently investigating the issue. In order to restore service, we have reduced our network usage and prioritised customer workloads."

By that, we understand, Google means it redirected traffic destined for its Google.com services hosted in the data center region, to other locations, allowing the remaining connectivity to carry customer packets.

By midday, Pacific Time, Google updated its status pages to note: "Mitigation work is currently underway by our engineering team to address the issue with Google Cloud Networking and Load Balancing in us-east1. The rate of errors is decreasing, however some users may still notice elevated latency."

However, at time of writing, the physically damaged cabling is not yet fully repaired, and US-east1 networking is thus still knackered. In fact, repairs could take as much as 24 hours to complete. The latest update, posted 1600 PDT, reads as follows:

The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1, and we expect a full resolution within the next 24 hours.

In the meantime, we are electively rerouting traffic to ensure that customers' services will continue to operate reliably until the affected fiber paths are repaired. Some customers may observe elevated latency during this period.

Customers using Google Cloud's Load Balancing service will automatically fall over to other regions, if configured, minimizing impact on their workloads, it is claimed. They can also migrate to, say US-east4, though they may have to rejig their code and scripts to reference the new region.

The Register asked Google for more details about the damaged fiber, such as how it happened. A spokesperson told us exactly what was already on the aforequoted status pages.

Meanwhile, a Google Cloud subscriber wrote a little ditty about the outage to the tune of Pink Floyd's Another Brick in the Wall. It starts: "We don't need no cloud computing..." ®

Similar topics


Other stories you might like

  • This startup says it can glue all your networks together in the cloud
    Or some approximation of that

    Multi-cloud networking startup Alkira has decided it wants to be a network-as-a-service (NaaS) provider with the launch of its cloud area networking platform this week.

    The upstart, founded in 2018, claims this platform lets customers automatically stitch together multiple on-prem datacenters, branches, and cloud workloads at the press of a button.

    The subscription is the latest evolution of Alkira’s multi-cloud platform introduced back in 2020. The service integrates with all major public cloud providers – Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud – and automates the provisioning and management of their network services.

    Continue reading
  • Cisco execs pledge simpler, more integrated networks
    Is this the end of Switchzilla's dashboard creep?

    Cisco Live In his first in-person Cisco Live keynote in two years, CEO Chuck Robbins didn't make any lofty claims about how AI is taking over the network or how the company's latest products would turn networking on its head. Instead, the presentation was all about working with customers to make their lives easier.

    "We need to simplify the things that we do with you. If I think back to eight or ten years ago, I think we've made progress, but we still have more to do," he said, promising to address customers' biggest complaints with the networking giant's various platforms.

    "Everything we find that is inhibiting your experience from being the best that it can be, we're going to tackle," he declared, appealing to customers to share their pain points at the show.

    Continue reading
  • Cloudflare explains how it managed to break the internet
    'Network engineers walked over each other's changes'

    A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage.

    The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

    "The outage," explained Cloudflare, "was caused by a change that was part of a long-running project to increase resilience in our busiest locations."

    Continue reading

Biting the hand that feeds IT © 1998–2022