Google has a canary problem: One clocked off and crocked its cloud

Again. So now Google's re-building dangerously centralized routing rigs

Fri 10 Feb 2017 // 01:35 UTC

Google's explained why new cloudy virtual machines in its cloud engine couldn't connect to the world for a couple of hours in January: a canary didn't fall off its perch, so the company was unaware of a problem.

The incident wasn't major: for a tick over two hours on January 30th newly created Google Compute Engine instances, Cloud VPNs and network load balancers became unavailable. The affected servers had public IP addresses, but couldn't be reached from the outside world or send any traffic. Nor would load balancing health checks work.

Google's explanation of the incident tells us a little about its infrastructure. Here's the Root Cause it's offered us all:

All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated.
The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase.

Once Google's people figured out a confused canary was the problem, they “restarted the jobs responsible for programming changes to the network load balancers” and then processed the problematic changes in a batch. That sorted things out in short order.

For now, Google's “increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them.”

This is not Google's first canary problem: in October 2016 one of its warning birds failed to chirp long or loud enough and bits of its cloud fell over.

Google's long-term fix is new tests to test more configurations and binning the inefficient code path. The company reckons it was already taking steps to address this sort of problem, by implementing more decentralized routing. That effort is “being accelerated as it will prevent issues with this layer having global impact.”

And of course Google says it's now building new metrics so it will be alerted to problems faster. Perhaps they need a DevOps UFO to flash news of problems? ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

SaaS

Google has a canary problem: One clocked off and crocked its cloud

Again. So now Google's re-building dangerously centralized routing rigs

More about

TIP US OFF

Other stories you might like

Your trainee just took down our business and has no idea how or why

UK unions publish AI bill to protect workers from 'risks and harms' of tech

Huawei's latest flagship smartphone contains no world-shaking silicon surprises

Protecting distributed branch office environments from ransomware

Oracle scores big win with Fujitsu Japan for its Alloy partner cloud

Meta lets Llama 3 LLM out to graze, claims it can give Google and Anthropic a kicking

US Air Force says AI-controlled F-16 fighter jet has been dogfighting with humans

Ransomware feared as IT 'issues' force Octapharma Plasma to close 150+ centers

Crooks exploit OpenMetadata holes to mine crypto – and leave a sob story for victims

Stability AI decimates staff just weeks after CEO's exit

IBM accused of cheating its own executive assistants out of overtime pay

Google fires 28 staff after sit-in protest against Israeli cloud deal ends in arrests

About Us

Our Websites

Your Privacy