Google cloud wobbles as workers patch wrong routers

Is 'Sorry about that, we promise to learn from our mistakes' any way to run a cloud?

Tue 1 Mar 2016 // 00:55 UTC

Add another SNAFU to the long list of Google cloud wobbles caused by human error: this time The Alphabet subsidiary decided to patch the wrong routers.

The wobble wasn't a big one: it lasted just 46 minutes and only hit Google Compute Engine Instances in the us-central1-f zone. Of course it wasn't minor if yours was one of the 25 per cent of network flows that just didn't make it into or out of that region.

Here's Google's explanation of the problem's root cause:

“Google follows a gradual rollout process for all new releases. As part of this process, Google network engineers modified a configuration setting on a group of network switches within the us-central1-f zone. The update was applied correctly to one group of switches, but, due to human error, it was also applied to some switches which were outside the target group and of a different type. The configuration was not correct for them and caused them to drop part of their traffic.

Face, meet palm. Repeat.

Google says “The traffic loss was detected by automated monitoring, which stopped the misconfiguration from propagating further, and alerted Google network engineers.” Which is just what you'd expect in this age of automation. Sadly, however, “Conflicting signals from our monitoring infrastructure caused some initial delay in correctly diagnosing the affected switches.”

“This caused the incident to last longer than it should have.”

El Reg's cloud desk keeps an eye on status feeds from the big four clouds – AWS, Google, SoftLayer and Azure – and Google's is by far the busiest. A great many of the reports concern human error.

It could well be that rival clouds aren't as forthcoming with reports of messes like this, and that the stream of SNAFUs Google reports is a sign of commendable openness and transparency.

Or they could be signs of immature processes. Google's notice about the incident says “To prevent recurrence of this issue, Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network. We are also reviewing and adjusting our monitoring signals in order to lower our response times.”

In related news, Google's cloudy operation has announced tweaks to Cloud Trace, a tool it offers to diagnose problems in cloudy applications. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

SaaS

Google cloud wobbles as workers patch wrong routers

Is 'Sorry about that, we promise to learn from our mistakes' any way to run a cloud?

More about

TIP US OFF

Other stories you might like

Flaws in Chinese keyboard apps leave 750 million users open to snooping, researchers claim

Atlassian loses half its CEOs, but customers stay solid after Server products exit support

Intel excited by PC sales pop and GPU prospects, but investors aren’t because the outlook is poor

Reducing the cloud security overhead

What's up with Alphabet and Microsoft lately? Profits, sales – and AI costs

Amazon to blow $11B on cluster of Indiana bit barns

Cops cuff man for allegedly framing colleague with AI-generated hate speech clip

Ring dinged for $5.6M after, among other claims, rogue insider spied on 'pretty girls'

ByteDance 'would rather' torpedo TikTok than sell it off

FCC votes 3-2 to bring net neutrality back from the dead

Detecting drift and dealing with the Silicon Valley mindset

Two cuffed in Samourai Wallet crypto dirty money sting

About Us

Our Websites

Your Privacy