Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

Google cloud wobbles as workers patch wrong routers

Is 'Sorry about that, we promise to learn from our mistakes' any way to run a cloud?

Add another SNAFU to the long list of Google cloud wobbles caused by human error: this time The Alphabet subsidiary decided to patch the wrong routers.

The wobble wasn't a big one: it lasted just 46 minutes and only hit Google Compute Engine Instances in the us-central1-f zone. Of course it wasn't minor if yours was one of the 25 per cent of network flows that just didn't make it into or out of that region.

Here's Google's explanation of the problem's root cause:

“Google follows a gradual rollout process for all new releases. As part of this process, Google network engineers modified a configuration setting on a group of network switches within the us-central1-f zone. The update was applied correctly to one group of switches, but, due to human error, it was also applied to some switches which were outside the target group and of a different type. The configuration was not correct for them and caused them to drop part of their traffic.

Face, meet palm. Repeat.

Google says “The traffic loss was detected by automated monitoring, which stopped the misconfiguration from propagating further, and alerted Google network engineers.” Which is just what you'd expect in this age of automation. Sadly, however, “Conflicting signals from our monitoring infrastructure caused some initial delay in correctly diagnosing the affected switches.”

“This caused the incident to last longer than it should have.”

El Reg's cloud desk keeps an eye on status feeds from the big four clouds – AWS, Google, SoftLayer and Azure – and Google's is by far the busiest. A great many of the reports concern human error.

It could well be that rival clouds aren't as forthcoming with reports of messes like this, and that the stream of SNAFUs Google reports is a sign of commendable openness and transparency.

Or they could be signs of immature processes. Google's notice about the incident says “To prevent recurrence of this issue, Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network. We are also reviewing and adjusting our monitoring signals in order to lower our response times.”

In related news, Google's cloudy operation has announced tweaks to Cloud Trace, a tool it offers to diagnose problems in cloudy applications. ®

Similar topics

TIP US OFF

Send us news


Other stories you might like