Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

Thu 14 Apr 2016 // 06:31 UTC

A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss.

And yes, that is Sloss' real title.

Sloss says the problem started when “engineers removed an unused Google Compute Engine (GCE) IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network.” Google announces the IP blocks it is using to help route traffic into its cloud.

On this occasion, the propagation failed due to “a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management.”

When propagation fails, Google usually fails over to the configuration in place before the new block was added. But on this occasion “a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”

Google says it has a “canary step” designed to catch messes like that described above.

But the canary had a bug “and thus the push system concluded that the new configuration was valid and began its progressive rollout.”

Once the new configuration reached Google bit barns around the world those that received the dud information stopped announcing their IP blocks, which made it rather hard to reach them. At this point, the Google cloud worked well because traffic from an un-reachable data centre was routed to another. But the dud IP configuration information was also moving from bit barn to bit barn, pulling them off the net.

The rest is 18 minutes of cloud outage history.

Google says it's found the bugs in its network configuration software responsible for the first mess, has killed 'em and is making “14 distinct engineering changes planned spanning prevention, detection and mitigation” and expects more will follow.

Sloss' apology follows:

We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.

They're fine words, but the fact remains that Google's cloud has been felled by a typo, bungled change management, lightning, failed automation and an imperfect patch. And those problems all happened since August 2015.

Topics

Special Features

Vendor Voice

Resources

SaaS

Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

More about

TIP US OFF

Other stories you might like

ASML ships another high NA EUV lithography machine to mystery client

Kremlin's Sandworm blamed for cyberattacks on US, European water utilities

Boston Dynamics' humanoid Atlas is dead, long live the ... new commercial Atlas

Industrial systems integrating digitalisation

Are we in a cost of technology crisis? Our vultures seem to think so

Future Roku TVs may inject tailored ads into anything and everything when you pause

NASA confirms nuclear powered Dragonfly drone is going to Titan

Tesla asks shareholders to reinstate Musk's voided $56B pay package

Sandia National Lab takes delivery of Intel's latest brain in a box

Samsung boosts LPDDR5X to 10.7 Gbps, ups efficiency and capacity for mobile and servers

Crypto conferences liquidated after biblical flooding in Dubai

AlmaLinux 9.4 beta prepares to tread where RHEL dares not

About Us

Our Websites

Your Privacy