Google broke its own cloud AGAIN, with TWO software bugs
'VP of 24x7' apologises in person for latest TITSUP
A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss.
And yes, that is Sloss' real title.
Sloss says the problem started when “engineers removed an unused Google Compute Engine (GCE) IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network.” Google announces the IP blocks it is using to help route traffic into its cloud.
On this occasion, the propagation failed due to “a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management.”
When propagation fails, Google usually fails over to the configuration in place before the new block was added. But on this occasion “a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”
Google says it has a “canary step” designed to catch messes like that described above.
But the canary had a bug “and thus the push system concluded that the new configuration was valid and began its progressive rollout.”
Once the new configuration reached Google bit barns around the world those that received the dud information stopped announcing their IP blocks, which made it rather hard to reach them. At this point, the Google cloud worked well because traffic from an un-reachable data centre was routed to another. But the dud IP configuration information was also moving from bit barn to bit barn, pulling them off the net.
The rest is 18 minutes of cloud outage history.
Google says it's found the bugs in its network configuration software responsible for the first mess, has killed 'em and is making “14 distinct engineering changes planned spanning prevention, detection and mitigation” and expects more will follow.
Sloss' apology follows:
We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.
They're fine words, but the fact remains that Google's cloud has been felled by a typo, bungled change management, lightning, failed automation and an imperfect patch. And those problems all happened since August 2015.
Incoming Google cloud boss Diane Greene has her work cut out for her. ®