Google has explained how it took a big slab of its Euro-cloud offline last week, and as usual the problem was of its own making.
The 9 December incident was brief and contained – it started at 18:31 PT (02:31 UTC), lasted 84 minutes and only impacted the europe-west2-a zone – but meant 60 per cent of VMs in the zone were unreachable from the outside world.
Google said VM creation and deletion operations stalled during the outage, while any VMs or hosts that had hardware or other faults during the outage were not repaired and restarted onto healthy hosts.
Now Google has explained what went wrong. In this incident report the company explains that its software-defined networking stack consists of distributed components that run across a fleet of servers, a design intended to deliver resilience.
“To achieve this, the control plane elects a leader from a pool of machines to provide configuration to the various infrastructure components,” the explanation said.
The leader election process depends on a local instance of what Google calls its “internal lock service”. That service provides Access Control List (ACL) mechanisms to control reading and writing of various files stored in the service.
World+dog share in collective panic attack as Google slides off the face of the internetREAD MORE
But someone or something at Google changed the ACLs so that the process that picks a leader lost access to the files required for the job.
Without a leader to drive its networks, the zone was in trouble.
Failure wasn’t instant, by design. But a few minutes after the Euro-cloud couldn’t elect a leader, “BGP routing between europe-west2-a and the rest of the Google backbone network was withdrawn, resulting in isolation of the zone and inaccessibility of resources in the zone.”
Google said the reason for the outage was that its production environment “contained ACLs not present in the staging or canary environments due to those environments being rebuilt using updated processes during previous maintenance events.”
That discrepancy meant that even those “canary” rigs designed to show trouble didn’t pick up the problem, because they thought they were using the correct ACLs.
While the outage was brief, and mostly impacted access from outside the region, the incident impacted App Engine, Cloud SQL, and an eight-hour outage for a small proportion of Cloud VPN users.
“Once zone europe-west2-a reconnected to the network, a combination of bugs in the VPN control plane were triggered by some of the now stale VPN gateways in the zone,” google confessed. “This caused an outage to 4.5 per cent of Classic Cloud VPN tunnels in europe-west2 for a duration of 8 hours and 10 minutes after the main disruption had recovered.”
Google has apologized and acknowledged that some customers may wish to invoke their service level agreements and seek compensation.
It has also promised to audit all network ACLs to ensure consistency and to improve resilience on those occasions its network control plane isn’t available. “Improvements in visibility to recent changes will be made to reduce the time to mitigation,” the company pledged, and “Additional observability will be added to lock service ACLs allowing for additional validation when making changes to ACLs. We are also improving the canary and release process for future changes of this type to ensure these changes are made safely.” ®