Google Cloud caused outage by ignoring its usual code quality protections

Promises operational change - and improvements to customer comms when it crashes

Google Cloud has explained the massive outage it created last week and, as has happened many times previously, admitted that it broke itself.

The outage struck last Thursday and meant that Google Cloud customers could not access their rented infrastructure for at least three hours. Among the customers impacted by the event was Cloudflare, whose services wobbled because of Google’s errors – meaning its customers also experienced disruptions.

We will improve our external communications so our customers get the information they need asap

Google’s explanation of the incident opens by informing readers that its APIs, and Google Cloud’s, are served through our Google API management and control planes.”

Those two planes are distributed regionally and “are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints.”

The core binary that is part of this policy check system is known as “Service Control”.

“Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers,” the post states.

On May 29, Google added a new feature to Service Control, to enable “additional quota policy checks.”

“This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code,” Google’s incident report explains.

The search monopolist appears to have had concerns about this change as it “came with a red-button to turn off that particular policy serving path.”

But the change “did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.”

Google uses feature flags to catch issues in its code.

“If this had been flag protected, the issue would have been caught in staging.”

That unprotected code ran inside Google until June 12th, when the company changed a policy that contained “unintended blank fields.”

Here’s what happened next:

Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.

Google’s post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes.

But in some larger Google Cloud regions, “as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on … overloading the infrastructure.”

Service Control wasn’t built to handle this, which is why it took almost three hours to resolve the issue in its larger regions.

The teams running Google products that went down due to this mess then had to perform their own recovery chores.

More than the usual apology

Google has promised to stop repeating the mistakes that led to this outage – as it always does.

But this time the company has also promised a couple of operational changes, namely:

We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.

Those initiatives are effectively an admission that Google did not provide enough info during this outage, and plans to do something about that.

Which means it is also effectively an admission that it can’t avoid big outages. ®

More about

TIP US OFF

Send us news


Other stories you might like