Google Cloud Engine outage caused by 'large backlog of queued mutations'
Ad giant added memory to servers, restarted, watched things get worse ... is on top of things again now
A 14-hour Google cloud platform outage that we missed in the shadow of last week's G Suite outage was caused by a failure to scale, an internal investigation has shown.
The outage, which occurred on 26 March, brought down Google's cloud services in multiple regions, including Dataflow, Big Query, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, and Cloud Console. The systems were affected for a total of 14 hours.
The outage was caused by a lack of memory in the company's cache servers, according to an internal investigation by the company published today. "The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time," the investigation said.
"The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this in turn resulted in requests to IAM timing out. The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage."
Google resolved the issue by installing more memory into the cache servers and restarting them. But by this point, a heap of stale data had built up, which led to further issues which system engineers had to battle with for several more hours. The systems were back up and operating at 05:55AM UTC the following morning.
In response to the issues, Google said that it is "ensuring that the cache servers can handle bulk updates of the kind which triggered this incident" and that "efforts are underway to optimize the memory usage and protections on the cache servers, and allow emergency configuration changes without requiring restarts."
"To allow us to mitigate data staleness issues more quickly in future, we will also be sharding out the database batch processing to allow for parallelization and more frequent runs. We understand how important regional reliability is for our users and apologize for this incident." ®