Google has explained an August 11th brownout on its cloud as, yet again, a self-inflicted wound.
At the time of the incident Google said App Engine APIs were unavailable for a time.
It's now saying the almost-two-hour incident meant “18% of applications hosted in the US-CENTRAL region experienced error rates between 10% and 50%, and 3% of applications experienced error rates in excess of 50%. Additionally, 14% experienced error rates between 1% and 10%, and 2% experienced error rate below 1% but above baseline levels.”
Users also wore “a median latency increase of just under 0.8 seconds per request.”
Google's now revealed the root cause of the accident, which started with “... a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly.”
When Google does this sort of thing “... we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers.”
All of which sounds entirely sensible.
But while Google was draining the pool on this occasion, “a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.”
“The server drain resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance.”
Some of those manually-scaled instances started up slowly “resulting in the App Engine system retrying the start requests multiple times which caused a spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests.”
Google says it had enough routing capacity to handle the load, but that the routers weren't expecting all those retry requests. And so its cloud browned-out.
The Alphabet managed to rollback and restore services and now promises that “In order to prevent a recurrence of this type of incident, we have added more traffic routing capacity in order to create more capacity buffer when draining servers in this region.”
“We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system's retry behavior so that it cannot trigger this type of failure.”
No mention, however, of trying to schedule upgrades so it is only doing one at a time. ®