A week after Google suffered a TITSUP*, the gang at Mountain View has published a lengthy post-mortem on what went wrong. It was a known bug in the configuration pipeline.
Things went south on Tuesday 16 November after a fault in Google's cloud infrastructure made it all too clear just how many online outfits rely on it. Users found themselves faced with errors using services such as Spotify and Etsy – sites that used the Chocolate Factory's cloud-based load balancers.
According to Google, "issues" with the Google External Proxy Load Balancing (GCLB) service started at 09:35 Pacific Time (17:35 UTC). By "issues", the company meant the dread 404 error in response to HTTP/S requests. Engineers were on the case by 09:50 PT (17:50 UTC) and had rolled back to the last known good configuration by 10:08 PT (18:08 UTC), resolving the 404 problems. However, it wasn't until 11:28 PT (19:28 UTC) before customers were allowed to make changes to their load balancing configuration as engineers worried about the problem recurring.
"The total duration of impact," said Google, "was one hour and 53 minutes."
But what had happened? It transpired that six months ago a bug was introduced into the configuration pipeline that propagates customer configuration rules to GCLB. The bug itself permitted a race condition that "in very rare cases" could push a corrupted config file to GCLB and dodge the validation checks in the pipeline.
- US Defense Department invites four cloud firms to seek contracts for JEDI replacement system
- Google Cloud partially fixes load balancer SNAFU that hit Discord, Spotify, others today
- Another brick in the (kitchen) wall: Users report frozen 1st generation Google Home Hubs
- US states' antitrust lawsuit against Google's advertising business keeps growing
An engineer found the bug on 12 November, and the team had set about fixing it via a two-pronged approach – fix the bug itself and also add some extra validation to stop such a corrupted file making it into the system. It was declared a high-priority incident, but heck – the bug had been there for months without anything exploding, so a decision was taken not to opt for a same-day emergency patch, but instead roll out the fix in a steadier manner.
What could possibly go wrong?
By 15 November, the validation patch had been rolled out. On 16 November, the rollout of the patch to fix the bug itself was a mere 30 minutes away from being completed when the law of Sod struck and, as Google put it, "the race condition did manifest in an unpatched cluster, and the outage started."
It almost sounds like an entry for Who, Me?
To make matters worse, it transpired that the validation patch didn't actually handle the error produced by the race condition, meaning that the corruption was cheerfully accepted regardless.
It's all a bit embarrassing, although let he or she who has never had that weird, one-in-a-million bug that should never happen rear its head during a client meeting cast the first stone. Then again, not many of us are responsible for the cloud infrastructure of a multi-billion dollar ad company with a show-stopping coding cockup lurking on the servers.
As for Google, it has continued to apologise for the impact on its customers and insists that its services are in tiptop shape for Black Friday and Cyber Monday's festival of tat. ®
* Terrible IT Software Undermines Purchasing