A bug introduced 6 months ago brought Google's Cloud Load Balancer to its knees

Another 30 minutes and there would have been nothing to see

A week after Google suffered a TITSUP*, the gang at Mountain View has published a lengthy post-mortem on what went wrong. It was a known bug in the configuration pipeline.

Things went south on Tuesday 16 November after a fault in Google's cloud infrastructure made it all too clear just how many online outfits rely on it. Users found themselves faced with errors using services such as Spotify and Etsy – sites that used the Chocolate Factory's cloud-based load balancers.

According to Google, "issues" with the Google External Proxy Load Balancing (GCLB) service started at 09:35 Pacific Time (17:35 UTC). By "issues", the company meant the dread 404 error in response to HTTP/S requests. Engineers were on the case by 09:50 PT (17:50 UTC) and had rolled back to the last known good configuration by 10:08 PT (18:08 UTC), resolving the 404 problems. However, it wasn't until 11:28 PT (19:28 UTC) before customers were allowed to make changes to their load balancing configuration as engineers worried about the problem recurring.

"The total duration of impact," said Google, "was one hour and 53 minutes."

But what had happened? It transpired that six months ago a bug was introduced into the configuration pipeline that propagates customer configuration rules to GCLB. The bug itself permitted a race condition that "in very rare cases" could push a corrupted config file to GCLB and dodge the validation checks in the pipeline.

An engineer found the bug on 12 November, and the team had set about fixing it via a two-pronged approach – fix the bug itself and also add some extra validation to stop such a corrupted file making it into the system. It was declared a high-priority incident, but heck – the bug had been there for months without anything exploding, so a decision was taken not to opt for a same-day emergency patch, but instead roll out the fix in a steadier manner.

What could possibly go wrong?

By 15 November, the validation patch had been rolled out. On 16 November, the rollout of the patch to fix the bug itself was a mere 30 minutes away from being completed when the law of Sod struck and, as Google put it, "the race condition did manifest in an unpatched cluster, and the outage started."

It almost sounds like an entry for Who, Me?

To make matters worse, it transpired that the validation patch didn't actually handle the error produced by the race condition, meaning that the corruption was cheerfully accepted regardless.

It's all a bit embarrassing, although let he or she who has never had that weird, one-in-a-million bug that should never happen rear its head during a client meeting cast the first stone. Then again, not many of us are responsible for the cloud infrastructure of a multi-billion dollar ad company with a show-stopping coding cockup lurking on the servers.

As for Google, it has continued to apologise for the impact on its customers and insists that its services are in tiptop shape for Black Friday and Cyber Monday's festival of tat. ®

* Terrible IT Software Undermines Purchasing

Similar topics

Broader topics

Other stories you might like

  • Google sours on legacy G Suite freeloaders, demands fee or flee

    Free incarnation of online app package, which became Workplace, is going away

    Google has served eviction notices to its legacy G Suite squatters: the free service will no longer be available in four months and existing users can either pay for a Google Workspace subscription or export their data and take their not particularly valuable businesses elsewhere.

    "If you have the G Suite legacy free edition, you need to upgrade to a paid Google Workspace subscription to keep your services," the company said in a recently revised support document. "The G Suite legacy free edition will no longer be available starting May 1, 2022."

    Continue reading
  • SpaceX Starlink sat streaks now present in nearly a fifth of all astronomical images snapped by Caltech telescope

    Annoying, maybe – but totally ruining this science, maybe not

    SpaceX’s Starlink satellites appear in about a fifth of all images snapped by the Zwicky Transient Facility (ZTF), a camera attached to the Samuel Oschin Telescope in California, which is used by astronomers to study supernovae, gamma ray bursts, asteroids, and suchlike.

    A study led by Przemek Mróz, a former postdoctoral scholar at the California Institute of Technology (Caltech) and now a researcher at the University of Warsaw in Poland, analysed the current and future effects of Starlink satellites on the ZTF. The telescope and camera are housed at the Palomar Observatory, which is operated by Caltech.

    The team of astronomers found 5,301 streaks leftover from the moving satellites in images taken by the instrument between November 2019 and September 2021, according to their paper on the subject, published in the Astrophysical Journal Letters this week.

    Continue reading
  • AI tool finds hundreds of genes related to human motor neuron disease

    Breakthrough could lead to development of drugs to target illness

    A machine-learning algorithm has helped scientists find 690 human genes associated with a higher risk of developing motor neuron disease, according to research published in Cell this week.

    Neuronal cells in the central nervous system and brain break down and die in people with motor neuron disease, like amyotrophic lateral sclerosis (ALS) more commonly known as Lou Gehrig's disease, named after the baseball player who developed it. They lose control over their bodies, and as the disease progresses patients become completely paralyzed. There is currently no verified cure for ALS.

    Motor neuron disease typically affects people in old age and its causes are unknown. Johnathan Cooper-Knock, a clinical lecturer at the University of Sheffield in England and leader of Project MinE, an ambitious effort to perform whole genome sequencing of ALS, believes that understanding how genes affect cellular function could help scientists develop new drugs to treat the disease.

    Continue reading

Biting the hand that feeds IT © 1998–2022