How a power blip briefly broke GitHub's boxes and tripped it offline

git-blame -l -t

Exactly how a momentary power failure managed to trigger a two-hour GitHub outage has been revealed in full today.

The popular and widely used source-code-hosting service fell off the internet last Wednesday, and soon after blamed the downtime on a "brief power disruption at our primary data center" that "caused a cascading failure." For those who keep the lights on in server warehouses and want to know what went wrong, here's a summary of how it went down, literally:

  1. At 0023 UTC on January 28, the power supply equipment in GitHub's main data center suffered a brief disruption that caused 25 per cent of the website's machines and networking gear to reboot. This triggered a bunch of alerts to on-call engineers.
  2. Load-balancing devices and front-end application servers largely managed to stay up, but couldn't reach the backend systems that remained unavailable after the reboot. Users were served the unicorn-of-fail page by public-facing web servers that couldn't reach essential services in the backend.
  3. The internal chat system used by GitHub staff was also knackered by the power blip, hampering attempts to organize a recovery for a short while. For that reason, engineers were late to raise the alarm on
  4. The team gradually worked out that some systems had rebooted: some servers' uptimes were in the minutes. Meanwhile, some backend database machines had disappeared, and app servers relying on them were failing to start.
  5. All of the offline Redis database machines used a particular hardware spec, and were spread out along rows of racks of servers. Connecting to their serial consoles revealed they had died during boot up as their physical drives were no longer recognized by the firmware. Gulp. Technicians had to manually disconnect the boxes from their power supply, plug them back in again, and turn them on in an attempt to restore them.
  6. Meanwhile, another team was trying to rebuild the missing Redis clusters on a second set of machines, an effort hampered by the fact that vital information was stuck on the dead hardware. Eventually, the standby Redis servers were up and running without any data loss, which allowed the app servers to start up properly.
  7. Two hours and six minutes since the start of the outage, the website recovered.

"We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, but we can take steps to ensure recovery occurs in a fast and reliable manner," GitHub engineer Scott Sanders wrote in a blog post on Wednesday explaining the cascade of failures. "We can also take steps to mitigate the negative impact of these events on our users.

"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet. Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment."

Sanders also said work will be carried out to make its app servers more resilient the next time its backend systems fall over.

"All of us at GitHub would like to apologize for the impact of this outage," he added. ®

Other stories you might like

Biting the hand that feeds IT © 1998–2021