CSI GitHub: That big outage last month? It's always DNS. Or it was Kubernetes. Maybe it was a heady blend of both

'Impact was increased when a redeploy was triggered in an attempt to mitigate'


GitHub has reported on the reasons behind a severe four-and-a-half-hour outage on 13 July.

In its latest availability report, GitHub senior veep of engineering Keith Ballinger described the sequence of events that caused the issue. Kubernetes is intended to be a resilient platform, but not this time. "The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services," he said.

A Kubernetes Pod is a group of containers. A node, which is a virtual or physical machine in a Kubernetes cluster, may run one or more Pods.

GitHub's Pod failures, said Ballinger, were caused by a single container within a Pod being terminated because it exceeded memory limits. "Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available," he said, so these Pods went offline.

Normally a Pod will recover by discarding the bad container and replacing it with a new one. "However, due to a routine DNS maintenance operation... our clusters were unable to successfully reach our registry resulting in Pods failing to start."

Efforts to mitigate the problem made it worse, presumably because GitHub (or its automated systems) had not cottoned on to the DNS issue. "Impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters," Ballinger reported. "It wasn't until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services."

The "areas to address" identified by Ballinger include enhanced monitoring, less dependency on the container image registry, better validation of DNS changes, and "reevaluating all the existing Kubernetes deployment policies" – with this last one perhaps particularly significant.

GitHub's reliability record is fair, considering its huge scale, though the 13 July outage was notable both for its length and the fact that it affected most of the company's services. It followed a two-hour-plus outage on 29 June caused by MySQL issues. One bright spot is that the Git operations that form the core functionality of GitHub were unaffected on 13 July.

The availability report is commendably frank, though it also illustrates both the challenges of administering Kubernetes and also that self-healing systems can be prone to cascading failures. Nothing of a similar impact has happened since, though only a few weeks have passed.

GitHub and Microsoft are encouraging developers to use the platform for DevOps beyond code and issue management, including continuous integration, and getting reliability right is critical to increased adoption. ®

Similar topics

Broader topics


Other stories you might like

  • World’s smallest remote-controlled robots are smaller than a flea
    So small, you can't feel it crawl

    Video Robot boffins have revealed they've created a half-millimeter wide remote-controlled walking robot that resembles a crab, and hope it will one day perform tasks in tiny crevices.

    In a paper published in the journal Science Robotics , the boffins said they had in mind applications like minimally invasive surgery or manipulation of cells or tissue in biological research.

    With a round tick-like body and 10 protruding legs, the smaller-than-a-flea robot crab can bend, twist, crawl, walk, turn and even jump. The machines can move at an average speed of half their body length per second - a huge challenge at such a small scale, said the boffins.

    Continue reading
  • IBM-powered Mayflower robo-ship once again tries to cross Atlantic
    Whaddayaknow? It's made it more than halfway to America

    The autonomous Mayflower ship is making another attempt at a transatlantic journey from the UK to the US, after engineers hauled the vessel to port and fixed a technical glitch. 

    Built by ProMare, a non-profit organization focused on marine research, and IBM, the Mayflower set sail on April 28, beginning its over 3,000-mile voyage across the Atlantic Ocean. But after less than two weeks, the crewless ship broke down and was brought back to port in Horta in the Azores, 850 miles off the coast of Portugal, for engineers to inspect.

    With no humans onboard, the Mayflower Autonomous Ship (MAS) can only rely on its numerous cameras, sensors, equipment controllers, and various bits of hardware running machine-learning algorithms to survive. The computer-vision software helps it navigate through choppy waters and avoid objects that may be in its path.

    Continue reading
  • Revealed: The semi-secret list of techs Beijing really really wishes it didn't have to import
    I think we can all agree that China is not alone in wishing it had an alternative to Microsoft Windows

    China has identified "chokepoints" that leave it dependent on foreign countries for key technologies, and the US-based Center for Security and Emerging Technology (CSET) claims to have translated and published key document that name the technologies about which Beijing is most worried.

    CSET considered 35 articles published in Science and Technology Daily from April until July 2018. Each story detailed a different “chokepoint” or tech import dependency that China faces. The pieces are complete with insights from Chinese academics, industry insiders and other experts.

    CSET said the items, which offer a rare admission of economic and technological vulnerability , have hitherto “largely unnoticed in the non-Chinese speaking world.”

    Continue reading

Biting the hand that feeds IT © 1998–2022