CSI GitHub: That big outage last month? It's always DNS. Or it was Kubernetes. Maybe it was a heady blend of both

'Impact was increased when a redeploy was triggered in an attempt to mitigate'

4 Reg comments Got Tips?

GitHub has reported on the reasons behind a severe four-and-a-half-hour outage on 13 July.

In its latest availability report, GitHub senior veep of engineering Keith Ballinger described the sequence of events that caused the issue. Kubernetes is intended to be a resilient platform, but not this time. "The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services," he said.

A Kubernetes Pod is a group of containers. A node, which is a virtual or physical machine in a Kubernetes cluster, may run one or more Pods.

GitHub's Pod failures, said Ballinger, were caused by a single container within a Pod being terminated because it exceeded memory limits. "Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available," he said, so these Pods went offline.

Normally a Pod will recover by discarding the bad container and replacing it with a new one. "However, due to a routine DNS maintenance operation... our clusters were unable to successfully reach our registry resulting in Pods failing to start."

Efforts to mitigate the problem made it worse, presumably because GitHub (or its automated systems) had not cottoned on to the DNS issue. "Impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters," Ballinger reported. "It wasn't until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services."

The "areas to address" identified by Ballinger include enhanced monitoring, less dependency on the container image registry, better validation of DNS changes, and "reevaluating all the existing Kubernetes deployment policies" – with this last one perhaps particularly significant.

GitHub's reliability record is fair, considering its huge scale, though the 13 July outage was notable both for its length and the fact that it affected most of the company's services. It followed a two-hour-plus outage on 29 June caused by MySQL issues. One bright spot is that the Git operations that form the core functionality of GitHub were unaffected on 13 July.

The availability report is commendably frank, though it also illustrates both the challenges of administering Kubernetes and also that self-healing systems can be prone to cascading failures. Nothing of a similar impact has happened since, though only a few weeks have passed.

GitHub and Microsoft are encouraging developers to use the platform for DevOps beyond code and issue management, including continuous integration, and getting reliability right is critical to increased adoption. ®


Keep Reading

Leaked benchmarks from developer kit for Apple's home-baked silicon appear to give Microsoft a run for its money

Before you get too excited 1) They're benchmarks 2) New consumer Arm-based Macs might use something else

Microsoft sides with Epic over Apple developer ban, supports motion for temporary restraining order

'Apple’s discontinuation of Epic’s ability to develop and support Unreal Engine for iOS or macOS will harm game creators and gamers,' says Microsoft

Apple funnels Worldwide Developer Conference 2020 through iOS app, website amid coronavirus lockdowns

Hey, finally an Apple event The Reg can attend, sorta, right?

Microsoft will release a web browser for Linux next month. Repeat, Microsoft will release a browser for Linux – and it uses Google's technology

Ignite This means Linus Torvalds has definitely won, doesn't it?

ReactOS hits a milestone – actually hiring a full-time developer. And we've got our talons on the latest build to see what needs fixing

Open-source Windows lookalike aims to fix its 'long neglected' storage stack

Microsoft open-sources fuzzing tool it uses in-house to keep Windows so very secure

Erm ... guys ... have you looked at recent patch counts? (We have: you issued 372 this quarter, 54 critical)

No more installing Microsoft's Chromium-centered Edge by hand: Windows 10 will do it for you automatically

Something something pushing us over the Edge

Node.js community finally prodded to patch Chromium XHR bug after developer refuses to let flaw stand

If at first you don't succeed, try, try... try, try, try... try again

Biting the hand that feeds IT © 1998–2020