GitHub has reported on the reasons behind a severe four-and-a-half-hour outage on 13 July.
In its latest availability report, GitHub senior veep of engineering Keith Ballinger described the sequence of events that caused the issue. Kubernetes is intended to be a resilient platform, but not this time. "The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services," he said.
A Kubernetes Pod is a group of containers. A node, which is a virtual or physical machine in a Kubernetes cluster, may run one or more Pods.
GitHub's Pod failures, said Ballinger, were caused by a single container within a Pod being terminated because it exceeded memory limits. "Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available," he said, so these Pods went offline.
Normally a Pod will recover by discarding the bad container and replacing it with a new one. "However, due to a routine DNS maintenance operation... our clusters were unable to successfully reach our registry resulting in Pods failing to start."
Efforts to mitigate the problem made it worse, presumably because GitHub (or its automated systems) had not cottoned on to the DNS issue. "Impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters," Ballinger reported. "It wasn't until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services."
The "areas to address" identified by Ballinger include enhanced monitoring, less dependency on the container image registry, better validation of DNS changes, and "reevaluating all the existing Kubernetes deployment policies" – with this last one perhaps particularly significant.
GitHub's reliability record is fair, considering its huge scale, though the 13 July outage was notable both for its length and the fact that it affected most of the company's services. It followed a two-hour-plus outage on 29 June caused by MySQL issues. One bright spot is that the Git operations that form the core functionality of GitHub were unaffected on 13 July.
The availability report is commendably frank, though it also illustrates both the challenges of administering Kubernetes and also that self-healing systems can be prone to cascading failures. Nothing of a similar impact has happened since, though only a few weeks have passed.
GitHub and Microsoft are encouraging developers to use the platform for DevOps beyond code and issue management, including continuous integration, and getting reliability right is critical to increased adoption. ®