GitHub is just like all of us: The week has just started but it needed 4 whole hours of downtime
Plus: New availability report highlights database issues
GitHub marked the start of the week with more than four hours of downtime, as GitHub Issues, Actions, Pages, Packages and API requests all reported "degraded performance."
A problem on the world's most popular code repository and developer collaboration site was first reported around 05:00 UK time (04:00 UTC) this morning and was resolved at 09:30 UK time (08:30 UTC). Basic Git operations were not affected.
GitHub, on the whole, is a relatively reliable site but the impact of downtime is considerable because of its wide use and critical importance. The site has over 44 million users and over 100 million repositories (nearly 34 million of which are public).
The last major outage before today was on 29th June, and before that on 19 June, and 22nd and 23rd May. In the context of such a key service, that isn't a great recent track record. “You are a dependency to our systems and if this keeps happening, many will say goodbye,” said developer Emad Mokhtar on Twitter.
What is going on? The Microsoft-owned company has not always been great at releasing prompt post-mortem explanations, but according to a post last week a new Availability Report will now be published on the first Wednesday of each month, so we can expect an explanation of today’s blip by 5 August, if not before.
In the same post, GitHub reported on what went wrong in May and June. It turns out that database issues are the most common problem. On May 5, “a shared database table’s auto-incrementing ID column exceeded the size that can be represented by the MySQL Integer type,” said GitHub’s SVP of engineering, Keith Ballinger.
May 22 was another bad day for the company’s MySQL servers. A primary MySQL instance was failed over for planned maintenance, but the newly promoted instance crashed after six seconds. “We manually redirected traffic back to the original primary,” said Ballinger. Recovering the six seconds of writes to the crashed instance, though, caused delays. “A restore of replicas from the new primary was initiated which took approximately four hours with a further hour for cluster reconfiguration to re-enable full read capacity,” he added.
On 19 June, Ballinger said, a dependency introduced in order to “better instrument A/B experimentation” caused “site-wide application errors for a percentage of users enrolled in the experiment.”
June 29th saw another MySQL-related issue. GitHub had recently updated its ProxySQL service to a new version. “The primary MySQL node on one of our main database clusters failed and was replaced automatically by a new host,” Ballinger said. “Within seconds, the newly promoted primary crashed.” Sounds familiar.
“After we recovered service manually, the new primary became CPU starved and crashed again. A new primary was promoted which also crashed shortly thereafter.”
The fix, it turned out, was to roll back to the previous version of ProxySQL.
The above reports suggest that GitHub’s reliability issues are not related to changes introduced as a result of Microsoft’s 2018 acquisition of the company, even though the number of outages seems, if anything, to be a bit worse since then.
The correlation may be more to do with continued user growth, and additional features that increase the load. The core issue, judging from past reports, is database server reliability at massive scale – something which (in theory) Microsoft’s Azure expertise could help with. However, as the reports themselves illustrate, making changes to improve the service can itself be a downtime risk. ®