Updated Programmers, your snow day is well and truly over: GitHub's website has finally cleared its 24-hour outage, and reckons everything is operating normally again.
Last year, CEO Chris Wanstrath said the company was shooting for zero downtime, or at least five-nines of uptime. Well, that comes out at about five minutes of non-service per year, and it's safe to say we've blown past that.
Microsoft's hoped-for $7.5bn acquisition first reported “elevated error rates” on its website at 4pm US Pacific Time on Sunday, followed by intermittent service until an “all clear” message arrived almost exactly 24 hours later at 4pm PT Monday. For UK users, that was an all-of-Monday outage, since it ran 11pm Sunday to 11pm Monday; Australian users barely had time to log in on Monday before the site tripped over at 10am.
As we reported yesterday, the backend git services were working, but the website was frozen in time, serving out-of-date code repos and ignoring submitted material, Gists, and bug reports.
GitHub.com freezes up as techies race to fix dead data storage gearREAD MORE
The collapse was attributed to a data storage system that died, understood to be one or more MySQL database servers. Now things have returned to normal, GitHub's incident report explained the problems, which started with “a network partition and subsequent database failure resulting in inconsistent information being presented on our website.”
To stop the errors propagating to repositories, GitHub said it decided to pause webhook events “and other internal processing systems.” That much worked, at least: the incident report claimed the outage “only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.”
An hour after the databases started playing up, admins tried failing over to a backup data storage system, but that didn't work. Three hours after the site went off the rails, the status message changed from announcing a migration of the data storage systems, to “we continue to work to repair” the knackered storage backend.
The restoration of its databases took “longer than we anticipated,” and after that, the repair work had to be validated, and a huge backlog of events – Pages builds and webhooks, for instance – had to be processed.
The last “error” message was posted 1523 Monday PT (2223 UTC, 0923 Tuesday AEST), and the welcome “everything operating normally” arrived 40 minutes later. ®
Updated to add
The MySQL database meltdown was sparked by a dodgy network link, which left data stores in inconsistent states.