A 43-second loss of connectivity on the US East Coast helped trigger GitHub's 24-hour TITSUP (Total Inability To Support User Pulls) earlier this month.
The bit bucket today published a detailed analysis of the outage, and explained that the brief loss of connectivity between its US East Coast network hub and the primary US East Coast data centre left it with an inconsistency between two MySQL databases.
The brief outage between the East Coast sites, at nearly 11pm on October 21, caused problems in the organisation's complex MySQL replication architecture: writes recorded by the West Coast replica weren't present in the East Coast databases.
GitHub.com freezes up as techies race to fix dead data storage gearREAD MORE
The post explained that to handle its huge workload, GitHub operates multiple MySQL clusters responsible for pull requests and issues, authentication, coordinating background processing, and “additional functionality beyond raw Git storage”. Its Orchestrator software manages replication and topology to keep things tidy.
When the network link dropped, Orchestrator did what it was meant to do: clusters on East Coast were failed over to direct writes in the West Coast facility.
However, when connectivity returned, the system found “a brief period of writes that had not been replicated to the US West Coast facility”, and it ended up with clusters in both data centres containing writes that the other didn't have.
With two vast data stores out of sync, “we were unable to fail the primary back over to the US East Coast data centre safely”.
Two minutes after the network outage, “Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults” (ie, “oh no, look at all those red lights”), and Orchestrator lost track of the East Coast: “Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data centre”.
We are code red
The suffering sysadmins locked down deployment tooling, put the system into “yellow” status, and called in an incident coordinator who almost immediately escalated the outage to “status red”.
That was when GitHub's engineers – by now, including people cursing the pagers that woke them well after 11pm – worked out the extent of the loss of sync. There were 40 minutes of writes the West Coast had “ingested writes from our application tier”; and the “several seconds” of East Coast writes not replicated to the West blocked new writes in the east.
That, Warner said, led to the decision to suspend webhook delivery and GitHub Pages builds: “our strategy was to prioritise data integrity over site usability and time to recovery.”
Warner added an apology for how long it took to post an update for users: the decision to block Pages builds stopped GitHub publishing its own detailed status post. “We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints”, he wrote.
By 6am on October 22, East Coast data restoration was finished and the sysadmins began replicating new data from the West Coast, and later, new clusters were run up on the East Coast so replication could catch up.
Even then, GitHub was still challenged by the amount of traffic banked up waiting for the databases. By 4pm on October 22, webhook had a queue of 5 million events, and there were 80,000 Pages builds queued. “We remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.”
With everything back to normal, Warner said, GitHub is going over MySQL logs to identify anyone who might have lost data during the brief East Coast outage (only a few thousand events at most, since one of the busiest clusters in the time window only recorded 954 writes). If the writes can't be automatically reconciled, GitHub will get in touch with repo owners.
Responding to the outage will take some time, Warner wrote, but so far GitHub has reconfigured Orchestrator so database primaries aren't promoted across regional boundaries, because 60 ms of "cross-country latency was a major contributing factor”; there's a new status reporting system to users get better information than “green/orange/red”; and at the macro scale, there's an effort underway to support “N+1 redundancy at the facility level”. ®