Devops

This article is more than 1 year old

GitHub lost a network link for 43 seconds, went TITSUP for a day

Database replication is hard

Wed 31 Oct 2018 // 00:28 UTC

A 43-second loss of connectivity on the US East Coast helped trigger GitHub's 24-hour TITSUP (Total Inability To Support User Pulls) earlier this month.

The bit bucket today published a detailed analysis of the outage, and explained that the brief loss of connectivity between its US East Coast network hub and the primary US East Coast data centre left it with an inconsistency between two MySQL databases.

The TITSUP began with planned maintenance work, GitHub's head of technology Jason Warner explained, to “replace failing 100G optical equipment”.

The brief outage between the East Coast sites, at nearly 11pm on October 21, caused problems in the organisation's complex MySQL replication architecture: writes recorded by the West Coast replica weren't present in the East Coast databases.

GitHub.com freezes up as techies race to fix dead data storage gear

The post explained that to handle its huge workload, GitHub operates multiple MySQL clusters responsible for pull requests and issues, authentication, coordinating background processing, and “additional functionality beyond raw Git storage”. Its Orchestrator software manages replication and topology to keep things tidy.

When the network link dropped, Orchestrator did what it was meant to do: clusters on East Coast were failed over to direct writes in the West Coast facility.

However, when connectivity returned, the system found “a brief period of writes that had not been replicated to the US West Coast facility”, and it ended up with clusters in both data centres containing writes that the other didn't have.

With two vast data stores out of sync, “we were unable to fail the primary back over to the US East Coast data centre safely”.

Two minutes after the network outage, “Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults” (ie, “oh no, look at all those red lights”), and Orchestrator lost track of the East Coast: “Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data centre”.

We are code red

The suffering sysadmins locked down deployment tooling, put the system into “yellow” status, and called in an incident coordinator who almost immediately escalated the outage to “status red”.

That was when GitHub's engineers – by now, including people cursing the pagers that woke them well after 11pm – worked out the extent of the loss of sync. There were 40 minutes of writes the West Coast had “ingested writes from our application tier”; and the “several seconds” of East Coast writes not replicated to the West blocked new writes in the east.

That, Warner said, led to the decision to suspend webhook delivery and GitHub Pages builds: “our strategy was to prioritise data integrity over site usability and time to recovery.”

Warner added an apology for how long it took to post an update for users: the decision to block Pages builds stopped GitHub publishing its own detailed status post. “We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints”, he wrote.

By 6am on October 22, East Coast data restoration was finished and the sysadmins began replicating new data from the West Coast, and later, new clusters were run up on the East Coast so replication could catch up.

Even then, GitHub was still challenged by the amount of traffic banked up waiting for the databases. By 4pm on October 22, webhook had a queue of 5 million events, and there were 80,000 Pages builds queued. “We remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.”

With everything back to normal, Warner said, GitHub is going over MySQL logs to identify anyone who might have lost data during the brief East Coast outage (only a few thousand events at most, since one of the busiest clusters in the time window only recorded 954 writes). If the writes can't be automatically reconciled, GitHub will get in touch with repo owners.

Responding to the outage will take some time, Warner wrote, but so far GitHub has reconfigured Orchestrator so database primaries aren't promoted across regional boundaries, because 60 ms of "cross-country latency was a major contributing factor”; there's a new status reporting system to users get better information than “green/orange/red”; and at the macro scale, there's an effort underway to support “N+1 redundancy at the facility level”. ®

Topics

Special Features

Vendor Voice

Resources

Devops

GitHub lost a network link for 43 seconds, went TITSUP for a day

Database replication is hard

GitHub.com freezes up as techies race to fix dead data storage gear

We are code red

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Sacramento airport goes no-fly after AT&T internet cable snipped

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Protecting distributed branch office environments from ransomware

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Datacenter outages are on the decline, but when they hit, they hit hard

Over 170K users caught up in poisoned Python package ruse

Tech trade union confirms cyberattack behind IT, email outage

GitHub fixes pull request delay that derailed developers

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads

About Us

Our Websites

Your Privacy