GitHub explains outage string in incidents update

It was MySQL, with the resource contention, in the database cluster


Code shack GitHub is offering an explanation for a succession of lengthy outages this month - it's the fault of resource contention issues in its primary database cluster during peak loads but more investigation is needed.

It's fair to say last week was not a fun time for either GitHub or its users. The service was unavailable for over five hours on Tuesday March 16; was out for two-and-a-half hours on Wednesday March 17; had issues for nearly three hours on Monday March 22; and went dark for a similar amount of time yesterday, on March 23.

The underlying theme, according to GitHub's Keith Ballinger, was a resource contention in its mysql1 cluster, which caused things to topple over during periods of peak load.

The first outage, on March 16, took out pretty much all write operations as the company's database proxying tech hit its maximum number of connections. A combination of peak load and iffy query performance sent things off the rails until GitHub was able to failover to a healthy replica while the team tried to work out what had gone wrong.

It didn't have long to wait. The loading started again 24 hours later (we'd suggest it was the US waking up midway through the European working day) and again things began to wobble. This time, however, the gang reckoned they were ahead of the curve and hit the big red failover button before things got out of hand.

Alas, this threw up a new set of problems that caused connectivity issues once again. The good news, however, is that the team identified the load pattern and popped on an index to cure the main performance problem.

It's a solution familiar to all too many DBAs. Got a performance problem? Spray the database with indexes until it goes away. Not that this writer has ever done such a thing.

GitHub admitted "we were not fully confident in the mitigations" and, sure enough, the service started wobbling again on March 22 after it enabled memory profiling to better understand what was happening. Once again, client connections to mysql1 started to fail and once again a primary failover was required in order to recover.

Things went south at the same time on March 23, requiring another failover. This time, GitHub opted to throttle webhook traffic "as a mitigation to prevent future recurrence during peak load times" but investigations are ongoing.

And that, in a nutshell, is the problem.

While GitHub's transparency in explaining its failings is laudable, the fact that they have continued to happen and the company does not seem too clear on how to resolve them is concerning. "We have started an audit of load patterns for this particular database during peak hours," it said, while also promising to shunt traffic elsewhere, speed up failover time and review its change management procedures.

If only it had a parent company that was frequently found bragging about its own database technology, cloud elasticity and availability.

Promises to scale up infrastructure and beef up monitoring as loads increase are all well and good, however Reg readers may not escape the feeling that simply throwing resource at a problem won't deal with the underlying issues causing the outages.

Still, it's not as if GitHub also chose this time to alienate large chunks of its customer base with an unpopular social-media-like algorithmic feed. Ahem. ®

Similar topics


Other stories you might like

Biting the hand that feeds IT © 1998–2022