GitHub fixes pull request delay that derailed developers

Went down yesterday, too, longer and harder. Maybe we should call it GitFlub?

GitHub is experiencing a second day of degraded performance, following a bad update that threw the code locker into chaos.

Microsoft's cloudy code collaboration service today advised users of degraded performance for Pull Requests.

Users who requested anonymity told The Register that delays of around ten minutes are the norm at the time of writing – meaning commits are made and pushed to branches, but then aren't visible to all team members.

GitHub first acknowledged the issue at 23:39 UTC on March 12.

Around two hours later, it Xeeted news that it had found a mitigation and was "currently monitoring systems for recovery."

It didn't have to monitor for long: five minutes later the fix was in, and the incident ended.

Without explanation – for now.

GitHub has, however, explained the previous day's outage, which struck at 22:45 UTC on March 11 and persisted until 00:48 UTC the next day.

During that incident, Secret Scanning and 2FA using GitHub Mobile produced error rates up to 100 percent, before settling at around 30 percent for the last hour of the incident. Copilot error rates reached 17 percent, and API error rates reached one percent.

"This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend," according to GitHub's Status History page.

"The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment," states GitHub's error report.

The error was spotted within four minutes and a rollback initiated.

But the rollback failed in one datacenter, extending the time needed for recovery.

"At this point, many failed requests succeeded upon retrying," the status page adds.

Here's the rest of the service's mea culpa

This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.

GitHub has pledged to work on "various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future."

Good. ®

More about

TIP US OFF

Send us news


Other stories you might like