This article is more than 1 year old

GitLab crawling back online after breaking its brain in two

Database replication SNAFU took down three out of five PostgreSQL servers

In a classic example of the genre, GitLab yesterday dented its performance by accidentally triggering a database failover.

The resulting “split-brain problem” left the code-collection trying to serve its users out of a single database server, postgres-02, while it tries to sort out the remaining three.

The problem first arose at around 1:30am UTC on Thursday, and the resulting rebuilds are continuing.

When the accidental failover was triggered, Alex Hanselka wrote that while the fleet “continued to follow the true primary”, the event was apparently painful:

“We shut down postgres-01 since it was the rogue primary. In our investigation, both postgres-03 and postgres-04 were trying to follow postgres-01. As such, we are rebuilding replication on postgres-03 as I write this issue and then postgres-04 when it is finished.”

Also impacting performance are a backup (needed because there wasn't a full pg_basebackup since before the failover), and GitLab's shut down its Sidekiq cluster because it causes large queries.

That was the situation when things first broke: nearly 20 hours later, the ticket hasn't been closed.

For a start, the backup of postgres-03 is running at 75GB per hour and took until after 23:00 (11pm) to complete. There are still other database tasks to complete, but performance is starting to return to normal according to posts from Andrew Newdigate.

There's also a timeline here.

At least the backups are working: in February 2017, a data replication error was compounded by backup failures: “So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place”.

The missing data was found on a staging server, and after much much soul-searching, marketing veep Tim Anglade told The Register understood its role as “a critical place for peoples' projects and businesses”.

Working backups, it has to be said, indicate at least some of the lessons were learned. ®

More about

TIP US OFF

Send us news


Other stories you might like