Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

GitLab crawling back online after breaking its brain in two

Database replication SNAFU took down three out of five PostgreSQL servers

In a classic example of the genre, GitLab yesterday dented its performance by accidentally triggering a database failover.

The resulting “split-brain problem” left the code-collection trying to serve its users out of a single database server, postgres-02, while it tries to sort out the remaining three.

The problem first arose at around 1:30am UTC on Thursday, and the resulting rebuilds are continuing.

When the accidental failover was triggered, Alex Hanselka wrote that while the fleet “continued to follow the true primary”, the event was apparently painful:

“We shut down postgres-01 since it was the rogue primary. In our investigation, both postgres-03 and postgres-04 were trying to follow postgres-01. As such, we are rebuilding replication on postgres-03 as I write this issue and then postgres-04 when it is finished.”

Also impacting performance are a backup (needed because there wasn't a full pg_basebackup since before the failover), and GitLab's shut down its Sidekiq cluster because it causes large queries.

That was the situation when things first broke: nearly 20 hours later, the ticket hasn't been closed.

For a start, the backup of postgres-03 is running at 75GB per hour and took until after 23:00 (11pm) to complete. There are still other database tasks to complete, but performance is starting to return to normal according to posts from Andrew Newdigate.

There's also a timeline here.

At least the backups are working: in February 2017, a data replication error was compounded by backup failures: “So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place”.

The missing data was found on a staging server, and after much much soul-searching, marketing veep Tim Anglade told The Register understood its role as “a critical place for peoples' projects and businesses”.

Working backups, it has to be said, indicate at least some of the lessons were learned. ®

Similar topics

Similar topics

Similar topics

TIP US OFF

Send us news


Other stories you might like