GitLab.com melts down after wrong directory deleted, backups fail

Upstart said it had outgrown the cloud – now five out of five restore tools have failed


Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups.

On Tuesday evening, Pacific Time, the startup issued a sobering series of tweets we've listed below. Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated.

Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand.

That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)."

So some solace there for users because not all is lost. But the document concludes with the following:

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

The world doesn't contain enough faces and palms to even begin to offer a reaction to that sentence. Or, perhaps, to summarise the mistakes the startup candidly details as follows:

  • LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
  • Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
  • SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
  • Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
  • The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
  • The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
  • Our backups to S3 apparently don’t work either: the bucket is empty

Making matters worse is the fact that GitLab last year decreed it had outgrown the cloud and would build and operate its own Ceph clusters. GitLab's infrastructure lead Pablo Carranza said the decision to roll its own infrastructure “will make GitLab more efficient, consistent, and reliable as we will have more ownership of the entire infrastructure.”

It's since wound back that decision, telling us the following by tweet:

At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be “without webhooks” but is “the only available snapshot.” That source is six hours old, so there will be some data loss.

Last year, GitLab, founded in 2014, scored US$20m of venture funding. Those investors may just be a little more ticked off than its users right now.

"On Tuesday, GitLab experienced an outage for one of its products, the online service GitLab.com," a spokesperson for the San Francisco-based biz told The Register in an email, adding: "This outage did not affect our Enterprise customers."

"We have been working around the clock to resume service on the affected product, and set up long-term measures to prevent this from happening again," the spinner said. "We will continue to keep our community updated through Twitter, our blog and other channels."

Meanwhile, the sysadmin who accidentally nuked the live data reckons "it’s best for him not to run anything with sudo any more today." ®

Similar topics


Other stories you might like

  • Meet Wizard Spider, the multimillion-dollar gang behind Conti, Ryuk malware
    Russia-linked crime-as-a-service crew is rich, professional – and investing in R&D

    Analysis Wizard Spider, the Russia-linked crew behind high-profile malware Conti, Ryuk and Trickbot, has grown over the past five years into a multimillion-dollar organization that has built a corporate-like operating model, a year-long study has found.

    In a technical report this week, the folks at Prodaft, which has been tracking the cybercrime gang since 2021, outlined its own findings on Wizard Spider, supplemented by info that leaked about the Conti operation in February after the crooks publicly sided with Russia during the illegal invasion of Ukraine.

    What Prodaft found was a gang sitting on assets worth hundreds of millions of dollars funneled from multiple sophisticated malware variants. Wizard Spider, we're told, runs as a business with a complex network of subgroups and teams that target specific types of software, and has associations with other well-known miscreants, including those behind REvil and Qbot (also known as Qakbot or Pinkslipbot).

    Continue reading
  • Supreme Court urged to halt 'unconstitutional' Texas content-no-moderation law
    Everyone's entitled to a viewpoint but what's your viewpoint on what exactly is and isn't a viewpoint?

    A coalition of advocacy groups on Tuesday asked the US Supreme Court to block Texas' social media law HB 20 after the US Fifth Circuit Court of Appeals last week lifted a preliminary injunction that had kept it from taking effect.

    The Lone Star State law, which forbids large social media platforms from moderating content that's "lawful-but-awful," as advocacy group the Center for Democracy and Technology puts it, was approved last September by Governor Greg Abbott (R). It was immediately challenged in court and the judge hearing the case imposed a preliminary injunction, preventing the legislation from being enforced, on the basis that the trade groups opposing it – NetChoice and CCIA – were likely to prevail.

    But that injunction was lifted on appeal. That case continues to be litigated, but thanks to the Fifth Circuit, HB 20 can be enforced even as its constitutionality remains in dispute, hence the coalition's application [PDF] this month to the Supreme Court.

    Continue reading
  • How these crooks backdoor online shops and siphon victims' credit card info
    FBI and co blow lid off latest PHP tampering scam

    The FBI and its friends have warned businesses of crooks scraping people's credit-card details from tampered payment pages on compromised websites.

    It's an age-old problem: someone breaks into your online store and alters the code so that as your customers enter their info, copies of their data is siphoned to fraudsters to exploit. The Feds this week have detailed one such effort that reared its head lately.

    As early as September 2020, we're told, miscreants compromised at least one American company's vulnerable website from three IP addresses: 80[.]249.207.19, 80[.]82.64.211 and 80[.]249.206.197. The intruders modified the web script TempOrders.php in an attempt to inject malicious code into the checkout.php page.

    Continue reading

Biting the hand that feeds IT © 1998–2022