On-Prem

This article is more than 1 year old

'No BS' web host Gandi emits outage postmortem, has 'only theories' on what went wrong

Also reckons that it should 'accurately document the data recovery procedure' for metadata corruption. Y'think?

Tue 28 Jan 2020 // 11:41 UTC

Hosting outfit Gandi has published its postmortem regarding this month's outage and concluded that while it still has "no clear explanation", the main problem was "the duration".

So that's OK then.

The mystery incident took down a storage unit in the company's Luxembourg facility at 14:51 UTC on 8 January. It wasn't until 13 January that data was restored and services were all back online, according to the postmortem published yesterday.

414 customers were impacted "at most", according to Gandi.

The problem was that after failover attempts failed, the company 'fessed up via its various social media orifices that customer snapshots could well be lost. The file system used, ZFS, allows these snapshots of disks to be taken, and a good number of customers had expected these to be preserved.

Not so, said Gandi, as its Twitter operatives twisted this way and that in justifying the company's take on things. It was, the outfit insisted, up to customers to maintain their backups.

The postmortem doubled down on this, stating: "Contractually, we don't provide a backup product for customers," before mumbling: "That may have not been explained clearly enough in our V5 documentation."

The technical timeline published would make a good candidate for The Register's Who, Me? or On Call columns, and while Gandi is to be commended for its honesty, the floundering of its technical team was palpable as the situation unfolded.

Though the discovery that the version of ZFS in use was too old to support some of the import options that turned up during frantic documentation searching is quite comical, those pointing smugly at their own storage and hosting setups would do well to take a careful look at Gandi's experience.

At least the story had a relatively happy ending (not counting the lengthy outage).

Unsurprisingly, Gandi plans to finish its ongoing upgrade of storage units to a newer version of ZFS and, in what will likely tip admins off their chairs, "accurately document the data recovery procedure in case of metadata corruption".

It added: "We have identified areas for improvement internally in order to be even more fluid and responsive in near-real time." We'd suggest a slightly more sympathetic approach to customers panicking over lost data and holding off on posting Game of Thrones gifs would be a start.

The actual cause of the metadata corruption that left those customers dangling remains a mystery. Gandi said it had ruled out fat-fingered keyboard jockeys, saying that it didn't have a clear explanation, "only theories".

Maybe it was the server RAM wot dunnit, the company wondered. We, in turn, wondered if they were using ECC memory. A company representative told us that, according to the company's self-proclaimed BOFH, the servers do indeed use the stuff.

Ultimately, it said: "We acknowledge the main problem was the duration."

Customers, likely surprised by the lack of a documented recovery procedure for duff metadata and the get-ready-to-restore messaging from Gandi, might not entirely agree. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

On-Prem

'No BS' web host Gandi emits outage postmortem, has 'only theories' on what went wrong

Also reckons that it should 'accurately document the data recovery procedure' for metadata corruption. Y'think?

More about

TIP US OFF

Other stories you might like

Samsung shows off battery tech it says will see you gone in nine minutes

IBM to acquire Hashi for $6.4 billion, hopes it will boost software biz and Red Hat

Australia’s spies and cops want ‘accountable encryption’ - aka access to backdoors

Getting on board with AI

Governments issue alerts after 'sophisticated' state-backed actor found exploiting flaws in Cisco security boxes

With Run:ai acquisition, Nvidia aims to manage your AI kubes

Apple releases OpenELM, a slightly more accurate LLM

Musk moves Tesla's goalposts, investors happily move shares higher

Shouldn't Teams, Zoom, Slack all interoperate securely for the Feds? Wyden is asking

Now all Windows 11 users are getting adverts to 'make the Start menu great again'

Lenovo and Micron first to implement LPCAMM2 in laptop

Microsoft cannot keep its own security in order, so what hope for its add-ons customers?

About Us

Our Websites

Your Privacy