On Call Friday has arrived once again with a tale from the smouldering world of On Call.
Today's remembrance comes from "Phil", who a few short years ago found himself supporting an unnamed public cloud vendor that decided to base its product on OpenStack Grizzly.
It's safe to say that it wasn't a pleasant experience. "OpenStack," said Phil, with perhaps a little bitterness borne of experience, was "mostly tens of thousands of lines worth of Python, some RabbitMQ, a MySQL database, some Ceph storage and the kind of SDN that was released into the wild with the express aim of hunting down ops teams and murdering them violently in the dark corners of data centres."
Say what you mean, Phil.
We've found it... the last shred of human decency in an IT director – all for a poxy Unix engineerREAD MORE
To be fair to the noble foundation, "the heaving monstrosity of pulsing evil that we built wasn't all OpenStack's fault," he admitted. The team had known nothing about the public cloud when the project kicked off and "our ineptitude resulted in some spectacular failures in the years following launch."
We'll leave it to you to guess the vendor in question.
"Some of the business decisions haunted us day in day out," recalled Phil, "ticket after soul-destroying ticket."
Phil had singularly failed to extract himself from the project (or "bastard of creation" as he put it) and was at home after a day spent nursing The Beast through another 12 hours of near-disaster.
The day had not gone well: compute nodes maxing out, 4TB disks on the Ceph cluster failing, "causing the recovery ops to lock client ops from performing read/writes (which further broke the already stressed compute nodes as the 8CPU 32GB RAM Windows boxes started to thrash local disk on the compute nodes while looking for their volumes on the Ceph cluster)."
Just another day dealing with the results of impressively iffy planning, by the sound of it.
"Thinking that I was done for the day and could unwind in front of the telly while praying for a speedy and sudden death," Phil's phone emitted chirp of data centre distress. The Beast demanded attention.
One of the compute nodes had gone down. Ordinarily, this would present no problem – after all, by this time OpenStack could cheerfully migrate a guest away from a duff node to somewhere healthier and spin up a VM accordingly.
"Sadly, The Beast didn't have this."
"Everyone's 60GB attached root disk lived on the compute node, because the storage cluster was dozens upon dozens of 4TB 7,200rpm drives. The required IOPS to run root disks would have killed it. Migrating 20-30 guest across SSH to other nodes was impossible.
"We didn't have enough local disk on other compute nodes anyway."
Oh, and "backups didn't exist."
Phil had to get that node back online. A reset via remote access wasn't an option – the hardware appeared to be totally powered off. A trip to the data centre, a mere 45 minutes away, was in order.
"When I got there," remembered Phil, "I soon realised that repeatedly stabbing the power button on the compute node and swearing wasn't going to fix the issue."
The power cables checked out so Phil was forced to take a closer look at the compute node itself.
"We may have racked the boxes a little too closely together," he admitted.
Even after two hours of being off, the machine's lid "was hot enough to fry breakfast". A look inside the box confirmed the worst – the onboard RAID controller "was totally and utterly cooked, its chipset black and flaky [because] it'd been so hot when it popped."
By this time, the phone had begun ringing off the hook from clients who had opted to plonk their mission-critical machines "on a cheap-as-chips non-SLA-backed platform" – because of course they had.
With replacements from hardware suppliers on a "yeah, whenever" basis, Phil was in a bind. Unsurprisingly, he had no spare RAID cards in stock but what he did have was a new server, a bit newer and earmarked to add some much-needed extra capacity, but beggars can't be choosers...
First the mirrored OS drives from the dead server were ripped out and plugged into the new box. Ubuntu and OpenStack received some attention to help them understand their new homes. A couple of pings were run around the stack and the new box shut down, ready for the heart-stopping second stage.
"The guests," said Phil, his memory dimmed by the trauma of the experience, "were stored on a RAID 6 array that spanned 10 or so drives."
Carefully, he moved each drive from machine to machine, ensuring the bay numbers matched. When the deed was done, and with breath held, he hit the power button.
"When the dreaded 'Do you want to import this foreign RAID config?' popped up, I took another breath, pressed the 'ah fuck it' button and waited.
"What happened next surprised even me."
Everything just... worked. Guest VMs (needing a Virtual Interface reset) span up and all came back online.
Phil's work was done, leaving him "a little older, a little sadder, and a little closer to beginning some serious work on a full-time drinking problem."
Naturally, The Register put it to the OpenStack gang that maybe its software was a bit of git to get running back in the day. Chief Operating Officer Mark Collier agreed and told us: "Back when OpenStack was launched with NASA, you literally had to be a rocket scientist to run it.
"But now, you have Adobe Advertising Cloud running over 100,000 cores of OpenStack with just four people, so you can say we have come a long way."
Four very, very clever people we'd wager.
Ever been caught in the blast radius of an OpenStack installation gone horribly wrong? Or found yourself in the whirlwind of a project that nobody fully understands? What did you do when that fateful call came in? Drop On Call a line and tell us all about it. ®