Fear and Loathing in the co-lo cabinets

Is that a frog in your pocket or are The Reg's servers down again?


Nightclub shootings by Reg Towers, as happened last week, certainly didn't impress this staffer. I live in Hackney and frankly, if I want nightclub shootings there's absolutely no need for me to travel into the West End to get them. Plus if the psychotic Hackney Yardie gangster with a spray and pray gets narked in a nightclub queue, he's going to do a heap more damage than just wing a bouncer. Basically, I'm underwhelmed. And there are other jolly features of living in Hackney too — for instance, it's horribly close to the physical location of The Reg servers in Aldgate. Horribly, horribly...

There's really no way out. The techies are in Bradford, Cullen's conveniently out in the wilds of Kent, and when Birtles isn't out not quite clinching vast money-spinning deals with the movers and shakers over the odd chablis he's in flipping Merseyside. Or on a train. Or a plane.

Face it Lettice, you're IT. Which, given the state of The Register's Cisco gear over the past week*, has meant many happy hours down at Level 3 watching support engineers at play. Friends, it's been an education. But before I continue I should just pay due credit to Mike Banahan of GBDdrect, who was IT for the day Sunday of last week, who hared down from Bradford, patched things up into the wee small hours then got to sit at Kings Cross station while winos hurled abuse at one another until dawn, it being too late for him to get a hotel or catch the last train back. Well done, Mike.

Monday–Tuesday

Naturally, it then fell over again on Monday night, twice. After the second we left it down and called support,and at about 1am clinched an appointment with an expensive professional at 8am on Tuesday. At the time we seem not to have noticed that this didn't entirely fit with the four hour response time guaranteed in our support contract, but there you go. This is where the handily-based IT comes in. I am authorised to wander into Level 3 at all hours of the day and night, obtain a cabinet key and sit morosely on the floor in the Arctic chill of the server room while security gaily shouts "put that mobile phone out!"

I can also email Level 3 support with the name of a contractor and get them on the door list at Level 3, but if I don't have the name of the engineer who's going to show up, I can't do this, and in general the support people don't seem to know who's going to show up until, er, they actually show up. Never mind, I can go down there and sign them in, which is what happens on Tuesday at 8am.

Or not. Episode One of the mighty Cisco support machine cranks into action approximately on cue at 8am, when I overhear a bloke from DHL asking for somebody called Simon Myers at Level 3 reception. They've never heard of him, of course, but he's in Bradford anyway, and despite him always stressing this and telling the support people to ask for John Lettice, his representative on earth, they always ask for Simon Myers. And it might be worth us just doing some kind of Vulcan mind-meld job, given the extent to which we're starting to overlap. Last night, for example, a Cisco rep who's clearly losing the plot entirely emailed some useful data for Simon to me, and left a gung ho message for him on my home phone line.

Anyway, I get the net over the DHL guy before Level 3 shoves the replacement Localdirector behind the desk then starts denying it exists (if it's not on the list it doesn't, even if it's sitting next to you — one of our Compaqs may or may not currently be in this limbo there, but that's another story), and go back to waiting for the engineer. Who arrives at 9am, this having been the time his people had actually booked him for.

This triggers my first observation about how support works, or doesn't work. Note that the hardware arrives to schedule, on greased ball-bearings, and things only start to wobble when the humans get into the picture. The picture was exactly the same regarding the Compaq I just mentioned — the swapout parts were on the lorry (albeit to the wrong place) nanoseconds after we'd put the phone down, whereas the engineer... But as I say, that's another story.

Slipping our disks

Naturally, this being The Register, it's a little more complicated than just swapping out the LocalDirector, and it's this teensie complication that turns out to be fatal. We need to get the configuration off the faulty kit in order to put it back onto the replacement, but we can't do this without the password, and we don't have the password. This is sort of our fault — the company that put the system in place (did I tell you about the backup admin server that wasn't actually connected to anything else until we noticed? But that's another story as well) is now no longer with us, and the staff who'd done it were no longer with the company well before that happened. We know somebody who thinks he might know somebody who might know where the bloke who might know the password went, but for chrissakes do you really have to involve Interpol to nail down the password for your LocalDirector?

Resetting the password is actually dead simple, because all you need is a Cisco floppy disk that you shove into the floppy drive cunningly concealed under the plate on the front of the box. It's dead simple if you have the floppy disk, that is.

The engineer doesn't but never mind, Cisco's emailed it to him, he just has to get to his email. But... By happy coincidence the hard disk in his portable died that very morning. Fortunately Mr Lettice has a portable about his person, but unfortunately the floppy drive is elsewhere, Mr Lettice having more or less abandoned floppies quite some while ago. Never mind, the engineer is from NCR, so in a twinkling we can have a replacement portable shipped over from his base in Marylebone Road. This turns out to be one of life's more sustained twinklings.

Over a period of approximately five hours we wait while totting up the number of times Mr Lettice could have gone back to Hackney and got his floppy drive, the number of times the the engineer could have gone to Marylebone Road and got his email, and the number of times we could have gone to whichever safe Cisco keeps the ruddy thing in and blown it.

Engineer's boss eventually elects to drive a replacement portable down himself, claiming he'll be there in 15 minutes. No, I think, that's how long it's going to take you to park.

Eventually, we have a total of three portables with no floppy drives (it's catching), and one geriatric Tosh running Win95 with — result! — a floppy drive. Boss instructs file to be emailed to him, and he can then collect it on one of the machines with, er, no floppy drive. Now all we need is a phone point.

You guessed it. Considering that Level 3 is hosting a giant pile of servers all of them connected to the Internet already, why the blazes would you want to bother with boring old analogue phone points? You could actually just connect the portable at the cabinet, filling in the right IPs etc (as the saintly Mike Banahan had done on Sunday), but the engineers elect to wheedle access to the only available analogue line, at the front desk fax machine, instead. I discover later that Level 3 reception doesn't care about the fax line anyway, given that the fax is out of toner.

Nokia 7110s, incidentally, seem to be general issue to NCR, but setting portables up for mobile data doesn't seem to be. Mr Lettice's 7110 is set up for mobile data, but as I do it via a Psion netBook which the Win2k portable obstinately refuses to talk to, that gets us nowhere. There's a 1.5 meg download on the Nokia site that lets you use the 7110 with Win2k, and I got it the following day, just in case. I bet the NCR guys didn't, though.

Numerous "will you send that bloody email!" phone calls to NCR later, we have the file on the machine with no floppy. So we just need to set up an IR connection to the old Tosh with the floppy, and we're in business (it is now 3pm, and the site is still down). Privately boggling over the optimism of people who think an IR connection on a Win95 machine will actually work, I decide to press the reset button on the LocalDirector and see what happens. The site comes back up, and of course the IR connection doesn't work.

I now have a dilemma. I know that pushing the button brings the site up, but I know it'll go down again at some random point. Could be a couple of hours, could be 16, could be you never know. So I can actually keep us alive so long as I camp in the cabinet room (no mobile phones, no data points, no food or drink, no chairs, Arctic microclimate) for the rest of my life. It's actually quite attractive compared to the day of futile tedium I've just been through, but it really is very cold in there, and although it's warmer by the Starlabs cabinet (what a lot of Compaqs they've got), it's still pretty damn parky.

Alternatively, I could not let these two poor suckers go until it's fixed, which I calculate at being another three hours, even if everything goes according to plan (and given experience so far, how could that be?). I crack, say we'll go with the resets for the moment and reconvene tomorrow morning, where they guarantee one of them will be there, plus another techie who's cleverer than either of them, plus all of the gear they need to do the job in the first place.

Another observation — maybe they honestly believe they'll do this, but you, I and they (subconsciously) know they're out of here, never to return. This is standard support procedure. And the guy who's going to wind up coming tomorrow is the mark, who's not going to know he's been stiffed until it's too late for him to do anything about it.

Wednesday

The next morning a guy arrives asking for Simon Myers (the password, you remember) and although he can't get a pass because he's not one of the two guys who definitely weren't coming I'd arranged passes for, I've taken the precaution of getting my own tail down there so I can sign him in. (Note in passing this wonderful opportunity for blowing several hours of expensive techie time. You don't know who's coming so you can't arrange for them to be let in, so numerous techies arrive and hang around morosely, barking into their mobiles in reception, and then eventually shuffle off, access denied.)

Engineer three — I'm sure you guessed this as well — doesn't actually have the critical floppy, but don't worry, it's been emailed to him, so all he's got to do is pick up his email, which he hasn't yet. I note he's got a Nokia 7110, but of course... I explain the case to him, show him how to wheedle the fax line out of reception, then head for the office after leaving instructions for him to press the reset button whenever we call him.

As the day goes by it gets weirder. He does actually get the floppy and use it, then discovers the firmware in the replacement LocalDirector is severely older than in the dead one. Simon, who's on his case while I'm in meetings, claims that "they" (which I assume means another engineer may have arrived at some point) came up with the wheeze of opening up both boxes and trying to swap the bits about, but retreated in horror on discovering that in there "it was just like a PC." I'm not entirely convinced this allegation is true, although I believe the bit about it being just like a PC, but eventually, after an afternoon of the site whipping up and down like a flasher on acid, we have an operational unit with up to date firmware and we know the password.

Thursday

Except that's not the end of the story. Down again goes the all-new kit in the early hours of Thursday morning, we press reset to confirm it's still a LocalDirector problem, and we call support. Support advises us to wait until it goes down again, then leave it down while they send an engineer ASAP. It goes down again at five, by which time the guy who told us that was off-shift and the replacement didn't seem to think we needed an engineer, because we were back down to a priority three, or something. We shout our way back up to a priority one again, and my email resumes filling up with automated messages; I've shouted at Cisco in the US to shout at Cisco in the Netherlands to get us an engineer now, with name and ETA and it's 6.45pm UK time.

But I think I'm going to Aldgate tonight, and when I get back I fear I'll have to write some more of this. I might even be able to post it when I'm through. Upside — we might not be getting our money's worth out of this support contract, but it's sure as hell cost somebody a packet more than the £2,300 annual wedge we pay for it, so there's a certain grim satisfaction attached.

The call from Cisco comes with the engineer's name at 8.20pm, and remarkably his ETA is 12.20am. Within four hours of them actually confirming the engineer's coming? And there was me thinking it was supposed to be from when we reported the problem... It turns out to be engineer number one again, whose mobile phone number I cunningly collected back on Tuesday. So I call him.

To the background of munching noises at his end (I'd already eaten everything in the house that could be manipulated with the 'not on hold to Cisco' hand) we discuss the case. But then the site that's supposed to be staying down until the engineer gets there comes back up. We agree this could change things, so I call Cisco, which can now consider a remote log in to check what happens. But actually they come up with the useful nugget that the site came back up 37 minutes ago, so it looks like the engineer's still coming. The engineer calls, but it's not engineer one after all — it's some mug who's now halfway from Cambridge. I tell him to call me again when he hits Tottenham, and I'll snag him and pilot him in.

He calls again, tells me he's been pulled off the case, and after a screaming u-turn is heading back to Cambridge. I call Cisco, they confirm the engineer ain't coming after all. Up in Yorkshire Simon thankfully heads for the pub before last orders, I demolish the bottle of wine I'd been looking glumly at all night, and go to bed.

Friday

We keeled again overnight, apparently, but Simon dug some luckless Level 3 techie out of bed at 6am so he could go in and push the button. Whew, could have been me. He tells me that Cisco has juiced up the logging on the LocalDirector, raving that they've only done so six days into the case. But we must be reasonable — given that we didn't know the password for some of that period, they could surely only have done so 36 hours ago, or thereabouts.

We resume waiting. Simon considers that as all of our kit is working and that we're running a complete new LocalDirector, it does kind of look like the problem is somewhere beyond our cabinet. What, for example, if some klutz has duped one of our IP addresses on another piece of kit? Wouldn't that slay us every time it gibbered into action? We're going to have the damned job getting whoever it is to admit it, but if we can get close maybe they'll stop and carry on denying they ever did it in the first place.

Through Friday we don't go down. I do go to Level 3 on the way home to seek out the MIA Compaq server which Compaq may or may not have repaired. My not having been in Level 3 reception when it arrived, it's in the bowels of the building addressed to an indeterminate person and signed in as being from the repair company, not Compaq. Easy-peasy. I prove it's mine by telling them what the label we stuck on the front says, shove it in the rack, plug in and run.

Saturday–Sunday

We still don't go down. Is it... over? Shush... If so, this untracked issue has probably cost us about a third of our traffic over the past week, has cost Cisco and/or NCR absolute piles of dough, and whatever it was fixed itself (maybe) without our being able to affect it. Weirdly, we do now seem to have a backup available, although it's not, er, exactly ours. Back on Tuesday engineer one and I pulled the replacement LocalDirector out of its packaging and shoved it in the cabinet before we aborted that day's mission. A plaintive call from engineer three on Wednesday however revealed that Level 3 security had tossed the packaging, which engineer one had left in the aisle. Cisco won't accept it back without documentation, so he shoves it in the cabinet. He calls me again Friday and gives me a phone number for the people at Cisco who can facilitate my returning it. I consider the alternative strategy of charging them for rack space, and security... ®

* In order to keep the helpful suggestions in line a tad, we'd just like to point out that we know having a single point of failure in the shape of the LocalDirector is unwise. But it's stayed up for over a year, and the entry cost of the lot when we bought it was around £12k. We really could not afford two of these, although we're now checking out Ebay for bankrupt stock (buy two or three, stuff the support, just shove one in when the first one breaks). Or we're looking at a couple of alternatives, but they have to be next to free, and it's only the load balancing we've got to get back on top of, so we don't want new servers unless you give them to us for nothing, we don't want expensive total rip and replace services, and we're only interesting in hosting deals if you savagely undercut Level 3. Feel free to make suggestions anyway, but vendors, you're in it for the glory, not the money, OK?


Other stories you might like

  • Wash your mouth out with shape-shifting metal
    You wanted flying cars and robo-butlers. Instead, we're getting tooth-cleaning morphing nanoparticle bots

    Experts in chemistry, dentistry, and engineering have developed a way to electromagnetically control iron oxide nanoparticles to clean plaque on human teeth.

    In an article published recently in the journal ACS Nano, University of Pennsylvania researchers Min Jun Oh, Alaa Babeer, Yuan Liu, Zhi Ren, Jingyu Wu, David A. Issadore, Kathleen J. Stebe, Daeyeon Lee, Edward Steager, and Hyun Koo describe a "magnetic field-directed assembly of nanoparticles into surface topography-adaptive robotic superstructures (STARS)" for removing dental plaque (biofilms) and detecting pathogens.

    Iron oxide nanoparticles (IONP) have been approved by the US Food and Drug Administration for other uses. As the paper explains, they have both catalytic and magnetic properties. They catalyze hydrogen peroxide for an antimicrobial effect and they can be manipulated via magnetic fields.

    Continue reading
  • Apple's latest security feature could literally save lives
    Cupertino is so sure of Lockdown Mode it's offering $2m to bug hunters to break it

    Apple's latest security feature won't be used by most of its customers, but those who need Lockdown Mode could find it to be a literal life saver.

    The functionality, coming with iOS/iPadOS 16 and macOS Ventura, dramatically shrinks an iDevice's attack surface by disabling many of its features. It's designed to protect the small number of Apple users who, "because of who they are or what they do, may be personally targeted by some of the most sophisticated digital threats, such as those from NSO Group and other private companies developing state-sponsored mercenary spyware," Apple said in a statement. 

    Lockdown, thus, effectively reduces the number of potential vulnerabilities spyware could exploit to compromise a device, cutting the possible routes into surveillance targets' kit.

    Continue reading
  • Has Intel gone too far with its Ohio fab 'delay' stunt?
    With construction unceremoniously underway, x86 giant may have overplayed its hand

    COMMENT The way Intel has been talking about the status of its $20 billion Ohio fab project, you would be forgiven if you assumed that construction on the Midwest mega-site has been delayed in light of Congress struggling to pass a large subsidies package that would support new American chip factories.

    When Intel delayed a groundbreaking ceremony for the Ohio manufacturing site two weeks ago out of frustration over the subsidies inaction, some headlines may have given you the impression the semiconductor giant was putting off construction entirely.

    However, an Intel spokesperson made it clear to The Register and others at the time that the start date for construction had not changed.

    Continue reading
  • Hive ransomware gang rapidly evolves with complex encryption, Rust code
    RaaS malware devs have been busy bees

    The Hive group, which has become one of the most prolific ransomware-as-a-service (RaaS) operators, has significantly overhauled its malware, including migrating the code to the Rust programming language and using a more complex file encryption process.

    Researchers at the Microsoft Threat Intelligence Center (MSTIC) uncovered the Hive variant while analyzing a change in the group's methods.

    "With its latest variant carrying several major upgrades, Hive also proves it's one of the fastest evolving ransomware families, exemplifying the continuously changing ransomware ecosystem," the researchers said in a write-up this week.

    Continue reading
  • What do you mean your exaflop is better than mine?
    Gaming the system was fine for a while, now it's time to get precise about precision

    Comment A multi-exaflop supercomputer the size of your mini-fridge? Sure, but read the fine print and you may discover those performance figures have been a bit … stretched.

    As more chipmakers bake support for 8-bit floating point (FP8) math into next-gen silicon, we can expect an era of increasingly wild AI performance claims that differ dramatically from the standard way of measuring large system performance, using double-precision 64-bit floating point or FP64.

    When vendors shout about exascale performance, be aware that some will use FP8 and some FP64, and it's important to know which is being used as a metric. A computer system that can achieve (say) 200 peta-FLOPS of FP64 is a much more powerful beast than a system capable of 200 peta-FLOPS at just FP8.

    Continue reading
  • Meta's AI translation breaks 200 language barrier
    Open source model improves translation of rarer spoken languages by 70%

    Meta's quest to translate underserved languages is marking its first victory with the open source release of a language model able to decipher 202 languages.

    Named after Meta's No Language Left Behind initiative and dubbed NLLB-200, the model is the first able to translate so many languages, according to its makers, all with the goal to improve translation for languages overlooked by similar projects. 

    "The vast majority of improvements made in machine translation in the last decades have been for high-resource languages," Meta researchers wrote in a paper [PDF]. "While machine translation continues to grow, the fruits it bears are unevenly distributed," they said. 

    Continue reading
  • Tracking cookies found in more than half of G20 government websites
    Sorry, conspiracy theorists, it's more likely sloppy webdev work rather than spying

    We expect a certain amount of cookie-based tracking on retail websites and social networks, but in some countries up to 90 percent of government sites have implemented trackers – and serve them seemingly without user consent. 

    A study by IMDEA, a research facility in Madrid, Spain, evaluated more than 118,000 URLs of 5,500 government websites – think .gov, .gov.uk. .gov.au, .gc.ca, etc. – hosted in the twenty largest global economies (the G20) and discovered a surprising tracking cookie problem, even among countries party to Europe's GDPR and those with their own data privacy regulations.

    On average, the study found, more than half of cookies created on G20 government websites were third-party cookies, meaning they were created by outside entities typically to collect information on the user. While the proportion of cookies issued by third-party trackers ought to be zero on a government web site, some (in Russia for example) had as many as 90 percent of the cookies come from known third-party cookies or trackers.

    Continue reading

Biting the hand that feeds IT © 1998–2022