This article is more than 1 year old
Crashed and alone in a remote location: When paid help is no help
Coping strategies for first-timers
This Damn War
I took the plunge and became a freelance IT consultant in 2001. Through an unlikely series of coincidences (former colleague from London goes to travel show in France and bumps into two guys from Yorkshire who are looking for a software and database architect) I ended up in North Yorkshire that summer, working for a holiday cottage rental company.
Now, I'll admit I heard the word “cottage” and thought “small company, run from someone's shed.” So when I rocked up to a former mill in the countryside with a 70-seat call centre and an eight-CPU Sun Microsystems server running the Oracle back end to their reservation system, I was more than a little surprised.
It was a nice powerful setup. They'd got a good deal on the server hardware; the model they'd gone for was close to end-of-sale and so although they didn't really need all eight processors yet, they decided to fill it up because later expansion of obsolete kit could have been a problem. Support wasn't an issue, of course – spares stock was formally preserved for a good few years and of course they had a top-line (equals very pricy) maintenance agreement. Not only did this give them one-hour on-site attendance from a qualified engineer, but they also had proactive monitoring by the vendor via a dedicated leased line into the company's office.
The storage arrays were also many and full; as was normal at the time they had vast numbers of relatively small, high-spin-speed disks – maximising the number of spindles and platters and hence maximising concurrent reads and writes and minimising time lost to read/write head movements. SSD wasn't even heard of back then – the only option was what storage techs refer to as spinning rust.
It was a great box to work with – and very familiar given that I cut my teeth as a Sun sysadmin back in the day. It was speedy and very under utilised, and I didn't have to worry about the witchcraft of managing the Oracle installation as we had an Oracle guru visit from Cumbria for a week per month to do the laying on of hands, database optimisation, upgrades and the like.
It was also a great company to work with. The development manager was my primary partner in crime, and we got to know each other pretty well. Then one evening at about 6:45 I was having dinner with him at his house when we got a call from the call centre: “The reservation system's down.”
A few minutes later (he lived in the next village to the office) we trotted into the server room. The Sun console showed that it was in mid-boot but that it had crashed part way through. Told it to reboot again, and we could see that the power-on self-test was showing a failed CPU. “Aha”, we thought, and popped one of the CPU boards out (there were four, each with two processors). One more reboot, a pause for a cuppa whilst we waited for it to do its: “Hey, I crashed, I'm going to check all my disks for 20 minutes” cycle and the call centre people were happy.
Then the phone rang. “Hi, this is the monitoring centre. We're seeing a problem on your server”. No shit, Sherlock. I pointed out that we were fully aware of this fact, citing as evidence the increasingly cold treacle sponge on Mark's dinner table.
We asked them to send an engineer to replace the faulty part, and that he should drive carefully. There was a reason for this latter comment: there were two on-call engineers in that region at the time, each of whom lived an hour away… on a good day, in nice weather, and if there weren't any big yellow roadside cameras and/or uniformed people in cars with blue lights. An hour was optimistic, and as the service was actually running OK, albeit short of a CPU board, we didn't want the guy to kill himself on the way.
Interestingly the spares must have been closer than the engineer, because a courier arrived with a brown box of bits several minutes before the repair guy. By 9:30 all was good and we were back to normal.
We had a couple more faults over the months I worked with that company, and on both those occasions the same thing happened: we worked around the problem then the monitoring centre called us.
At one point the operator told me that if something crapped itself inelegantly (so there was no “I'm dying” event sent from the system to the monitor) they often wouldn't notice – but that they usually picked up the fact that it had come back up thanks to the: “Hey, look at me, I crashed” critical log messages. Gee, thanks for that.
I learned three important lessons from working with that system:
- Don't expect things always to behave as desired. We'd been told that if a redundant component failed the system would just invisibly swap it out and continue on the other seven; in fact the reality was that it would sometimes just keel over and refuse to reboot until you took physical action to remove it.
- If you're looking to have a specific time limit for on-site response in your maintenance contract, ask the vendor where the engineers are based. We were in the wilds of Yorkshire, and having got to know the engineers and where they lived it was probably the case that they couldn't even legally get to us from home in an hour (and if they could, it was only just).
- If you're stern about failure with your maintenance vendor at renewal time, you can get lots of money off next year's service.