Automation is great. Until it breaks and nobody gets paid
An ill-considered cron job turned into a nasty 2AM job
On Call With Friday upon us, and a weekend next on the schedule, The Register once again brings you an instalment of On Call, our weekly reader-contributed tales of being dragged out at all hours to fix failures inflicted by the foolish, flummoxed, or fatuous.
This week, meet a reader we’ll Regomize as “Hugh”, who in the early 2000s scored a contract as a Linux admin for a global auto manufacturer.
Hugh spent regular weeks on call and told us those times were “sure to bring at least one sleepless night doing battle against failed software, or hardware.”
One of those incidents started at 2:00 AM when Hugh’s pager pinged with news that a host used by the HR team was in trouble.
Hugh did the shake yourself awake and turn on the laptop in the middle of the night thing and logged in to inspect the system, which clearly needed a reboot. So Hugh initiated a power cycle, watched it reboot without incident, then ran a series of tests. That effort produced nothing untoward, so Hugh prepared to retire for the night once again.
But just as he was about to hit the sack, a new alert arrived. The same host was in trouble again. Again, the system showed no sign of distress, and a reboot again brought it back to life This time Hugh decided to run some extra checks to make sure he hadn’t missed anything the first time.
That extra time paid off because exactly five minutes after reboot the host locked up again.
Hugh decided that 300 second interval was a clue, so when the system came back to life, he disabled cron, the ubiquitous job scheduler found in Unix-esque systems.
Next, Hugh started looking for evidence of any scheduled jobs.
And found one.
“It's MASSIVE, and its time stamp was ... about five minutes in the past,” Hugh told On Call.
- Techie called out to customer ASAP, then: Do nothing
- Uptime guarantees don't apply when you turn a machine off, then on again, to 'fix' it
- Errors logged as 'nut loose on the keyboard' were – ahem – not a hardware problem
- Techie fired for inventing an acronym – and accidentally applying it to the boss
A little investigation led him to a script that he described as designed to “append itself to his crontab each time it runs, then execute his target script 16384 times, and copy itself again.”
“The job in question was to collect timesheets from various sources, and take that to payroll.”
But the payroll system, and the host Hugh was trying to fix, did not enjoy that influx of info and fell over.
Which was bad for Hugh seeing as he was now wide awake at 2:00AM, and also because the function of this crappy cron job was to collect time sheet info for contractors.
Contractors like Hugh.
“Folks were not happy when they did not get paid on time,” Hugh told On Call, rounding out his tale with news that the chap who wrote the script and had his cron privileges revoked.
What has automation messed up in your life? Click here to send On Call an email and we’ll automatically consider it for a future On Call. ®