When a deleted primary device file only takes 20 mins out of your maintenance window, but a whole year off your lifespan
In praise of UNIX and knowing when to deliberately drive a server off a cliff
Who, Me? The weekend has been deleted. Pause a moment before you start your own workplace odyssey and enjoy another's trip to Oopsville courtesy of Who, Me?
Today's story comes from "Jim", and concerns the time he and a colleague were performing an all-night hardware, OS, database and application upgrade of a daily newspaper's publishing system, running on a Sun/Sybase combo.
The fate of Sun Microsystems is sadly well documented, while Sybase continues to be a thing (although SAP has long since ditched the name).
Back then both were in rude health, and Jim and his pal were gainfully employed as engineers for a Sun/Sybase VAR and ISV.
The upgrade was going to plan. "Users and our team of trainers were expecting to arrive in the morning and log into a fully upgraded system," he explained. "In the middle of a critical phase of the upgrade, my buddy (he's still my buddy) suddenly got quiet – always a bad sign."
It transpired Jim's partner had been playing with Solaris's whizzy GUI file manager "and accidentally deleted the Sybase master device file while it was running."
It is difficult to describe how big a disaster this was. The loss of the primary device file would leave Sybase decidedly poorly. It is, however, an easy mistake for the unwary to make.
This hack fondly remembers a time toward the end of the last century when one particularly overconfident DBA decided to remove the chaff from a Microsoft SQL Server database directory by running
del *.* "because the files it needs will be locked, right?"
More than 20 years on, still etched into my memory is his expression as Windows NT cheerfully shredded production database after production database, as ordered. And no, there were no recent backups.
For Jim, things weren't so dire. "In Solaris (and other UNIX boxes), deleting a file merely unlinks it from its directory.
"The file space isn't reclaimed as long as the file is held open by some process."
So the database would continue to work, even though a relatively major organ had been excised. However, no new processes would find the file, so a dump of the system databases wasn't an option. Nor was a graceful shutdown since the file would be closed and its bytes cast to the wind.
Jim had wisely made backups, but a recovery from them would burn through the maintenance window "and probably kill the project for another week."
And that's without considering the employment prospects once the silliness had been found out.
What to do?
Out of ideas, Jim decided to crash (rather than halt) the system by typing the BREAK sequence at the console. The server would not get the chance to close the file cleanly...
"We said a small prayer, crossed our fingers, booted the server, and waited for the file system check (fsck) to repair the damage we had done," he recalled.
"I've never typed the letter 'y' more carefully than when asked if we wanted to re-link orphaned inodes."
With an elevated heart rate, Jim logged in and checked the file system's lost+found directory.
Sure enough, there were a handful of files with integer names ("all fsck knows is the inode number, so that becomes the file name", he explained.) After a bit of investigation, he put the most likely file back in place, held his breath, and fired up the database.
"Using up all my good fortune, the database took off and we finished the upgrade.
"I wouldn't be so dramatic as to say I have PTSD from this, but retelling the story still raises the hair on the back of my neck.
"It only took 20 minutes from our maintenance window, but at least a year from my lifespan."
Ever had your bacon saved by the designers of the Unix file system? Or seen a simple task suddenly take on job-threatening proportions thanks to a co-worker's curiosity? Share your tales of near misses and near hits with an email to Who, Me? ®