One person's shortcut was another's long road to panic
Clever techie thought of everything – except someone else's stupidity
Who, Me? Why hello, dear reader – fancy seeing you here again on a Monday – the slot we The Register reserves for a fresh installment of Who, Me? in which Register readers share their tales of tech tribulations.
This week meet a reader we'll Regomize as "Bart" who once held the grand title of "Scientist" at a research lab which did a lot of data-crunching for orgs like NASA and ESA.
Such orgs obviously produce a huge amount of data. Bart told Who, Me? that at the time he left there was at least 2.5 petabytes of storage on hand, with that total growing fast.
The machines that doing all that processing had also grown rapidly and a bit haphazardly as and when need arose over the years, so the system was, let's say … quirky. Sometimes, for no apparent reason, processing jobs fell over. Because they were processing near-real-time data and had contracts to fulfill, priority was placed on getting stuff done fast rather than figuring out why a particular job crashed.
Thus, if a particular processing job failed, the system would leave it in a junk directory and start again. Storage capacity was not at a premium – time was.
Of course, even with all that capacity, the useless junk directories full of half-processed data would eventually build up and occupy considerable space. For a while, Bart would just tell people to delete their junk directories – but users were often in no hurry to comply. Doing it manually himself was also not a good use of Bart's time – he had more Scientist-y things to do than clearing out other people's junk.
So Bart created a shell script that he could run periodically to scan the workspace directory, dig down into any subdirectories, find the junk and clear it out.
How would it find the junk? Well, it was quite simple. You see, once a job was fully processed on the workspace directory it was transferred to one of the storage servers, where it would be collected by whoever owned it. Thus, the only data on the workspace directory should be work in progress. Anything that didn't have a "running" flag on it was obviously one of the failed jobs and could go.
Quite simple. Quite clever. And it was quick – usually a couple of minutes to free up terabytes of space.
Of course, the trick was only ever to run the script on the workspace server, where all the live work was happening. Who, Me? mentions this for the sake of what we in the business call "dramatic foreshadowing."
- Poor communication led to complete lack of communication
- WTF? Potty-mouthed intern's obscene error message mostly amused manager
- New year, new bug – rivalry between devs led to a deep-code disaster
- PLACEHOLDER ONLY Someone please write witty headline here
One day, when the servers were starting to look kind of fullish, Bart ran his script on the workspace server and headed off to grab a coffee and talk to colleagues. Twenty minutes later, when he came back, the script was still running. Concerned, he killed it to find out what was going on.
Reader, you will not be surprised to learn that bad stuff was going on. For some reason, one of Bart's colleagues had left a symlink – kind of like an alias or a shortcut – from a directory on the workspace server to the root directory of one of the storage servers. The script had found the symlink, followed it as if it were a subdirectory, and begun deleting the contents of the storage server.
Because of course nothing on the storage server had a "running" flag. No-one would ever be crazy enough to do live processing on the storage server, would they?
It's important to note that there's no good reason Bart could think of why such a symlink would exist. The whole system was designed to keep live data on the workspaces away from the storage server. The script could have been written to ignore symlinks, but who would even think to do that?
Bart never found out who created the symlink, nor why. What he did find was gigabyte after gigabyte of empty disk space where data ought to be.
Data for which his employer had contracts.
Thankfully it didn't take long to work out that much of what had been deleted was superseded data, and the rest could be reasonably quickly reconstructed by re-processing the raw data. Nonetheless, Bart had some serious egg on his face.
If you've ever found yourself wearing a facial frittata after deleting the wrong data, look on the sunny side: you can share it with other readers via an email to Who, Me? and we'll make you (anonymously) famous. Go on, don't keep all your eggs in one basket – we could use some fresh tales to tell. ®