This article is more than 1 year old
OK, we've got your data. But we really want to delete it ASAP
No kidding. We need the storage
Storage is a big deal for IT people and beancounters alike. For the IT team the story is pretty consistent: there's never quite enough, and the users seem to eat it up and an amazing rate. For the finance team it's a seemingly endless queue of IT people asking for funds for yet more storage because the rate of growth in stored data seems to accelerate more than anyone ever predicts.
Here's a novel thought: what about if, every so often, we proactively delete some data to free up space for new stuff? Perhaps we don't need to grow the volume of our storage systems if we remove old data to make way for the new.
The minimum retention period
Most of us are obliged by some kind of law or contract to retain data for at least a certain number of months or years, depending on where we live. Tax authorities have a habit of obliging you to keep key financial information for anything up to six or seven years, for instance, and of course if you're a service provider selling backup services to your customers then you'll probably want to keep the data for at least as long as the contract says you will.
Although knowing the minimum retention period for each type of data in each jurisdiction is a non-trivial task – because there are so many different rules based on location and data type – it is a known quantity and it's written in statute and/or the contract between you and the customer.
The maximum retention period
What people frequently fail to do, though, is delete data when it's been stored for long enough and it's eligible for deletion. After all, it's doing no harm sitting there on the disk with the exception that it's taking up space – which is fine, because when the time comes that you're running short of space you just dump the oldest folders. And if it's on tape then who cares – you can get terabytes of data in very little space when you stream it off to modern tape cartridges, and anyway when a cartridge is four or five years old you're probably not going to want to erase and re-use it as a new one will be cheap and more reliable.
Errr … no.
You must always work on the premise that the only reason you keep data is because you have a darned good reason to do so – which means you hold it legitimately and need it for business or that you're mandated to hang on to it. So if you're a landlord of residential properties and you have tenants who've been in your properties for 20 years, it makes sense (and it's legitimate) to keep their data. But is it right to keep data on someone with whom you've not dealt for 10 years? Probably not. And with regard to not re-using tapes: well, you probably wouldn't – but if the data on them is obsolete and irrelevant to your current business you should be securely destroying them rather than keeping them “just in case”.
Let's take a slightly exaggerated example of why you need to delete data. Imagine in 2005 you paid someone 50 quid for a dodgy email list of a million people and spammed them to promote your funky new online store. (Note: for the young people reading this, spam worked pretty well in those days – a million spams cost next to nothing to send but might get you 20,000 or 30,000 responses). Now imagine that in 2017 the EU brings in a law that bans holding people's email addresses without their permission. If that data's nestling on a tape in your drawer, you're not going to get away with “Sorry, we forgot we had it”. So the moment you no longer have a legitimate need to keep it, bin it. Irrevocably and verifiably.
Deleting aged data doesn't actually save you that much storage space, ironically: the volume of data you add this year will be an order of magnitude greater than the volume you added five years ago. So the act of removing it is primarily one of covering your arse and protecting against either prosecution or, in these days of Freedom Of Information, embarrassing revelations. Transient data's another matter, though: it's what eats our storage like nobody's business.
Transient data is all the temporary stuff we generate but don't throw away. System A exports data to system C via System B, which does some transformations on it. Much of the time the transformation script is written to take the input data and write the output data but doesn't bother cleaning up the bit in the middle. And much of the time this data can be quite chunky – often way bigger than both the source and the output because (how often have we written a script that goes: uncompress -> transform -> compress and leaves the bit in the middle in raw text?).
So bin the temporary crap. That's what takes the space without you noticing, and most of the time it's a simple case of writing your maintenance scripts to use temp directories as scratch space and to have automated mechanisms for purging the temp directories frequently.
Organise it properly
Going back to the material we're deliberately storing, then, the final thing you need to do is organise it properly. And this basically means ensuring that the software you use to store and archive the data is also able to delete it. We're all used to configuring backup programs to do weekly full backups and daily incrementals, perhaps with a monthly snapshot of the latest full dump, but how many of us have every configured the same program to chuck away the December 2013 monthly snapshot automatically when it's created and verified the January 2015 one? Probably not even 10 per cent of us.
And that's just backups – which are pretty straightforward to bin on a schedule even if you do need to include manual steps such as physically destroying tapes when they reach 'n' months of age. The real trial comes when you look at applications.
Take your CRM system, for instance. Does it have the ability to remove all customer and contact information for contacts that haven't been used for, say, three years? If not, maybe you want to consider an alternative, or at least to script something through its API that will allow you to do so.
It's pretty obvious, when you talk about data retention, that it's all about ensuring data is retained for long enough to be compliant with the law and to allow it to be processed for legitimate purposes.
But don't forget the bit that's always missed from the end of the sentence: “... and throwing it away when keeping it is no longer legitimate or relevant”.