If you have ever been asked to recover an old, lost or deleted file, you will know just how hard people find it to tell the difference between backup and archiving. The administrator's workload has grown so much that backup companies have even added user self-service portals to ease it.
The problem has accentuated as companies have moved the backup process off tape and onto disk-based arrays and appliances to get faster backups and restores. After all, modern disk-to-disk backup appliances look remarkably similar to the sort of disk arrays typically used now for secondary storage.
But there are a lot of challenges and problems associated with using a backup as an archive. One of the most obvious is how to find stuff: once you have run backups for three or four years, finding stuff is going to be impossible unless you have strong indexing and data management tools.
Using your backup as if it were an archive is undoubtedly inefficient, but so is operating the two as independent systems, each with its own hardware and software.
Points on a spectrum
The best approach is to treat the two simply as separate points along a spectrum of data availability – after all, they often use much the same underlying hardware. The main differences lie in the roles they play.
Increasingly, that underlying hardware is more alike than we might think, says Steve Mackey, vice president international at Spectra Logic, a tape library vendor which a few years ago pivoted away from backup and towards archiving.
“We used to design tape libraries primarily for backup and then repurpose them for archiving. Now they are primarily designed for archiving,” Mackey says.
“It's all about integrity of data, the quality of the media, recoverability and so on. Every archiving system also has a disk cache. Archiving involves multiple different technologies.”
Briefly, a backup is usually a secondary copy of primary data for system recovery, while an archive is the primary copy of that archived data stored on cheaper and lower-performance hardware, whether locally or in the cloud.
Indeed, you might very well need to back up your archive, suggests Frank Reichart, senior director product marketing storage at Fujitsu Technology Solutions.
“Typically we see three problems. The first is that a lot of users still don't see the difference and are treating backup as an archive,” he says.
“That is a very inefficient way to work, as there are typically multiple copies of the same data in backup processes (daily, weekly, monthly backups and so on), and also different versions depending on the time backups are made. Archived data needs to exist only once and in one final version.
“The second thing is retrieval. Backup is very poor for getting specific data back.”
One option is to have a single integrated and unified data protection appliance that can provide both functionalities.
“An intelligent backup appliance can keep backup and archiving logically separate while converging the hardware,” Reichart says.
"Intelligent data protection appliances can share the hardware"
“If people understand the difference and implement good backup and archiving, they typically have separate storage for each. That is not necessary with intelligent data protection appliances though, as they can share the hardware and have common services.”
There is even an emerging division within archiving, with two different classes of archive each needing different service levels, according to Bob Plumridge, the chair of storage industry group SNIA-Europe.
“You need to ask why you are archiving. Is it for compliance? For business reasons? Or just because you are keeping everything for safety's sake? People used to archive to tape because they thought they would never touch that data again,” he says.
“There are products that will scan your backups for age or rate of change – for example, should this element be in an archive instead because it hasn't changed in months? There are also a lot more online archives, where data is protected and immediately accessible, but not in your backup cycle. It's a third group of data, a lot of which has come about through regulatory changes.”
Steve Mackey agrees. “Content is also an archive. We would describe it as an active archive, with people constantly accessing it. For example, in big broadcast archives you might need access at short notice to news segments to use in an obituary, say. When you need it, you need it fast, but you don't know when that will be,” he says.
“The other kind of archive is stuff you want to keep but you don't know if you will need to access it or not. You still want it secure, often because it's regulated, and you need the ability to prove deletion, for instance. You might need to recover it for an audit, but there is no requirement for regular access.”
It is important to keep in mind that these are just more points on that data storage spectrum. They are separate use-cases perhaps, but ideally they need to be part of the same integrated and co-ordinated management framework. They all use the same underlying technology, whether that be disk and tape arrays or de-duplication and data compression software.
If it sounds familiar, that is not too surprising: this is pretty much what hierarchical storage management promised back in the 1980s and 90s. The same ideas were subsequently repackaged a decade later, first as part of the information lifecycle management (ILM) concept.
And then when the ILM name became tainted by some high-profile project failures, they reappeared as storage tiering. Since then they have gradually worked their way into pretty much every serious storage management software or storage subsystem.
The ability to use the same disk-to-disk appliance for both backup and archiving could pay off in other ways as companies realise the value that still exists in their older data.
“If you think about big-data analytics, most is focused on real-time analysis. But in the future it could be more on what can we do with the last few years' data,” says Plumridge.
“So more and more organisations are looking at online archiving and not to tape. These are not particularly huge archives today but it will be interesting to see what happens when they get to multiple petabytes. Could some of them move to tape?
“One other aspect to consider is when the archive is not disused data as such, but is actually a working content store – it just happens to be older content that has been moved off the primary systems and storage."
It seems clear though that for most organisations there are opportunities for optimisation when it comes to the overlaps between backup and that other kind of long-term “for safety's sake” storage.
Whether you call it deep archiving, ILM or storage tiering, and whether you store it on disk, tape or both, it seems there are significant storage savings to be made. ®