Whether it’s unstructured rich media, traditional business documents and files, or the audiovisual library of a media company, there is more data about than ever before. And more than ever, it has potential value – whether that is to build new content, to improve customer relationships, to answer the demands of regulators, or even to protect your organisation and its intellectual property in court.
Keeping all that data online is a massively complex and expensive task, however, and it may be that it is not even technically possible. The answer, as some know and many more are learning, is archiving: you take the data that is no longer needed on a daily or even monthly basis, but which must be kept accessible 'just in case', and move it to a cheaper form of storage.
The scale of the challenge was highlighted by Chris Luther, director of professional services at content archiving specialist SGL. Speaking at the 2015 Creative Storage Conference, he estimated that “90 percent of deep archived content is never restored. Of the 10 percent that is, 90 percent is restored only once. Since we have no way to determine what the remaining one percent is at the time of archive, [all we can do is] store everything, as economically as possible.”
The underlying concepts for doing this have been around for several decades under various names, the best known being HSM (hierarchical storage management) and ILM (information life-cycle management). Most notably, when an HSM process moves data to a cheaper near-line or off-line storage tier, it leaves a file stub or pointer behind. To the user, it looks as if the data is still in place, accessible and searchable, but with a slower access speed.
The difference now is that the enabling technologies have finally matured enough to make HSM a practicable reality, even for organisations without a team of dedicated storage engineers, says Matthew Addis, the CTO of UK-based cloud-based archiving specialist Arkivum. “Whether you say HSM or ILM, it's using different storage for different needs. That hierarchy has never gone away,” he says.
“The advantage of archiving is that where access to a backup must go through a portal or client, the archive data mover ensures that the file appears to be in the same place. It's HSM again,” agrees Stéphane Estevez, a senior product marketing manager with Quantum. “The owner of the data needs it transparently, so you need something that can do data movement but not get in the way. This HSM concept has been there for years but for most users it was too complex to integrate. Now it's ready.”
Tiers of joy
Two of the key enabling innovations here are policy-based automated tiering and of course cloud services, with the latter allowing smaller organisations to access enterprise-class applications and technologies. Others include improved discovery software so you can figure out what data to archive, and better search software to work with it once it has been archived. And then there is the realisation that as with so much else – including most notably the cloud – moving from file and block to object storage, this is potentially a natural fit with archiving.
But aren't we already doing elements of archiving when we add tiered storage arrays, perhaps with a tier one of flash or enterprise SAS disk and a tier two of cheaper bulk SATA disk, plus backups to tape or the cloud? Sadly not, because while some backup applications now include a self-service portal for users to recover old versions or files they've accidentally deleted, a backup is not an archive.
“A backup is for hot data, archive is for cold,” explains Walter Fusi, a senior VP of sales at QStar Technologies. He adds that while you still need backups, they are for recovering live systems, not for storing and finding old data – the latter is the job of the archive.
Indeed, if you are one of the many who simply backup everything, including cold data, you could be wasting an incredible amount of storage, argues Stéphane Estevez. “If you use traditional methods, with dailies, weeklies and so on, then for every terabyte of primary storage you could have 10TB of backup – and potentially another 10TB if you replicate everything,” he says. “Even if you de-duplicate, not all data is a good candidate for that – video, for example.”
In contrast, Addis says that Arkivum typically stores three copies of an archive, two of them online in different data centres and a third offline on tape held in escrow – this is not only as a last resort data copy, but also to cover the client against failures in and of the cloud service.
Fixing the backup problem means figuring out what to archive, and this is a major area of development and innovation. “Discovery is the first thing,” Estevez says. “We use Rocket Software's Arkivio to understand what's in your NAS, say, and get a picture of its footprint, how cold it is, who owns it, and so on. This kind of thing is not new, but there's a lot more pressure now to do it. Plus, these tools have evolved – now they not only do a technical simulation, they also include business metrics such as costs and savings.”
Ingestion, not indigestion
Then it is the ingestion process, notes EMC's product marketing VP, Peter Smails. “For large-scale enterprises, there are two real areas of innovation,” he says. “The first is in metadata – indexing content on the ingestion phase is getting much more sophisticated. This upfront investment improves the archive experience across the entire life-cycle of the data, enabling near real-time search and easier more-Google-like searching. The second is in scalability, where exponential data growth is necessitating archiving at the petabyte scale.”
He adds, “We’re starting to see a crossover between big data analytics and archiving, as businesses recognise the inherent value of archived data. With improvements in indexing, and better metadata captured at the start of the archiving process, it’s now possible to glean valuable information from the archive. This changes it from being a cost burden for compliance purposes to a business asset.
“To be successful, however, businesses need to be able to search seamlessly across archive targets. The power of metadata comes from being able to treat the entire archive as a single pool of searchable information without incurring the costs of holding all the data on-site. Emerging archive strategies need to deliver tiered storage as well as holistic search in order to drive both cost savings and business transformation.”
Of course, this isn't the only area of data storage where terms such as holistic search, indexing and metadata are bandied about. Another one that is seeing major growth is object storage, and that's no wonder, because the two have a lot in common – indeed, the one could even be seen as an extension of the other.
“Object storage is becoming more of a go-to plan. You can put archiving into an object store – those two play together very well,” says Matt Starr, Spectra Logic's CTO. “The ability to search the archive is very important – you don't want 'dark data'.”
He continues, “With traditional file storage, people tend to talk speeds and feeds. With object storage it's more about putting in and getting out, the same as with archiving. File systems aren't going away, but over time the lion's share of data will be objects. In two to three years we could see a single-namespace object store with a traditional archive as part of that object store.”
The underlying hardware for the archive is evolving too. Tape is still the most common medium, but it is being used in new ways, notes Stéphane Estevez. “We usually see customers add one tier of archive,” he says. “The classic is tape – you already have it for backup, but now you're probably doing medium-term backups on disk instead, so the tape library could be re-purposed. We do have customers with two archive tiers, for example a NAS landing zone for fast access, and then automatically copy to tape.