Opinion One of the tech industry’s longest running quests is developing the notion of a single source of truth within organizations. That is, no matter who or where you are within a business, when it comes to running the numbers or making a decision, your applications are accessing the same information as everybody else internally. No out of date, duplicated, or otherwise imperfect copies.
While that sounds admirable – obvious, actually – in reality it’s difficult to achieve. And it’s getting even harder as we generate and store more data.
IDC and others have suggested there could be up to seven copies of production data littering the IT hallways of your typical enterprise. Not fresh data, just in via the sales department or the website, but long-in-the-tooth data sets produced for long-forgotten purposes, now sitting on storage media taking up space and consuming power and cooling. Or, dozens of copies of the same data spinning on disks and doing nothing useful – again costing money. Or, worse yet, copies of data that some people think are current and are basing decisions upon.
Creating a single golden copy of your organization’s data can pay for itself in terms of storage costs, data accuracy, prevention of uncontrolled data sprawl, and regulation compliance.
So how do you build such a centralized repository in this day and age?
The move away from a single source has, in some ways, its roots in an excess of caution. Copies of a primary source are known as secondary data, and exist in many forms in today’s safety-first world: as backups, snapshots, replicas, and archives. Each of these was born out of necessity. Backups, for example, provide security and reliability should the original source be lost or corrupted. This data will be collected and stored by the second, minute, hour – or whatever your backup schedule dictates – and held long-term.
In other ways, secondary data is a result of modern life: aggregated records of, say, sales and manufacturing data for later analysis; structured databases; unstructured sources; and application development, test, and production environments, all mounting up in an enterprise in different places, all in the form of live data, replicas, snapshots, archives, backups, and more.
What we're saying is: it's understandable to have secondary data sprinkled everywhere. It's also a smart move to unify it into a single source of truth.
Clash of the titans
The fragmentation and duplication is understandable because your backups may be fractured across your organization due to incompatibilities between storage suppliers. For example, a Veeam backup file cannot be read by Veritas backup software, or by Acronis, Assigra, Commvault, Druva, nor other vendors' products. Backup formats are proprietary, potentially forcing you to have a chunk of your data in one architecture, and another in another architecture – platforms chosen and deployed as your organization has grown at a fast or uneven rate – with a version of some data lingering in both.
The same goes for snapshots and replicas. One supplier’s snapshots, from NetApp, for example, can’t be read by another supplier’s storage products, say Dell EMC.
Your storage estate may therefore look rather like a library that stores its tomes in various formats: some books as paperbacks on shelves, some in boxes in the basement, some as audio tapes, some on microfilm or microfiche, and some in multiple formats, and most just one format. Information cannot be easily transferred, and has to be obtained and managed in its own silo, divorcing it from any central primary source that may have existed, or is being built or rebuilt.
Further complicating matters is the distribution of storage systems and secondary data across the country or the wider world. Throw in hybrid cloud, and, typically, you now have Amazon Web Services, Google Cloud Platform, and Microsoft Azure's eccentricities to deal with. And, don't forget, data growth seems to be unstoppable. We are in a petabyte era, and moving towards an exabyte one.
We need, somehow, to bring control and order to this confusion, to arrive at a single version of the truth with a coordinated and controllable system of secondary data storage – a converged repository, or a meta source of a single truth.
Two routes to the truth
There are two basic ways to arrive at a single version of the truth. One is to have a single physical repository of your primary data, which apps and users connect into. Such a single home for all your primary data will be a massive beast, and would need to be able to scale out to the exabyte level. Scale-out file systems could cope, though they may be complex to operate and manage.
The other approach is to access centralized primary source data through an abstraction layer that presents the underlying individual physical silos as a single virtual repository, and acts as a control or management plane to administer the data. The end result is the same – a single access point through which users and applications can access data – though the ability to scale can be limited by certain hardware choices. Ideally, a robust integrated hardware and software solution achieving this could provide a higher degree of security and certainty.
One truth at last?
A single version of the truth for your business is a lofty yet essential goal to maximize business opportunity. It means starting from an initial project and widening the scope as your journey develops. Migrating away from the data sprawl that many are experiencing is going to become more important over time, and a golden master would seem to be the route to achieving that. It should save you a great deal of management time while ensuring data integrity and truthfulness. ®