Go back 15 years and big data as a concept was only just beginning.
We were still shivering in an AI winter because people hadn't yet figured out how to funnel large piles of data through fast graphics processors to train machine-learning software. Applications that gulped down data by the terabyte were thin on the ground. Today, these gargantuan projects are driving the reuse of data within organizations, enabling staff to access a central pool of information from multiple systems. The aim of the game is to give your employees a competitive advantage over their rivals.
Capturing, managing, and sharing data brings challenges involving everything from privacy and security to quality control, ownership, and storage. To surmount those challenges, you need what some folks call a “data strategy.” And you're probably wondering, rightly, what on Earth does that actually mean?
First, why do you need one?
A strategy at the top coordinates things at the bottom. Without a top-level data strategy, you may end up with multiple projects overlapping each other, creating data sets that are too similar or otherwise in conflict with each other. A data strategy should define a single source of truth in your organization, thus avoiding any inconsistencies and duplication. This approach also reduces your storage needs by making people draw from the same data pool and reuse the same information.
It also helps with compliance. Take, for example, GDPR and similar privacy regulations that give people the right to request and delete data you hold on them. How will your organization painlessly and reliably find these requested records if your piles of data are not properly classified and managed centrally somehow? Your data strategy should set out how you classify and organize your internal information.
A data strategy can also lower the response times for requests for data. If you know what kinds of data you have, where it is all stored and how it is all organized, and that it is clean and and easily accessible, as demanded by your strategy, you can answer queries fast. This is something McKinsey reckons can help reduce your IT costs and investments by 30 to 40 per cent. Finally, a data strategy should define how you determine and maintain the provenance of your enterprise data so that you always know exactly where it came from.
Inside a data strategy
Back to the crucial question at hand: what exactly is a data strategy? According to SAS [PDF], it should cover five components: identification, storage, governance, provisioning, and process. We'll summarize them here:
The strategy should insist data is consistently classified and labeled, and define how to do exactly that, so that smoothly and easily sharing information across systems and teams is possible. This must include a standard definition for metadata, such the type of each information record or object, its sensitivity, and who owns it.
This is something you should implement as soon as possible: for example, slipshod front-office data entry today will wreck the quality of your back-end databases tomorrow. Applications should be chosen and configured so that users enter information and metadata consistently, clearly, and accurately, and only enter relevant information.
This data needs to live somewhere, and your strategy should document where information is held and how it is backed up. Whether it is structured or unstructured, the data should at least be stored in its original raw format to avoid any loss in quality. You could even consider placing it all in what is now called a data lake. To give you an idea of what we mean by this, Goldman Sachs, in its attempt to become the Google of Wall Street, shunted more than 13PB of material into a data lake built on top of Hadoop. This deep pool holds all sorts of material, from research notes to emails and phone call recordings, and serves a variety of folks including analysts, salespeople, equities traders, and sales bots. The investment bank also uses these extensive archives for machine learning.
Data lakes are often built separately to core IT systems, and each one maintains a centralized index for the data it holds. They can coexist with enterprise data warehouses (EDWs), acting as test-and-learn environments for large cross-enterprise projects while EDWs handle intensive and important transactional tasks.
Physically storing and backing up this data in a reliable and resilient manner is a challenge. As the amount of information that can be hoarded increases, factors like scale-out, distributed storage, and the provisioning of sufficient compute capacity to process it all must feature in your data strategy. You must also specify the usual table stakes of redundancy, fail overs, and so on, to keep systems operational in the event of disaster.
For on-premises storage, you can use distributed nodes in the form of just-a-bunch-of-disks (JBOD) boxes sitting close to your compute nodes. This is something Hadoop uses a lot, and it's good for massively parallel data processing. The downside is that JBOD gear typical needs third-party management software to be of any real use. An alternative approach is to use scale-out network-attached storage (NAS) boxes that include their own management functionality and provide things like tiered storage. On that subject, consider all-flash storage or even in-memory setups involving SAP HANA for high-performance data-crunching.
That's fine, you say, but what about servers with direct-attached storage containing legacy application data? You don't necessarily need to ditch all that older kit and bytes right away. With data repositories now so large, it's difficult to store them in one place, and with data living in so many areas of an organization, you may find that data virtualization is an option. Data virtualization can be used to create a software layer in front of these legacy data sources, allowing new systems to interface with the old.
You don't need to do it all on-premises, of course. Cloud-based data lakes are a thing, though you may need to compose them from various services. For example, Amazon uses CloudFormation templates to bolt together separate services such as Lambda, S3, DynamoDB, and ElasticSearch to create a cloud-based data lake. Goldman Sachs uses a combination of AWS and Google Cloud Platform as part of its data strategy.
Finally, you need to include forward plans for your organization: you may not need, say, real-time analytics right now, though you may need to add this capability and others like it in future. Give yourself room in your data strategy to expand and adapt, using gap analysis to guide your decisions.
With all sorts of information flowing in, you don't want to end up with a cesspool of unmanaged and unclean records, owned and accounted for by no one. Experts call this a data swamp, and it has little value. Avoid this with the next part of your data strategy: governance.
Governance is a huge topic and with so many elements to it, the Data Management Body of Knowledge [PDF], aka the DAMA DMBOK, is a good place to start when drafting your strategy. McKinsey, meanwhile, advises making people accountable for data by grouping information into domains – such as demographics, regulatory data, and so on – and put someone in charge of each area. Then, have them all sit under a central governance unit that dictates things such as policies, tools, processes, security, and compliance.
Your data strategy should not focus on just importing, organizing, storing, and retrieving information. You must document how you will process your pool of raw data into something useful for customers and staff: how will your precious information be transformed, presented, combined and assembled, or whatever else is needed to turn it into a product. The aim here is to plan an approach that does not involve any overlapping efforts or duplicated code, applications, or processes. Just as you strive to reuse data across your organization without duplication, you should develop a strategy that ensures the same applies to your processes: well-defined pipelines or assembly lines that turn raw data into polished output.
Finally, you need to think about getting the data where it is needed. This involves not just defining sets of policies and processes on how data will be used, but also – potentially – changes to your IT infrastructure to accommodate them. Goldman Sachs, for example, published a set of APIs that allowed customers and partners to see and access data in addition to internal users.
Writing it all up
It's one thing to have aspirations about strategy. Now you have to write it down and stick to it. Don't be afraid to keep it simple. Be realistic. Make it crystal clear so that staff reading it know exactly what needs to be done. Grandiose and poorly defined initiatives are costly and difficult to implement. With vague goals and big price tags, you can quickly run into trouble with budgets and deadlines.
Break your data strategy and accompanying infrastructure changes into discrete goals, such as providing new reporting services, reaching a particular level of data quality, and setting up pilot projects. Identify the domains of data you intend to create, and develop a roll-out plan for each domain. Dedicate individual teams to each domain: they own it, they clean and maintain it, they govern it, and they provide it.
Your data lake doesn't have to start off as some vast ocean. It can grow in size and functionality over time, starting off as a resource of raw data that data scientists can experiment with. Later, as it matures, you can integrate it with EDWs and perhaps even use it to replace some operational data stores. There's nothing wrong with a medium-sized data pond to start with. Goldman Sachs' data lake contained just 5PB back in 2013.
An effective data strategy will mean tweaking your organization, your governance process, and your data gathering and management processes. More than that, though, it will mean taking a long, hard look at your infrastructure. It isn't feasible for most companies to wipe their entire IT portfolio clean and start again, though you can modernize parts of it as your strategy evolves. ®