How to tame tech's terrifying Fragmented Data Monster – the Cohesity way

As files pile up, customer numbers grow, storage systems spread, it's only going to get worse

Got Tips?

Sponsored One customer, one customer order, right? Wrong.

Sales will have a copy of the original as will shipping, who have probably copied it to a desktop. That’s two or three, right there. Credit control will receive a copy, via email, that might get stored on a network drive with a copy then sent to accounts receivable. Then the backup and archival process started.

Repeat this, every day, and even the smallest company is soon swimming in copies of the same document.

Welcome to the world of mass data fragmentation - copying, slicing, dicing and then storing - in a multitude of locations - something that’s termed “secondary data.” That is, the data that goes outside transaction systems in production databases. We’re thinking about data like backups, file and object storage, non-production test and development files, search and analytics. Archived data, too.

Why should you care? One reason is the hidden cost of storing those duplicate. If, and when, it comes to consolidating you won’t know where to begin. And consolidate you should, for how can you be sure that everybody has the absolutely latest and definitive view or understanding of the customer?

Stuart Gilks, systems engineering manager at data management company Cohesity, reckons mass data fragmentation is a function of data volume, infrastructure complexity, the number of physical data locations and cloud adoption. And guess what? They’re all growing at an astounding rate.

IT systems aren’t becoming any less complex, thanks to a combination of organic and inorganic IT growth. A succession of different project owners and IT teams’ layer different IT systems atop each other over the years, each of which contains secondary data and few of which talk to each other easily.

As far as data volume goes, enterprises are producing data more quickly than they can manage it. Last year, Cohesity surveyed 900 senior decision makers from companies across six countries, with 1,000 employees or more. Ninety eight per cent of them said their secondary storage had increased over the prior 18 months, and most said that they couldn’t manage it with existing IT tools.

Not content with generating more data than they can handle, companies are starting to fling it around more. They started by storing it with single cloud providers, but quickly gravitated to hybrid cloud and multi-cloud systems. Eighty five per cent are using multi-cloud environments, says IBM.

These multi-cloud environments spread data over different domains, each of which usually has its own data management tool. Oh, joy.

“You’ve got a proliferation of locations and you’ve got a proliferation of silos that have a specific purpose. You’re almost generating a problem in three dimensions,” Gilks says. “This makes it difficult to manage systems, risk and efficiency and deliver business value, especially at a time when budgets aren’t going up.”

Not all this data duplication is haphazard, mind. Organisational and legal drivers sometimes force companies to fragment their secondary data. Compliance or security concerns may make it necessary to draw hard lines between different departments or customers, serving each with different copies of the same data.

Multi-tenancy is a good example. You may provide a service to one company or department but be forced to isolate their data completely from company B in the same computing environment, even if some of it is identical. Other reasons may stem from office politics. Server huggers lurk in every department. We said there might be well-understood reasons for creating data silos, but we didn’t say they were all good.


This mass data fragmentation problem creates several impacts that can cripple a business.

The first is a lack of visibility. This secondary data is valuable because there is a wealth of corporate value locked up inside it. Analytics systems thrive on data ranging from call centre metadata to historical sales information. If data is the new oil, then carving it up into different silos chokes off your fuel supply.

The second is data insecurity. Much of that secondary data will be sensitive, including personally-identifiable customer information. Someone who stumbles on the right piece of secondary data in your organization could find and target members of your skunkworks product research team, or customer list, or email everyone in your company with a list of senior management salaries. None of these outcomes are good.

The third, linked impact is compliance. GDPR was a game-changing regulation that made it mandatory to know where your data is. When a customer demands that you reproduce all the data you hold on them, you’d better be able to find it. If it’s smeared across a dozen corporate systems and difficult to identify let alone retrieve, you’re in trouble.

Bloat and drag

Then, there’s the effect on business agility. Developing new systems invariably means supporting and drawing on secondary data sources. The Cohesity survey found that 63 per cent of respondents had between four and 15 copies of the same data, while 10 per cent had 11 copies or more. Those files aren’t just located on a company’s premises; they’re also stored off-site.

Developing new systems while ensuring integrity across all of those file copies might feel like pulling a garbage dump up a mountain. It would not only affect IT’s agility to support business requirements, but would bloat development budgets too.

The numbers bear this out. Forty eight of those answering Cohesity’s survey spent at least 30 per cent (and up to 100 per cent) of their time managing secondary data and apps. The average IT department spent four months of the working year grappling with fragmented secondary data.

This leaves IT employees feeling overworked and underappreciated. Over half of all survey respondents said that staff were working 10 hours of overtime or more to deal with mass data fragmentation issues. Thirty eight percent are worried about “massive turnover” on the IT team.

Mass data fragmentation also affects a company’s immediate ability to do business. Ninety one of fretted about the level of visibility that the IT team had into secondary data across all sources. That translates directly into customer blindness. If the IT team can’t pull together customer data from different silos, then how can they draw on it for operations like CRM or customer analytics?

Taming in action

So much for the data fragmentation problem. Now, how do you solve it?

The least drastic version involves point systems that manage secondary data for specific workloads. Dedicated backup or email archiving systems are one example. They do one thing really well, although you may well end up needing more than one of them to cope with different departmental silos. In any case, according to Gilks, they don’t handle all of the workloads you might want to apply to secondary data. Instead, you need different software for different things.

Another option is a middleware or integration platform that makes the data accessible at a lower level, for consumption by a variety of applications. These products allow architects to create mappings between different systems. They can program those mappings to extract, transform and filter data from one location before loading it into another.

Gilks still sees problems. “Even if it’s completely successful, I still have 11 copies of my data,” he says. “At best, middleware is an effective Band-Aid.”

Ideally, he says, you’d want to consolidate those 11 copies down to a smaller number, whittling away those that weren’t there purely for security and compliance reasons.

“I’d probably want two or three resilient copies at most,” he continues. “I might think about a copy for my primary data centre, a copy for my secondary data centre, and a copy for the cloud.”

Some companies have had success with hyperconvergence in their primary systems. This approach simplifies IT infrastructure by merging computing, storage, and networking components. It uses software-defined management to coordinate commodity compute, storage and network components in single nodes that scale out.

Squeezing data into a collapsed compute-storage-network fabric has its pros and cons. While the hyper converged kit has no internal silos, it might become its own silo, presenting barriers to the non-hyperconverged infrastructure in the rest of the server room.

Hyperconverged infrastructure also often needs you to scale compute and storage together, and it is typically difficult for non-virtualised legacy applications to access the virtual storage on these boxes.

Perhaps most importantly in this context, you’re unlikely to store the bulk of your secondary data on these systems, especially the archived stuff.

Cohesity applied the hyperconvergence approach to secondary data management, using software-defined nodes that can run on hardware appliances, or on virtualised machines on customer premises or in the cloud. It slurps data and then deduplicates, compresses and encrypts it to produce a smaller, more efficient dataset.

The company then offers scale-out storage for access via standard interfaces including NFS, SMB and S3, and provides a range of services via the platform ranging from anti-ransomware and backup/recovery through to an analytics workbench and App Marketplace.

Whichever approach you choose, fighting the multi-headed data beast now will save you budgetary woes late and free your IT department up to be more sprightly in future developments. If you can’t entirely slay the monster, then at least try to tame it a little.

Sponsored by Cohesity


Biting the hand that feeds IT © 1998–2020