Make sure big data doesn't land you in big trouble
Lock up your secrets
Size isn't everything. Big data may be about storing terabytes or petabytes of information but it is also about complexity, and complexity often brings security challenges. Are you ready to handle them?
Right now, someone in a marketing or finance role somewhere in your organisation is probably putting together a big data proposal, and if they aren't it won't be long before they think of it.
Beyond using a strong password to access a database, they probably have no idea of the security burden it will bring. That means the security part is down to you.
Here are some of the things you will have to think about to ensure that the business gets the data insights it wants, without also getting a truckload of unwanted legal headaches and headlines in the press.
Back in the day, mounds of customer data would be stored in data warehouses and drilled periodically for customer insight. The type of data stored on these systems, along with the batch-based queries that ran on it, differ significantly from today’s big data systems.
For one thing, big data is less structured. It is also drawn from a variety of sources, whereas data warehousing data was sourced mainly from internal systems.
These differences can present legal ownership problems, explains Kim Walker, a partner at law firm Thomas Eggar.
“Most things belong to someone and there will be a lawyer somewhere who has defined certain terms for their use,” she says.
“They will define what you can use, and then terms about the content and quality of the data.”
Victor White, director of marketing operations for Gigya, knows from practical experience about the need for establishing ownership boundaries and meeting licensing requirements.
Gigya is a business intelligence company which gathers information on its clients’ customers using tools such as logins on social media sites and managed online registration services. It vacuums up social data, among other data types, and uses it to deliver insights to clients about their own customers.
Those social media services have their own responsibilities for user data, which are passed on to companies using it for big data purposes. “If you store social data, you have to adhere to the terms of service,” White says.
The company introduced its own social compliance service to ensure that when it came to using third-party data, it coloured inside the lines.
The data that you source may also carry privacy protection issues. In addition to the simple customer records stored in yesteryear’s OLAP cubes, big data environments can derive information from sensitive sources, including, say, customer loyalty cards.
Those cards can also now be mobile apps, of course, potentially helping companies to understand where customers were when they accessed the apps.
Mobile phones these days are sensors in themselves, but there are also dedicated health sensors that can feed back customer data. A recent report on big data by the UK's Information Commissioner’s Office points out that the processing of personal data must be “fair”, and part of that comes down to how it is collected.
“Organisations need to be transparent when they collect data, and explaining how it will be used is an important element in complying with data protection principles,” says the report.
“We must design the collection of data so that we are stripping out information that really we don’t need"
Jim Reavis, co-founder and CEO of the Cloud Security Alliance, advises that smart companies design their data collection policies to minimise their compliance risk.
“We must design the collection of data so that we are stripping out information that really we don’t need and substituting it with other types of information,” he says.
Filters may be necessary to cut out personally identifiable information (PII) altogether. “We may potentially be able to find ourselves not required to comply with a whole host of regulations if we can do that in the right way,” he adds.
This considered approach involves understanding up front what the goals are for a big data solution, rather than simply slopping as much data as possible into a bucket and hoping that something insightful will trickle out in the end.
Because data arrives from so many different sources, often in real time, it is also important to validate the input so that malicious actors don't pollute the data set. Reavis's organisation recommends building systems that validate endpoints and detect malicious data feeds.
Time is precious
Another particular challenge of big data is that it is far more temporally focused than yesterday’s business intelligence systems. These were mostly static affairs, with snapshots of sales and other data updated on a regular, but far from real-time, basis. You could run reports on monthly sales data, for example.
Big data is different because it comes in from multiple sources much more frequently, some being updated in real or near-real time.
“Big data retains time cycles so that you can maintain concurrency information,” says Richard Chew, director of consulting for Emerald Management Group, who also served on IT security association ISACA’s project development team to create a white paper on privacy and big data.
Time data can create unforeseen problems because even carefully defined data set collection routines can impinge on privacy.
“It can lead to positive re-identification. There are situations where times and dates can link back to the same people,” warns Chew.
This is one example of a broader re-identification problem, where several layers of data can be aggregated to deduce someone’s details by removing entropy. This can happen even when there is no PII explicitly in the data set.
Entropy describes how uncertain we are about a particular piece of data. Princeton assistant professor of computer science Arvind Narayanan argues that with 6.6 billion people on the planet, you need only 33 bits of information (entropy) to identify them.
The more information you give away about that person, the less entropy you are left with. If you live in metro Vancouver (population 2.3 million), then that drops to 22 bits of entropy. And if someone can start cross-referencing that stuff with information from social media feeds and time series data, the bits will drop off fairly rapidly.
Narayanan sliced away enough entropy from Netflix’s anonymised data set of user activity to identify several individual user records.
We know who you are
Another famous case involved AOL, which released an anonymised search history for researchers to analyse and extract insights from. It was broken, revealing the identity and search history of one embarrassed internet user.
With dangers like these, companies must take care when developing filters to analyse their data. Whereas the data warehouses of old were interrogated using particular parameters, big data is often mined regularly for possible correlations between data sets, so far more of the structured and unstructured data is used in different combinations.
In some cases, software may even learn which algorithms are most successful, so that they can be used to create new search algorithms. All of this must be handled within privacy boundaries.
In some cases, the lines between big data and operational servers are blurred, with direct lines between big data sources and operational systems. It is more akin to the DevOps application development model, in which development and operation teams work together, rolling out changes to the usage of the information in real time.
Reavis warns that this can often involve making several changes to an application in a day, which contrasts with traditional warehouses where system operations were highly scheduled.
Big data moves systems away from the world of batch processing to one of constant change, which can create additional problems for security experts.
“That agility creates opportunities for mistakes to be made. That’s a consequence of wanting to solve problems faster,” Reavis says.
It also creates a need for real-time security monitoring, so that administrators can watch for unauthorised access, data poisoning attacks and snooping. Security information and event management tools may be useful here, as are specific tools to monitor requests to the Hadoop or alternative big data systems.
These infrastructural technical challenges can manifest themselves in various ways. The CSA's Big Data Working Group addresses several of them, including security in distributed processing. Hadoop, which is one of the most commonly used big data stores, uses MapReduce, a function to distribute data-processing tasks over multiple nodes.
Rogue or malfunctioning nodes could be used to snoop on queries or alter their results, says the CSA.
Threat mitigation involves establishing and maintaining trust between nodes, and also establishing mandatory access control for the nodes and their controller.
NoSQL and similar non-relational databases also carry risks, because database level security in these systems has been minimal. Injection attacks and unauthorised access are just two dangers observed by the CSA, which advises wrapping NoSQL databases in a secure middleware layer to shield direct access to the data.
The organisation also recommends encrypting data to help protect it from attack. Hadoop offers file layer encryption.
Nevertheless, things may be more complex in a big data environment where fast access is needed to data from different sources, which is all in the same data store and being accessed by different parties.
Arnab Roy, a member of research staff at Fujitsu Laboratories, notes that homomorphic encryption is receiving a lot of attention in the cryptographic world.
"The goal is to use encryption while performing computation on data," he says.
The other option is functional encryption, which enables applications to derive controlled knowledge in the cloud while working with encrypted data. So an application may be able to pinpoint specific search results while working with an encrypted data set, but nothing else.
Techniques such as these can also help to offset the dangers inherent in multi-tiered storage, where data may be moved between storage layers to save money. Points of transmission between tiers can be considered security weak spots, especially if service providers are involved.
Infrastructural challenges may also be influenced by broader legal considations that force other issues onto big data users, says Roy.
Roy has worked with the National Institute of Standards and Technology (NIST) and the CSA on big data security issues. NIST’s big data working group has a security and privacy subgroup which is attempting to map big data topics onto a generic security reference architecture.
"Big data management also has its own factors, such as the provenance of data as it moves through certain contexts, like geographical and legal and trust boundaries," says Roy.
White, who has experienced the challenges associated with legal and geographical boundaries firsthand, warns that companies using big data should be thinking hard about data residency issues. They must sometimes mirror their data governance efforts across multiple geographies, which can introduce its own security issues.
Gigya needs data centres all over the world, thanks to regional regulation. “Many governments want you to house customer data regionally. That represents a big challenge overall that will be hard for companies to accomplish,” White says.
Several countries have required organisations to store information about their citizens on local soil. Indonesia, for example, enacted Government Regulation PP82/2012, which forces data centres providing services for its citizens to be maintained within the country. Russia has passed similar laws which will come into effect in 2016.
Typically, companies roll up legal and technical risks into a broader governance framework designed to mitigate them. IT is well understood within corporate risk management frameworks but big data governance is still immature, warns Roy.
"There are compliance frameworks for specific verticals, but there are no compliance standards for big data," he says.
Big data can shine a light onto organisations' innermost secrets. The trick is to avoid giving up your customers' privacy into the bargain or allowing attackers to discover them. The challenges span areas including infrastructure, data privacy and data management.
And because big data solutions are so powerful, the stakes are high. Careful planning and foresight is crucial if you are to enjoy the benefits of this technology, while avoiding the security and privacy nightmares it could bring. ®