The world and its dog has been shocked by the Prism news story. Early in June, we found out that the US National Security Agency (NSA) had developed a secret data-gathering mechanism to steal all our data and store it in a large data warehouse.
We are outraged that it is being mined, searched and otherwise prodded. But do we really think that big data security problems stop at Google, Facebook, Microsoft and Fort Meade?
The private sector has been collecting data on all of us for ages. It is stored in massive data sets, often spread between multiple sources. What makes us think this is any more secure? At least the NSA is well trained in keeping it all under lock and key.
What does “big data” mean, anyway? Some describe it – wrongly – as simply a lot of data in a relational database. But if that were the case, then the security challenges would be the same as for conventional databases. And they aren’t.
Others view it as data sets so large that they cannot be handled by traditional relational tools. But we have had that kind of thing for years, in the form of data warehouses.
One difference is that modern large data sets often consist of far more varied data, including unstructured stuff such as tweets. Big data is inherently social, meaning that much of it is personal.
Big data is also supposed to perform better. Really large data sets can be tuned to look for “weak signals” – emerging trends that a traditional data warehouse-based business intelligence system may not have spotted.
The goal is also to have them work quickly, so that they can help companies predict and react to market trends efficiently. No more three-day turnarounds for specific reports here.
So big data is more complex, more flexible and faster. It is powerful stuff, but with power also comes risk. Big data carries unforeseen security consequences, warns Tony Lock, programme director at analyst firm Freeform Dynamics.
“Customers give you data to use for certain purposes, but they may not have allowed you to start crunching it to answer all kinds of questions,” he says.
Many companies have not considered those issues, he adds.
In practice, says PA Consulting IT specialist James Mucklow, this means you must have a clear policy, explaining what you are going to use customers’ data for.
Big data can provide deep-dive profiles of individuals by using sources that we are not always aware of. Take loyalty cards, for example.
The credit-card industry spends millions developing and enforcing data security and privacy guidelines for the storage of personal financial information. Anyone dealing with currency transactions of any sort is heavily regulated. But loyalty points are not currency and don’t face the same kinds of rules.
We know where you live
Yet loyalty card customers provide mounds of personal information, both directly and indirectly. They may hand over names and addresses, gender, phone number, birthday, and email addresses. Sometimes, they even reveal their income.
Even basic postal code information can enable companies to infer more information about you, based on the demographic data for your area.
Every purchasing decision can be tracked and sucked into a wider data set. Suddenly, data has gotten much bigger – and much more personal.
“You have to be sure that you are seen to be using the data in a responsible way,” says Mucklow. He outlines the story of US retailer Target, which figured out that a girl was pregnant before she had told her parents and let the cat out of the bag by sending her leaflets with advertising.
One of the biggest challenges for companies holding big data sets is that they are like the pan-dimensional, hyper-intelligent beings that built Douglas Adams’s computer, Deep Thought.
They asked the computer the meaning of Life, the Universe, and Everything. After 7.5 million years, it told them that the answer was 42 and it transpired that they didn’t really understand the question.
Big data sets are massive pools of data, designed to answer questions that people don’t even know they want the answer to. It is tricky defining privacy policies that provide enough flexibility to make proper use of the data and enough privacy to ensure that customers are happy.
Ideally, all of this data would be rendered anonymous, but this can provide a false sense of security, warns Jamal Elmellas, technical director for Auriga, a security consulting firm.
“The mechanism you use to anonymise that data must be sufficiently robust to not breach the Data Protection Act but also leave the data in a state that is useful for what you want to achieve,” he says. “It is a very fine line.”
Unfortunately, companies get it wrong. Data ends up being “pseudo-anonymised”, he warns, making it relatively easy to reassemble into information that can help to identify individuals.
We have seen this before. Researchers used re-identification techniques to find user identities in an anonymous set of data published by Netflix in 2006. They matched that data to IMDb, a third-party source of movie reviews written by individuals.
This shows how big data’s biggest strength – its ability to derive data from different sources – is also its biggest security weakness.
Increased exposure happens when multiple data sources are brought together
“Different consumer organisations collect information with one-dimensional views of a consumer,” says Hunter Albright, CEO of Beyond Analysis, a consulting firm that specialises in big data.
“It has limited value and risk because of that. Increased exposure happens for an individual when multiple data sources are brought together.”