This article is more than 1 year old

How do you anonymize personal databases and protect people's privacy – over to you, NIST

Here are Uncle Sam's boffins' two cents

Fancier techniques to safeguarding privacy

In an effort to make it harder to make this sort of identification there are some novel new techniques, including creating "synthetic" results – basically, fake results based on real ones. Or adding noise to a set of data that makes it harder to pinpoint specific datasets.

These techniques do seem to work – but they also require a lot more thought and work than simply replacing or deleting certain data fields. There is also something somehow not right about creating fake records, especially if people start to rely on them to make other decisions.

Of course, all these techniques have long names and associated acronyms, like Privacy Preserving Data Mining (PPDM) or Privacy Preserving Data Publishing (PPDP).

And then there is the access question: how do you put the data out there?

For example, there is the "release and forget" model where you just put the spreadsheet out there and let people use it as they wish. This has the advantage of making it easily and widely available but it also becomes impossible to control.

Instead of that method, organizations can use two other broad systems: get them to sign a contract, or force them to do their searches through your system. The contract approach brings rules around what people can do with the data, and so discourages abuse. The paper views this as a very useful tool but one that gets less useful the more the data is shared.

And then there is the "enclave" model, where people question your database and get given the results as summaries. In other words, they never get at all the raw data and so it becomes much harder to manipulate it in ways to identify people. Of course this requires additional effort and resources on the part of the organization with the data.


The bigger question is really: what is the risk of people actually being identified?

There have been a range of academic studies into this exact question and the broad answer is somewhere between 0.01 per cent and 0.25 per cent. Or in other words, out of 10,000 records, you will be able to identify between one and 25 people in them, depending on the type of data and the methods used to anonymize them. And how much effort people are prepared to put in.

And it's there that the celebrity factor comes into play.

If you are a celebrity or a public figure, there is much more information on you out there – and so, many more data points that can be used to identify you. The best example is probably the photos of celebrities getting into taxis – not something that is going to be a problem for most people.

The truth is that you are going to be able to find out a lot more about Bradley Cooper than you are your own next-door neighbor. As a result, it is much easier to "re-identify" them from data.

Much of the discussion around re-identifying people has, naturally enough, focused on people that others are interested in knowing about, and as a result inflates the likely risk, the paper [PDF] argues.

What to do

It argues for a commonsense approach. "It is important to be realistic and consider plausible attacks, especially when there are data use agreements that prohibit re-identification, linking to other data, and sharing without permission," the paper notes.

What if you are still worried about the data you are thinking of putting out there? Well, you could carry out a "privacy attack," where you hire someone specifically to try to hack the data and identify people – a sort of intrusion detection approach but for data.

Ultimately though, all organizations need to think about what data they have, the degree to which it needs protecting, and the best method for doing so. The paper suggests "employing a combination of several approaches to mitigate re-identification risk."

They include removing or altering sensitive data fields, keeping an eye out for other data that could be used in conjunction with your data to identify people, getting people to sign agreements that prohibit re-identification, and implementing controls that limit what people can do with your data.

Ultimately, however, the biggest problem is that the privacy industry is still in its infancy. There are no widely accepted standards for anonymizing data or for testing its effectiveness. Even the phrases and jargon people use have different meanings for different groups.

"Given the growing interest in de-identification, there is a clear need for standards and assessment techniques that can measurably address the breadth of data and risks described in this paper," it concludes. ®

More about

More about

More about


Send us news

Other stories you might like