Security

De-identify, re-identify: Anonymised data's dirty little secret

Jeffrey Singh, stamp-collecting bachelor (35) of Milwaukee, Wisconsin – is that you?


Feature Publishing data of all kinds offers big benefits for government, academic, and business users. Regulators demand that we make that data anonymous to deliver its benefits while protecting personal privacy. But what happens when people read between the lines?

Making data anonymous is known as de-identifying it, but doing it properly is more challenging than it seems, says Wei Wang, professor of computer science and director of the Scalable Analytics Institute at UCLA.

"It's one thing to remove the identity, but we also need to keep in mind that the remaining data right after we remove that entity is still useful," she says.

With a little work, people can often recreate your identity from these remaining data points. This process is called re-identification, and it can ruin lives.

In a recent case, an online newsletter outed a Catholic priest who was a frequent user of the Grindr gay hookup app. The newsletter purchased the Grindr usage data from a third-party data broker. Even though the data set had no identifying information, the newsletter found him using his device ID and location data. The ID showed up in gay bars, his work address, and family addresses, which was enough to find his name and out him. He later resigned.

The spectre of re-identification has grave implications for us all, and should give us pause as we rush to publish anonymous data sets. It has become a sport for some researchers, such as those who mined anonymous AOL search queries in 2006 and identified individuals from de-identified Netflix usage data. Both organisations had published the data in the name of research. Back in 2009, a gay woman sued Netflix, alleging that the data could have outed her.

How de-identification works

There are different ways to de-identify data. These include deleting identifiable fields from records, which theoretically should let researchers use the data without linking it back to an individual.

The danger here is that smart third parties could re-identify someone using data elements that were deemed innocuous enough to leave in the records. In an explainer on the topic, the Georgetown University Law School describes multiple levels of identifiability.

These levels begin with data such as a phone number and social security number that can directly identify a person. At the level below that are items such as gender, birth date, and zip code. These might not identify an individual alone but can quickly single a person out when combined. At still lower levels, the data points relate less specifically to a single person, such as favourite restaurants and movies.

In the mid-nineties, the state of Massachusetts published scrubbed data on every state employee's hospital visits, but left in some level-two data: zip code, gender, and age.

Re-identification researcher Latanya Sweeney used public zip code records, correlated with the other two data points, to single out the one person matching them all: state governor William Weld. His full medical history, gleaned from the data set, landed on his desk shortly afterwards.

A token gesture

Another approach to de-identification replaces identifiable data with a token. This theoretically allows the data set's producer to map the tokens back to the user's real ID while leaving others guessing.

This is also sometimes vulnerable to attack. If those tokens aren't truly random and an attacker can reverse-engineer them to retrieve a real-world data attribute, they could find the data's owner. This happened in 2014, when someone reverse-engineered tokens created from New York taxi medallions and mined information about specific taxi rides.

Even if you can't reverse-engineer the token, you can use it to correlate a single data subject's activity over time. That's how researchers pinpointed people in the 2006 AOL dataset; tokens representing individuals allowed them to group search queries and attribute them to a single person, gleaning lots of information about them.

Using additional sources

The availability of multiple data sets compounds the problem of re-identification, warns Wang. "There's a lot of information that you can collect from different sources and correlate them together," she says. Taken individually, each data set might seem innocuous enough. Put them together, and you can cross-reference that information. "Then you can figure out a lot of information that's going to surprise you," she adds.

The problem, as the UK's ICO outlines in its own Anonymisation Code (PDF), is that you can never be sure what other data is out there and how someone might map it against your anonymous data set. Neither can you tell what data will surface tomorrow, or how re-identification techniques might evolve. Data brokers readily selling location access data without the owners' knowledge amplifies the dangers.

Other de-identification techniques include aggregating data. This, the fourth level of data on Georgetown Law's list, includes summarised data such as census records.

You could aggregate neighbourhood-level health records at a county level. Even that can be dangerous, warns Wang. You might be able to correlate aggregate data with other data sets, especially if the number of people with a specific attribute at the aggregated level are low enough.

Concerns about re-identification have surfaced of late with the NHS Digital's recent push to collect the public's health data en masse under its General Practice Data for Planning and Research initiative. The scheme would have transferred GP medical records for all of England's residents to a central research store, giving people a short window to opt out.

NHS Digital had outlined specific data fields that it would transfer under the scheme, which would have allowed it to share that data with third parties. After delaying the deadline in response to pressure from GPs and relaxing opt-out deadlines, it had to put the project on hold.

Solving the re-identification problem

One theoretical way to cut through the whole tangled mess is to just keep removing data points that could reveal someone's identity. Taking out age, zip (post) code, and gender might have stopped Sweeney's Weld discovery, for example. But each piece of data that you take out lessens the data set's value, warns Eerke Boiten, professor of cybersecurity at De Montfort University's School of Computer Science and Informatics.

"If your objective is to make the information less specific, less specifically pinpointing one specific person, you're also taking out the utility," he says.

One way to reconcile anonymity and usefulness could be differential privacy. This technique adds statistical noise to the data by subtly altering parameters, perhaps shifting someone's age or zip code slightly, which makes it harder to correlate them.

Scientists can still filter out that noise with repeated database queries, so another factor of differential privacy is a restriction on the number of times that they can access that data. This restriction is known as a privacy budget, or epsilon, and you can alter the anonymity of a database by changing it.

That involves retaining control over the data, Boiten says, pointing out: "Control and accountability disappears when you hand it over." An alternative is to avoid publishing the data openly and instead make it available in a controlled research environment. "Rather than sharing the data set you share the access," he explains.

The ICO's Anonymisation Code makes it clear that in some scenarios, where re-identification could be damaging, organisations should seek consent before distributing anonymous data sets. Some situations might demand restricting disclosure to a closed community, it adds, and in some cases the data shouldn't be shared at all.

Regulating our way out of it

Scientists also call for more legislation around de-identification. The GDPR excludes data that it deems de-identified from its regulation.

The ICO warns that if the data can be re-identified using "any reasonably available means," then it won't pass muster under the EU General Data Protection Regulations. Olivier Thereaux, head of research and development for the non-profit Open Data Institute, says that misjudging this can get companies into hot water.

"GDPR does state that it does not apply to anonymous information, so anonymisation has sometimes been seen as a way to 'get out' of data protection obligations," he says. "That is often a mistake as there are many ways to anonymise data, and some may be regarded by data protection authorities as 'not reasonably anonymised'."

Danish taxi service Taxa 4x35 is a case in point. Regulators penalised it after it deleted names associated with trip records from its database after two years. The regulator found that the customers were still re-identifiable.

It's a question of risk

No de-identification technique is completely foolproof though, warns Omer Tene, chief knowledge officer at the International Association of Privacy Professionals.

"While there are scientific remedies, most practical remedies are limited in terms of really being risk-based," he says. "They minimise or limit risk but don't completely eliminate it."

The ICO makes this clear in its Code, pointing out that it's "impossible to assess re-identification risk with absolute certainty."

It recommends what it calls a 'motivated intruder' test in which a person without prior knowledge could re-identify individuals using publicly available tools.

Does this mean that we shouldn't publish data at all? Not at all, says Thereaux. To do so would have a chilling effect on research. "Statistics bodies like the ONS do publish data that is anonymised to a minute risk of re-identification, and that publication is hugely valuable to our society," he says.

Lowering risk involves taking a careful and multi-faceted approach to de-identification. Thereaux points to the UK Anonymisation Network, which is a non-profit originally created by the ICO to share best practices in de-identification. It publishes a decision-making framework to help navigate the de-identification process.

The framework emphasises the need to engage with people who might be affected. "Making sure you are transparent and honest about how risks were mitigated, and how you are responding to a breach is key," Thereaux warns. "Organisations who fail to engage and plan for what they might do if anonymisation is breached are the ones who end up at the heart of data scandals."

The data broker who sold Grindr data without considering the implications could perhaps have done with some of that thinking. Come to that, so could everyone involved in that supply chain. Clearly, when it comes to understanding and protecting identities in anonymous data, there's still a lot of work to be done. ®

Send us news
77 Comments

Crims target telcos' Linux and Solaris boxes, which don't get enough infosec love

CrowdStrike says 'LightBasin' gang avoids Windows, and knows that telco networks run on badly-secured *nix

A mysterious criminal gang is targeting telcos' Linux and Solaris boxes, because it perceives they aren't being watched by infosec teams that have focussed their efforts on securing Windows.

Security vendor CrowdStrike claims it's spotted the group and that it "has been consistently targeting the telecommunications sector at a global scale since at least 2016 … to retrieve highly specific information from mobile communication infrastructure, such as subscriber information and call metadata." The gang appears to understand telco operations well enough to surf the carrier-to-carrier links that enable mobile roaming, across borders and between carriers, to spread its payloads.

CrowdStrike principal consultant Jamie Harries and senior security researcher Dan Mayer named the group "LightBasin", but it also goes by the handle "UNC1945".

Continue reading

Acer servers cracked in India and Taiwan – including systems with customer data

Gang says it grabbed internal info, could do the same to Acer elsewhere

Taiwanese PC maker Acer has not only admitted servers it operates in India and and Taiwan were compromised but that only those systems in India contained customer data.

The miscreants who claimed to be behind the network breaches boasted they stole gigabytes of information from the servers, and suggested other Acer operations around the world are also vulnerable to information theft.

Acer issued the following statement this week about the affair:

Continue reading

India's big four services giants wrestle with staff attrition amid COVID-19 pandemic

With high enough vax rates, HCL, Infosys and Wipro say hybrid work environment is the way of the future, while TCS is going to want you to come right on in.

India's big four IT services providers – HCL, Infosys, Tata Consulting Services, and Wipro – have all highlighted increasing staff attrition rates in their most recently completed quarters.

Wipro had the highest attrition rate at 20.5 per cent – up from 15.5 per cent in Q1. Next highest was Infosys, which reached "voluntary" 12-month attrition of 20.1 per cent. This reflects an ongoing situation, as in April 2021 the company reported that its workers had started to believe the COVID-19 pandemic had ebbed to a point at which they felt comfortable looking for a new gig.

HCL reported an all-time high attrition rate of 15.7 per cent – up from 11.8 per cent in Q1. TCS claimed to have the lowest attrition in the industry – 11.9 per cent in the last twelve months – chalking it up to "industry-wide churn".

Continue reading

You've heard of HTTPS. Now get a load of HTTPA: Web services in verified remote trusted environments?

Intel duo propose fresh use of, yes, SGX but also Arm's TrustZone and similar TEEs

Two Intel staffers believe web services can be made more secure by not only carrying out computations in remote trusted execution environments, or TEEs, but by also verifying for clients that this was done so.

Software engineer Gordon King and Hans Wang, a research scientist at Intel Labs, proposed the protocol to make that possible. In a paper distributed this month through ArXiv, they describe a HTTP protocol called HTTPS Attestable (HTTPA) to enhance online security with remote attestation – a way for apps to obtain an assurance that data will be handled by trusted software in secure execution environments.

Essentially, it's hoped that applications can verify through certificates and cryptography that code running in a server-side TEE is precisely the code expected to be run, unmodified by a rogue administrator, hijacked OS or hypervisor, network intruder, or malware. Ideally, the TEE should prevent or detect miscreants from snooping on or altering the code and data.

Continue reading

Allegations of favoring visa holders over US workers for jobs cost Facebook just 4 hours of annual profit

And that $14.25m is supposed to be some kind of record

Facebook will hand over $14.25m to the US government and American workers to settle allegations of discriminatory hiring practices.

The Justice Dept last year sued the internet giant accusing it of unfairly favoring job candidates who had temporary working papers, such as H-1B visas, over US citizens and permanent residents.

Between January 2018 and September 2019, foreigners whose pending green cards were being sponsored by Facebook were slotted into 2,600 roles at the social network in the United States with an average annual salary of $156,000, according to the DoJ in its lawsuit.

Continue reading

Google Pixel 6, 6 Pro Android 12 smartphone launch marred by shopping cart crashes

Chocolate Factory talks up Tensor mobile SoC, Titan M2 security ... for those who can get them

Google held a virtual event on Tuesday to introduce its latest Android phones, the Pixel 6 and 6 Pro, which are based on a Google-designed Tensor system-on-a-chip (SoC).

"We're getting the most out of leading edge hardware and software, and AI," said Rick Osterloh, SVP of devices and services at Google. "The brains of our new Pixel lineup is Google Tensor, a mobile system on a chip that we designed specifically around our ambient computing vision and Google's work in AI."

This latest Tensor SoC has dual Arm Cortex-X1 CPU cores running at 2.8GHz to handle application threads that need a lot of oomph, two Cortex-A76 cores at 2.25GHz for more modest workloads, and four 1.8GHz workhorse Cortex-A55 cores for lighter, less-energy-intensive tasks.

Continue reading

BlackMatter ransomware gang will target agriculture for its next harvest – Uncle Sam

What was that about hackable tractors?

The US CISA cybersecurity agency has warned that the Darkside ransomware gang, aka BlackMatter, has been targeting American food and agriculture businesses – and urges security pros to be on the lookout for indicators of compromise.

Well known in Western infosec circles for causing the shutdown of the US Colonial Pipeline, Darkside's apparent rebranding as BlackMatter after promising to go away for good in the wake of the pipeline hack hasn't slowed their criminal extortion down at all.

"Ransomware attacks against critical infrastructure entities could directly affect consumer access to critical infrastructure services; therefore, CISA, the FBI, and NSA urge all organizations, including critical infrastructure organizations, to implement the recommendations listed in the Mitigations section of this joint advisory," said the agencies in an alert published on the CISA website.

Continue reading

It's heeere: Node.js 17 is out – but not for production use, says dev team

EcmaScript 6 modules will not stop growing use of Node, claims chair of Technical Steering Committee

Node.js 17 is out, loaded with OpenSSL 3 and other new features, but it is not intended for use in production – and the promotion for Node.js 16 to an LTS release, expected soon, may be more important to most developers.

The release cycle is based on six-monthly major versions, with only the even numbers becoming LTS (long term support) editions. The rule is that a new even-numbered release becomes LTS six months later. All releases get six months of support. This means that Node.js 17 is primarily for testing and experimentation, but also that Node.js 16 (released in April) is about to become LTS. New features in 16 included version 9.0 of the V8 JavaScript engine and prebuilt Apple silicon binaries.

"We put together the LTS release process almost five years ago, it works quite well in that we're balancing [the fact] that some people want the latest, others prefer to have things be stable… when we go LTS," Red Hat's Michael Dawson, chair of the Node.js Technical Steering Committee, told The Register.

Continue reading

Oracle-owned ERP outfit NetSuite fitted with banking and data warehouse features to keep accountants sane

It's about helping steer the business without so many spreadsheets, says analyst

NetSuite, the ERP software aimed at medium-sized businesses, has launched new product features addressing integration with banking systems and business-facing analytics.

The company – bought by Oracle for $9.3bn five years ago – said NetSuite Analytics Warehouse offers features similar to those available with enterprise ERP platforms from Big Red and SAP.

In a pre-canned statement, Oracle NetSuite exec veep Evan Goldberg said: "With NetSuite Analytics Warehouse, our customers can now take advantage of a complete, prebuilt analytics solution that accelerates decision making and enables their organisations to quickly respond to changing customer needs and new market opportunities."

Continue reading

Scrambling to counter a ransomware attack could leave you with egg on your face

Join this masterclass and learn how to plan a far more efficient response

Sponsored When you read about security teams “scrambling” to respond to a ransomware attack, what do you think is the real problem?

Ransomware and other cyberattacks are, sadly, a given. But if you’re always “scrambling” or racing to counter the threat, perhaps you need to radically rethink your approach.

Having a plan to not just detect an attack but also its impact, and to bring back only those files and data that are affected, means you’ll be far better placed to get your organisation back in business quickly. The alternative, after all, is to weigh up the costs of recovery time against just giving the attackers what they want.

Continue reading

Email phishing crapcannon operators TA505 are back from the dead, researchers warn

And they're packing a new dirty RAT as well

A prolific email phishing threat actor – TA505 – is back from the dead, according to enterprise security software slinger Proofpoint.

TA505, which was last active in 2020, restarted its mass emailing campaigns in September – armed with new malware loaders and a RAT.

"Many of the campaigns, especially the large volume ones, strongly resemble the historic TA505 activity from 2019 and 2020," said Proofpoint in a statement today. "The commonalities include similar domain naming conventions, email lures, Excel file lures, and the delivery of the FlawedGrace remote access trojan (RAT)."

Continue reading