This article is more than 1 year old
No hack needed: Anonymisation beaten with a dash of SQL
Melbourne researchers warn government: don't publish data down to the individual, ever
Governments should not release anonymised data that refers to individuals, because re-identification is inevitable.
That's the conclusion from Melbourne University's Dr Chris Culnane, Dr Benjamin Rubinstein and Dr Vanessa Teague, who have shown that the Medicare data the Australian government briefly published last year can be re-identified – trivially.
The researchers demonstrated last year that the (hopefully deprecated) formula the government used to create "anonymous" identifiers for personal data was easily reversible.
The paper, here [PDF], examines the same data set that brought the wrath of Australia's sysadmin-in-chief, er, attorney general George Brandis, who proposed legislation (not yet passed) to criminalise unauthorised research into re-identification.
The researchers explained that there are simply too many facts in a data release to properly protect individuals' data.
Speaking to El Reg today, Dr Teague emphasised that from an academic point of view, nothing the trio was doing was either new or sophisticated.
“What this shows: de-identification of detailed individual records about people doesn't work,” she said.
As Dr Culnane said in the University of Melbourne's media release, “We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth.”
“Without decryption” is also an important point: there's no “hacking” involved here, and as Dr Teague told us, there's not even much by way of analysis.
Year of birth is important (and for most people easily found), “because the database index is tagged with your year of birth.”
With “one or two surgeries on particular dates, or knowing one or unusual prescriptions,” Dr Teague said, “I can write a very simple database query to identify you”.
Open government boundaries
Dr Teague said the simplicity of re-identification is a wake-up call for a debate about limits to what governments release as open data like health, tax, welfare, or census records.
In short: while publishing aggregate data (“14,000 births in Victoria”, for example) is safe, individual records should be protected.
In individual record is “not something that can be put back in the box after it's been on the Internet … What this shows: de-identification of detailed individual records about people doesn't work.”
Researchers, she said, should only have access to that level of research data in a secure environment, and those researchers need have it drummed into them that the data is re-identifiable.
“The idea that the government can make open all the data about people is just wrong.”
She added that the government's attempt to prohibit re-identification research (the legislation has not yet passed) was “a misguided effort” that “prohibited the public demonstration that there is a problem, but didn't address the problem.
“That's not good for improving the science of privacy, and it's not good for public debate.” ®