The Australian Bureau of Statistics has made a hash of the census
Promising wonderful outcomes without explaining privacy protection burns the public's trust
Bootnoted The Australian Bureau of Statistics (ABS) has so badly mishandled the question of retaining names that its senior leadership need to consider their futures.
The ABS is – sorry, was – probably one of Australia's most trusted bureaucracies, alongside the Bureau of Meteorology, the Australian Electoral Commission, and Geosciences Australia.
But since deciding that this year's Australian census will retain participants' names and use them for ill-defined data-matching purposes, the Bureau has so alienated people there are serious calls for name-boycotts and a persistent discussion about the scale of fines (AU$180 a day up to a maximum $1,800, if you're interested). Those calls can undermine the census and its mission of providing policy-makers with useful data.
And the ABS persistently ignores questions put to it. Its first response when asked about the retention of names is something like the Tweet below, which talks about collection, not retention.
It's a mess that the ABS created for itself.
It takes a lot to make me say “security is now no longer the primary consideration”, but that's what the ABS has achieved.
Its data is useless without the trust of the public, and I've never seen public goodwill burned as quickly as has happened since Australians learned – somewhat after the decision was made – that the Bureau wants to keep their names.
And since then, the bureau has acted in a high-handed, condescending and dismissive manner.
That's when the bureau spoke at all: mostly, it pointed people to an FAQ that didn't provide responsive answers to specific questions about security, and let researchers and academics fill the vacuum in the debate.
Unfortunately, many of the researchers are as high-handed, condescending, dismissive and unresponsive as the ABS – which hasn't helped.
Take this, from The Conversation: it puts the case that better data means better policy, with the following examples:
- ”Linking census data to health or mortality data” helps compare immigrant lifespans to those born in Australia – which, however, doesn't address the question of intrusiveness and privacy;
- ”Linking census data to social security data can better identify which individuals are likely to remain on welfare over the long term” – ditto;
- ”Linking two or more censuses provides a longitudinal database” – ditto.
Or this, from the Fairfax Media, in which demographer Liz Allen says “linking the census data to pharmaceutical benefits records can get that data and get it linked to all sorts of other information”.
That's useful, sure – but it's also exactly what people are scared of. Putting the use-case doesn't address rational fears.
Here's a speech from 2015, which is in no way reassuring, by the chief statistician David Kalisch.
The exact concerns being raised now, he dismissed last year: “Technology, expertise and confidentiality are not the issues or the constraints. It can take some time and resources for government agencies to provide better access to their data, even to an organisation such as the ABS with all the data protections and community support you would require.”
Ahem, confidentiality and technology certainly should be considered “constraints”, when the aim is to create a named identifier for all citizens, which Kalisch clearly admires.
Moreover: the ABS is not mandated to be the data integrator Kalisch imagines and desires. Kalisch is already advocating scope creep when he should be resisting it in the name of privacy.
In the presence of such sensitivities, transparency and trust are indispensable – but the bureau dispensed with both.
And at last, I will come to the generally-demanded “tech angle” to this story: it's perfectly feasible to tie data to a unique identifier without the name being that identifier.
If two data sets – the Census and the Pharmaceutical Benefits Scheme, for example – contain enough data points to consistently identify me, then a hash of that data would work just as well for anonymous analysis.
Richard Chirgwin with a date of birth and an address will produce the same SHA-256 key (
c2483d63179b71b37334f730385272c81b5d6bd3ae6edffb49234cfeb7f7d9a6, I just tried it) no matter the source system – but the hash cannot be reversed to deliver my personal data.
If the data records with name are sufficient to identify me uniquely across two government systems, a hash of that data will be just as unique and will provide the same analytical link.
The ABS – and the data users defending it – must explain why names are indispensable to the mission.
But the cack-handed mishandling of the public debate is so destructive, it should be the next chief statistician to give the explanation. ®
Bootnote: As a clarification, I need to point out: I am saying Census data (with a hash as an identifier) should never be brought together with a second source (example above, the PBS) with names intact on either side.
Should a researcher demonstrate a use-case to construct Census-versus-PBS queries, the names in PBS data should be hashed before the two datasets are brought together. ®