There's a fine line between getting hold of data that may be in the public interest and downright stealing data just because you can. And simply because the data is out there – having been stolen by online intruders and then leaked – does not mean it is right to use it.
A paper published in Nature Machine Intelligence this week is an effort to help guide data scientists and researchers through the ethical dilemmas which present themselves when considering using information obtained from data breaches.
To kick off, Marcello Ienca, a research fellow at the Swiss Federal Institute of Technology, and Effy Vayena, deputy head of the Swiss Institute of Translational Medicine, offered the definition that "hacked" data is "data obtained in an unauthorized manner through illicit access to a computer or computer network." They claim it is increasingly being used in scientific research such as conflict modelling studies based on WikiLeaks datasets, and studies on sexual behaviour based on data leaked from Ashley Madison, a dating website whose database was pilfered by a group of attackers calling themselves The Impact Team in 2015.
But basing studies on such ill-gotten datasets presents problems analogous to previous debates on research that uses data of unethical origin, such as data obtained from Nazi medical "experiments."
Even though "it may be lawful for researchers to use hacked data if they are publicly available, responsible research practices still require clear ethical justification for doing so," the paper argues.
Researchers might argue they are justified in using publicly available purloined datasets because they offer public value, save resources, offer a unique source, and might present cross-domain consistency. On the other hand, using such data may lack consent from people mentioned or implicated in the data, using the data might cause secondary harm, it could represent a privacy breach, and might lower quality of scientific standards.
- Leaked: List of police, govt, uni orgs in Clearview AI's facial-recognition trials
- 1.5 BEEELLION sensitive files found exposed online dwarf Panama Papers leak
- You tried to hide your extramarital affair … by putting it on the web?
- Why Nobody Should Ever Search The Ashley Madison Data
The authors propose six ethical and procedural requirements that need to be addressed before going ahead with the use of stolen or leaked data for a project.
Firstly, they encourage researchers to consider uniqueness. Can they demonstrate that the leaked data could not have been collected using conventional methods? Next, can they show that their intended research is of high social value, and that the benefits clearly outweigh the possible harms? If hacked data is personally identifiable, researchers should obtain explicit and informed consent from those individuals.
If that's impossible, the research should only go ahead if the risk is minimal and the benefits obvious. They should also make sure they have a record of how and where all data has been obtained. Researchers should clearly state when they have accessed identifiable data without the subjects' consent, and say what they have done to ensure the data subjects' privacy and security. Those five conditions lead into a sixth: that Institutional Review Boards (IRBs) or analogous bodies such as Research Ethics are utilised.
Datasets made public via WikiLeaks or the Panama Papers can offer insight for the public good, but there are also risks and unintended consequences involved in researching illegally accessed datasets. Ienca and Vayena have offered an approach to getting some of those benefits while minimising the potential harm. ®