Is it OK to use stolen data? What if it's scientific research in the public interest?

Not always, but Swiss team says you can manage the risks

There's a fine line between getting hold of data that may be in the public interest and downright stealing data just because you can. And simply because the data is out there – having been stolen by online intruders and then leaked – does not mean it is right to use it.

A paper published in Nature Machine Intelligence this week is an effort to help guide data scientists and researchers through the ethical dilemmas which present themselves when considering using information obtained from data breaches.

To kick off, Marcello Ienca, a research fellow at the Swiss Federal Institute of Technology, and Effy Vayena, deputy head of the Swiss Institute of Translational Medicine, offered the definition that "hacked" data is "data obtained in an unauthorized manner through illicit access to a computer or computer network." They claim it is increasingly being used in scientific research such as conflict modelling studies based on WikiLeaks datasets, and studies on sexual behaviour based on data leaked from Ashley Madison, a dating website whose database was pilfered by a group of attackers calling themselves The Impact Team in 2015.

But basing studies on such ill-gotten datasets presents problems analogous to previous debates on research that uses data of unethical origin, such as data obtained from Nazi medical "experiments."

Even though "it may be lawful for researchers to use hacked data if they are publicly available, responsible research practices still require clear ethical justification for doing so," the paper argues.

Researchers might argue they are justified in using publicly available purloined datasets because they offer public value, save resources, offer a unique source, and might present cross-domain consistency. On the other hand, using such data may lack consent from people mentioned or implicated in the data, using the data might cause secondary harm, it could represent a privacy breach, and might lower quality of scientific standards.

The authors propose six ethical and procedural requirements that need to be addressed before going ahead with the use of stolen or leaked data for a project.

Firstly, they encourage researchers to consider uniqueness. Can they demonstrate that the leaked data could not have been collected using conventional methods? Next, can they show that their intended research is of high social value, and that the benefits clearly outweigh the possible harms? If hacked data is personally identifiable, researchers should obtain explicit and informed consent from those individuals.

If that's impossible, the research should only go ahead if the risk is minimal and the benefits obvious. They should also make sure they have a record of how and where all data has been obtained. Researchers should clearly state when they have accessed identifiable data without the subjects' consent, and say what they have done to ensure the data subjects' privacy and security. Those five conditions lead into a sixth: that Institutional Review Boards (IRBs) or analogous bodies such as Research Ethics are utilised.

Datasets made public via WikiLeaks or the Panama Papers can offer insight for the public good, but there are also risks and unintended consequences involved in researching illegally accessed datasets. Ienca and Vayena have offered an approach to getting some of those benefits while minimising the potential harm. ®

Broader topics

Other stories you might like

  • SpaceX Starlink sat streaks now present in nearly a fifth of all astronomical images snapped by Caltech telescope

    Annoying, maybe – but totally ruining science, no

    SpaceX’s Starlink satellites appear in about a fifth of all images snapped by the Zwicky Transient Facility (ZTF), a camera attached to the Samuel Oschin Telescope in California, which is used by astronomers to study supernovae, gamma ray bursts, asteroids, and suchlike.

    A study led by Przemek Mróz, a former postdoctoral scholar at the California Institute of Technology (Caltech) and now a researcher at the University of Warsaw in Poland, analysed the current and future effects of Starlink satellites on the ZTF. The telescope and camera are housed at the Palomar Observatory, which is operated by Caltech.

    The team of astronomers found 5,301 streaks leftover from the moving satellites in images taken by the instrument between November 2019 and September 2021, according to their paper on the subject, published in the Astrophysical Journal Letters this week.

    Continue reading
  • AI tool finds hundreds of genes related to human motor neuron disease

    Breakthrough could lead to development of drugs to target illness

    A machine-learning algorithm has helped scientists find 690 human genes associated with a higher risk of developing motor neuron disease, according to research published in Cell this week.

    Neuronal cells in the central nervous system and brain break down and die in people with motor neuron disease, like amyotrophic lateral sclerosis (ALS) more commonly known as Lou Gehrig's disease, named after the baseball player who developed it. They lose control over their bodies, and as the disease progresses patients become completely paralyzed. There is currently no verified cure for ALS.

    Motor neuron disease typically affects people in old age and its causes are unknown. Johnathan Cooper-Knock, a clinical lecturer at the University of Sheffield in England and leader of Project MinE, an ambitious effort to perform whole genome sequencing of ALS, believes that understanding how genes affect cellular function could help scientists develop new drugs to treat the disease.

    Continue reading
  • Need to prioritize security bug patches? Don't forget to scan Twitter as well as use CVSS scores

    Exploit, vulnerability discussion online can offer useful signals

    Organizations looking to minimize exposure to exploitable software should scan Twitter for mentions of security bugs as well as use the Common Vulnerability Scoring System or CVSS, Kenna Security argues.

    Better still is prioritizing the repair of vulnerabilities for which exploit code is available, if that information is known.

    CVSS is a framework for rating the severity of software vulnerabilities (identified using CVE, or Common Vulnerability Enumeration, numbers), on a scale from 1 (least severe) to 10 (most severe). It's overseen by, a US-based, non-profit computer security organization.

    Continue reading

Biting the hand that feeds IT © 1998–2022