Linux data-sharing licences: So, will big data hogs take the plunge?

Experts weigh in

With its new open data licensing framework, announced on Tuesday, the Linux Foundation has created legal frameworks around sharing raw, unorganised data to tempt generous companies, nonprofits, government agencies and researchers to do so.

But an expert says their current ambiguity makes them risky, and others are concerned over licensing compatibility issues.

Mike Dolan, the Linux Foundation's VP of strategic programs – who helped draft the licences – told El Reg that individuals or organisations working on machine learning, traffic flow or other data-heavy systems could gain a lot from sharing, such as improving algorithms and increasing resources.

But today (excluding sensitive data covered by law), you either keep your raw data a trade secret or release it with no IP restrictions, said Estelle Derclaye, an IP lawyer at the University of Nottingham. There are already comprehensive licence agreements for sharing and attributing data organised in a database (such as CC-BY, the Open Data Commons Open Database License, or the Open Data Commons Attribution License).

When Derclaye reviewed one of the two new licence agreements at The Reg's request, she told us: "I wouldn't want to sign it."

Why a new licence?

Dolan said the aim was "to ensure that data providers and users had clarity about their ability to curate, use, and share" in order to enable "the creation of open, collaborative data, collaborative data communities". Drafting began during the third quarter of 2016 because of a perceived gap in one-shop licence agreements.

He gave the example of training a drone to fly autonomously – what if a dataset didn't include any examples of trees, a user trained its drone on the data, and it crashed into one? Whose fault would that be?

One licence agreement requires that changes to data be shared. There's also a permissive choice, which Dolan expects to be the most popular because of the lower legal approval legwork.

The CDLA agreement does not put any restrictions on any results produced by processing and analysing the data.

Dolan said that "well over 100 lawyers" had reviewed the agreements and that the licences take into account differences between countries. Nevertheless, the framework is open to iteration. The team is opening a mailing list to facilitate public feedback and will "monitor" discussion.

Whose data is it anyway?

Daniel Himmelstein, a data science postdoctoral researcher at the University of Pennsylvania, told The Register: "Until recently, there was little awareness that data licensing could be an issue. This is an exciting development since it reflects that major players are now considering the importance. If more people feel comfortable releasing data openly because of these licences, then that's a win".

However, he was uncertain about the benefits of having a more "data-focused" licence agreement compared to creative commons licences. "I will likely continue to release most of my data under a CC0 public domain dedication," he said.

Dolan responded: "We did not draft the CDLA agreements to cure or fix any specific issues with other licences, but rather to look at what the current use cases required and build an agreement from that point of view based on what we've learned in open source-software licensing.

"All that we say is if there are attribution notices, you cannot remove them... In many jurisdictions there can be severe penalties for removing attribution notices so we wanted to prevent that from happening."

Derclaye, the author of The Legal Protection of Databases, said she could understand why the Linux Foundation had created these restrictive licence agreements for raw data – saying they'd be incentive for organisations to disclose it. At the same time, she thinks it "wouldn't have been much work" to modify the existing creative commons licences, such as the commonly used CC-BY, to accommodate raw data, instead of creating something from scratch.

Room to improve

What the Linux Foundation ended up writing is "too vague" and "might create problems". She argued that:

  1. Unlike CC-BY, the sharing licence agreement does not explicitly state that the data is royalty-free. The licensee would need to check with the Linux Foundation.
  2. The licence does not include language for removing technological protection measures, such as encryption or other anti-copy tech. (Dolan claims the licence does have this, though "we made it even more broad than just technological protection measures").
  3. The agreement does not explicitly state that the licensee can sub-license the agreement to other parties without the Linux Foundation's approval. This might come up if a PhD student switched labs and wanted to sub-license data to their new boss. (Dolan said: "Everyone gets a licence to use, modify and distribute it to anyone under the licence they're all agreeing to use – the CDLA").
  4. CC-BY has language explicitly allowing existing fair use laws in the US and exception laws in other countries, although the Linux Foundation does not touch on it. (Dolan responded that open-source software licences routinely don't explicitly reference such exceptions and that they would be dealt with by "applicable law").
  5. The licence adds explicit language stating that the data will not be considered a work of joint authorship – but the actual definition is unclear.
  6. The licence gives contradictory advice regarding moral obligations and attribution that is confusing – CC-BY is clearer.

The database law prof said it's better to be clear, even if an agreement is more restrictive than it would be otherwise. Because of the vagueness, she added, if you're using a Linux Foundation agreement in a shared resource with other data under a different license, there could be conflicts. "If they really want it to be useful, it's good to be aligned as possible."

Leigh Dodds, of the UK's Open Data Institute, told The Register: "Clear licensing, which gives anyone the permission to access, use and share data, is fundamental to the open data movement.

"While we welcome the efforts of the Linux Foundation, we are not yet clear on what these new licences bring to the ecosystem. Users need to understand how these new licenses are compatible with, or different from, existing creative commons licences (especially CC-BY 4.0 and CC-BY-SA 4.0), and whether they allow for relicensing."

The org will "continue to recommend use of CC-BY 4.0 as it is already well adopted internationally". ®

Similar topics

Other stories you might like

  • Robotics and 5G to spur growth of SoC industry – report
    Big OEMs hogging production and COVID causing supply issues

    The system-on-chip (SoC) side of the semiconductor industry is poised for growth between now and 2026, when it's predicted to be worth $6.85 billion, according to an analyst's report. 

    Chances are good that there's an SoC-powered device within arm's reach of you: the tiny integrated circuits contain everything needed for a basic computer, leading to their proliferation in mobile, IoT and smart devices. 

    The report predicting the growth comes from advisory biz Technavio, which looked at a long list of companies in the SoC market. Vendors it analyzed include Apple, Broadcom, Intel, Nvidia, TSMC, Toshiba, and more. The company predicts that much of the growth between now and 2026 will stem primarily from robotics and 5G. 

    Continue reading
  • Deepfake attacks can easily trick live facial recognition systems online
    Plus: Next PyTorch release will support Apple GPUs so devs can train neural networks on their own laptops

    In brief Miscreants can easily steal someone else's identity by tricking live facial recognition software using deepfakes, according to a new report.

    Sensity AI, a startup focused on tackling identity fraud, carried out a series of pretend attacks. Engineers scanned the image of someone from an ID card, and mapped their likeness onto another person's face. Sensity then tested whether they could breach live facial recognition systems by tricking them into believing the pretend attacker is a real user.

    So-called "liveness tests" try to authenticate identities in real-time, relying on images or video streams from cameras like face recognition used to unlock mobile phones, for example. Nine out of ten vendors failed Sensity's live deepfake attacks.

    Continue reading
  • Lonestar plans to put datacenters in the Moon's lava tubes
    How? Founder tells The Register 'Robots… lots of robots'

    Imagine a future where racks of computer servers hum quietly in darkness below the surface of the Moon.

    Here is where some of the most important data is stored, to be left untouched for as long as can be. The idea sounds like something from science-fiction, but one startup that recently emerged from stealth is trying to turn it into a reality. Lonestar Data Holdings has a unique mission unlike any other cloud provider: to build datacenters on the Moon backing up the world's data.

    "It's inconceivable to me that we are keeping our most precious assets, our knowledge and our data, on Earth, where we're setting off bombs and burning things," Christopher Stott, founder and CEO of Lonestar, told The Register. "We need to put our assets in place off our planet, where we can keep it safe."

    Continue reading

Biting the hand that feeds IT © 1998–2022