Linux Foundation backs Project OpenBytes: An attempt to slash legal risk of sharing data for training AI

Common format and license floated to foster greater exchange of info for ML

The non-profit Linux Foundation on Tuesday said it has teamed up with dataset management platform Graviti to develop Project OpenBytes, an initiative to make open data less legally risky through the development of data standards and formats.

The goal of Project OpenBytes is to reduce the legal risks for organizations and individuals interested in sharing their datasets with other AI/ML projects. Those who control data often hesitate to share their datasets due to concerns about licensing limitations.

According to the Linux Foundation, being able to reassure data stewards that their data rights will be protected and their data will not be misused will help make more datasets open and accessible.

"The OpenBytes project and community will benefit all AI developers, both academic and professional, and at both large and small enterprises, by enabling access to more high-quality open datasets and making AI deployment faster and easier," said Mike Dolan, general manager and SVP of projects at the Linux Foundation, in a statement.

The legal risks of AI and machine learning can be seen in various recent lawsuits. Last year, for example, IBM was accused of violating the Illinois Biometric Information Privacy Act when it used the plaintiff's photos in its "Diversity in Faces" dataset. And separate lawsuits also filed last year challenge Amazon, Google, Microsoft, and facial recognition biz FaceFirst for allegedly using that dataset to train their facial-recognition algorithms.

Then there's facial-recognition biz Clearview AI, which has been sued in the EU, UK, and US over claims it built its facial-recognition database by scraping various social media sites.

Let's be open

In an effort to avoid these sorts of legal entanglements, Project OpenBytes will require that data models, formats, labels, and other specifications be made available under the Community Specification License Agreement 1.0. Other relevant terms are outlined in the project's governance document.

Many large companies that deal with AI and ML datasets already operate under similar strictures, or at least say they do. But the Linux Foundation believes that it can provide vendor-neutral oversight of this aspiring data commons.

In a statement, Edward Cui, founder of Graviti, said many AI projects have been delayed due to the lack of high-quality data from real use cases. "Acquiring higher quality data is paramount if AI development is to progress," he said. "To accomplish that, an open data community built on collaboration and innovation is urgently needed."

Cui, in an email to The Register said a wide range of data formats, file formats, annotation formats and in-memory formats are possible.

"We are not talking about a specific format, but we plan to publish an IDL (Interface definition language) and a compiler to help users define the data structure in a way that is comprehensive and reusable, which will help users more easily understand and reuse data for the future model training, [and also] save the computational cost of converting data formats and increase the efficiency," he explained.

The benefit of this approach would be less resource intensive data preparation.

"If the community can settle on a standard data handling procedure, certain guarantees will be met," said Cui. "The data produced through those procedures will not need any further cleaning or preparation."

Cui said that data formats alone are not enough to reduce liability risks. " However, establishing standards, promoting licenses, and neutral governance of data sharing might. Setting up data standards and formats is part of creating quality control mechanisms and facilitating the data distribution process," he said.

"Data standards involve multiple procedures including registering appropriate licenses, desensitifying the data, providing dataset information, and restricting the purpose of data usage before they are released to the public. For example, filter data that must be desensitized before release, add legal checks if there are any license restrictions on the data, standardize how privacy information or sensitive content shall be handled, etc. We have a plan on working with the community to establish the guidelines, which will reduce both publishers and users’ liability risks."

The OpenBytes Project, Cui said, aims to establish guidance to ensure data quality, with the help of the participating community.

"Both publishers and users shall follow the same good practices for its own data release processes," he explained. "For a language model, the key to understand if data bias exists is to compare the dataset with the benchmark dataset defined by OpenBytes Project. The preparation and promotion of such a procedure is within the scope of the OpenBytes project."

The problem of trust

Siddharth Garg, associate professor of electrical and computer engineering at NYU Tandon, says that while a common data format and license may prove helpful, that doesn't necessarily resolve concerns about trust.

"One of the biggest challenges that is faced in the machine learning pipeline is how to vet your data," Garg said in an interview with The Register.

"If I get a dataset from a non-trusted source, the untrusted provider could potentially introduce a few data samples that have some special properties that are designed to mislead or to create a potentially intended misbehaviors in any neural network or machine learning algorithm that is trained on the data set. And some of these can be very subtle and extremely hard to diagnose."

There are already so-called model zoos that offer AI/ML researchers pre-trained models that are ready to be deployed. However, according to Garg, many of these do not provide much in the way of security (as noted in a paper he co-authored [PDF]. "For example, they were hosting incorrect hashes of the model...and that opens the door to new vulnerabilities," he said.

"I think it's good to have standards. I would hope that those standards have basic security functionality built in, in addition to licenses and so on."

Illustration of people working on build a giant head representing an AI system

Find out how to build trust in your AI apps from our MCubed web lecture


There's also, he said, an open question about people getting compensated fairly and credited for data they create in the AI/ML community. Licensing can address that, according to Garg, but dealing with model integrity – the possibility of tampering – is a more fundamental problem.

There's also the subtler concern about whether a data model collected for one purpose can really be applied to another purpose without unintended consequences.

"What other behaviors does your model inherit from either intentional or unintentional patterns that exist in the data set?" said Garg. "The bigger issue here is that training models to learn causal behavior is hard. You end up learning all manners of spurious correlations, intentionally inserted or unintentional spurious correlations." ®

Broader topics

Other stories you might like

  • Robotics and 5G to spur growth of SoC industry – report
    Big OEMs hogging production and COVID causing supply issues

    The system-on-chip (SoC) side of the semiconductor industry is poised for growth between now and 2026, when it's predicted to be worth $6.85 billion, according to an analyst's report. 

    Chances are good that there's an SoC-powered device within arm's reach of you: the tiny integrated circuits contain everything needed for a basic computer, leading to their proliferation in mobile, IoT and smart devices. 

    The report predicting the growth comes from advisory biz Technavio, which looked at a long list of companies in the SoC market. Vendors it analyzed include Apple, Broadcom, Intel, Nvidia, TSMC, Toshiba, and more. The company predicts that much of the growth between now and 2026 will stem primarily from robotics and 5G. 

    Continue reading
  • Deepfake attacks can easily trick live facial recognition systems online
    Plus: Next PyTorch release will support Apple GPUs so devs can train neural networks on their own laptops

    In brief Miscreants can easily steal someone else's identity by tricking live facial recognition software using deepfakes, according to a new report.

    Sensity AI, a startup focused on tackling identity fraud, carried out a series of pretend attacks. Engineers scanned the image of someone from an ID card, and mapped their likeness onto another person's face. Sensity then tested whether they could breach live facial recognition systems by tricking them into believing the pretend attacker is a real user.

    So-called "liveness tests" try to authenticate identities in real-time, relying on images or video streams from cameras like face recognition used to unlock mobile phones, for example. Nine out of ten vendors failed Sensity's live deepfake attacks.

    Continue reading
  • Lonestar plans to put datacenters in the Moon's lava tubes
    How? Founder tells The Register 'Robots… lots of robots'

    Imagine a future where racks of computer servers hum quietly in darkness below the surface of the Moon.

    Here is where some of the most important data is stored, to be left untouched for as long as can be. The idea sounds like something from science-fiction, but one startup that recently emerged from stealth is trying to turn it into a reality. Lonestar Data Holdings has a unique mission unlike any other cloud provider: to build datacenters on the Moon backing up the world's data.

    "It's inconceivable to me that we are keeping our most precious assets, our knowledge and our data, on Earth, where we're setting off bombs and burning things," Christopher Stott, founder and CEO of Lonestar, told The Register. "We need to put our assets in place off our planet, where we can keep it safe."

    Continue reading

Biting the hand that feeds IT © 1998–2022