This article is more than 1 year old
Linux Foundation backs Project OpenBytes: An attempt to slash legal risk of sharing data for training AI
Common format and license floated to foster greater exchange of info for ML
The non-profit Linux Foundation on Tuesday said it has teamed up with dataset management platform Graviti to develop Project OpenBytes, an initiative to make open data less legally risky through the development of data standards and formats.
The goal of Project OpenBytes is to reduce the legal risks for organizations and individuals interested in sharing their datasets with other AI/ML projects. Those who control data often hesitate to share their datasets due to concerns about licensing limitations.
According to the Linux Foundation, being able to reassure data stewards that their data rights will be protected and their data will not be misused will help make more datasets open and accessible.
"The OpenBytes project and community will benefit all AI developers, both academic and professional, and at both large and small enterprises, by enabling access to more high-quality open datasets and making AI deployment faster and easier," said Mike Dolan, general manager and SVP of projects at the Linux Foundation, in a statement.
The legal risks of AI and machine learning can be seen in various recent lawsuits. Last year, for example, IBM was accused of violating the Illinois Biometric Information Privacy Act when it used the plaintiff's photos in its "Diversity in Faces" dataset. And separate lawsuits also filed last year challenge Amazon, Google, Microsoft, and facial recognition biz FaceFirst for allegedly using that dataset to train their facial-recognition algorithms.
Then there's facial-recognition biz Clearview AI, which has been sued in the EU, UK, and US over claims it built its facial-recognition database by scraping various social media sites.
Let's be open
In an effort to avoid these sorts of legal entanglements, Project OpenBytes will require that data models, formats, labels, and other specifications be made available under the Community Specification License Agreement 1.0. Other relevant terms are outlined in the project's governance document.
Many large companies that deal with AI and ML datasets already operate under similar strictures, or at least say they do. But the Linux Foundation believes that it can provide vendor-neutral oversight of this aspiring data commons.
In a statement, Edward Cui, founder of Graviti, said many AI projects have been delayed due to the lack of high-quality data from real use cases. "Acquiring higher quality data is paramount if AI development is to progress," he said. "To accomplish that, an open data community built on collaboration and innovation is urgently needed."
Cui, in an email to The Register said a wide range of data formats, file formats, annotation formats and in-memory formats are possible.
"We are not talking about a specific format, but we plan to publish an IDL (Interface definition language) and a compiler to help users define the data structure in a way that is comprehensive and reusable, which will help users more easily understand and reuse data for the future model training, [and also] save the computational cost of converting data formats and increase the efficiency," he explained.
The benefit of this approach would be less resource intensive data preparation.
"If the community can settle on a standard data handling procedure, certain guarantees will be met," said Cui. "The data produced through those procedures will not need any further cleaning or preparation."
Cui said that data formats alone are not enough to reduce liability risks. " However, establishing standards, promoting licenses, and neutral governance of data sharing might. Setting up data standards and formats is part of creating quality control mechanisms and facilitating the data distribution process," he said.
"Data standards involve multiple procedures including registering appropriate licenses, desensitifying the data, providing dataset information, and restricting the purpose of data usage before they are released to the public. For example, filter data that must be desensitized before release, add legal checks if there are any license restrictions on the data, standardize how privacy information or sensitive content shall be handled, etc. We have a plan on working with the community to establish the guidelines, which will reduce both publishers and users’ liability risks."
The OpenBytes Project, Cui said, aims to establish guidance to ensure data quality, with the help of the participating community.
"Both publishers and users shall follow the same good practices for its own data release processes," he explained. "For a language model, the key to understand if data bias exists is to compare the dataset with the benchmark dataset defined by OpenBytes Project. The preparation and promotion of such a procedure is within the scope of the OpenBytes project."
The problem of trust
Siddharth Garg, associate professor of electrical and computer engineering at NYU Tandon, says that while a common data format and license may prove helpful, that doesn't necessarily resolve concerns about trust.
"One of the biggest challenges that is faced in the machine learning pipeline is how to vet your data," Garg said in an interview with The Register.
"If I get a dataset from a non-trusted source, the untrusted provider could potentially introduce a few data samples that have some special properties that are designed to mislead or to create a potentially intended misbehaviors in any neural network or machine learning algorithm that is trained on the data set. And some of these can be very subtle and extremely hard to diagnose."
- Open Source Jobs Report: Explosive cloud growth knocks Linux off top spot for desired skillsets
- 30 years of Linux: OS was successful because of how it was licensed, says Red Hat
- Open-source software starts with developers, but there are other important contributors, too. Who exactly? Good question
- Computer scientists at University of Edinburgh contemplate courses without 'Alice' and 'Bob'
There are already so-called model zoos that offer AI/ML researchers pre-trained models that are ready to be deployed. However, according to Garg, many of these do not provide much in the way of security (as noted in a paper he co-authored [PDF]. "For example, they were hosting incorrect hashes of the model...and that opens the door to new vulnerabilities," he said.
"I think it's good to have standards. I would hope that those standards have basic security functionality built in, in addition to licenses and so on."
Find out how to build trust in your AI apps from our MCubed web lectureFULL DETAILS
There's also, he said, an open question about people getting compensated fairly and credited for data they create in the AI/ML community. Licensing can address that, according to Garg, but dealing with model integrity – the possibility of tampering – is a more fundamental problem.
There's also the subtler concern about whether a data model collected for one purpose can really be applied to another purpose without unintended consequences.
"What other behaviors does your model inherit from either intentional or unintentional patterns that exist in the data set?" said Garg. "The bigger issue here is that training models to learn causal behavior is hard. You end up learning all manners of spurious correlations, intentionally inserted or unintentional spurious correlations." ®