Sign in

Linux Foundation backs Project OpenBytes: An attempt to slash legal risk of sharing data for training AI

Common format and license floated to foster greater exchange of info for ML

Thomas Claburn in San Francisco Tue 2 Nov 2021 // 18:49 UTC

The non-profit Linux Foundation on Tuesday said it has teamed up with dataset management platform Graviti to develop Project OpenBytes, an initiative to make open data less legally risky through the development of data standards and formats.

The goal of Project OpenBytes is to reduce the legal risks for organizations and individuals interested in sharing their datasets with other AI/ML projects. Those who control data often hesitate to share their datasets due to concerns about licensing limitations.

According to the Linux Foundation, being able to reassure data stewards that their data rights will be protected and their data will not be misused will help make more datasets open and accessible.

"The OpenBytes project and community will benefit all AI developers, both academic and professional, and at both large and small enterprises, by enabling access to more high-quality open datasets and making AI deployment faster and easier," said Mike Dolan, general manager and SVP of projects at the Linux Foundation, in a statement.

The legal risks of AI and machine learning can be seen in various recent lawsuits. Last year, for example, IBM was accused of violating the Illinois Biometric Information Privacy Act when it used the plaintiff's photos in its "Diversity in Faces" dataset. And separate lawsuits also filed last year challenge Amazon, Google, Microsoft, and facial recognition biz FaceFirst for allegedly using that dataset to train their facial-recognition algorithms.

Then there's facial-recognition biz Clearview AI, which has been sued in the EU, UK, and US over claims it built its facial-recognition database by scraping various social media sites.

Let's be open

In an effort to avoid these sorts of legal entanglements, Project OpenBytes will require that data models, formats, labels, and other specifications be made available under the Community Specification License Agreement 1.0. Other relevant terms are outlined in the project's governance document.

Many large companies that deal with AI and ML datasets already operate under similar strictures, or at least say they do. But the Linux Foundation believes that it can provide vendor-neutral oversight of this aspiring data commons.

In a statement, Edward Cui, founder of Graviti, said many AI projects have been delayed due to the lack of high-quality data from real use cases. "Acquiring higher quality data is paramount if AI development is to progress," he said. "To accomplish that, an open data community built on collaboration and innovation is urgently needed."

Cui, in an email to The Register said a wide range of data formats, file formats, annotation formats and in-memory formats are possible.

"We are not talking about a specific format, but we plan to publish an IDL (Interface definition language) and a compiler to help users define the data structure in a way that is comprehensive and reusable, which will help users more easily understand and reuse data for the future model training, [and also] save the computational cost of converting data formats and increase the efficiency," he explained.

The benefit of this approach would be less resource intensive data preparation.

"If the community can settle on a standard data handling procedure, certain guarantees will be met," said Cui. "The data produced through those procedures will not need any further cleaning or preparation."

Cui said that data formats alone are not enough to reduce liability risks. " However, establishing standards, promoting licenses, and neutral governance of data sharing might. Setting up data standards and formats is part of creating quality control mechanisms and facilitating the data distribution process," he said.

"Data standards involve multiple procedures including registering appropriate licenses, desensitifying the data, providing dataset information, and restricting the purpose of data usage before they are released to the public. For example, filter data that must be desensitized before release, add legal checks if there are any license restrictions on the data, standardize how privacy information or sensitive content shall be handled, etc. We have a plan on working with the community to establish the guidelines, which will reduce both publishers and users’ liability risks."

The OpenBytes Project, Cui said, aims to establish guidance to ensure data quality, with the help of the participating community.

"Both publishers and users shall follow the same good practices for its own data release processes," he explained. "For a language model, the key to understand if data bias exists is to compare the dataset with the benchmark dataset defined by OpenBytes Project. The preparation and promotion of such a procedure is within the scope of the OpenBytes project."

The problem of trust

Siddharth Garg, associate professor of electrical and computer engineering at NYU Tandon, says that while a common data format and license may prove helpful, that doesn't necessarily resolve concerns about trust.

"One of the biggest challenges that is faced in the machine learning pipeline is how to vet your data," Garg said in an interview with The Register.

"If I get a dataset from a non-trusted source, the untrusted provider could potentially introduce a few data samples that have some special properties that are designed to mislead or to create a potentially intended misbehaviors in any neural network or machine learning algorithm that is trained on the data set. And some of these can be very subtle and extremely hard to diagnose."

There are already so-called model zoos that offer AI/ML researchers pre-trained models that are ready to be deployed. However, according to Garg, many of these do not provide much in the way of security (as noted in a paper he co-authored [PDF]. "For example, they were hosting incorrect hashes of the model...and that opens the door to new vulnerabilities," he said.

"I think it's good to have standards. I would hope that those standards have basic security functionality built in, in addition to licenses and so on."

Illustration of people working on build a giant head representing an AI system

Find out how to build trust in your AI apps from our MCubed web lecture

FULL DETAILS

There's also, he said, an open question about people getting compensated fairly and credited for data they create in the AI/ML community. Licensing can address that, according to Garg, but dealing with model integrity – the possibility of tampering – is a more fundamental problem.

There's also the subtler concern about whether a data model collected for one purpose can really be applied to another purpose without unintended consequences.

"What other behaviors does your model inherit from either intentional or unintentional patterns that exist in the data set?" said Garg. "The bigger issue here is that training models to learn causal behavior is hard. You end up learning all manners of spurious correlations, intentionally inserted or unintentional spurious correlations." ®

6 Comments

Similar topics

Narrower topics

Other stories you might like

  • US prosecutors: Chinese walkie-talkie-maker Hytera stole Motorola secrets

    Lured workers with bigger pay. Workers walked – then they talked
    Laura Dobberstein Tue 8 Feb 2022 // 04:29 UTC

    The US Department of Justice announced on Monday that Chinese walkie-talkie manufacturer Hytera had been indicted on 21 counts related to an alleged theft of trade secrets from US-based competitor Motorola Solutions.

    According to unsealed court documents, the Shenzhen-based company recruited and hired employees from Motorola Solutions in Malaysia from 2007 until 2020. Hytera asked them to bring along proprietary information that was used to develop and market its own digital mobile radio (DMR) technology – aka those walkie-talkies, it is claimed.

    In exchange, the former Motorola Solutions employees enjoyed bumps to their salary and benefit packages, the Dept of Justice claimed. The court document also lists select associated individuals facing charges for the alleged possession or attempted possession of stolen trade secrets. Their names and some details are redacted, although they are described in this court filing [PDF] as "former Motorola employees recruited to Hytera in 2008 and 2009."

    Continue reading

  • Microsoft to block downloaded VBA macros in Office – you may be able to run 'em anyway

    Aims to make life harder for miscreants
    Simon Sharwood, APAC Editor Tue 8 Feb 2022 // 02:53 UTC

    Microsoft Office will soon block untrusted Visual Basic for Applications (VBA) macros sourced from the internet by default – a security measure users can still circumvent, permissions allowing.

    The Windows giant announced that the change will come in version 2203 of Office for Windows, due in April 2022, and applies to Access, Excel, PowerPoint, Visio, and Word. The change will come to Office LTSC, Office 2021, Office 2019, Office 2016, and Office 2013 at a date to be determined.

    Microsoft's rationale for the change is that criminals use macros to target users, and that Office's current defense strategy is somewhat lacking.

    Continue reading

  • To our total surprise, Apple makes adding alternative payment systems to apps 'painful, expensive, clunky'

    Developers fume while competition watchdog issues a third puny €5m fine
    Thomas Claburn in San Francisco Tue 8 Feb 2022 // 01:23 UTC

    Apple's idea of complying with the law in the Netherlands offers a glimpse of what developers elsewhere have to look forward to if regulators elsewhere succeed in challenging the company's control of its iOS App Store.

    Apple is currently trying to fend off lawsuits and proposed legislation around the globe that threaten its stewardship of its iOS App Store. Third-party developers and lawmakers argue that the iGiant's oversight, through contractually enforced rules, is anticompetitive. They aim to have some, if not all, of the company's requirements, like using only using Apple's own in-app payment system, relaxed.

    Beyond payment processor flexibility, many third-party developers, particularly those trying to compete with Apple, want iOS device owners to be able to choose to sideload apps – perhaps with the assistance of a third-party store but without Apple's permission and rent-seeking.

    Continue reading

  • Into x86 servers? Apple seeks 'upbeat and hard-working' hardware engineer

    Strictly Intel and AMD for these hyperscale systems
    Agam Shah Mon 7 Feb 2022 // 23:45 UTC

    Apple has tipped its hand by posting a job advert that reveals some details of the "next-generation" storage and server equipment it is building in its data centers.

    The corporation has posted a cheery recruitment ad seeking to hire an "upbeat and hard-working hardware validation engineer to develop, implement, and complete hardware validation plans for Apple’s next generation Hyperscale and Storage Server platforms!"

    The post points to the server hardware being powered by x86 processors, the very architecture Apple shunned for its PCs, which are moving to an in-house Arm-compatible architecture.

    Continue reading

  • Face Off: IRS kills plan to verify taxpayers with facial recognition database

    Uncle Sam takes security, privacy concerns seriously, it says here
    Thomas Claburn in San Francisco Mon 7 Feb 2022 // 21:14 UTC

    The Internal Revenue Service has abandoned its plan to verify the identities of US taxpayers using a private contractor's facial recognition technology after both Democrats and Republicans actively opposed the deal.

    US Senator Ron Wyden (D-OR) on Monday said Treasury Department officials informed his office that the agency has decided to move away from using the private facial recognition service ID.me to verify IRS.gov accounts.

    "The Treasury Department has made the smart decision to direct the IRS to transition away from using the controversial ID.me verification service, as I requested earlier today," Wyden said in a statement. "I understand the transition process may take time, but I appreciate that the administration recognizes that privacy and security are not mutually exclusive and no one should be forced to submit to facial recognition to access critical government services."

    Continue reading

  • Intel joins RISC-V governing body, pledges $1bn fund for chip designers

    Now that's a shot in the Arm
    Agam Shah Mon 7 Feb 2022 // 19:33 UTC

    Intel is establishing a $1bn fund to support early-stage and established chip companies to develop innovative chip and packaging technologies.

    The semiconductor giant is also opening its doors to companies who also need help with design and validation of advanced chips on all major architectures, including x86, Arm, and RISC-V. A goal of the fund is to advance the concept of 3D chiplet design for tighter integration of different types of processor cores.

    For example, some chip designs are combining beefy Arm CPU cores with RISC-V management CPU cores, or want to include multiple dies made using different process nodes in a single package. Intel's factories are advancing into nodes that will support tight packaging of those chips.

    Continue reading

  • Chip supply problems might mean Wi-Fi 6E is skipped over for Wi-Fi 7, says analyst

    Lack of endpoints mean firms are ordering 6 or 7, but not 6E
    Dan Robinson Mon 7 Feb 2022 // 17:16 UTC

    Supply chain woes with Wi-Fi 6E products could see organisations miss on deploying network kit with the new standard and instead wait on availability of Wi-Fi 7 equipment expected next year, says Dell'Oro Group.

    Wi-Fi 6E builds on Wi-Fi 6, itself only a newish standard, by adding support for frequencies in the 6GHz portion of the spectrum. One advantage of this is that compatible devices can be steered to these frequencies, keeping the existing 2.4GHz and 5GHz bands free for other devices and reducing network contention. Cisco is one firm that has just introduced Wi-Fi 6E access points with this capability.

    But according to Dell'Oro, although manufacturers might have launched Wi-Fi 6E devices, such products are often either not available, or they are in very limited supply. There is a general shortage of semiconductor components, not just Wi-Fi semiconductors, owing to the pandemic impacting production in the countries where the chips are manufactured, and this has led device vendors to focus resources on shipping the most popular models.

    Continue reading

  • Play Store class action has £15m budget for defeating Google in London court

    Detail emerges on who's funding it ... and for how much
    Gareth Corfield Mon 7 Feb 2022 // 15:05 UTC

    Google has partly won a legal bid to uncover the budget behind a not-quite-class-action lawsuit pursuing it for £920m in Britain's Competition Appeal Tribunal.

    Revealing that Elizabeth Coll's lawyers have £15.4m with which to take on the world's biggest adtech firm, judges dismissed Google's attempt to reveal how much the class-action group would have to pay those lawyers if they win.

    The decision sheds a little more light on the world of for-profit litigation in London against mostly US-based Big Tech firms.

    Continue reading

Biting the hand that feeds IT © 1998–2022

Do not sell my personal information Cookies Privacy Ts&Cs