Filing NeMo: Nvidia's AI framework hit with copyright lawsuit

Claims allegedly pirated content from Books3 dataset trawled by its models

Nvidia is the latest tech giant to face allegations that it used copyrighted works to train AI models without obtaining the permission of the authors.

A proposed class action lawsuit [PDF] filed against the GPU supremo in San Francisco on Friday March 8 claims the company used copyrighted material to train large language models in the Megatron library for its NeMo generative AI framework.

The complaint was filed by three authors, Abdi Nazemian, Brian Keene, and Stewart O'Nan, who claim that books they wrote were among the material used to train the Megatron LLMs.

From the court filing, it appears that Nvidia is not accused of overtly copying the work of the authors itself, but instead using a dataset to train the Megatron models that was known to contain a number of unlicensed copyrighted works.

The lawsuit refers specifically to models that Nvidia released in September 2022, namely NeMo Megatron-GPT 1.3B, NeMo Megatron-GPT 5B, NeMo Megatron-GPT 20B, and NeMo Megatron-T5 3B.

These are hosted on the website operated by AI outfit Hugging Face, along with information about each model, including its training dataset. In this case, the information states that the models were trained on "The Pile" dataset prepared by EleutherAI.

The Pile is described as "an 800GB Dataset of Diverse Text for Language Modeling," and one of its constituent parts is a collection of books called Books3, which contains the contents of about 196,640 books, including those created by the three authors.

According to the court filing, the Books3 dataset was available separately on Hugging Face until October 2023, when it was removed because it "is defunct and no longer accessible due to reported copyright infringement."

The authors want the case to proceed as a class action, with themselves serving as class representatives, and are asking for a jury trial and for damages for the alleged violations of their copyrights.

In a statement sent to The Register, an Nvidia spokesperson said: "We respect the rights of all content creators and believe we created NeMo in full compliance with copyright law."

This isn't the first case of an AI company being sued over accusations of copyright infringement regarding the data used to train AI models. In December last year, The New York Times launched a case against Microsoft and OpenAI over claims the pair had used its articles without permission to build ChatGPT and similar models.

That case was perhaps made more interesting by OpenAI's assertion in January that it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted works.

Meanwhile, Nvidia is still priming the AI pump with the announcement of a new professional certification in generative AI to help developers to establish technical credibility in this area.

Set to become available to coincide with the Santa Clara-based giant's GTC event later this month, the professional certification program will offer two associate-level generative AI accreditations, focusing on proficiency in large language models and multimodal workflow skills. ®

More about


Send us news

Other stories you might like