This article is more than 1 year old

Supercomputer to train 176-billion-parameter open-source AI language model

BigScience is a collaborative effort by developers volunteering to make ML research more accessible

GTC BigScience – a team made up of roughly a thousand developers around the world – has started training its 176-billion-parameter open-source AI language model in a bid to advance research into natural language processing (NLP).

The transformer architecture makes it easier to train large neural networks more efficiently. Powered by the novel self-attention mechanism, it's able to ingest large amounts of data in one go without having to break it down into smaller chunks first. 

Transformers are particularly useful in NLP. Instead of analyzing individual words in a sentence they can process all of the words in the sentence at the same time, making them superior at modelling relationships over larger ranges. They're better at tasks like text summarization or text generation compared to older architectures like recurrent neural networks or long short-term memory networks.

These models have steadily increased in size and complexity, increasing from tens of millions of parameters to hundreds of billions of parameters between 2018 and 2021. OpenAI's GPT-3, for example, has 175 billion, and the Microsoft-Nvidia Megatron-Turing model has 530 billion.

"We keep having bigger and bigger large language models, which is very interesting to observe, but it's also slightly worrying when you consider that there are only very few places in the world that have the kinds of resources to facilitate training such large language models," Douwe Kiela, head of research at Hugging Face, a company leading the BigScience effort, said during a talk presented at this year's GPU Technology Conference hosted by Nvidia. 

BigScience is an open project and nearly a thousand developers have volunteered to help create and maintain the large datasets required to train language models. There are numerous groups focused on everything from building the 176-billion-parameter system to studying its social impacts. All the data and source code will be made available, making it easier for researchers to get under the hood to figure out how the technology works and its limitations.

The project's previous and latest open-source work can be found here on GitHub.

Large language models developed by private companies – like OpenAI, Google, or Microsoft – are proprietary, making them difficult to probe. They all exhibit the same problematic behaviors, generating toxic speech, bias, and misinformation. But researchers can't understand these issues or fix them without access to the model and its training dataset, hence this open-science effort to create and share a large model.

"If we care about democratizing research progress as a community, and if we want to make sure that the whole wide world can make use of this technology, then we have to find a solution for that. And that is exactly what big science is trying to be," Kiela said. BigScience will be trained on data from 46 different languages. 

Backed by France's state-funded HPC company GENCI and its national supercomputer center IDRIS, the BigScience language model will be trained on the Jean Zay supercomputer. Its peak performance is over 28 petaFLOPS, and it contains multiple Nvidia V100 and A100 GPUs.

The training process is expected to take roughly three to four months, Kiela said. "The main sort of side effect of this large effort is that it fosters a lot of discussion around the more pertinent research questions that we should not be afraid to ask as a scientific community.

"What are the capabilities and limitations of these models? How can we overcome biases and artifacts? What are the ethical considerations that we need to factor in what about the environment? And is this really something we need to be much more careful with when we train these models? What is the general role of these models in society? These sorts of important questions aren't often not publicly discussed. And definitely not discussed by the large industrial companies that are building these large language models," he said. ®

More about

TIP US OFF

Send us news


Other stories you might like