Meta trains data2vec neural network to grok speech, images, text so it can 'understand the world'

Whatever it takes, Mark


Researchers at Facebook parent's Meta have trained a single AI model capable of processing speech, images, and text in the hope that these so-called multi-modal systems will power the company’s augmented reality and metaverse products.

The model, known as data2vec, can perform different tasks. Given an audio snippet, it can recognize speech. If it’s fed an image, it can classify objects. And when faced with text, it can check the grammar or analyse the writing’s tone and emotions.

AI algorithms are typically trained on one type of data, though data2vec is trained on three different modalities. It still, however, processes each form, whether its speech, images, and text, separately.

Meta believes these multi-modal models will help computers be more adaptable to blend physical and digital environments into one. “People experience the world through a combination of sight, sound and words, and systems like this could one day understand the world the way we do,” Meta CEO Mark Zuckerberg said in a statement to El Reg.

“This will all eventually get built into AR glasses with an AI assistant so, for example, it could help you cook dinner, noticing if you miss an ingredient, prompting you to turn down the heat, or more complex tasks.”

Data2vec is a transformer-based neural network and uses self-supervised learning to learn common patterns in audio, computer vision, and natural language processing. The model learns to operate with different types of data by learning how to predict how the representation of data it’s given; it knows it has to guess the next group of pixels when given an image, or the next speech utterance in audio, or fill in the words in a sentence.

The researchers used a mix of 16 Nvidia V100 and A100 GPUs to train data2vec on 960 hours of speech audio, millions of words from books and Wikipedia pages, and images from ImageNet-1K.

"We train separate models for each modality but the process through which the models learn is identical," Alexei Baevski, a research engineer at Meta AI told The Register.

"We hope that it will enable future work to build high performing self-supervised models that combine modalities and are more effective than specialized models. Different modalities can add additional information to the same piece of content - for example body language from video, prosodic information from audio, and text can combine into a richer representation of a dialog. The algorithms that currently try to combine multi-modal information exist but they do not yet perform well enough to replace specialized algorithms and we hope our work will help change that."

Baevski said in the future multi-modal systems could incorporate a larger range of data to model concepts like smell, 3D objects, or videos. He referred back to the idea of AR glasses helping wearers cook.

"Imagine having a model that has been trained on recordings of thousands of hours of cooking activity from various restaurants and chefs. Then, when you are cooking in a kitchen wearing your AR glasses that have access to this model, it’s able to overlay visual cues for what you need to do next, point out potential mistakes, or explain how adding a particular ingredient will affect the taste of your dish," he told us.

Previous research on multi-modal systems have shown they can be prone to easy adversarial attacks. OpenAI's CLIP model, for example, trained on images and text will identify an image of an apple incorrectly as an iPod if the word "iPod" is in the picture. It's unclear, however, if data2vec suffers from similar weaknesses.

"We have not specifically analyzed how our models will react to adversarial examples but since our current models are trained separately for each modality, we believe that existing research on adversarial attack analysis for each modality would be applicable to our work as well," Baevski said.

"In the future, we hope to use our work to enable high performance algorithms that combine modalities in one model and we plan to study how susceptible they are to adversarial attacks."

When the researchers tested data2vec, it outperformed some top models that had been trained on a specific data type only on different types of tasks. The preliminary results are described in a paper [PDF], and the code has been published on GitHub.

“Data2vec demonstrates that the same self-supervised algorithm can work well in different modalities — and often better than the best existing algorithms,” the researchers explained in a blog post this week.

“This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks.” ®


Other stories you might like

  • Experts: AI should be recognized as inventors in patent law
    Plus: Police release deepfake of murdered teen in cold case, and more

    In-brief Governments around the world should pass intellectual property laws that grant rights to AI systems, two academics at the University of New South Wales in Australia argued.

    Alexandra George, and Toby Walsh, professors of law and AI, respectively, believe failing to recognize machines as inventors could have long-lasting impacts on economies and societies. 

    "If courts and governments decide that AI-made inventions cannot be patented, the implications could be huge," they wrote in a comment article published in Nature. "Funders and businesses would be less incentivized to pursue useful research using AI inventors when a return on their investment could be limited. Society could miss out on the development of worthwhile and life-saving inventions."

    Continue reading
  • Declassified and released: More secret files on US govt's emergency doomsday powers
    Nuke incoming? Quick break out the plans for rationing, censorship, property seizures, and more

    More papers describing the orders and messages the US President can issue in the event of apocalyptic crises, such as a devastating nuclear attack, have been declassified and released for all to see.

    These government files are part of a larger collection of records that discuss the nature, reach, and use of secret Presidential Emergency Action Documents: these are executive orders, announcements, and statements to Congress that are all ready to sign and send out as soon as a doomsday scenario occurs. PEADs are supposed to give America's commander-in-chief immediate extraordinary powers to overcome extraordinary events.

    PEADs have never been declassified or revealed before. They remain hush-hush, and their exact details are not publicly known.

    Continue reading
  • Stolen university credentials up for sale by Russian crooks, FBI warns
    Forget dark-web souks, thousands of these are already being traded on public bazaars

    Russian crooks are selling network credentials and virtual private network access for a "multitude" of US universities and colleges on criminal marketplaces, according to the FBI.

    According to a warning issued on Thursday, these stolen credentials sell for thousands of dollars on both dark web and public internet forums, and could lead to subsequent cyberattacks against individual employees or the schools themselves.

    "The exposure of usernames and passwords can lead to brute force credential stuffing computer network attacks, whereby attackers attempt logins across various internet sites or exploit them for subsequent cyber attacks as criminal actors take advantage of users recycling the same credentials across multiple accounts, internet sites, and services," the Feds' alert [PDF] said.

    Continue reading

Biting the hand that feeds IT © 1998–2022