Google DeepMind's latest model promises realistic audio for your AI-generated vids

Launch comes as Runway, Pika, Kling push the boundaries of machine-imagined video

Video Mainstream adoption of generative AI technologies has, in large part, centered around the creation of text and images. But, as it turns out, the statistical probabilities on which these models are based are just as good at generating all manner of other media.

The latest example of this came on Monday when Google's AI lab DeepMind detailed its work on a video-to-audio model capable of generating sound to match video samples.

The model works by taking a video stream and encoding it into a compressed representation. This, alongside natural language prompts, acts as a guide for a diffusion model which, over the course of several steps, refines random noise into something resembling audio relevant to the input footage. This audio is then converted into a waveform and combined with the original video source.

As we understand it, this approach isn't all that different from how image generation models work, but, rather than emit pictures or illustrations, it's been trained to reproduce audio patterns from video and text inputs.

Here's one of several samples DeepMind released this week showing the model in action:

Youtube Video

DeepMind says it used a variety of datasets which not only included video and audio, as you might expect, but AI-generated annotations and transcriptions to help teach the model to associate various visual events with different sounds. This, the researchers explained, means that the model can generate audio with or without a text prompt and doesn't require manual alignment of the tracks. However, there are still some hurdles to overcome.

For one, because of how audio is generated, the actual quality of the soundtrack is dependent on the source material. If the video quality is poor, the audio is likely to be as well. Lip sync has also proven to be quite challenging, to be polite.

DeepMind expects the new model to pair nicely with those designed for video generation, including its own in-house Veo model.

According to the DeepMind team, one of the problems with the current crop of text-to-video models is that they usually are limited to generating silent films. Combined with its video-to-audio model, the DeepMind team claims that entirely AI-generated videos, complete with soundtracks and even dialogue, is possible.

Speaking of video-gen models, the category has grown considerably over the past year with more players entering the space. 

ML juggernaut OpenAI unveiled its own video-generation model called Sora back it February. But Sora is just one of several models pushing the envelope of what's possible.

Kling AI

Among these models is one from Kling AI. Developed by (partially state-owned) Chinese tech firm Kuaishou, Kling uses a combination of diffusion transformers to generate the frames and a "3D time-space attention system" to model motion and physical interactions within the scenes. Here's the system in action:

Youtube Video

The results are videos that, if you don't look too closely, could easily be confused for human-generated video footage. However, on closer inspection, you'll quickly start to notice visual artifacts and incongruities. With that said, this seems to be a common theme with many of the AI video generators on the market today.

While details on Kling are scarce, its developers claim it's more capable than OpenAI's Sora. The model can supposedly produce videos up to two minutes in length at resolutions of 1080p and 30 frames per second. Unfortunately, access to the model is, for the moment, limited to China.


Another model builder working on video generation is Runway which, on Monday, revealed its Gen-3 Alpha model. Runway has been working on a number of image and video generation models going back to early 2023.

According to Runway, Gen-3 Alpha is one of several models currently under development and was trained on a combination of videos and images paired with highly descriptive captions. According to the startup, this allowed them to achieve more immersive transitions and camera movements than was possible with previous models. Here's this one in action:

Embedded MP4 video

The model will also introduce safeguards to prevent users from generating unsavory images or videos, or so we're told. Runway plans to build the latest model into its existing library of text-to-video, image-to-video, and text-to-image services as well as work with industry partners to create custom versions.


These AI video upstarts have clearly got investors' attention. Earlier this month, Pika picked up $80 million in Series-B funding from Spark Capital and others to accelerate development of its AI video generation and editing platform.

The team's 1.0 model, which launched in beta last July, supports a variety of common scenarios including generating videos based on text, image, or video prompts. Over the past few months, Pika has added support for fine-grain editing, in-painting-style adjustments, sound effects, and lip sync. Here's where it's up to, below:

Youtube Video

Similar to the popular image-gen model Midjourney, users can interact with Pika's AI video service through Discord, or through the startup's web app.

This, of course, is by no means an exhaustive list, and we expect to see many more models and video-generation services crop up over the next few months, as AI devs' ambitions push beyond text and image models. Let us know below if you've seen any related stuff like this. ®

More about


Send us news

Other stories you might like