Google brains plumb depths of the uncanny valley with latest image-to-video tool

VLOGGER needs just a still photo and an audio recording to generate footage, but it's far from perfect

Google has a new AI trick up its sleeve that can animate a still photo using nothing but a recording of a person's speech, and boy is it widening the uncanny valley.

Dubbed VLOGGER in a paper [PDF] by a sextet of Google researchers (without any explanation for the name), the tool allegedly doesn't need any per-person training, face detection or other tweaking. Feed it a waist-up picture and an audio recording of whatever length you want, and it gets to work.

"Our objective is to bridge the gap between recent video synthesis efforts, which can generate dynamic videos with no control over identity or pose, and controllable image generation methods," the researchers say in the paper. "Industries like content creation, entertainment, or gaming all have high demand for human synthesis, yet the creation of realistic videos of humans is still complex and ripe with artifacts."

Whether VLOGGER does as good a job as the researchers seem to believe is debatable. El Reg readers can decide for themselves in videos posted to the project's GitHub page and on X. While impressive, none of the examples are likely to fool anyone – there's still something incredibly unrealistic about them.

Despite that, the researchers said VLOGGER outperforms previous state-of-the-art image quality, identity preservation, and temporal consistency measurements across three public benchmarks, and could be used to "not only ease creative processes, but also enable entirely new use cases, such as enhanced online communication, education, or personalized virtual assistants."

VLOGGER relies on a two-step process to generate uncanny videos from still photos. First, a stochastic human-to-3D generative diffusion-based model predicts body motion and facial expressions from input audio, which the researchers say "is necessary to model the nuanced (one-to-many) mapping between speech and pose, gaze, and expression."

Second, an architecture model based on recent image diffusion models is used to "provide control in the temporal and spatial domains," the researchers add in the paper.

The project has also required the creation of a new curated dataset, which the team calls MENTOR. The data set is "one order of magnitude larger than existing ones" filled with 3D pose and expression annotations, and includes some 800,000 identities, including "dynamic gestures." 

But … why?

The X-iverse didn't react kindly to the researchers' work, with many saying how fake the videos looked, or that it fell short of Google's usual standards.

So why in the world is Google presenting such stiff, obviously AI-generated synthetic videos in 2024, and if the core research doesn't look good, what else could it be used for?

Neither Google nor the research team responded to questions for this article, but one possible application can be found near the end of the project's page: lip syncing to translate existing videos from one language to another.

"VLOGGER takes an existing video in a particular language, and edits the lip and face areas to be consistent with new audios," the paper notes.

A video embedded on the page showing a translation from English to Spanish was a … bit off, to say the least, and nowhere near ready for a real-world product. Quite frankly, better results were delivered years ago, though those systems were trained on hours of footage to get their results, and VLOGGER is taking a shot without any training.

It's not clear if Google plans to release VLOGGER or add the tech to its other AI products, or if this was purely a research project to assess feasibility. Feasible, sure, but more work is definitely needed. ®

More about

TIP US OFF

Send us news


Other stories you might like