Nuh-uh, Meta, we can do text-to-video AI, too, says Google
Brace yourself for a weird future where everything is imagined by magic sand we taught how to think
Hot on the heels of Meta's Make-A-Video, Google said on Wednesday it too has built an AI-powered text-to-video system. This one's called Imagen Video.
We dare say the public reveal of Make-A-Video last week spurred the Big G to suddenly start shouting about its own competing system, lest it looks like it's fallen behind Mark Zuckerberg's team. Or perhaps Meta learned of Google's planned announcement, and raced to spoil it with its own unveiling. It seems too much of a coincidence.
Given a text prompt, such as "sprouts in the shape of text 'Imagen Video' coming out of a fairytale book. Smooth video," Google's software generates a sequence of images to create the short clip as seen below.
Prompt: "Sprouts in the shape of text 'Imagen Video' coming out of a fairytale book."Model Output: pic.twitter.com/FVgnM0UAAn
— Durk Kingma (@dpkingma) October 5, 2022
There are numerous other examples of the model completely fabricated footage from prompts, such as from "a teddy bear running through New York City," or "incredibly detailed science fiction scene set on an alien planet, view of a marketplace. Pixel art."
Imagen Video builds upon Google's previous text-to-image system, Imagen, launched in May. Instead of a single still picture, however, Imagen Video builds a video out of multiple frames of output.
Text-to-video systems are more computationally intensive to train and run than text-to-image systems. Imagen Video, for example, is made up of seven types of models. For one thing, it has to not just generate a frame from its text prompt but also predict what the next frames would be to form a coherent moving animation – each frame a slight progression from the previous – rather than a series of related images that played back would look like a jumbled mess.
"Imagen Video generates high resolution videos with Cascaded Diffusion Models," according to a Google research note.
"The first step is to take an input text prompt and encode it into textual embeddings with a T5 text encoder.
"A base Video Diffusion Model then generates a 16-frame video at 24×48 resolution and three frames per second; this is then followed by multiple Temporal Super-Resolution (TSR) and Spatial Super-Resolution (SSR) models to upsample and generate a final 128-frame video at 1280×768 resolution and 24 frames per second – resulting in 5.3 seconds of high definition video."
Like Meta's Make-A-Video, the quality of Google's Imagen Video is somewhat fuzzy. Edges of images are blurry, and the resolution isn't great yet. Research and development into generative visual models, however, moves quickly, and it'll only be a matter of time before a new architecture will create fake videos that are crisper, in high-definition, over longer periods of time.
These models show that computers are good at learning the logical sequence of events to simulate events, such as a water balloon bursting or an ice cream melting. Boffins at Google Brain described Imagen Video as being "temporally-coherent" and "well-aligned with the given prompt" in a non-peer reviewed research paper [PDF].
An internal Google dataset made up of 14 million video-text samples and 60 million image-text pairs, as well as information from the publicly available LAION-400M image-text dataset, was used to train Imagen Video.
- Text-to-image models are so last month, text-to-video is here
- SiFive RISC-V cores picked for Google AI compute nodes
- Someone's at last helping AI models understand those with speech disabilities
- Here's how crooks will use deepfakes to scam your biz
"Video generative models can be used to positively impact society, for example by amplifying and augmenting human creativity. However, these generative models may also be misused, for example to generate fake, hateful, explicit or harmful content," the researchers said. The LAION-400M dataset is also known to contain pornographic and other types of problematic images.
Although the team have applied content filters to block unsavory text prompts or images in videos generated by the model, Imagen Video is still prone to creating content with "social biases and stereotypes" and is not safe yet for people to experiment with. "We have decided not to release the Imagen Video model or its source code until these concerns are mitigated," they concluded.
So, like Meta's toy, Imagen Video isn't available to the general public, perhaps making their public unveiling more recruitment tools – hey, come work on cool stuff like this – than anything else right now. ®