Google debuts Gemini 1.5 Pro model in challenge to rivals

OpenAI meanwhile teases experimental text-to-vid system Sora

Google on Thursday introduced Gemini 1.5, a multi-modal model family for text, image, and audio interaction said to best rival models in benchmarks.

Gemini 1.5 Pro, the first member of the model family, performs comparably to the web titan's Ultra 1.0 model which debuted last week but does so with fewer computing resources, according to the Chocolate Factory.

Demis Hassabis, CEO of Google DeepMind, said Gemini 1.5 Pro is more efficient to train and to serve, thanks to its Mixture-of-Experts (MoE) architecture. Rather than combining text-only, image-only, and audio-only models in a cumbersome way at a secondary stage, MoE architecture incorporates text, image, and audio modes from the outset.

Google's latest AI model apparently outperforms rival models in benchmark tests, based on the number of tokens it can accept in an input prompt – a token represents about four characters in English. On a practical level, Gemini 1.5 can be fed text, code, images, audio, and video, and answer natural-language questions about that material as well as generate that sort of content.

"Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks," Google researchers wrote in a Gemini 1.5 Pro technical paper. [PDF]

That is to say, when presented with a long document to digest – up to 10M tokens – Gemini 1.5 can respond appropriately to a specific query more than 99 percent of the time. And according to Google researchers, Gemini's 10M token capacity represents "a generational leap over existing models such as Claude 2.1 and GPT-4 Turbo, which for the time being top out at 200K and 128K tokens respectively.

"[The] Gemini Ultra model currently beats all existing alternatives on a broad range of benchmarks," said François Chollet, creator of Keras and a software engineer at Google, in an online post, "and that's with Google having a state-of-the-art test set filtering mechanism that is unmatched externally, so the benchmarks are likely overestimating other models."

Citing such tests, Jeff Dean, chief scientist at Google DeepMind and Google Research, in an online post said, "For text, Gemini 1.5 Pro achieves 100 percent recall up to 530k tokens, 99.7 percent up to 1M tokens, and 99.2 percent accuracy up to 10M tokens."

Gemini 1.5 Pro's extensive capacity allows it to perform feats like ingesting the 402-page Apollo 11 flight transcript (326,914 tokens) and then, when prompted, finding "three comedic moments" in the banter between the Apollo 11 astronauts and identifying the transcript text that corresponds to a hand-drawn sketch of a boot walking on the lunar surface.

And when fed Sherlock Jr, a 45-minute Buster Keaton movie from 1924 (2,674 frames at 1FPS, 684K tokens), Gemini 1.5 Pro responded to the prompt, "Tell me some key information from the piece of paper that is removed from the person's pocket, and the timecode of that moment," by reciting the text on the note in the film and time that scene occurred.

Google is offering a limited preview of Gemini 1.5 Pro with a 1M token context window to developers and enterprise customers at no cost through its AI Studio and Vertex AI services. The general availability with a 128K token context window will come later, as will word of the mega-corp's price structure.

Sora spot for Deepmind

Not to be outdone, OpenAI on Thursday revealed Sora, a text-to-video model. Given a text prompt, it will create a short video, up to one minute in length.

According to the AI biz, Sora can generate complex scenes with multiple characters that move and interact with the depicted world in a coherent way. The super lab tweeted examples of its output here.

Jim Fan, senior research scientist at Nvidia, described Sora as a data-driven physics engine and he speculates that it was trained on a lot of synthetic data from Unreal Engine 5. "The simulator learns intricate rendering, 'intuitive' physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths," he said in a social media post.

Sora is not yet available to the public because it requires further safety testing.

"The current model has weaknesses," OpenAI said in a blog post. "It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark."

The model also has trouble with spatial details – knowing left from right, for example – and isn't great with descriptions that describe change over time.

As a result, Sora is being offered to "red teamers" who will test the model for harmful output, as well as assorted visual artists in order to gain feedback on how the model might be useful in their work.

According to OpenAI, once Sora is integrated into a publicly facing product, "our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others." ®

PS: Here's a Twitter thread comparing Sora's example vids with Midjourney's image output using the same text prompts.

More about


Send us news

Other stories you might like