OpenAI says natively multimodal GPT-4o eats text, visuals, sound – and emits the same

The o is not short for 'Oh sh...' for Google and co

OpenAI on Monday showed off GPT-4o, its latest multimodal machine learning model, making it partially available to both free and paid customers through its ChatGPT service and its API.

"The big news today is that we are launching our new flagship model and we are calling it GPT-4o," said Mira Murati, CTO of OpenAI, in a streaming video presentation. "The special thing about GPT-4o is that it brings GPT-4 level intelligence to everyone, including our free users."

The super lab also introduced a desktop app for macOS, available for Plus users today and others in the coming weeks, as well as a web user interface update for ChatGPT. As foretold, there was no word about an AI search engine.

The "o" in GPT-4o stands for "omni," according to the Microsoft-backed outfit, in reference to the model's ability to accept visual, audio, and text input, and to generate output in any of those modes from a user's prompt or request. By visual, OpenAI means video and still pictures.

It responds to audio input far better than preceding models. Previously using Voice Mode involved delays because the voice pipeline for GPT-3.5 or GPT-4 involved three models: One for transcription, one for handling text, and one for turning text to audio. So latencies of several seconds were common as data flowed between these separate models.

GPT-4o combines these functions into a single model, so it can respond faster and can access information that in prior incarnations did not survive intra-model transit, such as tone of voice, multiple speakers, and background noises.

Not all of the model's powers will be immediately available, however, due to safety concerns. GPT-4o's text and image capability should be accessible to free-tier ChatGPT users and paid Plus customers, who have 5x higher usage limits. Teams and Enterprise users can count on even higher limits.

The improved Voice Mode should enter alpha testing within ChatGPT Plus in a few weeks.

Developers using the OpenAI's API service should also have access to the text and vision capabilities of GPT-4o, said to be 2x faster, half the price, and with 5x higher rate limits than GPT-4 Turbo.

With the API, audio and video capabilities will be limited to a small group of partners in the weeks ahead.

"GPT-4o presents new challenges for us when it comes to safety because we're dealing with real time audio, real time vision," said Murati. "And our team has been hard at work figuring out how to build in mitigations against misuse."

Mira Murati, CTO of OpenAI

We're wondering how much of the background is real ... CTO Mira Murati during her presentation today

One such measure is that at least initially, spoken audio output will be limited to a specific set of voices, presumably to preclude scenarios like vocal impersonation fraud.

According to OpenAI, GPT-4o rates medium risk or below in the categories covered by its Preparedness framework.

The new flagship model scores well against its rivals, natch, apparently beating GPT-4T, GPT-4, Claude 3 Opus, Gemini Pro 1.5, Gemini Ultra 1.0, and Llama3 400b in most of the listed benchmarks (for text: MMLU, GPQA, Math, and HumanEval).

Google's annual developer conference begins tomorrow, and we suspect the Android titan's engineers are at this moment reviewing their presentations in light of OpenAI's product update.

At the OpenAI event, Murati invited Mark Chen, head of frontiers research at OpenAI, and Barret Zopf, head of the post-training team, on stage to demonstrate the new capabilities that will be rolled out over the next several weeks.

They showed off real-time audio language translation, with Murati speaking Italian and Chen speaking English. It was an impressive albeit carefully staged demo of a capability that's likely to be welcomed by travelers who don't speak the local language.

GPT-4o's ability to read and interpret programming code also looks promising, though the Python-based temperature graphing demo could be easily explained by a competent Python programmer. A novice though might appreciate the AI guidance. We note that OpenAI did not ask its model to clarify minified JavaScript or obfuscated malware.

Another demo in which Chen consulted GPT-4o for help with anxiety was a bit more provocative because the model recognized Chen's rapid breathing and told him to calm down. The model also emulated emotion by making its generated voice sound more dramatic on demand.

It will be interesting to see whether OpenAI allows customers to use tone and simulated emotion to drive purchases or otherwise persuade people to do things. Will a pleading or hectoring AI application produce better results than neutral recitation? And will ethical guardrails prevent emotionally manipulative AI responses?

"We recognize that GPT-4o’s audio modalities present a variety of novel risks," OpenAI said, promising more details when it releases GPT-4o's system card. ®

PS: Yes, GPT-4o still hallucinates, and also, no GPT-5 kinda suggests OpenAI is reaching a phase of diminishing returns?

More about


Send us news

Other stories you might like