Amazon Nova Sonic AI doesn't just hear you, it takes tonal cues too
The foundation model supports real-time bi-directional speech
Amazon has introduced a foundation model that claims to grasp not just what you're saying, but how you're saying it - tone, hesitation, and more.
Amazon Nova Sonic, the latest member of the Nova family of foundation models first introduced in December 2024, accepts spoken input and responds with real-time speech, while also generating a transcript for developers.
Traditionally, voice-based AI apps stitch together three separate models: one for speech recognition, one to generate responses, and one to synthesize speech. Amazon claims Nova Sonic unifies these capabilities into a single model.
"This unification enables the model to adapt the generated voice response to the acoustic context (e.g., tone, style) and the spoken input, resulting in more natural dialogue," Amazon said in its announcement. "Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins."
The e-souk has posted sample audio of an exchange in which this scenario might come into play. In the recording, an AI travel assistant handling a customer trying to book a trip adopts a reassuring tone after the customer expresses concern about the price of the tickets in the customer's voice.
"Amazon Nova Sonic doesn't just understand what you say," explains Osman Ipek, senior machine learning solutions architect at Amazon, in a video. "It also understands how you say it. So it adapts its responses to mirror your communication style. If you speak with excitement Nova Sonic's response will match with the similar enthusiasm. If you adopt a serious tone it will adjust accordingly by recognizing prosodic elements like pitch and emotion. It creates truly conversational interactions."
- 'Copilot will remember key details about you' for a 'catered to you' experience
- AI entrepreneur sent avatar to argue in court – and the judge shut it down fast
- UK officials insist 'murder prediction tool' algorithms purely abstract
- Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it
Available within Amazon Bedrock via the bidirectional streaming API, Nova Sonic "understands streaming speech in various speaking styles and generates expressive speech responses that dynamically adapt to the prosody of input speech."
Essentially, the model can modulate its voice and will pause when interrupted and then resume, which makes for more natural conversational flow.
API code can be tied to analytics-based sentiment analysis. But much of the model's tonal variation is expected to be driven by LLM prompts.
Nova Sonic models don't provide direct access to voice control parameters. Rather, the user instructs the model on the tone it should take via the system prompt. For example:
You are a friend. You and the user will engage in a spoken dialog exchanging the transcripts of a natural real-time conversation. Keep your responses short, generally two or three sentences for chatty scenarios. You may start each of your sentences with emotions in square brackets such as [amused], [neutral] or any other stage direction such as [joyful]. Only use a single pair of square brackets for indicating a stage command.
Nova Sonic supports a context window of 32K tokens for audio and has a default connection limit of eight minutes, which can be renewed to continue longer conversations. It can interface with enterprise systems via Retrieval Augmented Generation (RAG) and it can handle function calling and agent-orient workflows, in a variety of speaking styles across its set of supported languages – currently just English (American and British).
IT consultancy Gartner in April published a report titled, "Market Guide for Conversational AI Solutions." The firm found, "Demand for [conversational AI] capabilities is increasing across numerous use cases, both customer and employee-facing. However, leaders find it challenging to discern solutions that can best meet their requirements in such a rapidly evolving market."
Gartner expects the conversational AI market to reach $36 billion in revenue by 2032, up from $8.2 billion in 2023. ®