Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents
Amazon has introduced Amazon Nova Sonic, a foundation model that unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in artificial intelligence (AI) applications.
Available in Amazon Bedrock via a new bi-directional streaming API, the model simplifies the development of voice applications, such as customer service call automation and AI agents across a broad range of industries, including travel, education, healthcare, entertainment, and more.
"From the invention of the world's best personal AI assistant with Alexa, to developing AWS services like Connect, Lex, and Polly that are used across a wide range of industries, Amazon has long believed that voice-powered applications can make all of our customers' lives better and easier," said Rohit Prasad, senior vice president of Amazon Artificial General Intelligence, in a statement. "With Amazon Nova Sonic, we are releasing a new foundation model in Amazon Bedrock that makes it simpler for developers to build voice-powered applications that can complete tasks for customers with higher accuracy, while being more natural and engaging."
Nova Sonic has a uni?ed model architecture that delivers speech understanding and generation, without requiring a separate model for each of these steps. This unification enables the model to adapt the generated voice response to the acoustic context (e.g. tone, style) and the spoken input.
Nova Sonic even understands the nuances of human conversation, including natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins. It also generates a text transcript for the user's speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents (e.g., an AI-powered travel agent that can book flights by retrieving up to date flight information).
Amazon claims that Nova Sonic has been rigorously tested against a wide range of industry standard benchmarks for speech understanding and generation and found that the model excels in natural dialogue handling, seamlessly understanding and adapting to pauses, hesitations, and interruptions while maintaining conversational context throughout the interaction. This capability contributed to strong performance for overall quality and accuracy in turn-taking tests.
Amazon also claims that Nova Sonic demonstrates strong performance on overall conversation quality compared to other models, such as OpenAI's GPT-4o (Realtime) and Google Gemini Flash 2.0. In its testing, single-turn dialogues in Nova Sonic's American English masculine voice achieved a 51 percent and 69.7 percent win-rate against OpenAI's GPT-4o (Realtime) and Google's Gemini Flash 2.0 respectively, based on the Common Eval data set. Likewise, Nova Sonic's American English feminine voice scored 50.9 percent and 66.3 percent win-rate against OpenAI's GPT-4o (Realtime) and Google's Gemini Flash 2.0 respectively on the same data set. Nova Sonic also exceeds performance for its U.K. English feminine voice, scoring a 58.3 percent win-rate against OpenAI’s GPT-4o (Realtime).
In additional testing, on the Multilingual LibriSpeech, Nova Sonic achieved a word error rate of 4.2 percent, which is 36.4 percent relative lower than OpenAI's GPT-4o Transcribe model, when averaged across English, French, Italian, German, and Spanish, according to Amazon, which also claimed that on English utterances of the Multilingual LibriSpeech (MLS) data set, Nova Sonic has 24.2 percent relative lower WER compared to OpenAI's GPT-4o Transcribe model.
Nova Sonic is also robust to noisy conditions, with 46.7 percent relative lower WER for English compared to OpenAI's GPT-4o Transcribe model measured on Augmented Multi Party Interaction (AMI) meeting benchmark that consists of real-world noisy and multi-speaker interactions, according to Amazon.
And, Nova Sonic delivers an average customer-perceived latency of 1.09 seconds from the time the customer is done talking to the time the system generates the first speech response, compared to 1.18 seconds for OpenAI's GPT-4o (Realtime), and 1.41 seconds for Google's Gemini Flash 2.0, the company said.
Nova Sonic also supports tool-use for applications, like customer service call automation, that require the responses to be factually grounded in enterprise data, such as pricing plans, available inventory, and schedule availability. Nova Sonic's native tool-use also enables the model to resolve complex customer queries and complete tasks on behalf of customers, for example, make a reservation or find alternate flights.