OpenAI Introduces Speech-to-Text and Text-to-Speech Audio Models

OpenAI has released speech-to-text and text-to-speech audio models in its API to help developers build more powerful, customizable, and intelligent voice agents.

The company's latest speech-to-text models top other models in accuracy and reliability, especially in challenging scenarios involving accents, noisy environments, and varying speech speeds, the company claims. These improvements also increase transcription reliability, making the models especially well-suited for use cases like customer call centers, meeting note transcription, and more.

Developers can also instruct the text-to-speech model to speak in a specific way. This enables a wide range of tailored applications, from more empathetic and dynamic customer service voices to expressive narration for creative storytelling experiences.
Among the new speech-to-text models are gpt-4o-transcribe and gpt-4o-mini-transcribe, which OpenAI says have improvements to word error rate and better language recognition and accuracy compared to the original Whisper models.

gpt-4o-transcribe features advancements that stem directly from targeted innovations in reinforcement learning and extensive midtraining with diverse, high-quality audio datasets. As a result, these new speech-to-text models can better capture nuances of speech, reduce misrecognitions, and increase transcription reliability, especially in challenging scenarios involving accents, noisy environments, and varying speech speeds.

The new text-to-speech model, gpt-4o-mini-tts, offers better steerability, enabling developers to instruct the model not just on what to say but on how to say it.

The new audio models build on OpenAI's GPT-4o and GPT-4o-mini architectures and are pretrained on specialized audio-centric datasets. OpenAI also enhanced its distillation techniques, enabling knowledge transfer from its largest audio models to smaller, more efficient models. Leveraging advanced self-play methodologies, these distillation datasets capture realistic conversational dynamics, replicating genuine user-assistant interactions to help the smaller models deliver excellent conversational quality and responsiveness.

"These new audio models are available to all developers now," OpenAI said in a statement. "For developers already building conversational experiences with text-based models, adding our speech-to-text and text-to-speech models is the simplest way to build a voice agent."

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

OpenAI Introduces Speech-to-Text and Text-to-Speech Audio Models

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

OpenAI Introduces Speech-to-Text and Text-to-Speech Audio Models

aiOla Launches Jargonic Speech Recognition Model

XL8 Delivers Real-Time Spanish Translation Captions to U.S. Public Broadcasters