-->

2024 Speech Industry Award Winner: OpenAI Breaks More Molds with Voice Introductions

Article Featured Image

OpenAI turned the tech world upside down with its release of ChatGPT in November 2022. Then, in yet another groundbreaking move, the San Francisco-based company gave ChatGPT a voice last September.

With the new capability, ChatGPT users could engage in back-and-forth voice conversations with the assistant.

The voice capability is powered by a new text-to-speech model, capable of generating humanlike audio from just text and a few seconds of sample speech. OpenAI collaborated with professional voice actors to create each of the voices; Whisper, OpenAI’s open-source speech recognition system, can transcribe spoken words into text.

Other speech innovations from OpenAI included the November launch of the OpenAI Audio API, a text-to-speech application programming interface with six preset voices and two generative AI model variants. The Audio API provides a text-to-speech endpoint to narrate written blog posts, produce spoken audio in multiple languages, and give real-time audio output using streaming.

The speech endpoint requires just three inputs: the model name, the text that should be turned into audio, and the voice to be used. By default, the endpoint outputs an MP3 file of the spoken audio, but it can also be configured for other formats, like Opus (for internet streaming), AAC (for digital audio compression), and FLAC (for lossless audio compression).

The Speech API also now supports real-time audio streaming using chunk transfer encoding. This means that the audio can be played before the full file has been generated and made accessible.

OpenAI also launched Whisper large-v3, the next version of its speech recognition model, which reportedly offers improved performance across languages.

With the release of DALL-E 3, OpenAI’s latest text-to-image model gained new format, quality, and resolution options.

And then there was the introduction of GPT-4 Turbo, with expanded context; the ability to accept up to 128K of text input, up from the roughly 3,000 words that could be accepted by previous GPT models; and support for DALL-E 3 and TTS models.

And just this past summer, OpenAI finally began rolling out its much-anticipated ChatGPT Voice Mode to some users.

OpenAI’s advanced Voice Mode generated quite a bit of buzz when actress Scarlett Johansson accused OpenAI of modeling the “Sky” voice after her own. OpenAI denied using Johansson’s voice without permission but later removed “Sky” from the custom voices available.

The ChatGPT Voice Mode update is enabling users to have interactive voice conversations with ChatGPT. The new features included in the ChatGPT update include a translation feature, the ability to create sound effects, and custom character voices. OpenAI also says Voice Mode can understand and respond with emotions and nonverbal cues, making conversations more natural.

And, surprisingly, OpenAI even made headlines with what it didn’t release.

It was all set to broadly release Voice Engine, a text-to-speech AI model for creating synthetic voices based on 15-second audio samples but decided at the eleventh hour to delay the launch amid concerns about unethical misuse of the technology.

OpenAI developed Voice Engine in 2022 and integrated it into ChatGPT’s text-to-speech feature. The company cited Voice Engine’s tremendous potential, particularly as an educational tool for reading, translating content, and helping people with difficulties speaking on their own, but pulled back the release when fears mounted around its potential misuse.

“We are choosing to preview but not widely release this technology at this time,” the company said in a blog post, while it waits for a larger “societal resilience against the challenges brought by ever more convincing generative models.”

At the same time, OpenAI said it is encouraging “steps like phasing out voice-based authentication as a security measure for accessing bank accounts and other sensitive information.”

“We hope to start a dialogue on the responsible deployment of synthetic voices and how society can adapt to these new capabilities,” OpenAI added.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues