The 2023 State of Speech Engines
Speech engines have come a long way in recent years, thanks in no small part to artificial intelligence and related technologies. That trajectory will undoubtedly continue into 2023 and beyond.
Amey Dharwadker, machine learning tech lead for Meta, says that speech technologies have made tremendous progress, reaching high levels of accuracy in transcription and translation tasks, for example.
“Significant advances have been made in the field of text-to-speech in terms of realism and naturalness of synthesized voices, mainly due to the use of deep learning algorithms,” he says. “This has enabled speech engines to generate more humanlike speech patterns, resulting in more natural and intuitive interactions between humans and voice-enabled devices. There have also been several notable breakthroughs in machine translation that have opened new possibilities for cross-language communication and global collaboration.”
Matt Muldoon, president, North America, at ReadSpeaker, credits innovation in text-to-speech engines with “making it possible to achieve synthetic speech quality that sounds increasingly natural and ever closer to real human speech.”
If voice is to become the new user interface, the technology is finally catching up to make that idea a reality, insists Ian Shakil, founder, director, and chief strategy officer of Augmedix, a provider of AI-powered medical documentation solutions.
“Advances that are allowing this to be possible include the ability to handle diarized or multiparty conversations, word error rates becoming sufficiently low, and the technology performing more accurately in far-field, noisy environments. In addition, as the decrease in word error rates continues, driven in large part by advances in deep neural networks, the bottleneck in speech innovation is no longer on the automatic speech recognition (ASR) side, but rather has shifted focus on the natural language processing aspect of the technology stack,” he explains. “And that’s a good thing.”
Nicu Sebe, head of AI at Humans.ai and a professor of computer science at the University of Trento, Italy, points to rapid improvements in speech recognition and speech synthesis.
“They’ve become more and more accurate, particularly with the help of deep learning techniques,” Sebe says. “The technology has reached a level of realism that makes it comparable to human speech. The incorporation of deep learning techniques in the ecosystem, together with the use of large datasets, has enabled more natural-sounding speech and a better reproduction of the nuances of human languages.”
Speech technology is also providing quantifiable insights into patterns within customer conversations, says Cliff Wiser, vice president of software architecture at Voice Foundry.
“These insights inform better automation strategies, which in turn inform better insights. This forms a virtuous cycle, or flywheel effect, which is fueling overall better customer engagement, top to bottom,” he says.
Year in Review
Beyond that, experts look back on 2022 as a crucial year in the development of speech technology, particularly for speech recognition, voice cloning, enhanced ASR, and real-time translation.
“Voice cloning gained increased attention mostly because a large number of companies aimed at enhancing their user experience by introducing familiar voices in their final products as a boost for profit,” Sebe says. “And the real-time translation of voice messages saw major improvements regarding machine learning algorithms, the integration of multiple languages, and personalization.”
2022 is also remembered for steps forward in large language models (LLMs), as evidenced by the emergence of ChatGPT, GLaM, BERT, and others.
“Some of the most important developments include the launch of new and improved speech recognition models like Google’s BERT-based model, which allows for improved natural language understanding, and the integration of speech recognition and natural language understanding into more industries, such as healthcare and finance,” says Iu Ayala, CEO of Gradient Insight.
Another key story in 2022 was the public release of Whisper, an automatic speech recognition system, by OpenAI.
“We also learned about GPT-4 and a large, complicated deal with Microsoft that could be worth possibly tens or hundreds of billions,” says Liz Brown, senior researcher at Artefact. “Google’s voice assistant on Google Nest Hub Max gained a new natural voice feature, ‘Look and Talk,’ which means the voice assistant no longer waits for its wake phrase ‘OK Google’ because the engine uses computer vision to detect when it receives questions.”
Also, Facebook made clear it wants all languages to be represented in LLMs; in mid-2022, it open-sourced the new model No Language Left Behind (NLLB-200), which can translate 200 languages, Brown adds.
There was also a breakout with the release of the Sanas.ai conversational AI platform that can, for example, make a person from Spain sound like a German by altering the mechanics of the voice.
“In the field of natural language generation, there were some breakthroughs in terms of personalization and fluency of the generated speech,” Ayala says. “Last year also saw notable developments in adding support for more languages and dialects, particularly in the field of multilingual speech recognition. Companies such as Google and Amazon have been working on improving their speech recognition models for languages such as Hindi, Mandarin, and Spanish.”
2022 also saw a boom in generative AI.
“In 2023, there will be numerous applications that link generative AI modules with speech solutions,” Shakil points out.
Deep neural networks remain powerful in many areas of AI, “and advances in this field are enabling new capabilities and applications in areas such as computer vision, speech recognition, and natural language processing,” Sebe continues.
And in voice biometrics, voice recognition has become more mature and widely used in domains like banking or personal identification.
A Look Ahead
But multiple challenges still need to be addressed, the pros agree.
“These include sentence context comprehension and handling real-life conditions,” Sebe says. “Context is a key element in understanding the real meaning of communication. Improving the ability of speech engines to take into account the context of a sentence, such as a speaker’s intent and the surrounding conversation, will be crucial.”
Also, as more speech engines are used in real-world environments, they will need to be able to handle background noise and different accents and speech patterns.
“That will require continuous research and development in areas such as natural language processing, deep learning, and machine learning, as well as improvements in data collection, which is critical for training large, complex neural networks,” Sebe adds.
Current speech engines still fall short in understanding and interpreting nonverbal cues, such as tone of voice. Dharwadker notes that this can lead to miscommunication, especially in more complex or nuanced conversations.
Adding to the complexity are the “many different accents, dialects, and slang terms that are difficult to understand for even the most advanced speech engines,” Dharwadker adds.
Systems also need to contend with the thorny issue of bias and discrimination.
“There have been cases where speech recognition systems have performed poorly for certain groups of people, such as those with accents or dialects that differ from the standard language. This has raised concerns about the potential of these technologies to perpetuate or amplify existing inequalities and injustices,” Dharwadker cautions.
Speech engines have undoubtedly come a long way with ASR, but improving word error rates and diarization scores still need to happen as well. “Narrowly speaking, we think the ASR models need to do a better job with date/number formatting and punctuation, too. We are starting to see that as models get bigger with more data and more parameters, it becomes more expensive to develop for and scale these models,” Shakil warns.
Indeed, a problem that will increasingly confront smaller, resource-challenged organizations is the rising price tag that comes with rapid technological change.
Nonetheless, with the Internet of Things and the growing number of smart devices in homes and businesses, “speech technology is likely to become increasingly ubiquitous, which means it will be integrated into more devices, platforms, and applications,” Sebe says. “This will enable new use cases and applications, such as controlling smart home devices, cars, and other IoT devices with voice commands, to control multiple devices at once, switch between devices, and even initiate actions across multiple devices based on a single voice command.”
Going forward, expect to see advancements in personalization, multilingual support, and natural language understanding, Ayala predicts.
“Additionally, there will likely be more integration of speech technology in various industries, such as finance and healthcare, and an increased use of speech-enabled devices in everyday life,” he posits.
Specifically in healthcare, “use of speech engines is projected to become as normal as physicians putting on their stethoscopes today. Speech technology will be running ambient services for documentation and spur real-time contextual insights at the point of care more often than not,” Shakil says. “We’ll see this in other domains as well, but healthcare will be a pioneering sector.”
Lastly, Dharwadker is encouraged by the prospect of AI and speech technology being used to improve accessibility for people with disabilities, in virtual and augmented reality applications, and for the development of more advanced voice-enabled assistants and chatbots.
Erik J. Martin is a Chicago area-based freelance writer and public relations expert whose articles have been featured in AARP The Magazine, Reader’s Digest, The Costco Connection, and other publications. He often writes on topics related to real estate, business, technology, healthcare, insurance, and entertainment. He also publishes several blogs, including martinspiration.com and cineversegroup.com.