The State of Speech Engines
In 2019, speech engines became more sophisticated, able now to support additional languages and dialects, but more work remains. These solutions, which include technologies for speech-to-text, text-to-speech, speech recognition, voice command and control, voice search, transcription, translation, and related activities, now do a better job of recognizing words, but ironically, that ability is not what users ultimately desire. Instead, they demand systems that can respond to them like a person; for vendors, though, meeting that goal remains elusive.
The Year in Review
Among the new languages and dialects added in 2019, Amazon’s Alexa now supports Hindi voice interactions. In addition, the vendor enhanced its system to understand local variations of popular languages, such as U.S. Spanish and Brazilian Portuguese, allowing more consumers to check the weather, control smart home devices, and listen to music with Amazon-branded devices. Third parties, such as Bose, LG Electronics, and Sony, access Alexa Voice Service application programming interfaces (APIs) to develop Alexa Skills.
LumenVox also extended its system’s reach to support local dialects, like U.S., U.K. Australian, and New Zealand English and North American Spanish.
And because many individuals and families speak more than one language, Amazon Web Services further introduced a Multilingual Mode that allows Alexa to switch between two languages. The system automatically adapts by recognizing user-spoken utterances and responding in the same language. This feature is available in three pairs: English and Spanish in the United States, Indian English and Hindi in India, and English and French in Canada.
In a similar move, LumenVox also added a new transcription engine aimed exclusively at audio free form, according to Jeff Hopper, vice president of client services at LumenVox. “It works in real time so [interactive voice response] apps receive not just structured data or natural language input, but they also have raw text that they can manipulate,” he explains.
But what might be more impactful for the speech industry as a whole is the work being done with artificial intelligence (AI) and deep neural networks. AI work has quickly been making its way into mainstream speech technologies, allowing for more natural-language, conversational interactions, with machine learning enabling system accuracy and performance to improve as engines process more and more utterances.
This year, speech engine progress was also seen in the emergence of fourth-generation deep neural networks (DNN). They have multiple layers between the input and output, so they can use either linear or nonlinear relationships to draw conclusions.
A leader in this area has been Nuance Communications, which in July unveiled the Nuance Lightning Engine, a DNN that combines voice biometrics and natural language understanding to deliver personalized, humanlike experiences across voice channels.
A Look Ahead
While speech engines have improved in many ways, the underlying technology still has plenty of shortcomings. Systems are significantly better today at recognizing individual words, but what is needed are solutions that understand words in context, according to Stephen Arnold, a former Booz, Allen & Hamilton professional.
Because of this limitation, when voice systems are deployed for business and consumer use, they sometimes do not function well. The user focuses on the end result, like getting results back from a voice search, but systems often do not deliver the level of understanding needed. As a result, 71 percent of Americans would rather interact with a human than a chatbot or other automated process, according to a recent PwC survey.
Vendors are honing their systems to close this gap. Google developed BERT (Bidirectional Encoder Representations from Transformers), a speech recognition solution built to connect words and better comprehend sentence context. For example, if a person is looking for information about travel in another country, BERT recognizes that the word “to” is more important than “from.”
On a similar note, Translate Your World has been building speech solutions that recognize tone in conversations, and the vendor found that individuals’ tones vary depending on the situation. “The end goal is to guide AI translations so they deliver the right mode for addressing individuals within the context of the conversation,” explains Sue Reager, president of the company. For instance, the word “shingles” typically means items on the roof of a house, but in healthcare it represents a viral infection.
Also, communication varies by groups. “We found that consumers typically do not speak as clearly as business executives,” Reager adds. “Consumers’ pronunciation is sometimes not clear, and they often do not use complete sentences. Their thoughts are less organized, so sometimes, it is difficult to find set patterns.”
Training speech engines to recognize and respond appropriately to such distinctions is one challenge that her company, and many others across industry segments, will tackle in 2020.
Who will take on current speech challenges has been morphing. “Trying to translate speech is getting more difficult and very expensive,” Arnold maintains. “In the good-old days, a few folks at an MIT machine learning lab were able to build a commercial system. Not anymore.”
Nowadays, speech engine research requires deep pockets, highly skilled data scientists, and enormous data centers replete with vast computational processing power. Consequently, domestic industry behemoths, like Google, Amazon Web Services, Microsoft, and IBM, are taking on much of the work.
Internationally, Chinese companies, such as Baidu, are also attacking the problems. Chinese vendors are in a good position because they are not weighed down by legacy technology and can take a fresh approach to solving these long-standing problems, according to Arnold.
Vendors are also making progress in extending their product capabilities so they support more types of speech, but work remains to help the technology put individual words into context so systems can respond appropriately. As the market has shifted, the development burden has increased dramatically. Moving forward, it seems likely that only deep-pocketed industry giants will have the resources needed to move speech solutions forward so they become more human.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.