The Outlook for Deep Neural Networks and Speech Technology
According to Microsoft, the algorithm initially provided error reduction as much as 10 percent to 20 percent while also using 30 percent less processing time than speech recognition algorithms based on Gaussian mixture models, at the time the best-of-breed algorithms. That success laid the groundwork for further improvement by the developers, and technologies from other companies further helped fuel DNN-based speech recognition development.
“The tools [from Microsoft and others] are all capable of harnessing the power of GPUs, which makes training models feasible in a reasonable amount of time—something that hampered DNN usage in the past. They are also capable of handling enormous amounts of data, as speech engines now are built using thousands of hours of speech,” Ganapathiraju says.
Thomson also credits continued growth in computing power and speed as enabling the superior accuracy of today’s systems over what was available only a few years ago. Now the neural networks can handle thousands of processes in a fraction of a second. Additionally, as more data becomes available for inputting into these systems, the more accurately the systems can recognize speech and provide interactive responses.
“While these tools make experimenting and applying DNNs to new problems quite simple, there still is no substitute for extensive real-world training data and carefully designed performance characterization to build a commercially viable and robust DNN-based system,” Ganapathiraju adds.
The improvement that DNNs provide over previous systems is readily evident, according to Ganapathiraju. “DNNs have increased the accuracy of our ASR system over previous technology by an average of 20 percent. In our internal analysis, the gains we see with DNNs are better than what we could get previously with extensive custom-tuning activities, which is a clear advantage for our customers in terms of ROI. In our real-time keyword and phrase analytics engine, our accuracy has also greatly improved, and false positives have been significantly reduced.”
Looking Ahead
With the success it has already had in its deployments of English-based systems, Interactions plans to expand aggressively in 2017, adding a number of new languages, the exact ones to be determined by the market. Additional languages will be added in future years.
The company also expects to add chat- and text-based communications to the Curo Speech platform in the near future.
With the success he’s seen with DNNs to date, Ganapathiraju expects deeper advancement of the technology in the next two to three years. “We foresee all things speech moving to DNNs in the next two years, including ASR, TTS, keyword spotting, emotion and sentiment analysis, and voice biometrics.”
Some technologists in the financial services industry see voice-based and other biometrics technology as important in customer verification and fraud prevention, a growing concern in the industry as more customers move to mobile banking and mobile payments.
The sentiment analysis will enable systems to recognize when a customer is angry, in a hurry, frustrated, etc., so that the speech-enabled IVR system can respond in a tone and with words designed to empathize with the caller’s mood. This, speech technology experts maintain, will provide enhanced customer experiences and improve customer relationships.
Suleman says that he expects to see DNNs continue to leverage machine learning and other technology advancements to better mimic human reasoning, improving human-sounding responses for speech-enabled IVR systems.
Speech technology experts also expect to see DNNs expand to add multimodal capability—the combination of vision, speech, auditory, and text inputs all together to help machines make decisions.
Phillip Britt is a freelance writer who focuses on high-tech, financial services, and other industries; he can be reached at spenterprises@wowway.com.