The Outlook for Deep Neural Networks and Speech Technology
However, the deployment of DNN-enabled speech technology isn’t a simple process, Thomson cautions. To work right, the different DNN layers must be trained to properly recognize the different elements of speech.
“Just because something is fairly new doesn’t mean that it will be better than what was used before; you have to make sure that it has the right settings and that it is working right,” Thomson says.
DNNs have to be trained, which isn’t a simple process, according to Thomson. “Even the best system has to be on a vast amount of speech data.”
Technology Advances Enhance Development
Every day, more of this data becomes available. With enough data, as well as the processing power that has become available in the past few years, systems are able to properly recognize, understand, and respond at speeds that they couldn’t previously. Speed in recognition is essential to avoid latency in working with IVR systems. Today there’s no noticeable delay between a DNN-based system and one using older technologies.
Thomson blames the training challenge for the lack of progress in DNNs until about 2010, when a combination of factors came together to enable these systems to start showing enough promise to encourage accelerated development and some initial usage.
“We had to figure out all the tricks we needed in order to make these work. We moved to some new methods—mathematical methods—in our training,” Thomson says. The mathematical methods, which are also used in video games, are much faster and can handle much more data than earlier models; therefore, there’s faster, more accurate speech recognition. The introduction of ever-faster GPUs further enhanced DNN and speech recognition development.
As a result of enhanced training and development methods, one of the major challenges with speech-based IVR systems, the word error rate, started plummeting, according to Kaheer Suleman, cofounder and chief technology officer for Maluuba.
Prior to the introduction of deep neural networks, speech recognition relied on Gaussian mixture models, a type of probability modeling that for a few decades was state of the art. Gaussian mixture models attempted to determine what someone is saying based on a series of bell curve models to determine what a sound is (such as an “n” in “nurse”). The more levels of modeling that were added, the better the accuracy of the system. But these models were limited in how far they could go in accurately determining speech, leading to often inaccurate determinations that resulted in frustrated callers and the lack of use of these IVRs until the systems could be improved.
“The problem is that if the caller is unhappy, then you can lose their business,” Thomson says. These kinds of issues kept some companies from adding speech recognition capabilities and had a few even go back to older touchtone entry-only IVR models.
Neural networks, particularly deep neural networks, however, are much better at correctly recognizing elements of speech, natural language usage, and even some thick accents and regional dialects, according to Thomson. Suleman adds that the DNNs also can work in nonlinear sequences, unlike the Gaussian models—another capability that improves both the speed and the accuracy. Hence the development of the more familiar consumer-based applications, and the beginnings of enterprise-based applications that are far better than the systems of just a couple of years ago.
Much Better Accuracy
Estimates of improvement in accuracy compared to older speech recognition methods range as high as 95 percent, with that figure continuing to improve as these systems continue to evolve.
“The capabilities have been limited so far, but it’s getting more humanlike and taking over roles that you’ve previously needed a human for. The big thing that has changed is the improved accuracy. Now we’re able to do more than just transcribe speech; now we are able to understand deeper meaning,” Thomson says. So the DNN-enabled systems will quickly recognize synonyms, many colloquialisms, and speech patterns that don’t fit into the strict rules-based scripts of older systems.