Why We Don't Have HAL 9000...Yet
In early May, Robert Fortner published a provocative, yet poorly informed blog post titled “Rest in Peas: the Unrecognized Death of Speech Recognition,” in which he says speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension. Here is a summary of my response to his claim:
Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour and get people to the moon. There are limitations in any of our technologies, but the major limitations we perceive are the result of our false assumptions and wrong uses of them. Speech recognition is not about building HAL 9000, the supercomputer in “2001: A Space Odyssey.” Speech recognition is about building tools, and, like all tools, they can be imperfect. Our job is trying to find a good use of an imperfect, often crummy, tool that can make our life easier.
Fortner asserts, “The accuracy of speech recognition flatlined in 2001 before reaching human levels…[and the] funding plug was pulled.” Indeed it is true that in 2001 “some” funding plugs were pulled—mainly Defense Advanced Research Projects Agency (DARPA) funds for interactive speech recognition projects devoted to dialogue systems. But what really happened was that 9/11 changed the attention of the funding agencies. DARPA started quite a large project called Global Autonomous Language Exploitation (GALE), the main goal of which is to interpret huge volumes of speech and text in multiple languages, mostly for homeland security purposes. In recent years, several industry events and dozens of specialized workshops have continued to attract thousands of speech recognition researchers around the world. Other nontraditional uses of speech technology have emerged, such as emotion detection, speaker segmentation, summarization, speech-to-speech translation, and even the recognition of deception through speech analysis. This conveys the strong message that speech recognition is not dead.
The most frustrating perception of speech recognition accuracy—or the lack of it—is when we interact with its commercial realizations: dictation and interactive voice response. Still, many are happy with automated dictation and have been using it for years. It is also true that many people tried dictation and it did not work for them. But most likely they were not motivated to use it. If you have a physical disability or need to dictate thousands of words every day, then most likely dictation will work for you. Or, better, you will learn to make it work.
Making Life Better
IVR systems, as we all know, have a bad reputation reinforced by “Saturday Night Live” skits and GetHuman.com. But are these systems so bad? After all, thanks to speech recognition, hundreds of thousands of people can get real-time flight information, make train reservations, manage bank accounts, and even fix their Internet or cable TV just by making a call. Yes, occasionally these systems make irritating mistakes, but they are useful. They are tools, and tools can be useful when used properly and when truly needed.
Fortner also states, “In 2001, recognition accuracy topped out at 80 percent, far short of HAL-like levels of comprehension.” Besides the fact that talking about speech recognition accuracy is meaningless outside of a well-defined context, speech recognition can get well above 80 percent in specific contexts. We all know that data-driven tuning of deployed speech recognizers can get to the 90s on average and to the high 90s in constrained tasks, with the few mistakes gracefully handled by a well-designed voice user interface. We also have noticed many of these mistakes are due to users not properly responding to prompts. Again, speech recognition is a tool offered to people to make their lives easier, and tools should be used properly.
As demonstrated by the popularity of the recent AVIOS conference, speech recognition has moved out of the traditional telephone to become a smartphone input modality—what we call mobile voice. This new trend is receiving a lot of attention. The co-evolution of the products and their motivated users will make speech recognition a transparent technology—one we use every day without being aware of it, like the telephone, mouse, Internet, and everything else that makes our lives easier. And if we don’t give up teaching and fostering speech recognition, some smart kid from a university somewhere will figure out a smarter recipe for it. Then maybe we will have a HAL-like speech recognizer, or something closer than what we have today.
Roberto Pieraccini is chief technology officer at SpeechCycle. Additional responses to the original blog post can be found on his blog at http://robertopieraccini.blogspot.com.