The Holy Grail of Speech
Speech transcription is my passion. I love all of the ways that speech technology plays in the lab and in business, such as speech self-service and mobile search. But speech transcription, for me, is the pinnacle—the Holy Grail that we have aspired to from the beginning. With transcription, anything that is said can be converted to text. Speech self-service and other dialogue systems have an easier task. Users are guided and manipulated to say certain things in certain ways by clever dialogue management designs. The result is a satisfying customer experience, but the speech recognition component is benefiting from a number of crutches, such as limited domain and vocabulary and the guiding hand of good dialogue designers.
Speech self-service enjoys special status in the speech application arena. It is already deployed broadly and successfully and embraced more enthusiastically than touchtone services. Well-designed speech self-service systems seem to understand any speaker, irrespective of dialects, accents, and head colds. Users of these systems have said that they think speech recognition has achieved human accuracy. When speech self-service systems are well-designed, unsophisticated users don’t even know they are being guided, and speech recognition enjoys a boost in user perception.
But the speech recognition problem is not yet solved. According to Victor Zue, a professor at MIT, working in this field probably guarantees lifelong employment as we inch closer and closer to the final destination. It is not yet solved because speech recognition still requires tuning on particular domains to maximize performance. Speech recognition also requires tuning on particular channels, such as broadband, cell phone, and Voice over Internet Protocol. New languages are not easily added to the repertoire, either. Each new dimension—language, domain, or channel—can be created through speech data collection, transcription, and the art and science of cranking the data through the algorithms. The process is time-consuming, tedious, and expensive.
To date, the focus has been on a few key business areas, including digit recognition, broadcast news, and telephone conversation transcription, often thanks to government funding to address specific areas of need. Clearly many other application domains remain unaddressed.
So here lies one of the many Catch-22s of speech recognition. To blanket the world with the wonders of speech recognition, we need time and investment. In an emerging market, the return on investment is still evolving. Who will provide the needed investment? If a small country representing a small language group opts to deploy speech transcription services, who will underwrite the up-front development costs?
A Partner Approach
Our team at IBM Research has adopted an approach to broaden the market through a network of partnerships with industry and universities. This presents a range of advantages and opportunities. Speech technology gets molded, stretched, and used in new application environments. The base technology improves as we develop a deeper understanding of the demands on speech in diverse applications. Users get exposed to an even broader array of speech applications, thereby increasing their familiarity and, hopefully, their affection for the technology, making it easier to present new speech applications and services that will be enthusiastically adopted.
With thousands of languages in the world, and a constantly evolving array of channels and service needs, we need to adopt a collaborative approach rather than doing it alone. Through open collaboration, partners get early access to the most advanced technology, companies establish footholds in emerging speech markets, and the technology advances more rapidly.
The opportunity surface for transcription has hardly been scratched. Unmet opportunities for speech transcription await in education, accessibility, the medical and legal fields, media and call center analytics, and others. By working with partners, we can customize and tune transcription to better meet these challenges and start to address market demands.
Once we have solved and addressed the Holy Grail of transcription, other areas requiring speech recognition will improve as a collateral benefit, as technology becomes better at understanding anything a user says. If we begin a speech application by at least trying to be flexible and dealing with what the user wants to say, rather than asking for an unnatural interaction we try to control, users will see that spoken language with machines approximates the advantages of spoken language with people. By attacking transcription—the toughest problem—other application challenges will also fall into place.
Sara Basson, Ph.D., is program director of speech transcription strategy at IBM Research. She can be reached at sbasson@us.ibm.com.