Innovations - Speech Technology with Impact: Innovative Research in the Labs Part II - IBM Research
It might come as a surprise that for over half of IBM's 115 year history they have founded and developed research labs to do basic scientific research in the foundations of information technology. Since 1945, IBM Research has grown to eight labs around the world, has produced five Nobel Prize winners, and currently employs 3,000 people working in computer, mathematical, materials, and services science, chemistry, physics, and electrical engineering. Equally surprising is that they have been doing speech technology research for over 45 years - longer than the majority of the speech industry has even been around. Today, they employ over 100 people worldwide in speech with several hundred others complementing their efforts.
Mission and Focus
The mission that IBM Research has laid out is simply to deliver business value around the user experience and automation. They do this by creating technology for use in IBM software products such as WebSphere® Voice Server and embedded ViaVoice, customer solutions, and by working with customers to deliver those solutions. Their five areas of focus are: contact centers, mobile (embedded) speech, accessibility, security, and speech analytics. Core technology includes speech recognition (ASR), text-to-speech (TTS), dialog management, semantic interpretation, multimodal, speaker verification, and conversational biometrics.
Innovation
IBM is certainly at the forefront of what this column is about - innovation. Whether research results are near term or in development for decades, they innovate in new areas and push the envelope in existing ones. For example, they certainly have done sizeable research in ASR, TTS, machine translation and semantic interpretation. However, in their MASTOR speech-to-speech translation project, they combine all four to allow real-time communication between speakers of two separate languages. This allows them to see how these separate technologies work in concert, and provide end-to-end value optimization of the technologies. Of even greater interest is the level of complexity that is added beyond simple translation. For example, since the output is spoken rather than text, they have to map intonation of what the first speaker said in one language to intonation spoken to the listener in the second - not an easy task.
IBM has also pushed the envelope in improving ASR accuracy. Their Super Human Speech Recognition Project, launched in 2001, has the target of equally comparing an automated agent with a live one by the end of the decade. IBM likened this to having the recognizer accurately interact with a person who has an accent and a cold and is talking to a contact center while on a cell phone in a convertible passing a big rig truck. Are these people nuts? Four years into the project, they are on track to reach that goal, but admittedly, not without lots of effort.
Other innovative areas are emotion detection and expressive speech. For example, they have made vast improvements to their TTS by doing phrase splicing, whereby they take recorded phrases, and splice longer segments together to create phrases that haven't been recorded, but sound natural, and are working on improving the expressiveness of TTS by incorporating different emotions like a real speaker would use, such as using TTS to deliver written news broadcasts with a happy or sad demeanor, something that currently isn't powerfully delivered in an unlimited domain.
Similarly, IBM's Free Form Interaction dialog management is an improvement on natural language interaction. This development allows self-service application users to provide incomplete information, change intention, get help, and drive the conversation to complete a task. The result is that this allows the system to conform to the user, not the other way around. Additionally, they are working on speaker verification and conversational biometrics by tackling such issues as false acceptance of users in security applications and continually verifying throughout the user interaction to give continual assurance that the initial caller is the same.
A Final Word
It is refreshing to see the variety and depth of research being done at IBM. From improving self-service applications in a contact center, adding to the efficacy of machine translation and human communication, to improvements in the core technology, IBM is vastly improving the quality of speech technologies. Interested in more? Go to www.research.ibm.com.
Nancy Jamison is the principal analyst at Jamison Consulting. She can be reached at nsj@jamisons.com.