Innovations: Speech Technology with Impact - What Is Going on in the Labs?
What Is Going on in the Labs?
The focus of the next few issues of this column will be less on innovative new products, but rather what is going on behind closed (or open) doors in research and development (R&D) in the labs. Speech technologies in particular have had a long academic research history, highlighted by the exceedingly long time it took for initial research - perhaps 30-40 years prior - to bear fruit as viable commercial deployments. In fact, there were numerous product prototypes, such as voice-activated telephones or toys, that weren't commercially available for decades after they were first invented. One only needs to read about Radio Rex (1911), the voice command toy dog, or the AT&T "Answer Please Phone" (1984) to see examples of research that worked, but for whatever reason wasn't deployed right away. However, we will save the history of speech technology development for an upcoming issue of the magazine. In this column we will be looking at some of the things that are being worked on in the R&D labs of universities and companies.
Microsoft Corporation
It is probably true that when most people read or hear about something coming out of the R&D group of a company, as opposed to a university, they think about an organization headed up by an engineer with a number of people that work on projects in a lab. Not known for doing things on a small scale, Microsoft is a little different. In 1991, Microsoft created the Microsoft Research Organization, becoming one of the first software companies to create its own computer science research organization with the goal of supporting long-term research both for research itself and for feeding the development of new products. Currently, Microsoft has over 700 people in its research arm, working in more than 50 areas, many of them linguistically oriented, including speech recognition, natural language processing, text-to-speech, and language conversion. Based on an open academic model, many Microsoft research associates maintain their academic ties and continue to collaborate with the research community through participation and attendance at conferences, committee participation, and the publication of papers for peer review. Associates also work closely with product development groups to transfer research technology into Microsoft products. The two research groups that are focused on speech are based in Redmond, Washington and Beijing, China.
Speech Projects
The stated goal of the research speech group is "to build applications that make computers available everywhere, and work with the speech platform product group to make this vision a reality. We are interested not only in creating state-of-the-art spoken language components, but also in how these disparate components can come together with other modes of human-computer interaction to form a unified, consistent computing environment." As such, they cover all areas of speech, including - naturally - core aspects in improving speech recognition and text-to-speech, in areas such as accuracy and grammars, plus special focus in application areas such as telephony applications and call centers. Additionally, they have projects that delve deep into the realm of improving natural language recognition, language modeling, and aspects of technology that drive the usefulness of speech, such as improving user interfaces no matter which device is used. Some of the projects that have arisen out of the research group and been fed through the speech platforms team into products or in product development include:
- Multimodal Interactive Pad (MiPad), a multimodal interactive notepad prototype
- Speech Recognition (Whisper)
- Text-to-Speech (Whistler)
- Speaker Identification (Whisper ID)
- Speech Application Programming Interface (SAPI), an interface and developer toolkit
- Speech Enabled Language Tags (SALT), a markup language for the multimodal Web
A critical area of research for the speech group involves multimodal user interfaces, research that started with MiPad and continues with projects that include:
- Noise Robustness to improve system accuracy when background noise is present
- Acoustic Modeling to solve how to model phones and acoustic variations
- Language Model to predict how certain words will be spoken so that the recognizer makes the best choice independent of acoustics
- Automatic Grammar Induction to understand how to create grammars to ease the development of spoken language systems
- Multimodal Conversational User Interface
- Personalized Language Models for improved accuracy
Speech-enabled Agents
An important focus of speech for Microsoft is its incorporation into telephony applications such as phones and call centers, as well as personal interaction with a user. Speech-enabled agents combine human understanding and interaction with the application, so that an application can be queried and interacted with the way that you would with a live person. The goal of the agent is to both accept speech input, as well as understand what the person is asking, and then act upon it. So whether it's a personal virtual assistant looking for information or making transactions for a single person, or an "agent" in an application such as instant messaging acting as a go between, the "agent" will understand the request and act upon it.
Audio Information Management and Extraction (AIME)
One speech research project that moves beyond core speech technologies is AIME. Based on statistical pattern matching technology, along with a special "phonetic" speech recognition technique that can handle special words such as uncommon names or technology terminology, AIME seeks to make computers smart about speech and audio recordings. The end result is to produce a search engine that can mine recorded conversations, either audio or in print, for information about conversations. AIME will be able to search through content from voicemails, presentations, lectures, meetings, teleconferences, and broadcast news programs, for example, and gain a better understanding of the structure of conversations. As an example, users will be able to find out which speaker was speaking and when, when new topics were introduced, and what topics were important.
Language Modeling
Another interesting project is development of language models. Language models refine the speech recognizer's ability to figure out which is the right choice to make when two different sentences sound the same, independent of acoustics. Microsoft Research has language modeling projects in:
- Language Model Adaptation - Allows for the adaptation of a general-domain statistical language model to a new domain/user despite having limited amounts of sample data from the new domain/user. This allows for more effective use of a language model being transferred from one domain to another.
- Incorporation of Syntactic Constraints in a Statistical Language Model - Incorporation of syntactic constraints into a statistical language model to reduce the word error rate, or improve speech and language understanding.
- Speech Utterance Classification - Research into technology to classify speech utterances in a limited set of classes, such that it can assign a category to a given utterance. For example, this could be used to improve routing calls into a call center by allowing more free-form input from a caller as to why they are calling.
- Language Modeling for Other Applications - Extending language modeling into other fields such as handwriting recognition or spelling correction to help eliminate the ambiguousness of input.
In Closing
With the breadth of the research organization Microsoft employs, it is well suited to covering all aspects of speech technologies, whether improving caller interaction with an IVR application, being served by a virtual agent in a call center, or simply improving the ability for computer users to use speech for simple acts such as dictation or more complex ones such as language learning. However, the more interesting things yet to come will be a result of Microsoft's longer term goal of linking together all of the computing environments and applications possible and using speech to enhance the efficacy and user experience of those applications. I expect we will see some interesting developments from Microsoft near term.
Have a cool or noteworthy announcement or special interest story emerging from R&D? Please email me at nsj@jamisons.com