-->

Michael Cohen, Co-Founder, Nuance Communications

NewsBlast What is the history of speech technology and what brought us to where we are today? What are the biggest bottlenecks companies now struggle with?

Dr. Michael Cohen   It's accurate to say it all started with a toy dog named Rex. Radio Rex, developed in 1911, was the first automated speech recognition system - a celluloid dog that jumps out of his doghouse upon hearing the sound of his name. By the 1950s, more ambitious speech recognition research was taking place and by the 1970s, two different research paradigms had emerged. One camp focused on building systems that modeled linguistic knowledge at numerous levels. The other - composed mostly of engineers - focused on applying statistical pattern recognition techniques - building models based on observing many examples of spoken words or sentences. For a while, the two camps were at odds. Throughout much of the 1970s, the engineers seemed to be winning the battle. By the early 1980s the two research communities were working together, creating a statistical framework that mirrored certain well-understood linguistic structures, resulting in dramatic advances. The 1980s were an exciting time in speech research - the state-of-the-art changed from isolated-word recognition to continuous speech recognition, and from speaker-dependent to speaker-independent recognition, thanks to a strong focus across the research community on improving recognition accuracy.

Another important advance began in the late 1980s. DARPA (Defense Advanced Research Projects Agency), the U.S. government organization funding much of the speech recognition and natural language understanding research in the U.S., brought the speech recognition and natural language understanding research communities together to work on spoken language understanding. Many in the natural language research community switched from working on text input to speech input. A number of research centers worked on dialog systems for an application called Air Travel Information System (ATIS). The results were dramatic - after a few years of work, systems could recognize and understand complex requests such as, "Tell me the flights from New York to Boston next Tuesday morning" and return a result after accessing a database. For me personally, my role as a Principal Investigator on the ATIS program at SRI is what ultimately motivated me to participate in founding Nuance (with three colleagues) - for the first time, I could see the near-term potential for the important practical value of speech technology.

 

As we began to deploy commercial applications, the next important challenge became clear - the design of the voice user interface (VUI). The best technology cannot compensate for poorly designed dialog strategies or prompts. That remains the biggest challenge today - we need to advance our basic understanding, train many more practitioners, and make the process more efficient.

NB How is VUI design both an art and a science?

 

MC Voice User Interface (VUI) design combines elements of art, science and engineering. Artistically, it is a design activity demanding a sense of aesthetics.  For example, designing the persona for an application and crafting the prompts draw on some of the same writing skills a playwright may have. More importantly, the designer must work with the same sense of unity and consistency that drives artistic endeavors.

It's a science in the sense that it requires an understanding of basic human cognitive capabilities and linguistic behaviors to be successful. First of all, you are dealing with a challenging cognitive situation for end-users. Thus, you need to have some understanding of your users' cognitive limits - what's the maximum amount of information they can process at one time? What concepts are easiest for them to quickly grasp? What is the easiest way to navigate a list of options (i.e. for booking a flight), so that it corresponds with users' expectations? What will best fit with a user's current underlying "mental model" so he/she does not have to learn or remember rules or functions to use the application? Second of all, because the application is a spoken language system, it should be based on existing rules of conversation. When people engage in conversation, there are all kinds of underlying assumptions and expectations that quietly play out: turn-taking, expressions, pausing. By understanding and employing these subtle rules of engagement, you make it easy and intuitive for the end-user to have success with the system.

It's engineering in the sense that we can use concrete measures of success and a well-defined methodology for creating systems. Each step of the way has appropriate methods for gathering relevant data and testing results and assumptions.

 

NB  When does a persona enter the VUI development process? What are some of the things to consider when designing a persona?

MC Think of persona creation as the opportunity to design the "ideal employee" - with the right voice, the right personality traits, the right mood, and the right way of handling customer needs and problems. That image can be presented reliably, phone call after phone call. It can become the personification of the company's brand.

 

Some companies have not yet realized the opportunity that explicit persona design can offer them. The truth is, as we listen to someone speak, we make judgments about numerous attributes: the speaker's emotional state, social class, trustworthiness, friendliness, appearance, etc. We've all had the experience of hearing a radio DJ or conversing with someone over the phone and forming a mental image of this person. Often, when you meet that person, he/she looks nothing like what you have pictured.

 

Companies can take advantage of these "voice-based perceptions" to reinforce company brand and create solid relationships with customers. What we see is that some companies care a great deal about branding and invest a lot of time and energy into getting the persona "just right." In other cases, companies just don't want to make the investment. But even if they don't want to commit to a significant persona investment, it's worthwhile for a company to identify the system persona in advance of design, in order to assure consistency throughout the interface.

 

Aside from branding and image, there are usability issues to consider when designing a persona. For instance, for a VAD (voice-activated dialer) persona, you want one that is direct and ready to take quick action rather than being "chatty." Usability and end-user goals should be in-line; the persona should facilitate and reinforce end-user objectives.

 

NB  Many people think the goal of conversational interfaces and VUI is to lead users into thinking they are dealing with a live person on the other end of the phone. What is your take on this?

 

MC There has been a lot of discussion about the goal of both persona design and the use of advanced natural language understanding technologies (i.e., those facilitating "open-prompt" systems)  as leading people to actually think they are talking with a real person. We disagree. Current technology is far from the capabilities of a live human. You never want to mislead users about the capabilities of a system. That will lead them to form less effective mental models for interacting with the application.

The reason we design personas or use prompting approaches based on human-human conversation is to better meet end-user goals by making the user interface more intuitive and also, to better meet business and branding objectives. For example, using conversational discourse markers (little words at the beginning of a sentence such as "next,' "however," and "actually"), we are giving the caller natural cues and helping the interaction flow smoothly. Giving the caller a little word such as "however" gives him/her a preview of what will come next and makes the interaction more comprehensible.

 

When we choose to use advanced natural language technology, we do it to better meet application needs and user needs. In some cases, advanced natural language technology is appropriate, and in other cases, it is not. The bottom line is: make technology and design choices to maximize end-user and business value. 

NB How will the principles of VUI design change and adapt as technology evolves?

MC There are some fundamental principles that will be constant. For one, a person's cognitive capabilities will not appreciably change. People will not suddenly be able to handle menus of 50 items if today they can only process three or four. And basic rules for how people converse and use language will remain relatively stable. On the other hand, there are other aspects that will change: As technology enables more capabilities, features, and functions and allows for greater flexibility, the processes for creating a mental model for the caller should be adjusted to correspond with a system's capabilities. Additionally, new technologies will demand new dialog strategies, help facilities, instruction modes and error-recovery approaches.

The issue of appropriate mental model creation is one of the most challenging as we move towards more flexible and open interactions. Mental models for traditional directed-dialog systems are less challenging. You can simply present menus ("Do you want red, green, or blue?") or ask directed questions that make the answer set obvious ("How many shares would you like to buy?").

In contrast, when you give callers the flexibility and power offered by advanced natural language technologies, the problem of describing the boundaries around what they can say is much greater. The set of valid utterances is too large and diverse to describe explicitly. At Nuance, a set of experiments with open prompt call routing systems illuminated the methods for creating appropriate mental models for such applications. These approaches have led to improvements in task completion and user satisfaction, as well as reduced disfluency and misrouting rates in deployed call routing systems. 

Beyond technological advancements, changes in usage profiles and application needs will place new demands on VUI designers. For example, if voice access to the Internet while driving becomes an important application area, many issues related to attention, safety, and cognitive load become key. How should we change our rules of thumb about cognitive load to accommodate drivers, whose primary attention should be on their driving? How can we accommodate their need to totally divert their attention from the application from time to time, and then seamlessly resume? What design decisions will maximize driver safety?

 

Answering these and other questions that we cannot currently predict will require attention to the science, the art, and the engineering and measurement techniques we bring to bear.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues