Successful Applications Are the Combination of Technology and Craft
A designer needs to create a user interface (UI) that makes up for deficiencies in technology, and needs to stretch technology to make an application usable. This is especially important in applications that use barge-in. In ways, barge-in provides a metaphor for many of the issues we encounter in designing an application.
Even the name of the feature shows how little we pay attention to the interaction of behavior and technology. The technical jargon muddies, rather than clarifies the design problem. People don't barge-in. People interrupt. Only rude, obnoxious people barge-in. Understanding the difference is part of the process in designing a good UI.
The technical issues encountered in determining when the caller is speaking, turning the prompt off, and recognizing what was said, come from echoes in the telephone network. The recognizer might hear a prompt echo as caller speech, and falsely reject because it heard a word not in the grammar, or falsely accept an in-grammar phrase.
The solution is to use an echo canceller that listens to the prompt and subtracts (or cancels) it from the incoming signal. At a technical level, barge-in works very well, but it still takes a good deal of technical expertise. Optimal end-pointer settings differ from one recognizer to another. Some recognizers are affected by the distortions introduced by echo cancellation. We have had experiences in which one recognizer was unaffected by this, but another had accuracy decrease by 5 -10 percent.
Other less known aspects of barge-in have more of an impact on application performance than any technical concerns. Polite people don't barge-in, they interrupt. So what are the characteristics of interruption? In turns out that the rules for interruptions in human communications are known intuitively by all speakers.
First, a speaker lowers the level of his speaking voice to indicate a willingness to be interrupted. He pauses at a phrase or sentence boundary to invite interruption. When the second speaker begins to talk, the first one stops and cedes the floor. If the first speaker keeps talking, it's a signal that he's not willing to be interrupted. The second person stops, and waits for his next opportunity.
Because people intuitively follow such rules for interruption, it's critically important that a system turn the prompt off as soon as the person begins speaking. If there is a delay in turning off the prompt by as little as 100 - 200 miliseconds, barge-in won't work, even though the technology may be functioning flawlessly. If the prompt keeps playing, one would stop the interruption (having gotten a signal that the speaker doesn't want to be interrupted). When the prompt does stop, one would start an interruption again. What happens is that the application hears a stutter. Instead of saying "bill payment," the caller says "bill...bill payment." Since that exact expression would not be in the grammar, the application would reject and re-prompt with some error handling. Applications that behave this way take longer because of the false rejects and extra error prompts, and are more frustrating to the caller because they break the subtle rules of social communication.
For this reason, systems that can't react as fast as human behavior requires can not effectively implement barge-in. The longer latency doesn't have a technical effect on how the echo canceler or the recognizer works, but it has a hugely disruptive effect in triggering this sort of false stutter.
What else should we do in designing a good application? It's important to leave a pause at phrase boundaries to naturally elicit an interruption. Also a pause means there is no echo to be cancelled. A one to one-and-a-half second pause is the right duration to invite someone to interrupt at that point. One strategy is to play a short version of a prompt, pause, then provide examples of what the application grammar expects. E.g.: "Which department do you want?...pause...You can say shipping, sales.…"
Another critical part of making barge-in work, in both technical and psychological ways, is to lower prompt levels in barge-in applications. A standard for voice prompts in DTMF applications is to use a peak level of -8 dB. For barge-in applications, the level should be reduced another four dB. This change will be barely noticeable. It will, however, have important impacts. Psychologically, if the level reaching the earpiece is low, the caller intuitively raises his speaking level.
A louder speaking voice stands out more from any background noise than a softer voice. Lowering the prompt level by four dB greatly reduces the level of any prompt echo. So, lowering the prompt level by four dB results in the echo being reduced, and in the incoming level being increased reducing the complexity of the echo cancellation and end pointing.
As in many aspects of speech application design, combining technology and good UI design assures a successful application.
Richard Rosinski is vice president for professional services at VoiceGenie Technologies, and is vice-president of the board of directors of AVIOS, the applied speech technology professional society.