The Art and Science of Error Handling
Are there specific types of interactions where interruptions are more likely to occur?
No, it can occur during any type of question or prompt. One of the easiest that you can pick up on and watch out for are the yes/no questions. You have to pay attention to what happens before and after the prompt. Because it's such a short response, the caller will give it quickly, but if we're not prepared in an application to take back that speaking role and move on to the next question, the caller may not think that the system heard him and he will respond again.
Are there specific industries where this is more of a problem?
No, this is something that is universal to all speech recognition applications used over the telephone.
With the phone, it can be an obvious problem, but couldn't this happen with any speech application, whether it involves talking to a car's navigation system or a smart TV remote?
Of course. Any time technology is designed to mimic a conversation, where there is some interaction between a human and some type of a computer, it needs to take into account those unwritten rules, those innate rules that we know exist in a conversation. The reason it's so problematic for the telephone is that callers have no other reference. They don't have any visual cues to work off of to let them know what the system is doing.
As designers, the single best thing we can do is remember that speech system interactions are sequential, audio-only, and linear by nature. Callers have to hear every word before they answer and move on to the next question, but they don't have any of those visual cues. Even when you're interacting with a device or a TV, it may have a light or make a sound or give off some other reference to let you know that it's processing what you said, but on the phone, there is no visual cue to let you know that.
Is it better to build a system with natural language processing, or with a structured dialogue? How do the two compare when it comes to these types of errors?
The problem is universal regardless of the type of underlying technology. You can have a well-crafted natural language system that gets interrupted because you hadn't [planned] for what happens if the caller jumps in too early, and the system doesn't have an easy way to recover, or you can have a well-structured back-and-forth dialogue where you have a very solid recovery strategy and you can handle it. If you ignore the technology underneath, it's more about the dialogue itself and the way the questions are asked and understanding that the underlying technology may need time to process whatever information it was given.
No matter what type of system you use, don't underestimate the importance of transition phrases, discourse markers, things you can say to the caller to take back that speaking role in the conversation and make it clear when it's his turn to talk again. That's where the fine craft of the dialogue design comes in, where you can make up for any system shortcomings, when you can make up for any interferences in the conversation.
How do most systems today respond to these turn-taking errors?
Several things could occur. Based on what the caller said, it could match it to something in the grammars and take the caller somewhere [other than where he needs to go]. It could create a simple recognition error. It could create a feedback loop that I talked about earlier, or the next thing the caller hears is "We seem to be having problems here. Let me transfer you to an agent," which is probably the best response it could give.
Is a transfer to a live agent really the best thing a system could do? The whole purpose of the IVR is to take the agent out of the equation.
Ideally, there should have been measures in place to prevent the error in the first place, or at least to mask it from the caller. The caller might be confused, and probably doesn't know what he did wrong. Hopefully, those types of scenarios are rare because callers might play along, but in the process they've lost faith in the speech system's ability to interact in a natural way. It becomes a much more mechanical conversation than it could have been.
Is it possible to anticipate these errors when designing a system and to prepare for them in the design itself?
Absolutely. That is always what a designer should be doing for every question, and they usually do. They try to anticipate all the things that could go wrong. Unfortunately, sometimes, in trying to anticipate all the possible things that could go wrong, the designer structures the initial prompts to compensate for all of that at once, which is what usually gets us into this mess to begin with. You get really long messages because you're trying to prevent all these errors.
Instead of designing with a lot of heavy content, trying to push so much on the callers just to get the responses that we need, it may involve asking more questions or shorter questions or doing whatever else we need to do to get a more natural dialogue.
If you want a very specific type of response, ask that question specifically and give [the caller] time to respond. It's really as simple as that.