The Pains of Main Are Plainly VUI's Bane
There are at least two reasons why voice user interfaces (VUIs) use menus. First is the fact that interactive voice response (IVR) inherited menus from graphical user interface (GUI) systems. Way back in the history of IT, menu systems evolved as a reasonable way to organize system features and present them to users. Developers saw menus as an elegant way to wed functionality to usability. The practice is taken for granted today.
In the GUI world of visual presentation, most menu systems work quite well. Most share a starting point called the "Main Menu," to which system-navigating users can typically return if they so choose. But people in the VUI world gravitate towards menus for more than historical and organizational reasons. They have found by limiting the size of a system's grammars, the menus can stack the cards in favor of accurate speech recognition.
Menus and Human Dialogue
While it may seem odd to think of human-to-human conversation in terms of menu interactions, some human-to-human interactions actually exemplify menu organization. Take for example, the following form-filling dialogue fancied from a fast-food restaurant:
CLERK: Welcome to the Burger Whopper. Is this for here or to go? (MENU: here, to go)
CUSTOMER: For here.
CLERK: What can I get you? (MENU: hamburger, cheeseburger, chicken sandwich, fish sandwich, hotdog, fries, onion rings, cola, etc.)
CUSTOMER: A cheeseburger, French fries and a cola.
CLERK: What would you like on your cheeseburger? (MENU: mustard, ketchup, lettuce, tomatoes, onions, pickles, chili, etc.)
CUSTOMER: Mustard, lettuce, pickles, and onions.
CLERK: What size fry? (MENU: small, medium, large)
CUSTOMER: Small.
CLERK: What size cola? (MENU: small, medium, large)
CUSTOMER: Large.
Note that all of the menus are extremely small because the domain limits the choices. The size of the grammars alone will contribute to greater recognition rates. And, as a colleague once told me, "ASR works best when you already know what the user is going to say."
Even in our simplified fast-food example, the Main Menu plays the most important role in the interaction. The Main Menu is where the clerk determines the basic intent of the customer. It answers the question, "Why are you here?" Also note that whatever happens on the Main Menu determines what happens next during the dialogue.
Main Menu plays the same critical role in VUIs. Like in the fast-food example, It is where the system asks the user, "Why are you here?" and, subsequently, determines the dialogue to follow. If it asks the question and then cannot process the response properly, it is, by definition, clueless, and users quickly lose patience with a clueless system.
Overcoming Main Menu Problems
Of course what is missing in such dialogues is human intelligence. Barring dramatic breakthroughs in artificial intelligence, the situation for pure automation is not likely to change anytime soon. Thus, at present, there are two obvious ways to overcome the usability problems that the Main Menu presents. One is to eliminate the Main Menu entirely. In other words, instead of having one phone line wherein people can call to perform tasks, have dedicated lines for each specific task. This way the system knows the intent of the user once the call is answered.
Another, and very promising, way to overcome the problems of the Main Menu is to interject human intelligence by using a hybrid system architecture. In a hybrid system, human assistants sit behind the ASR engine. The ASR engine does what it always does and dialogue proceeds as long as confidence factors are sufficiently high. The human assistant surreptitiously intervenes, however, upon ASR failure. He listens to the unrecognized utterance and then directs the interaction accordingly. Users are thus spared the frustrating experience of interaction with a system that is incapable of determining their intent. Interestingly, users are not usually aware of the human intervention and they therefore tend to think highly of the automated system.
Walter Rolandi is founder and owner of The Voice User Interface Co. in Columbia, S.C. He provides consultative services in the design, development, and evaluation of telephony-based voice user interfaces and evaluates ASR, TTS, and conversational dialogue technologies. He can be reached at wrolandi@wrolandi.com.