Is It Stupid to Be Clever?
Grammar writers generally try to anticipate a number of ways users will respond when prompted to speak. Many designers believe that by expanding their grammars to permit highly variable user input, they will create a natural, easy-to-use voice user interface. This is a belief that is strongly held by some in the voice application development community. And applications developed by true believers can sport some truly huge grammars. I have seen, for example, a yes/no grammar containing thousands of acceptable utterances! The idea is to enable the user to speak in ways that come naturally. When writing a grammar, the grammar writer imagines a conversational exchange in which one person says this to which the other person says that. In a way, this is a rather presumptuous undertaking because what the grammar writer is ultimately endeavoring is to predict human behavior. The contents of his grammar are his predictions of what a user will say in response to a given prompt. This big grammar practice is ill advised. First of all, no one can predict human behavior with unfailing accuracy. This necessarily means that any attempt to do so will be error prone. For the grammar writer, this means that no grammar can ever be complete. Users will invariably find unanticipated ways to express themselves and these expressions will lead to out-of-grammar errors. But, one might ask, whats wrong with trying? Why shouldnt users be allowed to express themselves in highly variable ways? So what if you cant anticipate every possible user utterance, wont you still be right 90+ percent of the time? Ironically, a big problem results from getting it right 90+ percent of the time. When an application appropriately responds to many ways of saying, in effect, the same thing, it is reinforcing the user for saying the same thing differently. This has the effect of encouraging the user to experiment with novel ways of saying things, which, invariably, will eventually lead to out-of-grammar errors. The situation becomes more problematic because when the application does get it right, it is fostering an unrealistic expectation on the part of the user. In effect, it is telling the user, Feel free to say anything you like. I will understand you. At this point, the application no longer has a speech recognition problem. It has an artificial intelligence problem. Big grammar systems seem clever and sexy as long as the user sticks to the script. But again, using clever grammars can have some unforeseen consequences. For example, I have designed a number of applications that incorporate some form of dial-by-name feature. In its most basic form, the feature allows the user to call someone by speaking his or her name instead of dialing his or her phone number. Applications have included Virtual Assistants, Auto Attendants and voice dial tone systems. In its strictest sense, the recognition goal is to identify a particular person to be called. While recognizing proper names is a difficult problem in and of itself for many speech recognition engines, the effort is made much more difficult by expanding the grammar to include things like titles, nicknames and social amenities. The idea seems simple and reasonable enough: instead of limiting the user, forcing him to say a first name and last name, (i.e. John Doe) permit him to say things such as:
Mr. John Doe. John Doe, Jr. Johnnie Doe. Id like to speak to John Doe. Would you dial John Doe for me please? Wouldnt such a system be clever and neat? Sure it would, as long as it never fails. But an interesting thing happens the first time it does: the benefit of being clever (itself questionable, I might add) dramatically erodes. The system has just wasted some of the users time and the user is no longer impressed with its cleverness. The user will abandon more elaborate ways of speaking and quickly find the most effective way to indicate a persons name. Whatever way is most likely to be correctly recognized will be what is selected and in my experience, this tends to bring us full circle: Please say the first name and last name of the person you wish to call. Instead of being clever with grammars and setting unrealistic user expectations, it may be much smarter to create designs that implicitly remind the user of the reliable capabilities of existing technologies. Dr. Walter Rolandi is the founder and owner of The Voice User Interface Company in Columbia, SC. Dr. Rolandi provides consultive services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at wrolandi@wrolandi.com