November 25, 2003
By Walter Rolandi Founder - The Voice User Interface Company, LLC
The Human Factor

The Common Causes of VUI Infirmities

Call the Doctor While most of us know the various things we should and should not do to maintain a healthy lifestyle, relatively few of us consistently comply. Such is also the case in the Voice User Interface (VUI) design world. Best design practices, for the most part, are publicly available and widely known. Yet, and perhaps for the same reason that some people think they are above the rules of diet and exercise, many Interactive Voice Response (IVR) designers seem to see themselves as immune to the illnesses that invariably plague poorly designed voice applications. Just as an unhealthy lifestyle might eventually require the services of a physician, I am increasingly called upon to “fix” what ails the poorly designed VUI. Three Major Causes Most present day VUI infirmities are largely attributable to three major causes. The single greatest cause is the tendency of designers, developers and marketing professionals to selectively ignore the tried and trued best practices of IVR design. Again, many of these practices are publicly available. For example, Bruce Balentine and David P. Morgan’s “How to Build a Speech Recognition Application” is a terrific reference. While specific principles and practices can be found in many other publications, this work is probably the best single source of information currently available. Many of the most pedestrian, common place violations of sound IVR design could be completely avoided if more designers would study this book. The second and third causes of IVR usability problems are somewhat unique. They do not stem from violations of recognized best practices. They are born of unrealistic expectations that are conveyed to the user regarding the capabilities of speech technologies. Borrowing some diagnostic labels from the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders (DSMIV™), one might playfully lump these two causes into two basic categories: Personality Disorders and Mixed Receptive-Expressive Language Disorders. Personality Disorders Personality Disorders almost always arise from excesses in the application’s “persona.” While no one would argue the marketing value of an appropriate and representative corporate-tocustomer touch point, that value is often lost in an overly animated, artificially enthused IVR. Far too many designers succumb to the temptation to make their IVRs amusing or delightful. Entertainment itself becomes the primary goal, and concentration on the callers more practical reasons for using the application becomes secondary. An elaborate and animated persona can set expectations on the user’s part of human-like social abilities. While these sorts of dialogs can often be quite amusing, their ability to amuse the user tends to decrease as a function of general use and, more importantly, as a function of increases in the experience of recognition errors. Mixed Receptive-Expressive Language Disorder The psychiatric diagnosis of a Mixed Receptive-Expressive Language Disorder applies in cases where an individual exhibits deficits both in his ability to understand language and in his ability to use language to communicate himself with others. These people are largely incapable of processing natural language. The analogy to current VUI practice should be obvious: present day attempts to impart natural language processing (NLP) abilities to speech recognition applications often exhibit the most painful shortcomings. NLP, as an area of research in artificial intelligence, is far from new. In one form or another, it has been around for many decades. Some of the accomplishments of researchers in the field of text-based NLP are actually quite astonishing. However, the overall consensus seems to be that while NLP applications can be very impressive, are ultimately very brittle, capable of rapid degradation and likely to provoke user distrust. Within the speech industry, the term NLP can be used to mean many things. But given the wellknown limitations of text-based NLP attempts, one wonders why the term is in any use at all. Here is what appears to have happened: Many ASR engines are obviously and demonstrably capable of accurately recognizing extremely complex utterances. This is an undisputed fact. But somehow, people came to assume that this impressive ability to recognize the verbal content of an utterance was necessary and sufficient to support a human-like conversational competence. Herein lies the problem: Unfortunately, simply having some ears does not imply that one has a brain. Being able to hear and recognize something is only one step of many in the process of making an “intelligent” response. Automatic Speech Recognition (ASR) engines provide truly excellent ears. When appropriately deployed, they can accurately tell us what words are contained in an utterance. They cannot, however, provide any insight as to the semantics or meaning of those words. All of this aspect of interactive dialog must currently be built ad hoc into an application’s design. Just as an exuberant persona can set unrealistic expectations on the part of the user, so can a dialog design that encourages the user to indiscriminately “speak naturally.” NLP is not a speech recognition problem — it is an artificial intelligence and social modeling problem. When users are encouraged to think they can “say anything,” they will understandably act as if they can. This invariably leads to recognition failures and user frustration, both of which can effectively be minimized by setting and managing more realistic user expectations. Dr. Walter Rolandi is the founder and owner of The Voice User Interface Company in Columbia, S.C. Dr. Rolandi provides consultative services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at wrolandi@wrolandi.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Common Causes of VUI Infirmities

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents

Phonic Launches End-to-End Speech-to-Speech Platform for Building Voice Agents

Deepgram Launches Aura-2 Text-to-Speech Model

Wistia Becomes First Video Marketing Platform with End-to-End AI Translation and Voice Dubbing