Beyond Speech Recognition
Voice recognition is now widely used in telephone dialogues to convert voice to text. Some telephone applications use speaker identification to determine who is speaking. But this is only the beginning of information that recognition systems can extract from a person’s voice.
Experimental systems have demonstrated that many other kinds of user information can be extracted from a person’s speech. These systems, which I will call user property extraction systems (UPESes), use speech to detect and extract various kinds of information about the speaker.
A UPES could be used to improve the performance of an interactive voice response (IVR) system, with applications like:
- Speaker language identification: After determining the speaker’s language, a UPES could route the user to an IVR system that converses in the speaker’s language, or to a live agent who speaks that language.
- Speaker accent identification: An IVR system could switch acoustic models so the system can better understand a speaker with a British, Spanish, Chinese, or American accent.
- Speaker emotion identification: An IVR system could detect when a user becomes upset—unfortunately, this happens too frequently, often negatively affecting speech recognition accuracy—and offer to transfer him to a live agent.
A UPES could also make IVR dialogues more relevant to the user. This is especially true for marketing applications, as in the following examples:
- Age: An IVR sales application can suggest different styles of clothing based on the speaker’s age.
- Gender: An IVR sales application can suggest perfume or cologne based on the user’s gender. Guys wear cologne, not perfume.
- Poor social behavior: An IVR can identify when someone is acting inappropriately.
- Voice stress: A system could be used as a lie detector.
- Alcohol consumption: A system could identify those who have had too much to drink.
In addition, a UPES could be used for voice mining (e.g., to search a collection of audio files for young adult males speaking English with Scottish accents).
Like traditional speech recognition systems, a UPES requires training. A speaker-dependent UPES works reasonably well with short training sessions. Even so, only the most lonely or bored IVR system user will tolerate training a telephone system for this length of time. Alternatively, a speaker-independent UPES is pretrained by voices of hundreds of intended users, though it doesn’t usually reach the same level of accuracy as a speaker-dependent UPES. Pretrained systems like these are both time-consuming and expensive to construct.
A UPES that classifies users into one of many categories (e.g., age in years) is less accurate than a UPES that classifies users into a small number of categories (e.g., male or female). Some classifications may be too subjective to be useful. For example, it may be difficult to distinguish between an angry user and an anxious user.
It may be possible to combine a UPES with a traditional speech recognition system. Information from both systems determines the voice signature properties. For example, an age recognition UPES could be confirmed by a dialogue asking, How old are you?
Researchers in Germany reported that an alcohol ethanol concentration recognition system based on speech successfully discriminated between individuals with blood-alcohol levels above and below a specific value, with a success rate of 69 percent.
Then there is the problem of misclassification. Many men would be insulted if a UPES mistakenly determined them to be female and offered to sell them dresses. Senior citizens would be shocked to be offered tickets to a punk-rock concert.
Traditional speech dialogue systems sometimes confirm the words a person spoke using a confirmation dialogue. This strategy may not work for some UPES applications. Imagine the response to the confirmation question, “Honey, were you unfaithful while you were away at the convention?”
UPES systems promise to provide many new types of information about speakers. However, they also present many challenges, including how to use UPES information appropriately, how to detect and resolve incorrect UPES information, how to improve the accuracy of UPES information, and how to structure the user-computer dialogue. We have solved these challenges for traditional speech recognition systems, but we need additional experience before we can expect to find satisfactory solutions using UPES.
James A. Larson, Ph.D., is an independent consultant, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and author of the home-study guide The VoiceXML Guide. He can be reached at jim@larson-tech.com.