An Emotional Mess
After a customer enters #, which does not correspond to any viable input option, a typical call center application says Sorry you are having trouble. In most cases, that response sounds just like the previous five that the person had heard, so the feeling of sorrow is not evident to the frustrated user. Rather than an empathetic human, the system sounds like an unemotional silicon and plastic device.
Traditionally, text-to-speech and interactive voice response systems have not had much personality. While these systems now typically respond in voices that are intelligible, and can even communicate in a number of languages and voices, they often sound flat, usually speaking in fragmented monotones and lacking variation in qualities like speed, tone, and pitch.
But recently a number of companies have been working on ways to make their systems sound more empathetic by injecting emotional components, such as empathy and joy, into them.
That goal may seem simple, but the underlying work is exceedingly complex. Vendors need to understand how human speech signifies various emotions, find ways to calculate and quantify such findings, and then develop software that mimics human interactions. Compounding the complexity is the need to devise viable business cases that can justify the significant investments needed in this nascent area. While some progress has been made, a vast amount of work still needs to be completed before emotional utterances are common in speech synthesis systems.
Driving ForcesA number of factors are driving the interest to add emotion to speech system output. Proponents note that there are many potential business benefits. In today’s highly competitive, rapidly changing marketplace, companies are looking for ways to interact with customers more effectively. The emerging speech technology features can be integrated into a variety of applications: automated call centers, customer relationship management, news and email reading, self-service applications, live news, business documentation, e-learning, and even entertainment. Theoretically, the change will lead to richer, more effective interactions and result in improved customer satisfaction, increased business efficiency, and, most important, more revenue.
Another factor pushing this change is the natural progression of speech technology. "We have reached the point where we can deliver systems that are quite intelligible, so speech system design goals are now shifting to make voice systems sound more like people and less like machines," states Andy Aaron, a research scientist at the IBM T.J. Watson Research Center.
To add such capabilities, vendors must first clear a number of hurdles. One of the biggest challenges is finding a baseline for comparing emotional and unemotional speech. "As of now, the classifications needed to identify emotional speech have not been well-known or widely accepted," says Jim Larson, an independent consultant and VoiceXML trainer at Larson Technical Services. In universities and vendor research and development laboratories, scientists have been trying to identify and quantify the metrics needed to correlate emotions to patterns of speech.
Challenges have emerged right from the start. "Expressive speech is multidimensional: One can say ‘happy’ or ‘sad,’ but those words often have various connotations to different speakers," IBM’s Aaron explains. Consequentially, companies find themselves needing to quantify traits that lack a consensus set of features.
Another problem is that of variability. Different speakers say things in different ways at both a linguistic and vocal level. Further exacerbating the task, there is also considerable variability even within the speech of a single speaker. A speaker will not necessarily use the same words to say the same thing twice, and even different instances of the same word will not always be acoustically identical.
A number of factors contribute to this variability. Speakers alter their way of talking in response to a number of conditions related to their environments and their status relative to those to whom they are speaking. Such attitudinal conditions include consciously increasing intelligibility (a speaker will alter his speech for a non-native listener or due to increased background noise), familiarity (a speaker will speak more carefully to a listener with whom he is not familiar), and social status (a speaker will speak to a child differently from the way he would speak to a peer, and would speak in a different way again to a listener in a socially dominant position, such as a boss).
A Positive OutlookSpeakers’ outlooks can change as well. If a person has been successful (say, getting a big raise at work), that success will affect the way he communicates with others throughout the day. Different emotional states affect the speech production mechanism and lead to acoustical changes in the individual’s speech patterns.
Physiological factors also play a role in how we communicate. Stress can have a dramatic impact on how one talks; a person’s tone, emphasis, and even speed can change significantly if she is under a great deal of stress. Other factors, such as fatigue, illness, or the effects of drugs and alcohol, can also alter verbal exchanges between people.
While these factors are essentially independent of each speaker, they all manifest themselves to some degree in all exchanges. So researchers have to first figure out what the changes are that each of these factors produce, and then group them together to accurately deduce how speakers evoke different emotional states. The challenge is, for the most part, that these speech variabilities are produced unconsciously. Even when a speaking style is adopted consciously by a person, the actual vocal changes (varying pitch or pace) are often made unconsciously. Therefore, identifying the changes and developing precise descriptions of how they are produced can be hard.
So far the industry has made significant progress in its quest to deliver better speech systems, but these improvements have come in a limited context. Because the problems associated with emotive speech are so intimidating, most speech scientists have focused (at least to date) on dealing with "normal speech," that is, speech that does not display any of the previously mentioned variabilities. The end result is that most present speech synthesis systems do not exhibit emotion and instead produce bland, neutral, machine-like speech.
Vendors have been trying to add emotions to speech systems in a couple of ways, each with varying levels of success. "The industry has made more progress focusing on the linguistic features found when expressing different emotions than with examining the acoustical elements," admits Dan Faulkner, director of product management and offer marketing at Nuance Communications.
Adding Features
Vendors have spent a great deal of time, money, and effort trying to determine how different words impact customer exchanges. They have found that certain words, such as "just" and "simply," trigger various responses in customers. A few suppliers have also been trying to take that knowledge and use it to improve their systems’ effectiveness.
Loquendo has focused on adding emotional features to its text-to-speech systems. Patrizia Pautasso, marketing and business development manager at Loquendo, views the company’s work not as pioneering new technology but as an extension of the basic ideas of concatenative synthesis—the extraction of segments of true human speech and playing them in different combinations. Rather than concentrating on having short phoneme sequences evoke certain feelings, the vendor has focused on using entire phrases that have expressive power. Certain phrases are chosen to represent "speech acts" (i.e., common linguistic expressions with a strong pragmatic and social intention, such as greetings, requests, thanks, approvals, and apologies). Loquendo’s Expressive Cues feature, which works in multiple languages, provides a series of commonly used expressions said by Loquendo’s voices in an expressive way.
Emotional BondsThe more difficult task centers on trying to acoustically connect emotions, such as anger, happiness, and sadness, to synthetic speech systems dynamically. At the moment, delivery of such products is theoretical rather than practical, and no vendor offers a system capable of generating dynamic, real-time emotive speech output.
Here, vendors need to figure out how variations in speech patterns correspond to different emotions. Three areas present significant challenges in adding emotion to speech systems: intonation, voice quality, and interaction variability.
Intonation centers on the placement of word-level and utterance-level accents. A lot of work has been done on the description of intonation contours, and some rules have been produced for assigning contours to synthetic speech based on parsing its verbal content. While current systems have made progress in this respect, the limitations of word parsing and intonation rules mean that no system can correctly assign the correct contour for every possible utterance a person could make.
The underlying "personality" of a synthesized voice is a major contributor to whether it sounds natural. Systems based on prerecorded speech perform well in this respect because the speaker’s voice quality comes through in the resynthesized speech; however, this option is not available in all cases. Machine voice output has been improving, but still falls short of the vocal granularity found in human speech.
Also, vendors do not want to simply generate an emotive response; they need to produce an appropriate one. "Emotional projection also requires emotion detection," says Valentine Matula, director of multimedia research at Avaya Labs.
Adding expressive speech to voice products’ output is based on the premise that these devices can accurately gauge and respond to users’ moods from their input. After all, the emotive expressions are not generated in a vacuum; they come in response to something users say or do. To accurately gauge those items, vendors need to examine a number of different elements and rely on them as queues to guide customer interactions to successful completions.
Small Steps Forward
Suppliers have made some progress in this area. They know that users are usually not in a good mood when they reach out to a contact center. "Customers will not call contact centers in order to thank them for a job well done; they are usually calling because they have some type of problem," Nuance’s Faulkner notes. The company has incorporated features into its system to understand different cues, such as long pauses or a series of incorrect entries, and respond to them appropriately. Unfortunately, that usually means routing callers to live agents rather than having them continue to interact with the voice system.
Vendors also need to make other determinations, such as identifying target audiences and presenting information to them appropriately. A voice system catering to middle-age customers should not have a teenager’s voice and vocabulary, and vice versa.
In addition, some human interactions are quite difficult to replicate. "There are artificial intelligence elements that need to be integrated into emotive speech systems in order for them to be effective," says Raul Fernandez, another research scientist at IBM’s T.J. Watson Research Center. Agents are trained to apologize even if a problem is not their fault and to try to help even if they cannot, such as when a customer is asking for a refund for a coupon that recently expired. In effect, vendors are being asked to build machines that are smart enough to know that they might have to simply let customers vent for a few moments and then try to address their problems.
The myriad of emotional variability factors that vendors have to take into account to add emotive features to their systems has stymied suppliers. "The research from early projects found that emotions were correctly portrayed only 55 percent of the time, which is just slightly a bit better than guesswork," Larson says.
The work has been thwarted by many factors, starting with the challenges found in building a common emotive speech frame of reference. The work has also encountered problems due to the ephemeral nature of these concepts, which denies easy description and, therefore, prevents robust definitions of related terms. Furthermore, despite the largely cross-cultural nature of emotion, language translation problems add another facet to the problem.
Solving the problem means developing a common reference framework, one that could add structure and definition to a topic that to date has generated incomplete (some would add largely subjective) data. Once the framework is developed, vendors will need to develop algorithms so their systems convey the proper emotions. The algorithms will then need to be tested to determine how well they work. Vendors will need to find out which expressive phrases are truly useful in real applications for each language and record them with the right degree of emotion to match their expressive intents while avoiding an unnatural gap in the synthetic voice baseline. Since there are many gradations of happiness, companies will need to present the right one at the right time to customers.
An Indirect Path
Consequently, the line from these emotive speech objectives to their eventual end point is crooked, not straight; it will entail a lot more research, experimentation, and usability testing. Therefore, suppliers need to be willing to invest in projects that may go nowhere for a while. "History has shown that all new technologies are misused before they find their niche," Larson says.
Certain vendors are taking portions of their research and development budgets and allocating them to solving the problems. In addition, such initiatives have spread out from vendor labs into standards organizations. The World Wide Web Consortium, for example, has assigned a working group to study the issue of adding emotional components to voice systems.
Yet one important question has remained largely unanswered: "While a company could make a PC sound surprised, is that something that it really should do?" Avaya’s Matula asks. In other words, how will users react to emotive voice systems? Quite frankly, no one really knows—at least not right now.
Because of that uncertainty, not everyone is jumping on the emotive speech bandwagon. Some suppliers and customers are willing to let others be the industry’s guinea pigs. In addition, certain vendors are finding it difficult to build a business case for tackling the vexing questions, and some have moved the work to the back burner. "Adding expressive speaking to our products is not a high priority for us right now," Nuance’s Faulkner admits.
The expressive speech work is relatively new and a great deal more research and testing are needed. There is no guarantee that the tests will provide the desired results. Given that, Nuance is concentrating on features that are simpler to understand and offer a quicker payback, such as improving usability and aligning these systems more closely with a company’s business rules.
Others think a payback from adding emotions to speech systems is self-evident. "By incorporating emotive speech into voice systems, human-machine interactions may become more appealing, more natural, and more effective because users will find it easier to interact with a well-mannered machine," Loquendo’s Pautasso concludes. The more well-mannered a machine is, the less likely it is that the user will opt out to talk with an agent.
Time will tell which position is correct. Estimates range from a couple to a handful of years to forever before the speech synthesis industry addresses all the emotive speech challenges. So for now, users will have to interact with machines that apologize but do not really sound like they mean it.