NLP on Fertile Ground
Most voice user interfaces (VUIs) require the user to navigate a maze of 20 Questions formulated into very carefully crafted prompts. Once a caller answers a prompt incorrectly, the system fails to recognize the request. This not only sends the system into error mode but can destroy any chance of the caller completing his task with ease. And that has been the primary reason why callers complain that speech technologies don’t work.
Some call centers have tried to reduce the limitations inherent in these types of systems by adding a layer of natural language processing (NLP) that allows an application to understand a broader range of words and phrases. These types of applications can reduce—if not eliminate—many of the out-of-grammar and no-match inputs, but unfortunately, doing so is not as simple as just throwing in a How may I help you? opening prompt and letting the caller answer in whatever manner suits her.
Furthermore, adding such capabilities might not be what the caller needs or the call center requires. If, for example, the application is very simple and has to recognize only a handful of responses, the IVR could best be served with a directed dialogue or touch-tone application, according to Daniel Hong, senior speech industry analyst at Datamonitor.
Paradoxically though, those setups have been where natural language has had its greatest successes. The best NLP systems so far have been written for very specific domains where there are a limited number of responses that people might give. Conversely, NLP systems have often failed "because the domain was too large and the number of reasons for people to be calling was too great," says Juan Gilbert, an associate professor of computer science and software engineering at Alabama’s Auburn University.
Given this mixed track record, the decision of whether to implement an NLP system—and if so, when—should not be made in haste. "Sure, you can get a lot more capabilities by adding natural language to an application, but there is a cost, and you have to weigh that versus the cost of not doing it," adds Charles Galles, a speech scientist and solutions architect at Intervoice.
Industry consultants and vendors, therefore, suggest that as a first step before rolling in a natural language application, speech user companies should look at their needs and expectations. "You need to look at your own business rules, document what kind of business it is, what you want to share with your customers, and what information you want to capture from them," says Peter Trompetter, vice president of global development at GyrusLogic, a Phoenix-based firm with several patents in NLP development.
Roberto Pieraccini, chief technology officer at SpeechCycle, says the biggest need for natural language from a business perspective might be "when you cannot express your menu choices any other way" or when customers can get confused about their answers to a basic prompt. As an example, he cites a big-box retailer’s call center where a customer calls looking for a car stereo. "He does not know which department he needs. Is it in automotive, electronics, or appliances?" he says. "With natural language, it doesn’t matter."
The caveat here is that businesses also need to determine "is the customer ready for [natural language], and does he really need it?" Galles says. "If he’s doing things effectively without it, it may not be worth the effort."
Not So Easy
In most cases, that effort can be considerable, causing even the most intrepid VUI designers and consultants to wince. That’s because development of such applications is remarkably complex. To allow for a natural language response to a single question, a developer first has to capture thousands of possible utterances that callers might provide so that they can be programmed into the system. Conservative estimates place the number of possible utterances at 20,000, though some have gone as high as 40,000 to 50,000 or more.
"As a designer or programmer, it’s easy to get overwhelmed," says Robin Springer, president of Computer Talk, a consulting firm specializing in the design and implementation of speech recognition and hands-free technology. "You’re playing the role of a mind reader because you’re trying to determine what so many people, with all their different ways of saying things, will mean when they talk to a system."
It’s almost an impossible task, according to Luis Valles, principal, cofounder, and chief scientist at GyrusLogic. "It’s like a chess game because it’s so difficult to anticipate all the possible things that someone might say and do."
Gilbert says the problem is rooted in language itself. "If you look at the scope of the English language, for example, there are a lot of words and a lot of different ways to say the same thing," he says. To be effective, a speech processor must be able to recognize them all and take the appropriate action based on the parameters given.
Gathering data about what people might say in response to a particular prompt typically involves playing simulated prompts for potential customers and recording their answers. Some veteran call centers that have amassed a large library of call recordings can avoid this step but might still need to transcribe all the recordings to come up with actionable call history information.
In either case, compiling all the data needed is a time-consuming task that can involve many people and carry considerable costs. "It’s not that one person can pull together all the utterance data. It’s a collaborative effort," Galles explains.
"It gets big and complex. It takes a lot of effort to get complete coverage of all that people might say, and it’s very expensive, so don’t enter it lightly," advises Jim Larson, an independent consultant and VoiceXML trainer.
Once the data is collected, other steps in the process include transcribing each utterance, determining what each caller could mean with each utterance, and then annotating or tagging each response. Automatic classification systems then use statistical algorithms to map the utterances into several predefined categories based on statistical models derived from a corpus of typical responses. Those categories and subcategories—which can number anywhere from 80 to 100—then form the basis for deciding where calls get routed within the application.
Several methods exist for categorizing and tagging the utterances. They are:
• Statistical modeling, which takes apart each utterance to identify relevant words or phrases and estimate the probabilities of various linguistic units within those phrases. Though there are a number of different modeling types, the most common are statistical language modeling (SLM) and statistical semantic modeling (SSM). In both, a large training corpus of recorded speech is required, though some systems can model utterances that have not been observed in the training corpus.
• Grammars, which need to be written to look for very specific information presented in a very specific way. A grammar is the predefined set of words and phrases that establishes what the application is listening for when interacting with the caller, and uses different linguistic and statistical models to put boundaries around the application. Grammars require inputting all the possible caller utterances beforehand; if the caller says something that is not contained within that grammar, the system fails to recognize it.
• Basic keyword spotting, which can be used to pick out single words or phrases within an utterance so that the application can act on them. Of these, this is the least precise, and most agree that it should be avoided. Larson calls it "the poor man’s alternative to building grammars or statistical models" because it’s not very accurate.
• Similar to keyword spotting, robust parsing is a supporting technology that automatically pulls out meaningless words, phrases, and pauses to get to the important parts in the utterance. It’s a process of examining input strings and identifying and separating their most basic sentence components (i.e., subject, verb, object).
Opinions differ greatly regarding the best way to tag caller utterances. With each methodology, "you have to look at what trade-offs you’re willing to make," says Deborah Dahl, principal at consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. Grammar-building, she says, is more precise but not as flexible, while the newer statistical modeling approach "is very flexible and gives the user a lot more freedom in saying what the problem is, but doesn’t work as well if you are looking for only very specific information."
To that end, Dahl and many of her colleagues recommend combining approaches to get the benefits of both. Whereas this might not have been possible a few years ago, advances in artificial intelligence programming and the application and platform programming tools supplied by systems vendors have changed that. Many of these tool sets, which had previously been proprietary, are now moving to more open-source VoiceXML formats for easier use and integration.
"Now some of the new technologies allow you to run different modeling approaches at once," Galles says. "This makes things easier because you can analyze things from different perspectives at the same time."
But no matter how the data is gathered and tagged, it still remains a very labor-intensive process. And "once everything is set up, people are reluctant to change anything because they’ll have to recollect and retag everything," Larson says. "Any change to the original corpus or the categories derived from them could require repeating the entire process all over again."
Like so many in the industry, Larson welcomes efforts to automate the specifications of grammars, models, and anything else that will allow programmers and developers to standardize and reuse grammars that are frequently repeated in just about every application. "That way, you wouldn’t have to rewrite a grammar each time you wanted to collect a date or a phone number," he says.
Given the technology’s current constraints, however, most consultants, analysts, and systems vendors recommend a gradual transition from directed dialogue to NLP. "You do not need to rip [the existing system] out entirely," GyrusLogic’s Trompetter says. "You can grow into it if you want. You may only want to plug it into [frequently asked questions] or numbers collection."
Jeff Foley, senior manager of solutions marketing at Nuance Communications, stresses that directed dialogue and NLP do not have to be mutually exclusive. "In an ideal world, every question would be asked in a directed dialogue format, with NLP in the background for the answer," he says.
Mixed Initiative
That type of system might still be a few years away, but there’s no reason that a system couldn’t allow for a natural language response and still use directed dialogue as a backup. With directed dialogue "as a fallback, you can give people some examples of what they might want to say if they get stuck," Foley adds.
By going with that kind of a system, you can also "ease callers into natural language," Springer says. "Naturally, there would be some hesitation on their part to adopt the technology by itself all at once."
SpeechCycle offers a system that does just that. Its call center solution combines natural language to allow callers to articulate issues in their own words with directed dialogue for guidance when necessary. This is done through its Dynamically Adaptive Response Technology (DART) for recovering from low-confidence recognition in natural language applications. DART is able to confirm with callers at least some of what they are trying to convey.
For example: When a caller dials into a cable company to order a pay-per-view movie, he is likely to say something like Order a PPV. Should that generate a low-recognition score, the system’s directed dialogue can kick in with a reprompt that says I understand you would like to place an order. Would you like pay-per-view or on-demand? From there, the dialogue can proceed as normal.
"We know speech recognition can make a mistake, but we can reduce the effect of the mistake by acting on what we do know from the utterance and directing the caller to a more general call path," SpeechCycle’s Pieraccini explains.
Another advantage of a system like this is that it takes some of the added pressure off the developers to plan for every word. "You’ll never get 100 percent, but at least you can get to a balance where the system is usable, where people get some use out of it," Springer says. "You can get a handle on one small part and then build on it."
Making It Easier
Still, most of the developments in natural language are being driven by technology vendors in an effort to make their solutions more palatable to developers and programmers. Intervoice, for example, has been working on a process that can significantly reduce the number of caller utterances that need to be analyzed during the tagging process.
With this, "you can do more and generalize based on a smaller data set," Galles explains. "You can start with 1,000 [utterances] and build patterns. From that small sample set, you can make generalizations about the rest."
A basic premise behind this concept is that in a typical IVR, one can expect 15 percent to 20 percent of callers to be dialing in for the same reason. The problem, though, is that "about half the utterances tend to be unique, and those are the ones that give developers the most heartache," he said.
GyrusLogic may have found a way around that problem. Its latest U.S. patent, which it received late last year, covers a "conversational dialogue and natural language understanding system and conversational dialogue application development methodology." The methodology, contained within its Platica Conversational Dialog solution, is "dialogue built on the fly," allowing users to interact with the system just as they do during a normal conversation," Valles says.
"With every speech application out there, [developers] are trying to anticipate every possible question and response that people will throw at it, and then branch it to a specific point," he continues. "Most transactions today are so diverse that you can’t prebuild a dialogue. Answers are so specific to the pieces of information that need to be identified and the way that people talk.
"With Platica Version 2.6, there’s no need to build grammars and dialogues to handle it because they get generated automatically," he says.
Trompetter explains further that the application’s artificial intelligence is sophisticated enough to "contextually understand what a caller wants and build the dialogue around that, and to interchange between ad-hoc and transactional questions at any point in time because you never know what the caller’s next move will be."
That means that the application can significantly reduce the amount of time and effort it takes to develop an NLP application. "If you’re not thinking about dialogues and grammars, you take 90 percent of the work out of putting a system together," Valles continues.
In the end, though, with this or any other natural language application, "the challenge is getting people to think away from menus and menu trees," Trompetter adds. "They’re not really important because the system should be able to understand and act as normally as two people would during a conversation."
Nuance’s Foley breaks it down to even simpler terms. The bottom line with any system, he says, is that getting it to work means first "intelligently designing the VUI and then beefing up the system’s memory to handle it."
Serving Customers Beyond the Call CenterWhile only a fraction of IVRs currently use natural language technology, many in the industry expect user demand for higher customer care standards to spur a proliferation. The use of natural language interfaces is already increasing, and Daniel Hong, a speech systems analyst at Datamonitor, predicts a compounded annual growth rate of 50 percent for the next few years.
And although call centers have been the largest beneficiaries of all the research on natural language so far, analysts, consultants, and vendors all see the technology extending far beyond the customer self-service realm.
"Natural language is the next great frontier for speech recognition as far as accuracy is concerned," says Jeff Foley, senior manager of solutions marketing at Nuance Communications. "Speech systems in general are getting a lot smarter because they are able to understand a lot more of what people say."
Because of that, "natural language will advance more outside the call centers than within," predicts Deborah Dahl, principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. "In a call center, people usually have pretty specific goals and are not looking at a broad range of things."
For Dahl and many others, data mining, voice search, directory assistance, and location-based services are key areas that will really benefit from natural language. "If you have a very broad area that you can search, it encompasses every way that you can ask for a business," she says. "And with GPS, it’s a great opportunity for the flexibility of natural language because there are so many things that people can ask and so many different ways to ask for directions, addresses, etc."
The technology is also being applied to machine translation, text-to-speech, speech-to-text, dictation, and Web chat. "Ultimately, all speech engines will support it," Hong says. "It will be included in all engines as the price comes down."
That will be possible, he says, because many technological hurdles to NLP in the past are being overcome quickly. "The first iterations of NLP weren’t that good," he explains, "because of performance issues, hardware concerns, and [system] memory limitations. But now, it’s been in existence for four or five years, and we’ve reached a balance of user experience."
"Based on the current speed of CPUs and systems, you’re going to see more uses for [NLP] and more that you can do with it," says Charles Galles, a speech scientist and solutions architect at Intervoice. "The technology’s been developing for a while, but now we are able to get it out in unique and exciting ways."
Operational Savings with NLPAccording to research by GyrusLogic, call centers can cut average call times by 51 seconds with a natural language system.
The company’s research found that the average call through a standard directed dialogue system lasts 1 minute and 48 seconds; with conversational dialogue systems in place, those same calls were cut to just 57 seconds.
Financially, it costs the call center a lot less "because people don’t have to stay on the phone as long," says Peter Trompetter, vice president of global development at GyrusLogic.
The savings can be great, even in applications that use only a partial natural language solution. For example, a company that receives 50,000 calls a day and initially installs a 20 percent conversational dialogue system stands to save about 3.1 million minutes in its first year, GyrusLogic officials concluded.