It’s Not the Recognizer
It is an undeniable fact that speech-enabled applications can offer a wide array of benefits to an enterprise’s telephony environment. Reducing the cost of telephony operations, providing a consistent and high-quality caller experience, and enabling today’s mobile workforce are just a few of the ways speech recognition has changed modern business communications for the better.
That said, the success of any speech-enabled call routing solution is directly proportional to its ability to handle callers accurately and consistently. As success rates for servicing callers decline, the aforementioned benefits also begin to drop, effectively negating the value of the original investment.
Traditionally, blame for these declining success rates has fallen at the feet of a single system component—the speech recognizer. When a caller interacts with a speech-enabled call routing system and receives something other than the desired result, the immediate instinct is to assume the speech recognizer made an error in interpreting what was spoken. When reviewing the subsequent system metrics, system administrators often perceive poor application performance as being solely due to poor recognition accuracy.
This perception, however, ignores the other vital components that comprise the speech application as a whole—components that have the potential to dwarf the impact of raw recognizer accuracy on the application’s performance.
Indeed, successful speech-enabled call routing solutions are made of many separate components that work together to produce the desired result—successfully connecting the caller to his destination according to a spoken name or phrase. Components can vary slightly from product to product depending on which vendor’s solution you are using, but, in general, the five major components to a successful speech solution are:
1. The Speech Recognizer: The speech recognizer acts as the “interpreter” for the speech-enabled call routing system, matching a spoken name or phrase with an entry in the system’s directory. Often disparaged in contemporary society (see any number of comic strip plots pertaining to poor speech recognizer performance for reference), the speech recognizer is the caller-facing component of the call routing solution, along with the dialogues.
2. The Grammar: The application component commonly referred to as the grammar helps the speech solution determine the spoken request to connect the caller accurately. A well-formed grammar defines for the recognizer the expected words (vocabulary), pronunciation of the words, and grammatical structure of caller requests. For a call routing solution, the vocabulary includes the names of requested destinations. These are often drawn from a directory listing (which you’ll read about in the next component).
To address pronunciations of the directory items, today’s speech solutions generally use “dictionaries” of common terms and names. No dictionary, however, can offer 100 percent of the entries for a speech solution, leaving some percentage of names with questionable pronunciations. Consequently, attention must be given to pronunciations for the recognizer to correctly listen and make a match with what the caller requested. The pronunciation is also important when confirming what the caller requested. If a name is correctly recognized but mispronounced during confirmation, it could be rejected by the caller as not being an accurate match.
When addressing typical caller behavior, the grammatical structure is also an important consideration. Programming the grammar to recognize challenges related to speech recognition—such as vocal pauses and common non-sequitur words, like, please, umm, and so forth—improves solution performance.
3. The Directory: The importance of developing and maintaining a highly accurate directory cannot be overstated. Focused efforts must be taken to minimize gaps in the system between what a caller might request and what destinations are actually available in the directory. Identifying and including all possible caller requests through the system (employees, departments, contractors, vendors, etc.) helps mitigate this gap, increasing the likelihood that a caller will successfully reach his destination and subsequently use the speech solution for future calls.
In addition, constant and ongoing attention to the directory’s content must be maintained at all times to combat the effects of churn within any large enterprise. Churn is a term used to refer to dynamic changes within an enterprise call directory due to employees joining or leaving the organization, employee name changes (married versus maiden names, etc.), office changes, local area code exchange changes, physical site relocations, etc. It is estimated that the amount of churn within the typical enterprise’s call directory can average as much as 40 percent annually. Without a stringent protocol for addressing churn, performance and usage of the speech solution will quickly begin to wane.
4. The Routing Table: The routing table is key to the successful transfer of a phone call. Even if the dialogue is properly set up, the directory has been scrubbed, and the destination is identified by the speech recognizer, the call will not be routed to the desired destination if the routing table has been set up incorrectly.
Not only does the routing table tell the call routing system which numbers to dial (and in what sequence) when a certain name or department is spoken, but it also ensures the transfer process conforms to the protocol required to correctly use the dialing pattern.
The routing table is also used to direct the system to the least-cost-routing dialing pattern for each destination, providing considerable cost savings to an enterprise’s telecom infrastructure during the year. In addition, the routing table allows different PBXs (or PBXs that don’t have the same software release) that could each require different call transfer protocols to share the same directory content and coexist within the call routing solution.
5. The Dialogues: The correct dialogue is important to the success of the caller experience because it directs the caller to successfully interact with the system. The dialogue needs to give specific instructions about what the caller should request (first and last name, name of the department, state and ZIP code, etc.) to ensure the request matches the way the location is notated in the directory. Without these instructions, the caller will be more likely to say the wrong utterance, and the request will not match a destination in the directory.
The dialogue has to be specific and give instructions without being too lengthy or else the caller will become frustrated and simply zero out to the operator or hang up. This will affect the connection rate, and callers will subsequently not embrace the speech-enabled call routing solution for future calls.
Simply blaming the speech recognizer for poor system performance makes many assumptions concerning the other components that make up the solution as a whole. These assumptions include:
• The directory is 100 percent accurate, containing every possible destination within the enterprise at any given time;
• The routing table is correctly programmed to successfully connect every caller to every destination; and
• The dialogues are brief enough to encourage system use, yet are detailed enough to provide the level of information each caller requires for a successful connection.
Of course, these assumptions are unrealistic in real-world applications. It is nearly impossible to predict and address the various influences that can cause speech-enabled call routing solution errors before they can negatively impact performance. It is possible, however, to understand where these errors might originate, actively monitor the system to correct issues, and thus mitigate their impact on overall system performance.
While the speech recognizer has unfairly taken the brunt of the blame for system errors, it is actually not the most error-prone component of typical speech-enabled call routing solutions. During the past 14 years (and with hundreds of millions of calls connected), Parlance, which provides voice communications solutions, has found that for the majority of installed customers the source of application errors can be ranked as follows:
• the directory;
• the grammar (structure and pronunciations);
• the speech recognizer;
• the dialogues; and
• the routing table.
So now that we have taken a closer look at the potential error sources involving the other components of the speech-enabled call routing solution, you should be starting to see the bigger picture. No single component is responsible for poor system performance. Rather, the lack of continuous maintenance of the system as a whole is what leads to performance degradation. Dialogues, directories, routing tables, grammars, and speech recognizers—these components exist in a dynamically changing enterprise environment. Without ongoing maintenance and fine-tuning, the accuracy (and thus performance) of the system will slowly erode over time.
Joseph Maxwell is chief operating officer of Parlance.