Passing the Test
Speech applications represent a sizable resource investment for organizations deploying them. The investment is worthwhile because speech holds the promise of handling more calls without the intervention of a live agent, as well as a more personable and pleasant interaction for customers. Organizations therefore want and expect to begin reaping these benefits as soon as a speech application is released, but many are disappointed by technical and usability issues that arise after deployment. The way to avoid such disappointment, and consequent delays in realizing the value of speech is to thoroughly test speech applications throughout their lifecycles. Testing can be a hard sell for speech project champions who are eager to get new systems in place as quickly as possible. However, the time and effort spent in evaluating speech applications is significantly less than the costs of going live with a problematic application and repairing both the system and damage to customer relationships.
TYPES OF TESTING There are many different types of testing available to assess speech applications, and it can be difficult to understand the benefits and limitations of each. One helpful way to partition the testing space is to categorize methodologies as user tests or system tests. User tests focus on the behavior and opinions of individual end-users of the application. These tests focus on users ability to successfully interact with the application in order to accomplish tasks, as well as users opinions about the interactions. System tests focus on the behavior of the speech application itself. They may require user input, but their goal is not to analyze this input. Instead, they focus on the response of the application to user input. Both system and user testing are critical for every speech application because they provide unique evaluative data. System testing is important because of the complexity of many speech applications. Shortcuts, universal commands and multi-slot recognition are features that both enrich and complicate applications. Understanding how an application will respond under a wide variety of circumstances is vital to realizing its value proposition. Similarly, the significance of user testing cannot be ignored. Remember, speech technology is valuable only if users have an efficient, effective and satisfying experience interacting with the applications. User testing provides a method for evaluating these factors. Lets consider both user and system testing in detail.
PREREQUISITE TO SUCCESS
Consistently high performance is a prerequisite for a successful speech application. There are many varieties of system testing that can verify the soundness of an application throughout the development lifecycle. Specification testing confirms whether an application performs according to plans determined early in the project. Checking for adherence to specifications is useful during development, but can be even more critical once an application is fully coded and ready to be deployed. Many sets of hands will likely have touched the project between specification and implementation, and this testing can ensure that the application is meeting its original goals. Performance testing delivers information on recognition success rate as well as host and database connectivity issues. The two main tasks of a speech recognition application are to decode the incoming speech signal and then perform the function specified in the input, which generally involves interfacing with a backend data system. These tasks are distinct and it is important to diagnose the source of problems as misrecognitions or application errors. Stress testing examines the behavior of the system under heavy usage conditions. Speech can be resource-intensive so it is vital to understand system requirements in order to plan for peak usage. System testing requires participation from both the organizations deploying and developing the speech application. System testing may be conducted internal to these organizations, but many opt to engage outside firms who specialize in system testing (see sidebar.)
KNOW THY USER User testing provides key information on the ultimate success of a speech application, and can be conducted throughout the application lifecycle. Performance, as addressed by system testing, is a baseline measure, revealing simply how the application will respond to various types of user input. Only user testing can reveal what the actual form of user input will be and how users will react to the application. A wide variety of techniques can be classified as user testing, including usability testing and customer satisfaction testing. User testing yields two basic types of data behavioral and opinion. Behavioral data reflects how the user responds to the application and can be assessed using objective measurements such as task completion rate, error rate and error severity. These measures reveal which tasks users are able to complete successfully and which are problematic. It is also possible to obtain a measure of how quickly and easily users are able to complete their tasks by looking at time on task or by using a metric that captures the number of attempts required to complete each task (such as Intervoices Usability Grade). Performance benchmarks, in the form of task completion or error rates for certain critical tasks, can be set in advance; user data can then be collected to determine if these benchmarks are being met. Behavioral data is gathered by careful analysis of user responses at each state in the application. Opinion data forms an important counterpart to behavioral data. Opinion data describes users reactions to their interaction with the application. Although there is a connection between system performance and opinions, the two are not always tightly coupled. Depending on their previous experiences with speech technologies, users may have varying expectations about the abilities of the application. Thus, users may be pleased with less than perfect performance in certain situations and extremely critical of errors under other circumstances. Surveys, interviews and observation can be used to elicit opinion data.
EARLY AND OFTEN Early and often is the answer to the question of when to obtain user feedback on a speech application. Unlike system testing, which is limited by the current state of the system, user testing can take place at anytime in the design and development process. Using Wizard of Oz techniques, it is possible to gather user data even before there is an actual application. Users are asked to interact with and offer their opinions on a mock-up of the proposed application. These mock-ups can vary significantly in sophistication and the degree to which they mimic interaction with the actual system. At the lowest level of fidelity, there is no application at all; users hear a series of recorded prompts that are played on cue by a tester. More realistic mock-up applications can be enabled with speech recognition capabilities for a truer sense of how the production system will perform. Wizard of Oz tests can be planned and conducted quickly and at low cost, making them particularly valuable early in development. Formal usability testing is a user testing method that can be conducted on a completed or prototype version of an application. When the full application is used in usability testing, the data collected will be of very high quality. However, it can be time-consuming and costly to make changes to the application based on results of testing. Prototypes can be coded much quicker than full applications, and allow testing to occur early when it is less costly to make changes to the application. Note, however, that prototypes for formal usability testing, unlike mock-ups used in Wizard of Oz testing, must faithfully reproduce the user experience of the full application. This makes prototypes a more significant resource investment up front, but the investment is worthwhile because user data collected in interaction with a realistic prototype tend be more predictive of users real world reactions. Customer satisfaction testing provides a way to assess whether an application is meeting the expectations of those who use it. It is conducted on a full version of an application, under realistic circumstances. Rather than looking for problems in the application, customer satisfaction testing is focused on gauging customers reactions to the application. Either actual customers or individuals who share relevant background characteristics with the customer base can be used for satisfaction testing. In this method, customers are typically asked to complete one or more calls into the application and then answer a series of questions, either in a written survey or in an interview. These questions may ask users to rate the application itself, or to assess how their interaction with the application affects their perception of the organization deploying the application. Customer satisfaction testing is vital because the user experience can have wide-ranging effects on a customers overall opinion of a company and their likelihood to do business with the company in the future. Customer satisfaction testing is generally conducted prior to a full release of an application, but is also useful to reassess satisfaction periodically once an application is deployed to get a sense of users changing perceptions as they become accustomed to the application. User testing is most valuable when it is built in as a part of the development plan from the start of a project. Therefore user testing is often most convenient and efficient when conducted in-house by the organization developing the speech application. Customer satisfaction testing is an exception because it occurs toward the end of a project and there are a number of firms expert in this type of testing. ONCE IS NEVER ENOUGH However well conceived and expertly coded a speech application may be at the outset, it will always be improved by testing. Speech applications that perform well and provide a positive user experience rarely, if ever, resemble the designers and developers initial design. The question then becomes how much testing to include and which points in the development cycle. Determining how much testing is enough needs to be addressed early in the project for two reasons. First, it forces the project team to set performance and customer satisfaction goals for the application. Second, allocate time and resources in order to evaluate whether the application has met these goals. Are we done? is a question that can become impossible to answer once speech project is underway. There will always be tweaks that can boost performance satisfaction numbers. The gains these incremental improvements provide must be balanced against the costs associated with additional testing. Committing to a rigorous program of iterative system and user testing has enormous benefits to organizations and their customers. Building system and user testing into speech projects ensures that applications will pass the test with project sponsors and customers.
Dr. Susan L. Hura is senior manager, VUI Services, for Intervoice. She can be reached at
susan.hura@intervoice.com.