What's the Use?
It is not within human nature to interface comfortably with machines. Hollywood knows this. That’s why when humans interact with machines in the movies, they’re usually being murdered by them. Consider HAL 9000, the Terminator, or the replicants from Blade Runner. Of course when humans speak with machines in real life, it’s usually over the phone via a speech-enabled interactive voice response (IVR) system. And the machine, with little opportunity to murder its user, simply irritates him to death.
Thus, the importance of usability testing cannot be understated. But those tests are often the first thing to go when deadlines and budget constraints loom over a project.
And while there’s no set schedule for when and at what intervals designers should run usability tests, most designers agree that it should be done early and often.
"Do it as soon as you can and get it right as many times as you can," says Susan Hura, founder and principal at SpeechUsability, a voice user interface (VUI) design consulting firm.
It’s a sentiment with which everyone within the design community agrees. "But, when costs are involved, you really do have to apply testing where you get the most bang for the buck," says Dave Pelland, director of Intervoice’s Design Collaborative. Such discretion is important considering some projects run for a month, others for half a year or more.
Peter Leppik, CEO of VocaLabs, recommends clients plan on two or three tests, beginning with smaller studies that incorporate 50 to 100 people testing a mock-up IVR without speech recognition.
Finding Funds
Regardless of when tests are conducted, funding is almost always an issue. In recent years, though, Pelland has seen fewer dust-ups about it. As long as it’s clear what the vendor is doing and why, enterprise customers are more willing to release the money.
Yet others within the design community still find themselves scratching for usability funding and creative ways to loosen the hand that clutches the purse strings.
"I’d like to argue that [usability testing] is simply part of design," Hura says. She suggests instead of segmenting usability testing as a separate line item in the schedule, managers should simply integrate it throughout the design and development process. It’s a sentiment that Melanie Polkosky, an independent human factors psychologist specializing in VUI design, echoes. "I do also consider usability to be part of an iterative design process," she says. "So it’s always in there in that design phase."
VocaLabs provides panel-based research surveys to vendors developing speech IVRs. The vendor attaches its speech system to a phone network on a private number, and VocaLabs selects a consumer panel of 70,000 people to call into the system, accomplish a task, and fill out an online survey about the IVR’s functionality and ease of use. A small test comprised of 50 to 100 participants costs roughly $5,000; a larger test of 500 participants costs around $10,000. "It’s not a huge piece of the budget," Leppik says, "but sometimes it’s a difficult sell because people don’t realize if they don’t fix the problem in the development stage, it’s more difficult to fix it later."
As someone employing a speech system, it’s also important to demand usability testing if a designer doesn’t bring it up. Occasionally designers themselves forget that they need usability testing. "The biggest misconception is that people who build these things say, ‘I know how to use it so it must be easy to use,’" says Juan Gilbert, an associate professor in the Computer Science and Software Engineering Department at Auburn University and director of the school’s Human-Centered Computing Lab.
But just because a designer feels confident about her work doesn’t necessarily mean it will function properly in a live deployment. "Lots of designers are competent," Hura adds. "But it’s really, really hard to be objective. It’s the ‘my baby isn’t ugly’ phenomenon." Gilbert calls it arrogant design. "When you report to management, you give them a demo," he says. "The honest truth is when you demo something, it always looks easier." And when designers and developers fool themselves, this leads to poorly functioning IVRs, which in turn leads to poor customer service.
A January 2007 Harris Interactive Customer Experience Survey polled 750 North American consumers who had gone through a customer service organization in the past three months. The study discovered that 41 percent of consumers were dissatisfied with their customer service interactions and 72 percent of dissatisfied consumers ceased doing business with the offending enterprise. Yet 86 percent said they’d use an automated system if available, with most customers preferring speech over touchtone.
Ultimately, customers are willing to navigate a speech IVR, but if their experience is bad, they’re also very willing to desert the company that displeases them. For enterprises, it’s a decision: pay now by budgeting for copious testing, or pay later by losing the ROI.
Finding the Time
At first glance, the obvious time to test a speech project is at the end of development, like an author copy-editing the proofs of her novel. "But at that stage of the game it’s too late to make major changes to the system," Leppik warns. "You’ve already invested too much into building it. If you discover a problem and it’s going to take a lot to fix it—if it’s a matter of rerecording a prompt, no big deal—but if it requires you going back to rewrite code, you’re not going to have the time or the budget to do that."
In that regard, designing an IVR is less like writing a novel and more like building a house, which one constructs in phases by first pouring the foundation, then assembling the framework, and gradually adding to that skeleton. It’s important to test the integrity of the design at each stage.
"You want to do a formative evaluation," Gilbert says. "[That] means you develop an iterative approach. You test at the end of an iteration so you can go back and make changes. You can’t be too arrogant and confident. It’s very seldom you get it right the first time. You don’t want to lock yourself out." Finding a critical error after the coding and voice recordings have been layered over the IVR script would be like hammering the final nail into that house, only to realize the framework is rotted.
That’s one reason why Polkosky’s most important test takes place after the first draft of her script. "For me, I’d give up, and I have given up, all other testing to keep that one," she says. Polkosky reads her script against a live audience, which gives her insight into fundamental questions: Is she hitting the important points? Is her prioritization of those points logical, and is the audience responding appropriately? Is the tone proper for the user group?
Polkosky performs all of this before voice work has even been recorded. "I’m usually fairly confident with most of my voice talent," Polkosky says. "But if you start off with a bad script, then most voice talent won’t be able to fix that up."
Last year, she wrote a script for the technical support line of an electronics company. Her opening prompt was simple: What product are you calling about? Yet, she was uncomfortable with its open-endedness. Would callers simply say television or stereo? Would they name a specific brand? Would they say 45-inch flat-screen LCD TV? Polkosky envisioned a nightmare IVR bloated with grammars. So before she wrote anything else she designed a usability test around the opening prompt to determine exactly what customers would say. To her relief, callers tended toward general terms, like TV, radio, or home stereo system.
"When I feel hesitant about a prompt or the sequencing of prompts, I’ll do [usability testing] a couple of times until I feel comfortable with how that sequence is functioning," Polkosky says. "I did another usability test when I had a more complete script finished and I did some monitoring after deployment. It was very consistent what people were saying. We added a couple of synonyms for that grammar here and there, but it really wasn’t this very complex grammar like what I was concerned might happen."
But Hura points out the tradeoff with usability testing: If the test occurs too early, designers will have the benefit of user feedback, but it might not be particularly predictive as the application is likely to change throughout the design process. "I have found in practice that doing one usability test early—meaning before any coding has been done—and using that as an evaluative test can shake out the big problems," Hura says. She refers to a test much like what Polkosky ran for her consumer electronics IVR: a Wizard of Oz (WOz) test in which there’s no speech recognition, just a human playing the part of the IVR.
Two Types of Tests
In addition to WOz, another type of testing is prototype testing. Both allow designers and developers to test an IVR system in a vacuum, on its own merits. In a WOz test, a human tester plays the recorded prompts to a panel of around 100 people who are asked to navigate the prompts and complete a particular task.
For a banking application, for example, a third of the panelists might be asked to check their balances, another third to transfer money between accounts, and a final third to reconcile a hypothetical problem on their statements. Because there isn’t live speech recognition, the WOz test primarily considers the usability of the prompts. Afterward experimenters question participants on how well they felt the system worked and how they felt about the prompts.
"I used to be dead set against Wizard of Oz," Hura says. "The data is very, very, very unlike a real application. With the wizard playing the sound files, it’s very hard to be as unforgiving or stupid—let’s put it that way—as an IVR actually is." Consequently, a WOz doesn’t account for out-of-grammar utterances.
Additionally, Hura’s experience with WOz tests indicates that the human wizard initially responds much more slowly than an IVR. "There’s all sorts of variability that’s introduced that’s not present in an application," she says. "That’s what applications are good for: doing the same thing under the same circumstances. People just aren’t good at that."
However, Hura sees the value in WOz testing as a way to uncover major problems with menu options, terminology, or navigation. And because WOz doesn’t require a live speech back end, it takes less time to prepare.
Pelland emphasizes it’s the designer’s role to consider whether such a test would be beneficial. "If the design is for a new type of application we don’t have experience with or is breaking some new ground," he says, "we like to add WOz testing, paper prototypes, and other low-fidelity methods as needed to get the designer any and all input they can early on."
For Hura, a more predictive test than WOz is a prototype test that incorporates real speech recognition. After the initial design, coders take three to five days to bang out a quick and dirty prototype with real prompts and real speech recognition, but no functioning back end.
Panelists working a prototype banking IVR in this type of test would get their account numbers and, as in the WOz test, be asked to complete a certain task. In this case, their hypothetical account balances can be hard-coded into the prototype’s back end to simulate a live environment. Because a prototype incorporates real prompts with real latencies from real speech processing, the results are more predictive than those derived from a WOz. Thus, designers will know if pauses are too long or if they’re not long enough, or if the application cuts off callers before they’ve completed their task. Prototype testing provides a more accurate reflection of the user interface quality.
"I don’t know very many vendors that do that sort of prototype testing," Hura laments.
Pelland notes that prototype tests typically occur only when his team and the customer agree on a need to learn something about the design. "The cost here is often more noticeable on a project, and therefore a bit more collaboration is needed to determine if we should do this," he says.
Hura maintains that prototype testing is significantly more expensive, pointing to an added benefit: "Coding up this working prototype is not throwaway work," she says. "The parts of the coding that dictate how user input is accepted and how prompts are played back out, that’s reusable. You can use it when you’re coding up the final application. Another nice benefit is that it gets the development staff involved in the project a little bit earlier and helps them be aware of the user interface issues."
Identifying the Where
Of course, speech applications are not predominantly used within the confines of a laboratory, especially with the proliferation of mobile phones. Intervoice, which conducts usability testing internally, adopts a two-phased approach: laboratory testing followed by a pilot system using live callers in their usual environments. Scenarios like navigating an IVR while hurtling through rush hour traffic are too difficult or dangerous to test in real life. Gilbert mentions the often-neglected variable of rain on a windshield. "If you’re testing something for a car, you should take a decibel meter and measure the rain in a storm and use that in a lab," he says.
While this control over the environment has its benefits, it can also be costly to reserve the space. "You can’t afford more than a dozen experiments," Leppik says. "It becomes prohibitively expensive. You’re not getting the quantity of data and the diversity of experiences you need to find problems [in a lab]."
Hura disagrees, noting that remote testing can be a good way to save money, especially if there’s no additional rent or travel involved. Unlike Leppik, she feels a lot is missed by not being able to form a rapport with the panelists. A lot can be learned from listening to a person’s voice, but it’s also good to see the user’s gestures and facial expressions as he navigates the IVR. Interactions with the test subjects themselves can also be beneficial. "You’d be surprised how people feel a sense of responsibility to try to give you good information when you bring them in," Hura says. "They’ll tell you their concerns and frustrations [with past customer service experiences], and you can turn them around and say, ‘Does this system make you feel that way too?’"
Hura calls that control her gold standard in usability testing. If two test subjects behave differently in a usability test, designers need to understand and explain those differences or they’ll be left with an unknown, unpredictable variable within the IVR.
Leppik argues that lab tests don’t expose the IVR to enough people. "I like to look at the success of a test in terms of how likely are you to actually discover a problem in the system," he says, noting that a problem that affects one out of every two callers will probably be identified in a lab, while less prominent problems might go undetected.
"If you’re doing a lab test with only a dozen people, the odds of finding a 25 percent problem are less than 15 percent. If we do a test with 100 people, we can reliably find problems that will affect 5 percent of the people. We’re able to find a greater proportion of the problems much earlier on in the development process this way," he says.
But simply knowing that problems exist within a system only reveals the cracks in a complicated design, not their root causes. Because of this, Pelland thinks a common benchmark like success rate isn’t that important. "We want to dig deeper for more honest results," he says. Intervoice maintains guidelines on speed and pacing, but these vary depending on the application. Instead of establishing a common measure, Pelland uses post-task questionnaires with multiple choice answers to consider user perceptions.
"We don’t collect specific statistics on this," he says, "since different interfaces can’t be compared directly." If a user, however, checks "Strongly Agree" next to a statement like "I thought the interface’s voice spoke too fast," the designers know what the problem is and how to fix it.
Gilbert shares this disdain for statistical benchmarks. To counteract this rigidity, he and his colleagues at Auburn University last year created the Holistic Usability Measure (HUM), something he compares to a professor’s syllabus that breaks down the grading categories in a class. Under the HUM, clients are given 100 points to distribute among the metrics—such as user satisfaction, call completion time, and accuracy—they feel are most meaningful for their callers.
"We treat those like grades and come up with a holistic measure in the end that tells us the usability, as defined by [the client’s] requirements," Gilbert says. "The thing that people get confused about when they hear about our Holistic Usability Measure is they think you’re giving the management more power. But if you think about what it does, it empowers the designers."
Gilbert is confident in his measurement. He had experts test IVR applications, then compared their feedback with the HUM’s findings. "In some cases it showed that the experts were confusing each other. They had conflicting viewpoints," he says. "But when [the HUM’s measurements] came back to us and we asked the experts to look at it, they said, ‘Well we agree with that measure.’"
Polkosky’s measure, developed in 2004 as her doctoral dissertation, is more perceptual. Described in the November/December 2005 issue of Speech Technology magazine, the measure focuses on four factors:
• user-goal orientation—how efficiently the interface meets the user’s needs;
• speech characteristics—the pleasantness and naturalness of the voice;
• verbosity—which is inversely related to customer satisfaction; and
• customer service behavior—how polite and friendly the tone is.
Polkosky developed the measure through a statistical method called factor analysis. She identified 72 items, including recognition accuracy, voice, syntax, and vocabulary, that figured into a caller’s impressions of a system, then whittled the list down into a much smaller set of characteristics that make a critical difference in caller perception.
But just because testers have information from human subjects doesn’t necessarily mean it’s reliable. Usability subjects should reflect the target demographic.
Pelland once worked on an IVR designed for recent college graduates and undergraduates. The prompts, based on previous market research, were colloquial and casual. However, usability studies afterward revealed a high percentage of English as a Second Language students in the major markets. The prompts that were designed to assist the students, Pelland explains, ironically added to their confusion. "We ended up keeping the casual tone in initial prompts," Pelland says, "but becoming progressively more formal, literal, and explicit in timeouts and retries." Following this, measured success rates increased by 6 percent.
But even with the proper test panel, experimenters must still be wary of demand characteristics—test participants telling experimenters not what they truly think about the IVR system but what they think the experimenters want to hear. "It’s really well-known in the usability field that that’s a problem universally with usability testing," Polkosky says. She frequently encounters individuals whose behavior with an IVR system diverges from other testers. "Usually, with folks like that, they’ll tell you how great it was, what a great job they did, but when you look at your measurements, did they complete the task? Well, they didn’t. Did they have usability errors? Yes, they had multiple usability errors and failed the task on top of that. But when you ask for their perceptual ratings, they’ll give you the highest rating possible."
That’s why it’s important to use hard data with subjective data. If a tester notices a particular discrepancy in what was said versus what was done, she can throw out the anomalous data.
Ultimately, test engineers should focus primarily on finding errors within a system, not with individual idiosyncrasies. And the best way to do this is to look for problems that crop up consistently. "That’s what you’re really looking for—patterning of usability errors," Polkosky says. "When you run enough participants, you see it."