Assessing IVAs: How Do You Determine Which One Is Right for You?
Maybe you’ve been thinking about using an intelligent virtual assistant (IVA) in your business. You want to automate customer service, help your customers find their way around your website, or provide tools for your employees. A few minutes with any search engine will turn up countless vendors who say that their products are “intelligent,” “natural,” or “just like talking to a person”; others tout themselves as “truly conversational” and “revolutionary,” among many other glowing descriptions. Naturally, you want to use the best technology, but how can you know which one is the best, or even which alternatives are good enough to do the job you have in mind?
Clearly, simply looking at vendor websites is not the best way. Every vendor will claim that their technology is the best. Looking at YouTube demos and talking to salespeople won’t be very helpful, either. Vendors will be biased, and demos are based on very carefully curated interactions. And a few minutes of casually trying out a system can lead to results that are very misleading. Is there a reliable, objective way to measure system accuracy?
Other products can be compared with standard metrics. We have miles per gallon for cars, energy usage for appliances, and screen resolution for monitors. Unfortunately, we don’t have these kinds of metrics (yet) for virtual assistants. Even if we narrow down “best” to “most accurate,” there’s still a lot of room for subjectivity.
How can we measure virtual assistant accuracy in order to reliably compare systems? Unfortunately, we don’t have any official standards, but here are some ideas that seem promising.
Ways to Measure IVAs
Let’s start by saying that any fair comparison has to be based on widely accepted measurements and procedures. A practical evaluation also can’t be too expensive or time-consuming, so we don’t need perfection, just a comparison that’s good enough.
First, here are a few promising strategies.
1. Systems can make mistakes in two different ways, so we have to measure both of them. The system can give a wrong answer, but also it can fail to give an answer to a question that it should know. Technically, giving a wrong answer is a failure of precision. Failing to give an answer that the system should know is a failure of recall. Over a large set of test questions, we can get overall recall and precision scores that will give us scores for system accuracy. While recall and precision are not official standards, they are widely accepted by researchers.
2. A newer metric is Sensibleness and Specificity Average (SSA). This was developed by Google for its chatbot, Meena. Crowd-source workers look at pairs of user queries and system responses and score the responses on how sensible they are and how specific they are. The meaning of “sensible” is obvious. Specificity penalizes generic responses like “that’s nice.” Vague answers like “that’s nice” are a sign that the digital assistant is trying to hide its ignorance. The Sensibleness and Specificity scores are combined to yield an overall SSA score. An attractive feature of this metric is that the users scoring the responses don’t have to know the right answer—they just have to be able to decide how “sensible” and “specific” it is.
3. Another metric that’s worth mentioning is the one used in Amazon’s Alexa Prize. It doesn’t measure accuracy; rather, it measures how engaging an application is by keeping track of how long users interact with it. This could be a useful metric for applications like elder companions whose goal is to keep users involved with the application, but where accuracy is not a primary requirement.
Evaluating IVA Performance
Not only should the measurements be standardized, but evaluations should follow a standard process that (1) has repeatable results; (2) controls for extraneous variables; and (3) guards against gaming the results. A good example of an attempt to game an evaluation is the famous 2015 Volkswagen emissions scandal, when Volkswagen turned off emissions devices during testing so that they could falsely report better emissions ratings. They were caught. The outcome was not good for Volkswagen; its CEO resigned as a result.
Some best practices for the evaluation process include:
1. Doing cross-system comparisons with the same application, which can be more colloquially referred to as “comparing apples to apples.” It’s unfair to compare systems performing different applications, because one application is likely to be harder than another. For example, there might be more intents and entities in one application, and this would lower that system’s scores. The data used to develop the applications could be an open public dataset like the one developed by Clinc (https://github.com/clinc/oos-eval), or it could be internal data for an application in a specific vertical. For generic assistants that don’t have a specific application (think Alexa or Siri), there is published data like the data used in my Speech Technology article, “Does Your Intelligent Assistant Really Understand You?” (https://www.speechtechmag.com/Articles/Editorial/Industry-Voices/Does-Your-Intelligent-Assistant-Really-Understand-You-143235.aspx).
2. Training and testing systems on non-overlapping data. If a system is trained on data that it will be tested on later, the test won’t be representative of actual working conditions, when all kinds of new, previously unseen data will show up. This would be an example of gaming the system.
Putting It All Together
So going back to the initial question—how can you properly assess intelligent virtual assistants?—here are our general recommendations. First, don’t base an evaluation on a subjective test. An evaluation that consists of a few minutes of trying out a demo can be very misleading. Second, use common measurements, like recall, precision, and SSA. Third, follow a standard process: use the same dataset for all comparisons, and keep the training and test data separate.
Following these guidelines will lead to reliable and meaningful comparisons. Put that information together with other requirements—development tools, runtime cost, ease of maintenance—and you’re well on your way to a successful virtual agent deployment.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.