Exclusive Book Excerpt: Designing Better Speaker Verification
PAGE 2 OF 2
Building the Short List
Though not a technology requirement, one key element of success that needs to be understood is how to create a target list from a set of results. There are four strategies typically returned:
• Single Threshold: any response over a threshold is considered a possible match, and any response under that threshold is considered not a likely match;
• Dual Threshold: two thresholds are set, where anything above the top threshold is a likely match, anything below the lower threshold is not a likely match, and anything in the middle is a possible match;
• N-Best: an implementer wants the N closest matches to the target voice; and
• Full List: when all results are returned to an operator.
Each strategy has pros and cons. Threshold-based systems require live testing calibration to determine where the calibration point should be. N-Best lists can typically provide short lists that include the target speaker in a high number of cases, but they can be thrown off if there is no match in the system or if there is a cluster of many targets that score similarly (if n is set to the top 10 results but there are 30 results that all score very similarly). Full lists give a trained operator more granular control but can be unwieldy if there are more than 20 possible targets to test against.
It is important to use logic when setting thresholds or N-Best lists. If a target is of very high importance, you might favor a false match if it ensures against false rejects. Conversely, for a low-priority target, possibly failing to identify is more important than tracking down possible false matches.
For integrated solutions, other pieces of information can be fed into a decision engine. For example, law enforcement and intelligence services typically have full dossiers on suspects that list known whereabouts. If you get a voice in Chicago that matches the voice of someone known to be in federal lockdown in Miami, it’s probably safe to exclude that as a match. In a full solution, it can be much more valuable to send all results to an application that can apply these types of rules instead of fixing a threshold in the biometric engine.
If the strategy for rolling out the technology is flawed, even a technically successful deployment could be considered a failure. Though books have been written about how to successfully manage complex technology deployments, it’s worth keeping a simple acronym in mind: SPEC.
Scope: Fully understand how a solution will be used and what will make it successful. It’s almost impossible to spend too much time scoping out a project. Will the voice biometric solution work independently or be integrated into a multi-biometric searching system? What sort of probes will be used, and what could cause new probes to be introduced? What is the overall charter of the project? It’s important to level-set the implementer regarding what can be achieved with the technology and to ensure that they understand what it will take to properly integrate and deploy a voice biometric solution.
Scoping a project is not passive. As much as vendors need to extract information from the implementers, it’s also important for vendors to relay the best way for the technology to be implemented. Often, what seems like an unreasonable technical goal might be a misstatement of a reasonable operational goal.
Prototype: Never deploy a system without building a prototype to evaluate the technology’s performance in a series of near-live environments. Depending on the complexity of the final integrated solution and if it will be deployed in a classified environment, sometimes two or more prototypes are necessary.
Execute: Building a proper rollout plan is critical for success. Does the system need to be calibrated using live audio? How many units should be deployed as a beta before certifying that everything is working properly? Will you have access to live performance data to ensure the system is working properly? In many use cases, based on clearance levels, the vendor might never be able to gain access to live audio samples or get final specifications on the deployment hardware. In these cases, you need to determine if it is necessary to build an execution plan that includes a pre-deployment simulation for calibration or system training.
Control: When scoping a project, it is important to build the success criteria not solely based on technology goals, but also on operational system goals. Once the system has been deployed, it is important to determine a point in time where the solution will be evaluated to determine if goals have been met.
Worldwide, the case for voice biometrics in investigatory, forensic, and judicial processes has been made. As international precedents have been set, what holds back implementations in the United States is the voice biometric community itself. We speak to customers with the guarded voice of a researcher, not the confident voice of a vendor. Though voice biometrics is not infallible, no statistical identification method is. Fingerprints, DNA, and iris scanning all have acceptable levels of tolerance for errors. These levels are set not by the technologists but by the implementers. As an industry, we no longer can confuse voice biometric accuracy with speaker identification’s utility.
(This excerpt was lightly edited for space reasons.)
ABOUT THE BOOK:
Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism (released September 1) is an anthology of the research findings of 35 speaker recognition experts from around the world. The volume provides a multidimensional view of the science involved in determining whether a suspect’s voice matches forensic speech samples, collected by law enforcement and counter-terrorism agencies, that are associated with the commission of a terrorist act or other crimes. The challenges of forensic casework are explored, along with such issues as handling speech signal degradation, analyzing features of speaker recognition to optimize voice verification system performance, and designing voice applications that meet the practical needs of law enforcement and counter-terrorism agencies. A running theme is how the rigors of forensic utility are demanding new levels of excellence in all aspects of speaker recognition. The contributors are scientists in speech engineering and signal processing, and their work represents such diverse countries as Switzerland, Sweden, Italy, France, Japan, India, and the United States.
The above chapter was written by Avery Glasser, managing partner at Flecture, a provider of management consulting, solution design, and vendor representation for clients with a specialization in surveillance and identification for law enforcement and intelligence agencies.