Model Adaptation for Speaker Verification
Speaker verification allows people to use their voice to gain access to information or services. This technology has matured to the point where successful applications have been deployed in different markets including corrections, telecommunications and finance. Important to the success of a speaker verification system is its adaptability to change in the environment using a different phone, calling from an office instead of a residence, indoors as opposed to outdoors and changes in a persons voice over time. Frequently there may be subtle changes in a persons voice over time. These arent necessarily due to colds or temporary illnesses, but because of natural changes in how one speaks over sizeable periods of time such as months or years. Studies on this topic that have shown sizeable increases in error rates when a large time period has elapsed between the date that the enrollment data was collected and when verification is performed (Furui, 1981 and Mistretta & Farrell, 1998). In fact, one study revealed an error rate increase of threefold when there was a six month gap between enrollment and verification (Mistretta & Farrell, 1998). Known as aging, this only affects a subset of the enrolled population, increasing the likelihood that a true user is wrongfully rejected by the verification system, i.e., a false reject. Using model adaptation, we can now compensate for aging by adapting the model parameters with new verification data as it becomes available.
Multiple Methods for Model Adaptation Model Adaptation consists of adjusting model parameters based on verification data encountered after enrollment. There are a number of methods for creating models for speaker verification. In general, there are ways to adapt these different models with new data so that they perform better with future verifications. Some methods of creating speaker verification models use statistical models such as Hidden Markov Models (HMMs) (Matsui & Furui, 1994) or Gaussian Mixture Models (GMMs) (Heck & Mirghafori, 2000). Here, the statistical parameters, such as the mean and variance components, can be updated with new data. Another approach is a template-matching method known as dynamic time warping and adaptation methods have also been considered (Naik & Doddington, 1986). Neural networks are another form of pattern recognizer that has been applied to speaker verification and these too have more recently been evaluated with model adaptation (Mistretta & Farrell, 1998). Speaker verification has also been performed with combinations of models that use consensus-based information including a neural network, dynamic time warping and Hidden Markov Model. Model adaptation was used to adjust the parameters of all three models (Farrell, 2002). As a very simple example of how a statistical parameter such as a mean can be adapted, consider the following four numbers 2, 2, 3 and 3. The average of these four numbers is 2.5. Now, lets say that a fifth number became available, which was 5. The brute force way of computing the new average would be to add the five numbers together and divide them by five to yield the new average of 3. This approach requires you to store all the previous data and is not efficient. A simpler approach is to maintain the number of data samples used to create the average 4 for the original example. Then, when the new sample becomes available, you can quickly retrieve the original sum (10) by multiplying the average by the number of samples used to compute that average. Now you can just add the new sample to the average, to yield 15, and then divide this by the new number of observations (5), to yield the new average of 3. You only need to store the number of samples used to create the mean as opposed to requiring all of the original data. This latter approach to adapting the mean is much more efficient and storage friendly than the brute force approach. Granted, this example was greatly simplified, but this is the basic concept of how parameter adaptation functions. The model adaptation algorithm was applied to a number of different databases collected for the purpose of evaluating speaker verification technology. The databases consist of people enrolling with a spoken password, then validating with that password. There are numerous imposter attempts where people try to break in by using the correct password. Approximately 25,000 total verification attempts occurred and the distribution of these scores is shown in Figure 1. This score plot represents the frequency of scores that fall in groups known as bins. For example, 7,000 imposter scores fell within the range of 0.2 to 0.3 on a scale of 0 to 1. Overall, the imposter scores range from 0 to 0.7 and the true speaker scores range between 0.5 and 1.0. For speaker verification, a threshold is used to determine which scores will be rejected and which scores will be accepted. The baseline equal error rate (EER) a standard metric for benchmarking speaker verification systems for the data shown in the histogram is 2.7 percent. The EER corresponds to the operating point where the probability that the wrong person is accepted, i.e., false accept, is equal to the probability of a false reject. The model adaptation process was evaluated for the data shown in Figure 1. First, it is important to distinguish between supervised adaptation and unsupervised adaptation. Supervised adaptation corresponds to the case where the data is known to come from the correct user. This is a big assumption as the point of speaker verification is to determine whether or not the voice sample belongs to the speaker. If the decision were always perfect, then there would never be a need for adaptation. This performance metric provides a "best-case scenario. Unsupervised adaptation corresponds to a real-world situation where it is not known if the data comes from the correct user or an imposter and a threshold must be applied. For a given threshold position, only the data corresponding to scores above that threshold were used for adaptation. For unsupervised adaptation, this would correspond to data for all scores above that threshold being used to adapt the model, even if they were from imposters. For supervised adaptation, only the data for the true speaker, and not imposters, was used for scores that exceeded the threshold. It is also important to note that when a model is adapted, it will produce different scores and this needs to be compensated (Mirghafori & Heck, 2002). Otherwise, it is possible that the model can shift its scoring characteristics and a fixed threshold position will start to exhibit different performance. It can be very detrimental to performance to adapt on data that belongs to an imposter. By using a strict threshold criteria, only using data for scores above a certain threshold, the possibility of this occurrence is minimized and the unsupervised adaptation starts to approximate the performance of the supervised adaptation. Specifically, the best operating point in this example for unsupervised adaptation was obtained using a threshold of 0.65. In this case, the overall error rate was reduced from 2.7 percent to 2.0 percent, which reflects a relative improvement of roughly 25 percent. The supervised adaptation provides much more benefit and reduces the error rate by more than 50 percent. However, this again assumes that the true identity is known before applying the data. Certain speaker verification applications may obtain additional information from a user when they fail verification such that a high confidence of the identity can still be achieved following a marginal speaker verification score. In these cases, it may be possible to achieve improvements more toward those seen for supervised adaptation. Model adaptation can be a powerful method for maintaining, if not improving, the performance of models used for speaker verification over time. Research shows that when properly applied, model adaptation can provide relative improvements of 25- 50 percent in error rate reduction. However, model adaptation should be used carefully as adapting a model with data from the wrong person can have a dramatically negative impact on the model.
REFERENCES
1. S. Furui. Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29:342-350, April 1981. 2. W.J. Mistretta and K.R. Farrell. Model adaptation methods for speaker verification. Proceedings ICASSP, 1998. 3. T. Matsui and S. Furui. Speaker adaptation of tied-mixture-based phoneme models for text-prompted speaker recognition. Proceedings ICASSP, pages 1125-1128, 1994. 4. L. Heck and N. Mirghafori. On-line unsupervised adaptation in speaker verification. Proceedings ICSLP, 2000. 5. J.M. Naik and G.R. Doddington. High performance speaker verification using principal spectral components. Proceedings ICASSP, pages 881-884, 1986. 6. K.R. Farrell. Speaker verification with data fusion and model adaptation. Proceedings ICSLP, pages 585-588, 2002. 7. N. Mirghafori and L. Heck. An adaptive speaker verification system with speaker dependent a prior decision thresholds. Proceedings ICSLP, pages 589-592, 2002.
Dr. Kevin Farrell is a speaker verification engineer at ScanSoft, Inc.