Minorities More Vulnerable to Speech Recognition Inaccuracy
The automatic speech recognition models that power voice assistants often have difficulty transcribing English speakers with minority dialects, a Georgia Tech and Stanford University study found recently.
The study compared ASR models for people using Standard American English (SAE) and African-American Vernacular English (AAVE), Spanglish, and Chicano English to evaluate the transcription accuracy. Strudy participants who spoke each dialect read from a Spotify podcast dataset, which included podcast audio and metadata. The ASR models used to transcribe the audio were wav2vec 2.0, HUBERT, and Whisper.
For each model, the research found that SAE transcription significantly outperformed each of the three minority dialects. The models more accurately transcribed men who spoke SAE than women who spoke SAE. Members who spoke Spanglish and Chicano English had the least accurate transcriptions out of the test groups.
While the models transcribed SAE-speaking women less accurately than their male counterparts, that did not hold true across minority dialects. Minority men had the most inaccurate transcriptions of all demographics in the study.
"People would expect if women generally perform worse and minority dialects perform worse, then the combination of the two must also perform worse," said Georgia Tech interactive computing Ph.D. student Camille Harris, lead author of the paper. "That's not what we observed.
"Sometimes minority dialect women performed better than Standard American English. We found a consistent pattern that men of color, particularly black and Latino men, could be at the highest risk for these performance errors," she explained.
Harris said the training data used to build these models was at the heart of the discrepancies, noting an underrepresentation of minority dialects in the data sets.
AAVE performed best under the Whisper model, which Harris said had the most inclusive training data of minority dialects.
Harris said designers of ASR tools should look to incorporate minority dialects in their training data.