Accents Still Elude Speech Recognition Systems, But for How Long?
In Boston, people want you to “Pahk yah cah in Hahvahd Yahd.” In Dallas, they want to know, “How y’all dooin?”
Travel from one region of the United States to another and one easily finds local dialects, spoken and understood by natives but unfamiliar and often confusing to outsiders and automated speech recognition (ASR) systems.
Can technology advance so the machines close the gap and understand what people say, no matter where they were raised? Maybe in the future, but probably not in the near term because the challenge is extremely complex.
The reality is that speech recognition engines have made tremendous progress through the decades. Recent advances in computer technology, like cloud computing, have given artificial intelligence technology enough processing power so it can work its magic.
As a result, speech recognition engines are largely accurate. “There have been tons of lab tests for the ASR offerings from companies like Google, Amazon, Deepgram, etc., and I think you’ll find that each reached humanlike quality within the past two years,” notes Dan Miller, lead analyst and founder of Opus Research. “In real life, the expectation now is for ASR resources to have a word error rate (WER) north of 85 percent out of the box and can be tuned to exceed 90 percent in fairly short order.”
However, the composite numbers undersell the challenges that suppliers have taken on in understanding what a person says, no matter their background or how they speak. “The reality is speech systems today do a very good job with simple, practical requests, like ‘Find a pizza parlor in Concord,’” explains Stephen Arnold, senior adviser at Arnold Information Technology. “Where they stumble is with more sophisticated applications of the technology.”
Accents represent one of those areas. “If you’re talking about English or any language, regional dialects and the use of multiple languages within a country will always pose a big problem,” Miller points out.
That is because speech recognition systems only recognize the accents that they have been trained to understand. To crack regional code, vendors must create sophisticated data models. For instance, to learn how to interpret the accent of someone from New Jersey, a system needs to collect a large sampling of voice data from the locals and then build a data model that delivers accurate results.
Corporations tend to be cautious in adding such features, though. Since businesses typically roll these systems out to their customers, they want to be close to certain that they will help rather than hinder workflow and streamline interactions. Too small a sample dataset leads to inaccuracies and customer frustration. Even with large datasets, the systems make mistakes; between 10 percent and 23 percent of words are misidentified in environments like an emergency room, according to a National Library of Medicine study. Because of such inaccuracies, many users become frustrated with ASR solutions and have a negative view of them, so much so that they will sever all ties with vendors after just one frustrating exchange with the technology.
The Data Collection Challenge
Accents represent a significant obstacle for a few reasons. The number of people who speak with a particular accent is a subset—often a small one—of everyone using the system. So suppliers need to start the process by limiting input to individuals who speak with an accent. But how becomes a vexing question. Accents arise in specific geographies, but not everyone in an area speaks with the same twang. “We find accent variations as one moves from city to city and even from neighborhood to neighborhood,” explains George Lodge, a computational linguist at Speechmatics.
Speech systems have to be intelligent enough to recognize those with one from those without. The problem becomes a chicken-and-egg scenario: They need input from individuals with accents to tune the system, but the system is not smart enough to separate them from other speakers.
Another hurdle is language’s dynamic nature. How people speak constantly changes, as Lodge notes. New words emerge and gradually make their way to mainstream acceptance. As a result, the engines need to be constantly tuned.
Collecting the information is a complex task. The models need hundreds of thousands and in some cases millions of inputs to be effective.
Recently, the data collection process became more troublesome as governments, like the European Union and the state of California, enacted privacy laws that provide individuals with more control over the information that technology companies gather about them and how it is used. As a result, getting a representative pool of different types of speakers, varied by age, gender, education levels, etc., becomes more complicated.
Therefore, the process of creating a sophisticated ASR is time-consuming, painstaking, meticulous, and ultimately quite expensive. The question of tuning the system to account for accents becomes a difficult business case to make because the investments are so high and the improvements in accuracy can be low.
Variations on Traditional Themes Arise
Vendors are trying to close the gap in a few ways. To encourage client participation, ASR suppliers have begun to share the wealth with customers, providing them a percentage of any revenue generated if their input is used in their data models.
Another ripple effect is the emergence of synthetic datasets. Here, vendors develop artificial intelligence models that generate input that can be tested. The process is simpler and less expensive than collecting live data.
In addition, vendors are developing new types of ASR solutions. In some cases, the products try to put words into context, in a manner like filling out a crossword puzzle. The system determines what an unrecognized word is by examining the surrounding words and contextualizing, or best-guessing, what the person is trying to say.
Fluent.ai’s Speech-to-Intent system is based on that method. “The initial speech systems were designed for transcription services,” explains Probal Lala, CEO of Fluent.ai.
The solutions were built to translate input to text that various applications could use in some way. Even then, suppliers want accurate transcriptions of the input, which inevitably leads to the input being checked and rechecked by human beings. Instead, Fluent maps speech to its desired action without speech-to-text transcription.
Speechmatics also has taken a similar approach. Traditionally, the machine learning system is provided labeled data: an audio file of speech with an accompanying metadata or text file that has what’s being said, usually transcribed and checked by humans. This approach is supervised learning, where a model learns correlations between two forms of prepared data. Speechmatics uses self-supervised learning, where the ASR figures out what is said largely on its own.
Can Vendors Make Money?
Such work is well under way, but the focus is on applications other than recognizing local accents. Making the business case is tricky. “The challenge here is that the market gets so fragmented that it is hard to build a business case around new technologies to specifically address splintered dialects,” Miller says.
Suppliers spend lots of money on the infrastructure needed to create and deliver speech solutions. They need to recoup those investments, so the focus has been on the simplest, easiest to monetize needs, like “Find the local pizza shop,” according to Arnold at Arnold Information Technology. Most people probably use the internet to find local pizza parlors, so the range of individuals who need to find such accurate answers is large.
The subset of people speaking in unusual ways is often small. Training the data model is complex and expensive, so suppliers have to be convinced that their investment will lead to a positive impact on the bottom line, often something not quickly or easily guaranteed. Consequently, what the systems recognize tends to be skewed to different types of speakers. If one plots new accent releases on a map, the Global South is not a consideration, despite the numbers of English speakers there.
Therefore, vendors have largely not built ASRs specially to account for local dialects. Instead, they focused on other easier ways to monetize their investments.
But the market might be reaching an inflection point. Open-source ASRs are emerging. Founded by Elon Musk and Y-Combinator’s Sam Altman in 2015, OpenAI has created a general-purpose multilingual speech recognition system, called Whisper, that is trained by 680,000 hours of audio and corresponding transcripts in 98 languages.
Given that market change, suppliers want to avoid having their solutions become commoditized and look for new ways to differentiate their solutions. “Globalization of automated speech recognition will continue to drive demand for additional languages and dialects,” Miller says.
Niche applications offer another possible way to extend and improve speech recognition rates. Fluent.ai created an autonomous, small-footprint, edge ASR that recognizes tens of thousands of commands. “Our solution works well in noisy factories, where individuals with different accents input commands,” Lala boasts.
Inclusion as a Business Driver
One more way to justify such investments might be emerging. The speech industry, and businesses overall, are shifting from their traditional focus exclusively on profits to one concerned about social justice issues.
Discrepancies are evident among ethnic groups in terms of how easily their speech is recognized. A 2019 Stanford study found that popular speech engines exhibited substantial racial bias: an average word error rate of 0.35 for black speakers compared to 0.19 for white speakers.
Inclusion efforts also extend to individuals with physical disabilities. For millions of people around the world, speech impairment is a fact of life. In fact, roughly 7.5 million people in the United States have trouble vocalizing words and phrases. Disorders involving pitch, loudness, and quality affect about 5 percent of children by the first grade. There are many types of speech impediments, including disfluency (stuttering), articulation errors, ankyloglossia (tongue tie), dysarthria (slurred speech), and apraxia (problems with lip, jaw, and tongue movement).
Google Casts a Big Shadow
Solutions are emerging that might close that gap as well. Google scientists are investigating ways to minimize word substitution, deletion, and insertion errors in speech models as a part of Parrotron, an ongoing research initiative aiming to ensure that atypical speech becomes better understood. The solution leverages an end-to-end AI system trained to convert speech from a person with an impediment directly into fluent synthesized speech.
The system represents another attempt to skip text generation. Instead, the ASR considers only speech signals rather than visual cues, such as lip movements, and it is trained in two phases using parallel corpora of input/output speech pairs. To fill in the blanks, the Google software turns recorded voice samples into a spectrogram, or a visual representation of the sound. The computer then uses common transcribed spectrograms to train the system to better recognize the less common types of speech.
Researchers at Google trained the company’s ASR neural networks on a dataset of heavily accented speakers and separately on a dataset of speakers with amyotrophic lateral sclerosis (ALS), which causes slurred speech of varying degrees as the diseased progresses. Their thesis is that fine-tuning a small number of layers closest to the input of an ASR network improves ASR performance in atypical populations. This approach contrasts with typical transfer learning scenarios where test and training data are similar but output labels differ. In those scenarios, learning proceeds by fine-tuning layers closest to the output.
The initial results have been promising and shown improvements in model performance. The remaining errors are consistent with those associated with typical speech. The research can be applied to larger groups of people and to different speech impairments.
In the end, the work to have ASRs account for local dialects is dynamic and excruciatingly complex. Vendors are dabbling with new engine designs that could close the gaps one day. The solutions take different approaches to recognizing what individuals say. However, the challenges remain significant, and the quest is a bit quixotic given the many variables and market dynamics at play when trying to determine exactly what each person says regardless of where, how, and when they were raised.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.