Get Speech Recognition Right the First Time, but Quickly!
In my October 2023 column I noted the growing importance of real-time transcription, with the focus on text output consumed by humans. Accuracy and speed are very important for those uses. But the demand for accuracy rises sharply when the consumer of the output text is another computer application.
Many automated functions are beginning to rely on generative AI, which itself relies on large language models (LLMs.) When speech is the input for LLMs and the recognition isn’t extremely accurate, the results can quickly go astray. The output for a business report whose input of variance is “peaks and troughs” suddenly takes on a much more negative tone when the input is deemed to be “pigs in troughs.”
Deep learning models, which are a subset of neural networks, are getting better at processing imprecise speech inputs. But they are truly safeguards against poor speech input, either from human gaffes or poor speech recognition. Speech recognition works best when it gets it right the first time.
So as massive investments in deep learning models ramp up, so do the long ongoing efforts to improve the audio recognition that drives the rest of speech recognition functionality. And there is no new “golden egg” in audio accuracy; most of the efforts are still focused on noise reduction and speaker differentiation. As speaker differentiation improves for specific uses—think meeting transcriptions—it also is incorporated into noise reduction. One healthcare application I recently analyzed that does not have speaker differentiation is suffering “noise confusion” levels of more than 70 percent, due to background speech being picked up as the primary speaker.
Dialect and Emotion Recognition
Because of growing globalization, the ability to recognize accents and dialects is taking on a higher priority. The ability to quickly gain input audio and corresponding transcripts is driving the growth in additional dialects being added to speech models, matched by varying accents within a single dialect. And most applications are now expected to be multilingual rather than supporting one or perhaps two languages.
Like dialect recognition, specialized models have been developed and are being extended for various industries that use unique terminology. “Lawyer-speak” and “physician-speak” are different enough from standard languages to require specialized models.
Improvements in emotion detection are also driving a deeper understanding of speech input. Tone, pitch, and context can drastically shift the meaning of a simple sentence and alter the input into an LLM to influence a more appropriate output. An example of a simple sentence that has several meanings is “That’s just great.” You might be responding to a positive input, one that expresses happiness, excitement, admiration, or approval, or a negative input, one that expresses sarcasm, frustration, or resignation; it’s the difference between night and day. Perhaps enough for you to call your digital assistant a not very nice name!
Step up to the Mic
Nearly all the improvements over the past decade have been software-driven, be it neural networking or improved language models. But like Formula One race teams pushing to be first, every tiny change can make a difference, so in addition to software advancements, improved microphones are becoming part of the accuracy equation with a focus not seen for more than a decade.
Dependent on whether speakers are near or far, microphone advancements have come in the form of adaptive noise reduction, driven by embedded software in the microphone, as in automotive use cases, and beamforming, which uses algorithms to electronically steer multiple microphones’ focus to the desired sound source out of many, thereby reducing noise. Far-field speech recognition uses similar technology to accurately capture speech from a distance, allowing for better voice recognition in large areas.
When you think about any of the latest and greatest speech-enabled applications, consider that the heavy lifting of LLMs in the middle is dependent on the front end of accurate speech input. Deep learning models are relatively new to speech recognition and have provided massive improvements, but the challenges of 25 years ago are being overcome, lessening the dependence on those deep learning models for correction and leaving them to assist with the expansion of speech recognition capabilities.
All these improvements stymie those who wish to play devil’s advocate by saying that “speech recognition isn’t ready for prime time yet.” Apparently, they have been using old technology that hears their statement as “devil’s avocado.”
Kevin Brown is an enterprise architect at Miratech with more than 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@miratechgroup.com.
Related Articles
Personal assistants? Try personalized assistants.
14 Aug 2024
Let's not take all of this progress for granted.
02 Jul 2024
AI in the contact center has proved useful, but privacy issues loom.
07 Mar 2024