Eliminate Ambient Noise to Make Speech Recognition More Accurate
Speech recognition continues to grow and improve as more and more businesses and consumers rely on it regularly. Fortune Business Insights valued the market for these technologies at $12.6 billion in 2023 and expects it to grow to $15.5 billion this year and to $85 billion by 2032, representing a compound annual growth rate of 23.7 percent. Fueling that growth are increased use cases, from consumer-oriented voice assistants to contact center analytics.
Ongoing advances in the underlying technologies, such as artificial intelligence, natural language processing (NLP), and the growing amount of training data, have led to an exponential increase in the ability to process voice at a larger scale, the research firm says.
While the advancement of the underlying technologies has led to increased capacity, the accuracy of those systems has increased as well, says D. Daniel Ziv, vice president of AI and analytics go-to-market strategy at Verint, crediting in part neural networks embedded in large language models.
“We’ve gotten to the point that we have enough data to train the models that we can basically predict the next word,” Ziv explains. “Because of large language models and cloud computing, you can take large volumes of transcripts and do wonderful things with them.”
Despite all these advances, though, ambient noise—nearby conversations, machinery, roadway traffic, dogs barking, babies crying, phones ringing, etc.—can still wreak havoc with speech recognition.
Ambient noise compromises accuracy because it prevents speech recognition from being able to hear and capture speech properly, making it difficult for the technology to decipher precisely what is being said and by whom.
“The challenge of distinguishing between speaker commands and background noise often leads to misunderstandings or failures in task execution,” notes Vivoka, a provider of voice-enabled technologies, in a blog post. “This not only causes frustration for users but also limits the clarity, functionality, and reliability of voice-controlled devices. Such obstacles emphasize the urgent need to adapt voice AI to varying noise levels, particularly in work environments where clear communication is crucial for safety and efficiency.”
The challenges of ambient noise differ with the environment and use case, Ziv says. In the contact center environment, in which Verint specializes, ambient noise depends largely on the setting. Traditional contact centers in office buildings are designed to minimize ambient noise issues. But the COVID-19 pandemic saw a shift to more remote work, with many of those home offices not designed to minimize ambient noise. Even with the shift back to more traditional office settings, many contact center workers remain remote.
The contact centers’ use of directional microphones and noise canceling technologies will solve most of those problems for the agents and automated systems, but they do nothing for the other end of the conversation. Noise on the customer’s side of the call is inevitable and not something that contact centers can do anything to address.
“A lot of companies don’t invest enough in that,” Ziv says. “Customers call from their cell phones mostly. Nobody’s calling from a phone booth anymore. Someone might have AirPods or something similar, but there’s always ambient noise, whether the caller is in a bus station, an airport, or a car. That all affects the quality of the transcription.”
Technology Solutions for Ambient Noise
There are a variety of technologies designed to minimize ambient noise.
Noise canceling technology registers the sounds happening in the background and produces an opposite sound wave, canceling them out and making sure speech is the center of attention. Audio editing software with noise reduction features, or noise canceling software, enables the user to eliminate background noise from live audio (like virtual meetings) as well as recorded audio.
Cache Merrill, founder of software development company Zibtek, and Hiren Shah, founder of advertising firm Anstrex, recommend the following technologies to help mitigate ambient noise issues:
- Beamforming,a signal processing technique that uses multiple antennas to direct radio and sound waves to specific receiving devices. It’s used to improve the signal-to-noise ratio, reduce interference, and focus signals to specific locations. Owing to this method, the signals are enhanced, and ambient noise is reduced greatly.
- Multi-microphone arrays, where a group of microphones are arranged in a specific pattern to capture sound from multiple directions. They use beamforming to improve the quality of sound from specific directions, while reducing background noise and reverberation.
- Noise reduction algorithms and deep learning.These include spectral subtraction (a technique that removes noise from an audio signal by estimating the noise spectrum during periods of silence in the speech signal and subtracting it from the overall spectrum); Wiener filtering (a signal processing technique that uses a filter to estimate a desired random process from a noisy process); and newer recurrent neural networks (RNN), deep learning-based separation models that usually outperform traditional methods. The strength of RNNs lies in their ability to receive recurrent acquisitions and separate the noise from the speaker’s voice while enhancing recognition accuracy level.
- Preprocessing with voice activity detection (VAD).This enables systems to determine when speech is available, as opposed to when there is only noise or silence. This enhances the concentration on the relevant speech parts while reducing the unnecessary amount of audio that, if present, might lead to distortion.
In addition to the above technologies, another ambient noise canceling technology to consider is acoustic echo cancellation (AEC), says Chris Dukich, owner of digital signage provider Display Now. “AEC is critical for conferencing or smart device applications. This method locks and suppresses echoes for better sound quality.”
Many leading speech technology vendors have also been hard at work to address the ambient noise issue.
Earlier this year, Microsoft upgraded the capabilities of its Windows 11 platform to eliminate ambient noise with Voice Clarity, an AI-powered feature to enhance video calling. This functionality, which was previously exclusive to Surface devices, was made available to all devices running on the Windows 11 operating system. The new Voice Clarity function uses sophisticated AI models to instantly eliminate echoes, reverberation, and background noise.
Other technology vendors, including hyper-scalers like Google and Amazon, are also active in this area, though much of their efforts focus on speech recognition-based transcription and translation, according to experts.
For the video creation community, Adobe introduced Project Sound Lift late last year. The technology is designed to separate voices and ambient sounds from daily-life scenarios, splitting speech, applause, laughter, music, and other audio elements into distinct tracks. Each track can be individually controlled to enhance the quality and content of the video.
The readily available technologies to solve for ambient noise are good enough in many cases. But there are other instances (see the sidebar below) when the speech recognition must be very exacting.
“Not every word matters in the same amount,” Ziv says. “For most use cases, there are business terms that are very important, and you want to get those right. It doesn’t matter if you get the stop words (e.g., ‘um’) right or wrong. But if it’s a medical conversation about a condition you have or medication that you should be taking, it’s very important you get that right.”
Ziv adds that you should be sure to use the right speech recognition (and associated noise canceling technology) for the right purpose. A tool built for dictation might not work well in a contact center environment. Similarly, a tool built for broadcasting from a quiet studio likely will not work well on a busy street corner.
“Then there are a lot of things that are unique to the contact center environment,” Ziv says. “In a contact center, you have different models (all agents in an office, all remote, a hybrid environment). You need a model that self-tunes and improves to reach a very high level of accuracy.”
To help ensure that the transcription of speech recognition is accurate and continues to improve, Verint launched the Verint Exact Transcription Bot within the past year. According to the company, customers have reported greater than 90 percent transcription accuracy and greater than 95 percent categorization accuracy.
“It doesn’t necessarily get 100 percent accuracy, but it constantly learns and improves on the important business terms,” Ziv says. “When you’re summarizing something, you want to be very accurate with things like product names, competitor’s names, etc. You want a special model that knows those terms and transcribes them accurately.”
With the above technologies, which are increasingly aided by AI for continuous improvement, eliminating ambient noise to improve speech recognition is getting easier all of the time, according to Ziv. “We have the ability to self-tune and learn, especially with the important words.”
Further Advances on the Horizon
The ability of AI to fine-tune speech recognition to solve for ambient noise, accents, and other issues is expected to improve not only due to machine learning, but also due to emerging techniques.
Researchers at Ohio State University have developed a new deep learning model designed to improve audio quality by using human perception combined with AI. They found that they could use the subjective ratings of sound quality made by people and combine that with a speech enhancement model to lead to better speech quality.
This new model reportedly outperformed other standard approaches at minimizing the presence of ambient noise, and the predicted quality scores were very close to the judgments humans would make.
“What distinguishes this is that we’re trying to use perception to train the model to remove unwanted sounds,” says Donald Williamson, co-author of the study and an associate professor of computer science and engineering at Ohio State. “If something about the signal in terms of its quality can be perceived by people, then our model can use that as additional information to learn and better remove noise.”
The Massachusetts Institute of Technology’s Lincoln Laboratory is seeking a patent for a technology that uses deep neural networks (DNN) to improve the performance of single-channel speech enhancement systems. Embodiments feature a DNN-trained system capable of predicting the presence of speech in an input signal, along with a framework for tracking ambient noise and estimating the signal-to-noise ratio.
According to MIT, this system offers increased flexibility in its design parameters, like gain estimation, and enables joint suppression of both additive noise and reverberation. The technology is designed to detect speech amid noise and reverberation.
Yet as much as a variety of technologies can minimize ambient noise, there can still be issues, Ziv admits. “It’s not perfect, and it probably never will be. Even as humans, there are words we miss when we speak to each other on the phone. We predict and [insert] those missing words, but sometimes we have to go back and rephrase things because there are issues.”
Phillip Britt is a freelance writer based in the Chicago area. He can be reached at spenterprises1@comcast.net.
Technology Is Not Always Enough
Even with all the latest technologies, humans still sometimes need to ensure the accuracy of a speech recognition system—whether ambient noise is present or not—particularly in the sensitive areas of law, financial services, and healthcare.
The use of voice or speech recognition technology for healthcare documentation has put patients at risk for injury and death, notes the Joint Commission’s Division of Health Care Improvement, in a blog post.
In one speech recognition case, a 2012 medical malpractice filing, a jury awarded the plaintiff $140 million after it was determined that a transcription error led to a patient’s death. The patient, a lifelong diabetic, was admitted to a hospital when she developed a blood clot in her dialysis port. Upon discharge, she went to a rehabilitation facility. A nurse at the hospital transferred the patient’s information to the rehabilitation hospital. Instead of relying on the full medication reconciliation document and patient transfer order, the nurse obtained the information she needed from a copy of the doctor’s dictated discharge summary, which had been sent out of the United States for editing. The transcription contained errors, including a notation that the patient was to receive 80 units of insulin instead of 8 units. At the rehabilitation hospital, the patient received the much higher dosage, which caused brain damage, cardiopulmonary arrest, and death.
“Text capture by SRT should always be edited for accuracy by either a third-party editor, which is preferred, or in real time by the author,” the Joint Commission recommends. —P.B.