-->

Speech Technology Is Primed for a Big Leap Forward

Article Featured Image

The speed of progress in speech technology opens the door to many possibilities that would have been relegated to science fiction just a decade ago. The key drivers of expanded capabilities are ubiquitous fast networks, cloud computing that quickly scales, new language models with significantly expanded capabilities, and increasingly the adoption of neural networks.

All of the above provides the underpinnings for leaps of capability in ways broad (think Google’s thousand language initiative, currently supporting 243 languages in July 2024,) and targeted, even personalized to a single person.

One of the most exciting voice recognition advancements is contextual understanding and the ability to integrate it into a conversation. Contextual awareness is important to all types of speech technology because it shifts language understanding into the correct focus, overcoming some of the root challenges to speech recognition, like ambiguity caused by homonyms, pace of speech, and pronunciation.

Coupling context awareness with emotion detection begins to raise speech technologies to the level of human capability—instantly knowing what is being discussed, and adding situational context enriched by emotion detection.

Personal assistants today generally only take a command (“Google, turn the Nest down to 74 degrees”) or provide information (“Alexa, what is the temperature outside?”). Imagine a personalized assistant that listens to everything you hear and provides insights or help when needed or valuable to you, tweaked to a level that you prefer. (Expert business assistant: “Jane seems irritated because you will be late providing the design to her. You might remind her that she also asked you earlier this week to write the analysis describing the customer’s issue, along with suggestions on how to address it, and gave you only one day to deliver with no prior notice. Also, ask her if there is anything on your task list that she could have you put to the side so you can focus on completing the design.”)

A personalized assistant could be your coach for many daily activities, providing you feedback on how to complete activities faster or better. A personalized speech coach could provide you with an overview of a day, and once you have asked for it, give real-time notification of how to improve your communication. (Assistant daily report: “Olivia, you used the filler word ‘like’ 382 times today, including 116 times at work. Click here to see a breakdown by hour or conversation. I can notify you when you are using it more than two times per hour. Click here to configure how you would like me to provide the notifications.”)

Speaker disambiguation has improved to the level where we will see near-real-time reporting or documentation on events such as legal trials, surgical procedures, and field support activities. And as it improves and connects to robust AI-enabled knowledge bases, it could provide notifications to the appropriate people regarding what might not be leading to desired outcomes. For example, a flight checklist for pilots and copilots would catch anything out of the norm and provide an alert if anything was missed. Similar notifications could cover a variety of situations, be they legal, surgical, or business analysis.

Hyper-realistic voice synthesis is now becoming reality and is incorporated in functions such as interactive voice response systems and audible book reading. Soon, we may see it used in moviemaking to either correct dialogue and reduce production costs of reshoots, or even dub in an entire audio track of dialogue with the actors’ voices generated from deepfake text-to-speech. Whereas today’s CGI animated movies use a computer-generated image but with actors dubbing in their human voices in a studio, tomorrow’s movies might have flesh-and-blood actors with an audio track that is computer-generated using their voices. Making films available in languages other than the original will be possible using the original actors’ voices.

In the short term, little gain will be made for speech recognition and generation “at the edge,” or device level. Increasing capabilities at the device level requires a massive decrease in software-level processing requirements that do not seem to be in the cards for the next decade. And hardware power increases are still coupled to Moore’s Law. Also, most future capabilities will rely on portable devices that are power-hungry and will require new battery technology.

You likely cringed several times in this column, thinking about the legal and ethical concerns of always-on speech capture. And that is good, because you are correct in that we will have to consider them and adjust to satisfy those concerns. Leap forward, but safely. 

Kevin Brown is an enterprise architect at Miratech with more than 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@miratechgroup.com.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Speaking of Speech Tech’s Future, It Suddenly Arrived

Let's not take all of this progress for granted.

Are the Brakes About to Be Applied to Speech-Enabled AI?

AI in the contact center has proved useful, but privacy issues loom.

Real-Time Transcription Serves an Immediate Need (or Lots of Them)

Contact centers are seeing all kinds of potential use cases,