-->

Let’s Continue to Prioritize Innovation

Article Featured Image

For decades, the speech technology industry has been searching high and low for low-memory-footprint solutions—ways of reducing the amount of space and computing power that speech recognition, speech synthesis, and related technologies require—and with good reason. The amount of memory consumed by many speech recognition and speech synthesis apps has a direct impact not just on product manufacturing costs but also the types of devices where it can be installed, the types of applications with which it can be paired, processing speeds, accuracy, and user satisfaction. It is abundantly clear that anything that can be done to save memory should be exploited.

Some of the steps already explored have involved reducing the number of phonemes, compressing waveforms, experimenting with different lexicon and vocabulary sizes, and looking into different deployment options, such as moving processing and storage to the cloud or on device. For many companies, re-creating the speech processing stack on device was a massive undertaking that involved science and engineering teams working together on software and hardware redesign. It meant reengineering system architectures, endpoints, contextual awareness, model training, federated learning, neural networking, and a lot more, but it was worth the effort.

Still, researchers and practitioners continue to grapple with some very long-standing challenges that have placed limits on what speech technologies can currently achieve. A lot more work has to be done to address real-time capabilities, real-world noisy environments, domain- and industry-specific use cases, ending the trade-off between latency and accuracy, speaker diarization that can separate one speaker from another and identify which one said what, and combining speech modalities, such as text-to-speech and speech translation, in one hybrid approach.

While it’s true that speech has come a long way, the next waves of innovation in the speech technology space promise to be far more transformative than any of the breakthroughs that brought the industry to where it is right now. That is and should continue to be a high priority for everyone in the industry.

As we’ve reported in the last few issues, a lot of research has happened in academic circles. In this issue’s FYI section, for example, we cover research that a number of universities did to improve speech for medical uses like clinical documentation, diagnosing diseases, and controlling medical devices. In this issue’s two features, we also cover recent improvements in speech analytics for the purpose of identifying unique emotional states and sentiments and to make avatars seem more lifelike by synching their lip movements, gestures, and facial expressions with the audio.

Other work is under way to treat speech modalities like speech recognition and text-to-speech as unified systems that can act simultaneously, share representations of speech and text, and draw from a single training pipeline and feedback loop. And yet more work is being done to help models better understand and respond to complex audio environments, to enable models to accept natural language instructions, and to consider broader conversational and situational context before creating outputs, locating information, or responding to requests and inquiries.

Going forward, speech will no longer be able to act in isolation, as it has for so many years. The speech technologies of the future will be fully conversational, fully multimodal, fully integrated, and fully able to understand intent, tone, sentiment, and context in real time and respond dynamically as interactions progress and states change.

The challenges aren’t just technological, and they can’t be solved in academia alone. Building the speech systems that will be most in demand—ones that don’t just understand what we say but that understand us as individuals—will require a whole market approach that brings together academics, technology vendors, end users, governing bodies, and industry associations.

As we continue to build technologies that are advancing at an astonishing pace, it’s clear that conversational interfaces will be the way that we communicate with everything and everyone. And within this ecosystem, speech will continue to be one of the most complex and exciting areas of development. We all need to keep supporting this incredible industry and fueling its innovation. Though the technology has been around for decades, its evolution is just beginning.

Leonard Klie is the editor of Speech Technology magazine. He can be reached at lklie@infotoday.com.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

OpenAI Was the Biggest Disrupter. Now, That Could Change

Chinese AI lab DeepSeek's open-source large language model immediately sent ripples through the tech world.

Speech’s Next Big Thing Is Moving Fast

Quantum computing is making a resurgence, and speech tools could be beneficiaries.

Speech Industry Award Winners Fuel a Market on the Rise

Their efforts in the past year have left an indelible mark on the entire speech industry, brought the technology to new heights and new markets, and generated a lot of buzz along the way.