July 2, 2024
By Kevin Brown enterprise architect, Miratech.
Inside Speech

Speaking of Speech Tech’s Future, It Suddenly Arrived

In a world where communication is key, speech technology currently is standing at the forefront of innovation. Speech technology has progressed from a near-decade of repeating “This is the year of speech!” until it became a joke to “What?!? When did they make that available?”

And there is no end in sight; the realm of speech technology continues to evolve at an exponential pace, revolutionizing how we interact with technology and each other.

Think back to when you first heard of an idea and when you first heard that it had been delivered, then contemplate how we are taking it all for granted.

Here is a short inventory of what arrived at the speed of sound and caused the future to be the here and now. Even for us interested parties who spend time thinking about what’s next, these capabilities came rapidly indeed.

One of the most discussed breakthroughs in speech technology was natural language processing (NLP) and understanding (NLU). This advancement led to the proliferation of virtual assistants like Siri, Alexa, and Google Assistant and is directly tied to the rise of cloud computing and ubiquitous network connectivity.

Though each of the various players arrived at their tuned NLP/NLU solutions differently, a good example is Google’s launch of Google Voice in 2009, along with its end user ranked voicemail transcription to fuel the training of its NLP solutions. In May 2016, Google unveiled its Assistant. Amazon introduced Alexa in November 2014 and Apple Siri’s launch was October 2011.

Speech translation has piggybacked on top of the great strides with NLP and NLU for appropriate context, amazing everyone who provides or requires translation services. Though low-cost translation has been available since 2000, the level of accuracy and speed and number of languages offered is exponentially greater today. Google’s Universal Speech Model supports over 200 languages, with a goal of supporting 1,000.

With these breathroughs, speech-to-text transcription has improved in lockstep. Two years ago in this column, we examined real-time speech transcription and the effect it has on aiding service and support roles. It continues to be deployed in medical, legal, and other professions by immediately surfacing appropriate documentation without a physical search launch with best-guess query format. With the advancement of AI, more coaching assistance is being provided via the systems rather than just escalating to human management. This has resulted in positive feedback from employees, who report that AI coaching is more standardized and, in many cases, more user-friendly than human managers’ coaching. And the real-time assistants allow employees to review where they needed help and what that help was, thereby extending the coaching for those interested in improving their performance.

Even lowly text-to-speech (TTS) has taken a rocket ride into the future. With neural networking, the robotic voices of poorly designed IVRs are outdated. With truly human-sounding TTS, rapid changes can be made in interactive voice response systems, supporting the provision of dynamic information. In many cases where a dedicated IVR team previously needed to make changes, interfaces with granular permissions are allowing business units to make their own changes to what is spoken to customers, including differing emotional expression. Furthermore, the amazing capabilities of the latest TTS solutions are fueling virtual assistants’ ability to speak useful content in a manner that is easily digestible, which supports their entrance into business use.

Voice biometrics has emerged as a powerful tool for authentication and security. By analyzing unique vocal characteristics such as pitch, tone, and cadence, voice biometric systems can verify a person’s identity with a high degree of accuracy. Recent advancements in voice biometrics focus on improving accuracy, robustness, and anti-spoofing capabilities. Machine learning algorithms are being leveraged to adapt to individual speech patterns and enhance authentication performance over noticeably short periods of time.

The same neural networking capabilities used to modernize TTS have also enabled deepfakes—as always, criminals seek to leverage technologies to their advantage. Currently the major voice biometrics vendors are staying a few steps ahead of criminals through anti-spoofing capabilities that are rightly being held as trade secrets.

In the world of speech technology, we are now technically advancing so fast that it is difficult for people, processes, and regulations to keep up. Deep thought will be necessary to prevent having to make major course corrections.

On that note, our next column will discuss what fantastic things speech technology might bring us in the future, which will be here before we can imagine.

Kevin Brown is an enterprise architect at Miratech with over 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@miratechgroup.com.