Speech Engines: Improving Accuracy and Finding New Uses
Improvements in speech engines continue as suppliers use more processing power and more sophisticated software to enhance system performance. These solutions are doing a better job of understanding dialects, blocking out background noises, and supporting more customization options. As a result, the market is growing at a rapid clip. Vendors are trying to expand their systems’ reach by developing vertical market solutions and products that respond to human emotions.
Buoyed by recent technical advances, revenue for the speech and voice recognition market is expected to rise from $6.2 billion in 2017 to $18.3 billion by 2023, growing at a compound annual rate of 19.8 percent, according to research firm MarketsandMarkets. Competition is broad and includes consumer-focused suppliers such as Acapela Group, Amazon, Apple, and Google, as well as enterprise-centered vendors like Hoya (which is looking to unify the NeoSpeech, Voiceware, and ReadSpeaker companies it acquired under its banner), Microsoft, and Nuance Communications.
Traditionally, this market had been constrained by low accuracy rates. Usually, these solutions operated in the 70 percent range, but recent technical improvements have driven the results up to the 90 percent range. “The algorithms used in these systems have been getting better and better,” notes Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group.
Support for more languages has been another issue. These products typically recognize English and major European and Asian languages, such as Spanish, French, and Mandarin Chinese. Recently, vendors have been expanding the number of languages supported. Hoya, for example, now supports Cantonese. And Acapela Group, which is based in Belgium, in 2015 created two text-to-speech voices in North Sami, part of a family of languages spoken in northern Norway, Finland, and Sweden by about 20,000 people.
Systems are nowhere close to supporting the estimated 7,000 language and dialects currently spoken worldwide. A couple of factors limit the number, starting with cost. “If there are only 1 million people speaking a language, it becomes difficult for a vendor to justify their investment,” Dahl says.
Another issue is dialects. Vendors have been fine-tuning their engines to support a wider array of language variations (such as U.S., U.K., and Australian English) rather than going after entirely new languages altogether. Hoya’s ReadSpeaker, for example, recently added Pilar, a TTS voice in Castilian Spanish.
Addressing Long-Standing Issues
However, some traditional issues have become more problematic recently. Nowadays, speech recognition engines operate in more locations, some of which are not conducive to easily identifying spoken words. Smartphone and automotive applications can come along with users to places with a lot of background noise. Recent system upgrades have done a better job of filtering out those noises, as well as addressing echoes, reverberation, and similar anomalies.
In the enterprise, these solutions often capture customer interactions. Here, multiple speakers can be on the line simultaneously, so the system has to identify who is speaking and break the interaction down into multiple sets of input. Emerging use cases have created a need for even further refinements. For instance, in June, Nuance introduced a version of Dragon Drive that allows an in-car assistant to communicate not only with the driver but with passengers inside the vehicle.
Such improvements arise from the coalescence of advances in a number of underlying technologies. First, computer processing power has been growing. In the past, speech recognition applications were too complex to be accurate and required too much processing power to be cost-effective. But lately, suppliers have built out large data centers, capable of delivering oodles of processing power.
Cloud computing is also taking on a bigger role in this market. Suppliers like Amazon and Google have been building businesses based on delivering public cloud speech recognition services.
Google has been moving aggressively in this space. About a year ago, the vendor opened up its Cloud Speech API. The software was already powering speech recognition for Google services such as Google Assistant, Google Search, and Google Now; with this change, it could be used by third-party developers to convert audio to text from input on a wide range of devices, including cars, TVs, speakers, phones, and personal computers. Google also bettered its transcription accuracy for long-form audio and sped up speech processing threefold. Finally, a wider range of audio file formats, including WAV, OPUS, and Speex, are supported.