Speech Goes Mobile

What is the value of speech technologies in mobile devices? On its face the answer is obvious: Speech is the most natural of all user interfaces. Humans are genetically engineered to talk and listen, so unless human genetics changes sometime soon, Speech is and always will be the most natural interface between humans and machines. Add to this the fact that, because of their small size, keyboards are generally not available on mobile devices, and the case for speech technologies—including speech recognition, text-to-speech and speaker verification—on mobile devices gets even stronger. So, why isn’t speech on every device? There are a number of reasons. Let’s start by looking at speech recognition. Large vocabulary, speaker independent recognition that requires minimal or no training is a recent phenomenon. Based on their success deploying powerful speech applications in large network-based environments, companies are now migrating them into the embedded space. The result is speech recognition with large dynamic vocabularies (e.g., up to 100,000 words) that allow users to buy a mobile device, synchronize it with their Outlook™ or Notes™ and then be able to immediately access the stored contacts using only their voice. This is a big step forward from the tedious “voice enrollment” that is currently required on most speech capable mobile phones today. On the speech generation side, new technologies from TTS (Text-To-Speech) engines like have improved the quality of synthetic speech to the point where it is now nearly indistinguishable from recorded prompts. This dramatic improvement in naturalness offers many new options for phone services. Using high quality TTS, mobile devices can now offer an intelligible voice that users would actually enjoy; making it now possible to have the device read emails, play back a list of contacts in an address book, or read out song titles on an MP3 player, without annoying the user. The second barrier to embedded speech technologies has been the limited processing power of mainstream devices. This is now changing with the introduction of platforms such as Texas Instruments’ OMAP™, Intel’s xScale™, Hitachi’s SH-Mobile and Motorola’s MobileGT™. Limited memory and CPU speed are no longer a serious obstacle to getting powerful speech recognition and TTS engines onto these platforms. While it may take a few years for these platforms to dominate mainstream mobile devices— as the cost of the chips and related memory are still an issue—the platforms are on the market today and device manufacturers are beginning to incorporate them. The third hurdle to getting speech onto mobile devices is customer adoption. Although speech may be the most natural of all interfaces, it is not the only interface. For consumers to really feel the need for speech interfaces, speech needs to relieve a very real “point of pain” in an existing user interface. One example is recent hands-free legislation that bans the use of cell phones in cars by limiting drivers’ access to their cell phones. The Gartner Group estimates that 49.5 percent of all cell phone calls are made from the car, and with legislation being reviewed in over 20 states banning the use of phones in the car, speech represents a viable solution. Using speech, drivers can access contacts and dial phone numbers using their voice, allowing them to keep their eyes on the road, their hands on the steering wheel, and ultimately reducing driver distraction. Navigation devices with all their good intentions suffer from a similar point of pain, as users have trouble entering a destination quickly and easily. One can argue that navigation devices would be more widely used in rental cars if the driver were able to easily input the destination information. Rather than typing in a destination on an awkward keypad, the driver could say, “New York Marriott Marquis”. In fact, the reigning quip for most navigation systems on the market is that users can generally arrive at their destination before they are able to finish entering it on the device. Similar conclusions about the importance of speech can be made about accessing the hundreds of songs listed on an MP3 player or making public kiosks, ATMs and other devices accessible to the blind. In all of these cases, speech uniquely solves a problem in the user interface that many other technologies cannot. A final point on user adoption is the emergence of a combination of speech and visual display on mobile devices—often referred to as multimodal user interface. Although multimodal applications are not mainstream today, many platform providers have begun to successfully deploy multimodal solutions on portable devices and a number of carriers are evaluating the technology. These are important steps in getting users to adopt speech as part of a more complex user interface. Some multimodal applications that are being evaluated today include the following: ·Motorola has recently completed a successful trial of a multimodal user interface with SpeechWorks on its iDen phone;
·Kirusa has begun the first GPR trial of multimodal applications with French mobile operator Bouygues Telecom’s in France; and
·Orange Services (Imagineering) is working with Lobby7’s multimodal x|mode platform. The SALT forum, founded with the hope of creating a standard for multimodal application development, should begin to accelerate the push for multimodal user interfaces. Microsoft has already released the .NET Speech SDK, a SALT-based developer toolkit that integrates into the Microsoft Visual Studio™ development environment and Microsoft’s Web server programming environment, ASP.NET, enabling application developers to incorporate speech functionality into Web applications. Microsoft’s ASP.NET for Web development currently has over one million developers. The many recent multimodal applications would seem to be a pretty good start for getting speech going on mobile devices. Alan Schwartz, is VP of Business Development at SpeechWorks, and heads the Automotive & Mobile Device Unit. He can be reached at alan.schwartz@speechworks.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Speech Goes Mobile

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API