Speech Goes Mobile
What is the value of speech technologies in mobile devices? On its face the answer is obvious: Speech is the most natural of all user interfaces. Humans are genetically engineered to talk and listen, so unless human genetics changes sometime soon, Speech is and always will be the most natural interface between humans and machines. Add to this the fact that, because of their small size, keyboards are generally not available on mobile devices, and the case for speech technologiesincluding speech recognition, text-to-speech and speaker verificationon mobile devices gets even stronger. So, why isnt speech on every device? There are a number of reasons. Lets start by looking at speech recognition. Large vocabulary, speaker independent recognition that requires minimal or no training is a recent phenomenon. Based on their success deploying powerful speech applications in large network-based environments, companies are now migrating them into the embedded space. The result is speech recognition with large dynamic vocabularies (e.g., up to 100,000 words) that allow users to buy a mobile device, synchronize it with their Outlook or Notes and then be able to immediately access the stored contacts using only their voice. This is a big step forward from the tedious voice enrollment that is currently required on most speech capable mobile phones today. On the speech generation side, new technologies from TTS (Text-To-Speech) engines like have improved the quality of synthetic speech to the point where it is now nearly indistinguishable from recorded prompts. This dramatic improvement in naturalness offers many new options for phone services. Using high quality TTS, mobile devices can now offer an intelligible voice that users would actually enjoy; making it now possible to have the device read emails, play back a list of contacts in an address book, or read out song titles on an MP3 player, without annoying the user. The second barrier to embedded speech technologies has been the limited processing power of mainstream devices. This is now changing with the introduction of platforms such as Texas Instruments OMAP, Intels xScale, Hitachis SH-Mobile and Motorolas MobileGT. Limited memory and CPU speed are no longer a serious obstacle to getting powerful speech recognition and TTS engines onto these platforms. While it may take a few years for these platforms to dominate mainstream mobile devices as the cost of the chips and related memory are still an issuethe platforms are on the market today and device manufacturers are beginning to incorporate them. The third hurdle to getting speech onto mobile devices is customer adoption. Although speech may be the most natural of all interfaces, it is not the only interface. For consumers to really feel the need for speech interfaces, speech needs to relieve a very real point of pain in an existing user interface. One example is recent hands-free legislation that bans the use of cell phones in cars by limiting drivers access to their cell phones. The Gartner Group estimates that 49.5 percent of all cell phone calls are made from the car, and with legislation being reviewed in over 20 states banning the use of phones in the car, speech represents a viable solution. Using speech, drivers can access contacts and dial phone numbers using their voice, allowing them to keep their eyes on the road, their hands on the steering wheel, and ultimately reducing driver distraction. Navigation devices with all their good intentions suffer from a similar point of pain, as users have trouble entering a destination quickly and easily. One can argue that navigation devices would be more widely used in rental cars if the driver were able to easily input the destination information. Rather than typing in a destination on an awkward keypad, the driver could say, New York Marriott Marquis. In fact, the reigning quip for most navigation systems on the market is that users can generally arrive at their destination before they are able to finish entering it on the device. Similar conclusions about the importance of speech can be made about accessing the hundreds of songs listed on an MP3 player or making public kiosks, ATMs and other devices accessible to the blind. In all of these cases, speech uniquely solves a problem in the user interface that many other technologies cannot. A final point on user adoption is the emergence of a combination of speech and visual display on mobile devicesoften referred to as multimodal user interface. Although multimodal applications are not mainstream today, many platform providers have begun to successfully deploy multimodal solutions on portable devices and a number of carriers are evaluating the technology. These are important steps in getting users to adopt speech as part of a more complex user interface. Some multimodal applications that are being evaluated today include the following: ·Motorola has recently completed a successful trial of a multimodal user interface with SpeechWorks on its iDen phone;
·Kirusa has begun the first GPR trial of multimodal applications with French mobile operator Bouygues Telecoms in France; and
·Orange Services (Imagineering) is working with Lobby7s multimodal x|mode platform. The SALT forum, founded with the hope of creating a standard for multimodal application development, should begin to accelerate the push for multimodal user interfaces. Microsoft has already released the .NET Speech SDK, a SALT-based developer toolkit that integrates into the Microsoft Visual Studio development environment and Microsofts Web server programming environment, ASP.NET, enabling application developers to incorporate speech functionality into Web applications. Microsofts ASP.NET for Web development currently has over one million developers. The many recent multimodal applications would seem to be a pretty good start for getting speech going on mobile devices. Alan Schwartz, is VP of Business Development at SpeechWorks, and heads the Automotive & Mobile Device Unit. He can be reached at
alan.schwartz@speechworks.com.