Speaking to Web Pages
In an earlier column ("W3C Launches HTML Speech Incubator Group," January/February 2011), I wrote about a new W3C Incubator Group that was starting to work on ideas for integrating speech into Web browsers. A lot has been accomplished since then.
As I write, the HTML Speech Incubator group has nearly finished its work and is completing a final report that should be available by the time this column is published. With active participation from such major players in the speech and browser spaces as AT&T, Google, Microsoft, Mozilla, Nuance, Openstream, and Voxeo, the ideas being developed by the Incubator Group have a good chance of broad industry acceptance.
The work has two main goals: (1) enabling Web developers without deep speech expertise to use speech in Web browsers for simple or exploratory applications, and (2) providing enough capabilities to support full-scale, professional, enterprise-quality applications. Standards are important, because while proprietary approaches to speech-enabled Web pages are available now, speech applications aren't interoperable across browsers. As we know all too well from dealing with differences among graphical Web browsers, it's very frustrating to go to a Web page and see a message that says "This application doesn't work with browser X." A standard will ensure that this doesn't happen with the speech interface.
Here are a few ideas for how a standard might be used.
Multimodal customer support: Instead of calling a customer support phone number and interacting with a voice-only IVR system, customers needing support could go to the company's Web site on their smartphone, ask for support, and be directed to a multimodal support page. A natural language multimodal support system could allow users to enter their question in their own words, see the system's top three or four guesses about how their question should be classified, and then select the correct classification, either by voice or touch. In addition, a multimodal support system could display images, videos, and diagrams to help address the user's problem.
Natural language interaction with Web pages: Today's interactive Web pages rely heavily on form input. There might be five or six slots or more that the user has to navigate to and provide input for. In contrast, a user of a speech-enabled Web page could just say, for example, "I'm arriving on January 29 and I need a room for three nights for two adults." The application could fill all the slots on the page based on the user's speech. Many attempts were made in the early days of voice-only applications to do this kind of multislot filling application, but this approach never became popular in these applications—people had no way of knowing what the slots were, and it was also extremely difficult to correct errors. But the multimodal approach both allows the users to see the available slots, so they know what to say, and also to see the results of the slot filling, so they know if the system has made an error.
Hands-free text entry on Web pages: Many Web pages have text areas that allow for free text input. A speech application programming interface (API) could be used for dictating open-ended input like product reviews, email messages, blog posts, and comments. In contrast, dictation in a voice-only application is extremely difficult: No one has figured out how to design a good user interface for correcting dictation errors that the user can't see.
How will it work? The Incubator Group report proposes access to speech services programmatically, using Javascript, as well as with HTML markup. The HTML markup might be used by less expert Web developers who are interested in simpler functionality, while the programmatic interface could support more advanced applications.
In addition, developers will be able to specify speech recognition and TTS services of their choice, or leave it up to the browser to use its default speech services. Quick access to simple results will be available, but developers will also have access to a full EMMA result for more advanced applications. So a new developer might use the HTML markup, the default speech services, and simple results, but a more expert developer could take advantage of more powerful features such as the Javascript API, developer-specified speech services, and the full EMMA result to build advanced applications.
The final report will include detailed proposals for accessing both speech recognition and text-to-speech capabilities from within HTML. The final report won't itself be an official standard, but it will be available as input to future standards-track activity, and I believe it will be a very solid basis for a future standard for speech-enabled Web pages.
More information about the HTML Speech Incubator Group is available at www.w3.org/2005/Incubator/htmlspeech.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multilmodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.