January 5, 2011
By Deborah Dahl Principal - Conversational Technologies
Standards

W3C Launches HTML Speech Incubator Group

To explore ideas around integrating speech with HTML pages, the World Wide Web Consortium (W3C) recently chartered an incubator group on HTML and speech. W3C incubator groups work to explore initial ideas, develop use cases, and prepare requirements for ideas that could later be incorporated into future W3C standards.

The HTML Speech Incubator Group, with founding members AT&T, Google, Microsoft, Mozilla, Openstream, and Voxeo, was initiated to develop ideas that will enable developers to easily integrate basic speech capabilities into HTML5 applications. The ultimate goal of this work is to supply tools for Web developers to provide a high-quality speech and multimodal experience without excessive complexity.

Achieving this goal requires making sure the feature set is sufficiently rich so it can support real applications without being too difficult to use. Some of the interesting topics the group will explore are how speech recognition and text-to-speech capabilities can be made available in HTML pages, how privacy concerns will be addressed, how authors can control speech recognition resources such as engine selection, how different types of language models can be supported, and how to provide access to detailed speech recognition results, such as the N-best, confidences, and semantic interpretations. In focusing on basic speech services, the group will not be looking at nonspeech input modalities, such as multitouch or handwriting, and it will not look at spoken dialogues or speech capabilities, such as speaker identification and verification.

In addition to the HTML Speech Incubator Group, other ongoing efforts with similar but complementary goals are also taking place. One is the Multimodal Architecture under development by the W3C Multimodal Interaction Working Group. This standard is targeted at more general multimodal applications that could include multidevice capabilities as well as applications that use other input modalities in addition to HTML and speech. Speech grammar standards, such as the Speech Recognition Grammar Specification (SRGS), Semantic Interpretation for Speech Recognition (SISR), and Speech Synthesis Markup Language (SSML) developed by the W3C Voice Browser Working Group, also complement the HTML speech effort.

Of course, a standard for integrating speech and HTML will just provide the mechanics of using speech with Web pages. The ability to integrate speech with graphics alone doesn’t mean applications will automatically be useful or even usable, just as voice-only or graphical applications aren’t automatically useful or usable unless they are well-designed. Since voice applications have become widespread with the adoption of VoiceXML, we’ve seen the rise of a new discipline of voice user interface (VUI) designers who have a thorough understanding of the principles of VUI design. As the tools for integrating speech and HTML become easier to use and support more types of applications, just as with VUI design, we will likely see a new discipline of multimodal design with practitioners who understand and can apply the principles of multimodal design.

Two earlier ideas with similar goals to the HTML speech effort existed. These were Speech Application Language Tags (SALT), proposed by the SALT Forum, and XHTML + Voice (X+V), proposed by a team from Opera, IBM, and Motorola. Both proposals were developed in the early 2000s, but neither was widely adopted. The primary reason for lack of adoption was, at that time, the infrastructure for mobile applications—where the integration of voice and GUI is the most compelling—was much less sophisticated than it is today. Current networks and smartphones far surpass those available when these proposals were developed nearly 10 years ago. Not only is the mobile infrastructure more capable, but speech recognition itself is much faster and more accurate. All of these factors together are making the possibility of integrating speech and HTML an intriguing direction for future Web applications.

More information about the HTML Speech Incubator Group is available at http://www.w3.org/2005/Incubator/htmlspeech. This group works in public, so anyone who is interested can follow the discussions on the group’s public mailing list. The group is chartered to work until August 31.?

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

W3C Launches HTML Speech Incubator Group

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

Northeastern Researchers Develop AI App to Help Speech-Impaired

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents