September 1, 2010
By Deborah Dahl Principal - Conversational Technologies
Standards

Standards Need a New Pair of Eyes

Standards greatly simplify building complex systems like speech applications, but what happens when the underlying technologies change, as they inevitably will? Much of today’s speech industry was built on standards that originated in the late 1990s. But speech technologies have changed a lot since then.

The World Wide Web Consortium’s (W3C) Voice Browser Working Group started its work on voice standards for the Web in 1999, following a workshop at MIT the year before. Speech recognition applications were just beginning in call centers, connectivity (Web and cell) was slow and spotty, and mobile phones had limited capabilities. The idea of creating voice standards was just starting out.

The W3C eventually published a variety of voice standards that have made possible a whole industry built on speech recognition in the call center. Since then, the technology has continued to move forward. Mobile networks are significantly faster and provide increasingly wider bandwidth. Mobile devices are smaller, more capable, and have larger screens that support rich visual information. The use of voice-only landlines continues to shrink. Finally, speech recognition itself has made huge strides in speed and accuracy. All of these factors combine to make this a good time to revisit some early voice standards and see how they can be updated to meet the needs of today’s applications.

New Ideas

The W3C periodically holds public workshops to explore new use cases and requirements that could impact future standards. To that end, the W3C in mid-June held a workshop on “Conversational Applications—Use Cases and Requirements for New Models of Human Language to Support Mobile Conversational Systems.” The workshop was held in Somerset, N.J., and was hosted by Openstream. Participants brought ideas for applications that would be impractical, difficult, or impossible to do with the current speech and language-related standards.

Some interesting ideas discussed included extending the capabilities of speech grammars. Statistical language models (SLMs) used in statistical natural language applications were in their early stages and not widely used when the W3C Speech Recognition Grammar Specification (SRGS) was originally developed. Since then, they’ve become more common, but there is no standard for representing SLMs comparable to the SRGS for representing grammars.

Two interesting ideas for SLMs were proposed. One suggestion was to look into standardizing SLM formats so the same SLM could be used with different vendors’ platforms. The other was to develop standards that would allow grammars and SLMs to be used together. For example, sometimes only part of an utterance can be constrained by a grammar. Consider a personal assistant application where the user says, I’d like to make an appointment to take the dog in for shots on October 25. A grammar could be used to recognize October 25 and I’d like to make an appointment, but because you can make an appointment for many reasons, recognition for that part of the utterance has to be open-ended. As a result, an SLM would be a good tool for recognizing that. But using SLMs today requires a proprietary approach because no standards support them.

Participants also discussed on-the-fly activation of specific grammar rules, which means only part of a larger grammar would be active during recognition, depending on context. For example, a vendor might have developed a very large and comprehensive city grammar for a travel application, but if the user has already chosen an airline, then recognition will be much more accurate if the active grammar contains only the cities served by that airline when the user specifies a departure location and destination.

Another discussion topic was being able to represent richer semantic information than the common key-value pair format, such as “city-Philadelphia” and “drink size-medium.” A standard that makes it easier to express more complex meanings, like I want all the toppings on my pizza except onions, which don’t map directly into key-value pairs, would be more useful.

Many other thought-provoking ideas were presented at the workshop. A detailed public summary and minutes are available at www.w3.org/2010/02/convapps/summary.html.

The next step is for the Voice Browser and Multimodal Interaction working groups to review the use cases from the workshop and decide whether and how to address them in future versions of the specifications. Comments are welcome and can be sent to the public mailing lists of the Multimodal Interaction Working Group (www-multimodal@w3.org) and the Voice Browser Working Group (www-voice@w3.org).

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Standards Need a New Pair of Eyes

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

Northeastern Researchers Develop AI App to Help Speech-Impaired

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents