July 8, 2004
By Deborah Dahl Principal - Conversational Technologies
Features

Technical Standards Facilitate Innovation

Rarely does a technical standard directly benefit end users. However, in the world of speech technologies they do. Standards facilitate innovation and reduction in the total cost of ownership of speech applications, but have been slow to market. Standards allow programmers to create platform-independent (and ideally vendor-independent) speech applications. Prior to the advent and acceptance of standards, developers were forced to use the proprietary development environments of each speech technology provider to create a new speech application. There were a limited number of speech application specialists for each of the proprietary environments, making it expensive and difficult to find developers with the right expertise and experience. Speech vendors were also restricted to specific platforms – limiting the market’s ability to create open packaged speech applications that could run on any platform. These technical constraints contributed to the high cost of entry and limited investments in speech applications to enterprises with deep pockets. Many companies that wanted to invest in speech to provide a friendlier and more satisfying service experience for their customers could not afford to do so.

The New Standards

On March 16, 2004 the World Wide Web Consortium released two speech “recommendations” as part of its overall Speech Interface Framework. The W3C is the accepted standards body for the Web. It was founded in 1994 for the purpose of developing common protocols to ensure interoperability and today has more than 400 members. A recommendation, in the vocabulary of the W3C, is a fully tested and accepted standard that is ready for market adoption.

The first new recommendation is the Voice Extensible Markup Language Version 2.0. “VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony and mixed initiative conversations,” according to the W3C Recommendation for VoiceXML 2.0.

VoiceXML’s primary purpose is to bring the advantages of Web-based development and content delivery to interactive voice response applications. According to the W3C Recommendation on VoiceXML 2.0, the “main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management.”

The VoiceXML Forum, a group of vendors including ATandT, IBM, Lucent and Motorola, released the first version of VoiceXML in 2000. After releasing the standard, the Forum submitted it to the W3C, which has managed VoiceXML going forward. While the W3C is slow to release new recommendations, there has been a great deal of innovation since 2000 and it’s estimated that there are between 80 and 100 speech vendors currently using the speech standards.

The second new standard is the Speech Recognition Grammar Specification Version 1.0. This recommendation addresses “the syntax for grammar representation. The grammars are intended for use by speech recognizers and other grammar processors so that developers can specify the words and patterns of words to be listened for by a speech recognizer.” This standard describes “words that may be spoken, patterns in which those words may occur [and the] spoken language of each word that are presented to speech recognizers [speech recognition engines],” according to the W3C Recommendation, Speech Recognition Grammar Specification Version 1.0.

The WC3 is also addressing a number of other speech standards that will help to simplify the development and implementation of speech applications. One is the Call Control Extensible Markup Language. “CCXML is designed to provide telephony call control support for VoiceXML or other dialog systems,” according to the W3C working draft for Voice Browser Call Control. CCXML Version 1.0. CCXML is generally used in conjunction with VoiceXML, as VoiceXML does not address call control.

SALT and X+V Fill a Hole in the W3C Speech Interface Languages

The Speech Application Language Tags Forum was founded in 2001 and in August 2002 the SALT specification was submitted to the W3C. According to the SALT Forum, “The Speech Application Language Tags 1.0 specification enables multimodal and telephony-enabled access to information, applications and Web services from PCs, telephones, tablet PCs and wireless personal digital assistants. The Speech Application Language Tags extend existing mark-up languages such as HTML, XHTML and XML. Multimodal access will enable users to interact with an application in a variety of ways: they will be able to input data using speech, a keyboard, keypad, mouse and/or stylus and produce data as synthesized speech, audio, plain text, motion video and/or graphics. Each of these modes will be able to be used independently or concurrently.”

Early on there were differences between VoiceXML and SALT and today programming variations remain. SALT was designed to handle multimodal devices (PCs, phones, wireless PDAs), while VoiceXML was initially accessible only by phone. However, with the addition of X+V that has also been submitted to the W3C for consideration, VoiceXML-based applications can address multimodal devices. A second difference surrounds royalty charges. SALT has been royalty-free from the start, not always the case for VoiceXML. Today, VoiceXML can be obtained for free, just like SALT. The only major difference is the style of programming: VoiceXML provides a Forms Inter-pretation Language Algorithm for sequencing through the fields of a voice form, while programmers must specify this sequence when programming with SALT. However, even this difference will disappear in Version 3, the follow on to VoiceXML 2.0 (See story Beyond XML 2.0)

Final Thoughts

VoiceXML and its related standards are mature enough to be used by most organizations for everything from basic applications, such as directory assistance, to advanced “natural language-like” applications used for customer service and sales. During the past 18 months, these standards have facilitated the delivery of relatively sophisticated packaged applications, with many more on the way. The standards are bringing down the startup costs and TCO of speech applications, enabling companies large and small to invest and benefit from these technologies.

Resources

W3C Recommendation, Voice Extensible Markup Language (VoiceXML) Version 2.0, March 16, 2004.

W3C Recommendation, Speech Recogni-tion Grammar Specification Version 1.0, March 16, 2004.

W3C Working draft, Voice Browser Call Control: CCXML Version 1.0, June 12, 2003.

Donna Fluss is the principal of DMG Consulting LLC, delivering customer-focused business strategy, operations and technology for Global 2000 and emerging companies. Ms. Fluss is the author of the industry-leading 2004 Quality Management/Liability Recording Product and Market Report and the 2004 Guide to Successful Contact Center
Offshore Outsourcing. Contact her at donna.fluss@dmgconsult.com .

------------------------

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Technical Standards Facilitate Innovation

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

Northeastern Researchers Develop AI App to Help Speech-Impaired

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents