Developing with VoiceXML: Language and Platforms
THE LANGUAGE As its name suggests, VoiceXML is a variant of the Extensible Markup Language. Like XML, VoiceXML is basically a set of tags that, when embedded in a text document, describes the contents of the document. A program called an interpreter reads a VoiceXML document, deciphers the content based on the tags, and presents the information to the end user in the manner specified. In XML, tags may indicate which part of the document contains the name or price of products available for sale. In VoiceXML, tags indicate which part of a document contains a prompt and which defines the grammar to be used in recognizing a spoken word. [IMGCAP(1)]A simple VoiceXML script is shown in Figure 1. When interpreted by a VoiceXML gateway this script would be accessible to a caller using a designated telephone number. The gateway system would answer the incoming call and fetch the script from a pre-arranged location on a Web server. Any server on the Web can serve a VoiceXML script like this one. Upon answering, the system speaks the first prompt, the words "Please say hello." The words are generated using a text-to-speech engine integrated into the VoiceXML gateway. Alternatively, a recorded human voice could have given this greeting if the script had specified the name of an audio file. The system next executes the "field" tag, which is an instruction to pause and listen for spoken input from the user. When a response is heard, the speech recognition engine will attempt to determine if the spoken utterance is one found in the field grammar. A grammar is a specification of the different ways in which a user might respond to a prompt. Grammars typically use syntax allowing a fairly compact grammar to specify a wide range of responses. For example, the script in Figure 1 will recognize either the words "hello" or "hi" optionally followed by the words "there" and/or "world." This one grammar specifies eight phrases. If the system hears any utterance consistent with the grammar (like "hello world") it responds by speaking a prompt with the response embedded within it. It might respond "Well, hello world, yourself." If the utterance is not recognized, this script instructs the gateway to respond with the phrase inside the "no match" tag. VoiceXML contains provisions for conditional transfer of control to other applications, event generation and capture and subroutines. As such, it is essentially a scripting language. Like HTML it also supports embedded JavaScript, which substantially enhances the power of the language. Information can be exchanged with Web sites (and their underlying databases) using the conventional HTML "post", "get" and "submit" tags. Large and complex applications can be built from existing building blocks and standard grammars. While not nearly as versatile as a conventional programming language, VoiceXML's simplicity lends itself to the fast development of voice sites in much the same way as HTML enables Web sites.
APPLICATION PLATFORMS
VoiceXML technology was developed to allow relatively unskilled individuals to quickly build speech recognition enabled applications. Great emphasis was placed on making the VoiceXML language easy to learn and use. However the simplicity in building applications comes at a price when executing them. The design and deployment of VoiceXML gateway systems is complex and - at least so far - expensive. The VoiceXML gateway stands in the middle between the caller and the content provider. As mentioned earlier, any Web server can be the source of VoiceXML scripts. However, the job of fetching and interpreting the scripts, managing telephony services, operating speech recognition engines, text-to-speech generation, audio and text caching, and functioning as an HTTP client falls to the VoiceXML gateway platform. As this article is written, commercial-grade VoiceXML gateways are being introduced into the market. Some are "turnkey" systems consisting of integrated hardware and software. Others are primarily software and are intended to run on hardware platforms built and operated by the buyer. What about cost? Only a handful of VoiceXML gateways have been purchased at this time and so they are almost custom-built. Nevertheless, total costs for hardware and software together seem to be settling around $8,000 per port for small (4-port) systems and $2,000-$3,000 per port for larger systems (48, 96 or more ports). Cost is not the only barrier to entry. In fact some aspects of operating a VoiceXML gateway (primarily the telephony services and speech recognition engine) are so complex and require such a high level of skill that only the most technically savvy organizations can reasonably expect to field them in the near future. Even developing a VoiceXML application is not a trivial task. In addition to VoiceXML coding, a good developer needs to understand how speech recognition engines work, how to design a good speech UI, and how to write speech recognition grammars.
HOSTING SERVICES AND VOICE PORTALS So if operating a VoiceXML gateway or building an application is a technically difficult and expensive undertaking, are there any alternatives? Yes. Companies wanting to operate speech-recognition enabled call centers or other information services can leave the technical details to host services, sometimes called voice service providers. VSPs deploy and maintain the necessary hardware and software, usually at the same data centers used to house large Web sites. Customers pay for this service, either on per-minute or per-transaction basis, or by leasing port capacity in blocks. Calls come in to a number specified by the customer and are answered with a customer-specific script. The fact that the application is hosted by a VSP is invisible to the caller. Major VSPs include SpeechHost and Voci. Branded voice portals such as TellMe or BeVocal also host applications for third parties. Voice portals differ from VSPs in that all calls come in to a single number and are answered with the brand name of the voice portal. Callers have access to different types of content (typically weather, news, etc.) and can navigate their way to hosted applications. VPs may charge in a manner similar to VSPs or may insert their advertising into hosted content. The capabilities of individual VSPs and portals vary widely. Some offer a variety of services including VoiceXML and conventional speech recognition apps, IVR hosting, content, speaker identification, WAP and professional services to build, maintain and integrate a speech application. Others have more limited offerings, are dependent on technology from a particular vendor or host only applications they develop. Beware of VSPs or VPs who claim to offer voice access to an existing Web site without substantial revisions. Existing Web sites are rarely constructed in a way to permit an effective voice interface.
VOICEXML AND THE OPEN SOURCE MOVEMENT So what does the future hold for VoiceXML technology? VoiceXML may provide the magic key that finally brings speech recognition to the attention of the general public. Or the complexity and expense of operating gateways combined with failure to achieve a standard (see sidebar) may doom this technology to being the speech equivalent of 8-track tapes. One promising development is movement toward an open source model. A major step in this direction was taken recently when Speechworks, a major ASR vendor, and Carnegie-Mellon University, a leader in ASR research, announced the Open Source Speech Initiative. This project
Open Source Speech Initiative will include the release of an open source VoiceXML interpreter, a major component of a gateway system. The opportunity for the public at large to experiment with VoiceXML technology can only promote its acceptance. If other open source efforts (such as Apache and Linux) are an indication, the result will be better, cheaper and more standardized software than before.
Steve Ihnen is the vice president, applications development for, SpeechHost Inc.