Submitted for Your Approval
At SpeechTEK 2008, held in New York’s Marriott Marquis, several speech technology vendors submitted some of their latest products for interactive, hands-on evaluation as part of a new group of sessions called the SpeechTEK Lab. Attendees and judges tested products in three categories: speech tools, multimodal applications, and voice-activated videogames.
Moshe Yudkowsky, president of Disaggregate Consulting, led the tools evaluation session, Deborah Dahl, principal at Conversational Technologies, spearheaded the evaluations of multimodal applications, and David Thomson, chief technology officer at SpeechPhone, led the evaluations of some of the latest speech-enabled videogames.
Development/Testing Tools
By Moshe Yudkowsky
For those looking to write programs in C or Java, or to manage the content of a Web site, an astonishing range of tools exist. But for speech applications, we still have a limited assortment of tools. Many developers simply do without because they can’t find the right tool or aren’t aware of what’s available.
At SpeechTEK 2008, 10 companies gave hands-on demonstrations of their development tools. Shawn van Early from New York University’s Interactive Telephony Project and I examined each tool based on productivity gain, audience suitability, documentation, and output. Here are our reviews:
VoiceXML/CCXML Development Tools:
Avaya Dialog Designer layers on top of the Eclipse application development platform, a highly developed, extremely popular open-source tool with many features. It allows developers to use its graphical user interface (GUI) to define call flow, interactions with the caller, and grammars. Its output is a set of Java servlets that can be tweaked by the developer; running under a suitable platform the Java servlets send the appropriate VoiceXML 2.1 and CCXML scripts for use by any standards-compliant platform. It has online, tool-based, and Java docs documentation.
Cisco Unified Call Studio 7.0 provides a graphical user interface that allows developers to describe the call flow, prompts, grammars, and more. Based on the Eclipse platform, it generates Java servlets that send VoiceXML 2.0 and 2.1 scripts to standards-compliant platforms. Users can assign roles to various individuals who can then collaborate to build the application. Programmers don’t seem to have as much power as they ought to, and they can’t tweak the Java servlets. Wizards provide a useful way to define the behavior of the application. Help files are good, but they are not integrated into the tool.
Envox VoiceXML Studio 7.0 is a GUI-based tool; the output is Java script-enabled pages that produce VoiceXML. It includes many predefined building blocks, such as modules that access databases, email, and Web services, as well as graphical error checking and a deployment simulator. Designed to integrate with Envox’s other offerings, the VoiceXML output is compatible with platforms from other vendors as well.
Loquendo TTS Director marks up text for Loquendo’s TTS system. This multiplatform, specialized text editor includes syntax highlighting, wizards, and drop-down menus; the developer works directly with marked-up text—no visualization. A nice feature allows developers to highlight a section of text and hear the generated speech. The tool has good help facilities. The output is text files specific to Loquendo TTS.
VoiceObjects is an Eclipse-based GUI tool that lets developers create applications using two dozen built-in building blocks (menu, capture, etc.), as well as building blocks they can define on their own. VUI designers can use an ordinary spreadsheet to define call flows, which can then be imported and implemented by developers. The output is Java servlets that provide VoiceXML 2.1 scripts compatible with a wide range of platforms. In addition to good online documentation, it lets developers create documentation of their own projects.
VoiceXML Orchestrator from Ajax Weaver is a browser-based tool that creates Java servlets to produce VoiceXML 2.1 scripts compatible with a number of VoiceXML platforms. The GUI did not appeal to us. Its visualization was far less polished than other platforms. In addition, it generated “flat” VoiceXML files without support for subprojects. The only documentation is through a help menu.
Voxeo Designer is a browser-based tool with wizards to solicit input. The intended audience is less-experienced developers. It generates XML, which a Java servlet uses to generate VoiceXML. The VoiceXML can be used with non-Voxeo platforms. One very nice feature is the analytics engine, which uses a database tied to the runtime to produce very detailed usage reports. We could not find context-sensitive help, and could not access the generated code from within the tool.
In summary, we liked the tools from Avaya, Cisco, and Envox for their good graphical interfaces. Avaya earns a special mention for its CCXML output, and Envox for its database and email integration and simulator. Loquendo’s tool helps developers work with its rich set of TTS markup. VoiceObjects also has a good developer interface, and we appreciated its integrated documentation tool and the built-in detailed reports. Voxeo Designer also includes extensive reports and offers simplicity to less-experienced developers. Ajax Weaver’s VoiceXML Orchestrator has the fewest features and least sophisticated visual design of the products in this group.
Tools for Testing and Analysis:
Lumenvox Speech Tuner lets developers compare ASR results with human-generated transcripts. The tool highlights a visual representation of utterances with the recognized words. Developers can modify the grammars or engine parameters and test them by rerunning the original utterances. The tool works with Lumenvox speech engines.
VocaLabs Usability Survey Reports provides a visual, Web-based interface to results from VocaLabs’ huge pool of human assessors. Users can examine several layers down into the data to see complete details. Output can be downloaded as graphs, raw data, and similar formats, but the tool lacks any documentation.
Voiyager from Syntellect accepts VoiceXML 2.0 and 2.1, automatically generates test cases based on the scripts’ branches and grammars, and then exercises the scripts automatically. It has many features, including the ability to bookmark a place in the call flow and to force ASR confidence levels. It presents test results graphically and in complete detail.
In summary, we found all of these tools increased productivity. Voiyager is very clever, complete, and useful. VocaLabs’ tool provides access to its assessment service without a lot of pizzazz. Speech Tuner is very useful but does not allow dynamic adjustment of ASR parameters and grammars to test corrections.
Multimodal Applications
By Deborah Dahl
Six vendors submitted products for review in the multimodal lab. Some very innovative applications were shown, both in architectures and the applications themselves. David Thomson of SpeechPhone, Matt Womer of the World Wide Web Consortium, and I looked more closely at the applications with respect to their innovation, usability, and usefulness. Here are our impressions:
Avaya presented an application for buying flowers—an excellent use of multimodality because buying something as visual as flowers is extremely difficult with a voice-only interface. Another factor contributing to usability was that the voice menu prompts were displayed on the screen, making it easy for the user to decide what can be said. The fact that the multimodal tools were integrated into Avaya’s standard application building tools was very exciting because this will enable many people to develop multimodal applications.
Intervoice (now part of Convergys) showed an airline ticket-buying application that demonstrated excellent integration of voice and GUI. In particular, it illustrated how multimodality allows the strengths of one modality to compensate for limitations in another. The lab room was fairly noisy, making speech recognition sometimes problematic, but the application was still usable because the user could take advantage of the visual modality instead of relying entirely on voice. Another notable aspect of this integration was that the application did a good job of keeping the voice and visual modalities in sync. For example, when the user visually selected an airplane seat, the voice modality immediately knew about the seat selection. The panel also noted that the application showed an impressive use of standards, with SCXML, CCXML, and VoiceXML as part of the platform.
IQ Services demonstrated an application that wasn’t end-user-facing, but rather a testing service. The platform efficiently supported testing of both voice and Web applications with a sophisticated display of test results.
Loquendo, working with the University of Trento, Italy, demonstrated an innovative research platform that takes advantage of a number of standards, including EMMA, VoiceXML, Ajax, and XML. Their approach to defining the dialogue flow was very interesting—the dialogue flow is authored in a high-level, goal-oriented XML format that is translated to VoiceXML at runtime. Loquendo/Trento demonstrated several other applications; one very good illustration of the power of multimodality was voice navigation of Google Earth, allowing the user to select cities and zoom the display in and out by voice.
Openstream demonstrated its multimodal browser on several handsets. Remote Web and VoiceXML servers supply HTML content and voice services to a browser integrated with SCXML on a local client. This architecture is very innovative and appealing, a good way to integrate the voice and visual experience, and especially appropriate for applications on mobile devices.
VoiceVerified demonstrated a multimodal application with a different twist. VoiceVerified integrated visual Web pages not with speech recognition but with speaker verification, which was used to control access to a restricted Web site. One advantage to multimodality here is that the prompts can be given visually, which speeds up the interaction and ensures that the speaker being verified is not influenced by the voice coming from the application. Also, speaker verification can complement the current approaches of passwords, images, and security questions that currently control access to Web sites.
All in all, the panel found the applications and platforms showcased to be exciting, innovative, and creative examples of multimodality in action.
Videogames
By David Thomson
The Games Lab featured two videogames that use speech technologies from Nuance Communications. Delegates were given a chance to try voice activation features in live play and then asked to rate them on accuracy, how well speech technology was integrated into them, whether speech gave them new capabilities, and the level of voice interface sophistication.
Flight One Software’s Cockpit Chatter, a flight simulator for the PC, performed well in all categories, but judges were most impressed with its integration.
The voice technologies in Sony’s TalkMan, a language translation tutor running on Playstation Portable, scored flawlessly in its integration and ability to add new capabilities.
Accuracy ratings for both games were impressive (approximately 4 on a 1–5 scale), especially considering the high noise level in the room. Accuracy in noisy environments is a relevant metric since high noise levels are likely to be the norm for typical videogame environments.