June 25, 2021
By James A. Larson program co-chair, SpeechTEK 2021
Speech Technology News

Open Voice Network Hosts Interoperable Conversational Agent Workshop

The Open Voice Network last week hosted an online workshop to explore approaches for implementing interoperability of voice applications within and across platforms. The workshop produced a manifesto that presents the tone and general direction for the future of voice processing and outlines the key principles to keep in mind when designing a standards-based approach to voice application.

Jon Stine, executive director of the Open Voice Forum (OVNet) welcomed a small group of experts to the three-day virtual workshop by saying voice is the pathway to the digital future. This small group will drive the evolution of technology, ecosystems, and regulations of voice systems. He encouraged participants to concentrate on the why, what, and how of oice in the digital future

Shyamala Prayaga, product owner for the autonomous digital assistant at Ford, and I provided motivation for and a glimpse of two major approaches for interoperable agents where users (1) directly interact with multiple agents or (2) indirectly interact with them through a butler-like intermediary agent OVNet is developing guidelines, standards, and prototypes supporting both approaches.

Panelist Bradley Metrock, CEO of Project Voice, and Susan Bearden,director of digital programs at InnovateEDU, described the future of interoperable conversational agents. Bearden explained that the lack of data interoperability is a non-starter of students in K-12. She was especially concerned about the trade-offs between the goals of privacy and interoperability. Metrock emphasized the interoperability and data sharing among conversational agents, using the necessary integration of health and banking agents. Societal and regulatory pressure will encourage industry giants to support standards that move the industry forward.

Dr Michael McTear, a professor emeritus of computer science at Ulster University in Ireland, reviewed today's chaotic situation with many incompatible devices, platforms, dialogue styles, terminology, and development tools. Standards promoting consistency are clearly needed.

David Attwater, senior scientist at Enterprise Integration Group, described three overlapping goals: (1) write once with code working on multiple platforms, (2) write less with interworking between applications, and (3) be smart, with interworking of dialogue, context, and identity. To achieve these goals, Attwater suggested that we (a) establish standard protocols such as mediated, blind delegate, sub-delegate, round table, etc.; (b) consider levels of collaboration, such as acoustic, audio, and semantics; (c) establish layers of conversation, such as acoustic features, phonemes, words, and meaning; and (d) establish ways for context and history to be passed among agents supporting unified identity and privacy control. Finally, Attwater weaved these ideas into a framework of packets of conversational information.

McTear also presented an overview of the Amazon Voice Interoperability Initiative. He discussed several potential user interface issues with agent discovery, privacy, and security, multiple agent management, and multiple devices. These and other problems crop up in the forthcoming initial release.

Bernhard Hockhstatter, executive product manager and tribe lead voice platform at Deutsche Telekom, presented an overview of the first multi-agent system: Deutsche Telekom's Magenta and Amazon's Alexa. He advised product developers to think big and start small, pay attention to what is important, and not to underestimate technical complexity. The prioritization of agents is complex. Common commands [called universal commands by Alexa] are needed to manage volume and respond to interrupts. It currently supports limited multimodality and no agent transfer.

Dr. Leigh Clark, a lecturer in human-computer interaction at the Computational Foundry at Swansea University, spoke about the dimensions of trust in interoperable voice agents. A user's judgment of trust level depends on several variables, including the device being used, the company and brand, and the current activity. Because different agents might react differently to a message, be aware that the meaning of a message might change as it is passed among a sequence of agents.

Chirs Dix, head of architecture-product at the BBC, offered advice on five topics:

	Outcome	Consequence	OVNet Intervention
Vocabulary	Invocation and vocabulary are different across voice agent ecosystems.	User experience and lack of consistency make it hard for users to adopt the same agent across ecosystems.	Develop and influence ecosystem owners to adopt consistent vocabulary.
Privacy and data access	Consumer personal data is not captured, processed, and shared appropriately, leading to inappropriate use ad use cases outside of interoperability.	rust by consumers and content owners will be impacted. Further, disaggregation of content resulting in unfair arbitration of content.	Prominence, arbitration rules, personal data processing, and providing access for continuous improvement.
Attribution	Agents are not given sufficient attribution.	Agents are not recognizable. Consumers are unclear which agent is fulfilling their requirements.	Design a set of conversational design guidelines that address the appropriate level of attribution.
Multimodal	Content owners are maintaining multiple design patterns across ecosystems.	The overhead is too much, and agents fall back to being compatible on one or two ecosystems reducing choice for the user.	Working with ecosystem owners, OEMs and content owners and consider context driven content design guidelines that ensure portability across ecosystems and devices.
Discoverability and cooperation	Agent registration is minimal.	Discovery of agents is difficult for consumers. The level of agent cooperation is basic.	Work across the industry and define design principles for agent registration, attribution, and cooperation.

Dix suggested that OVNet start with registration, transfer, and attribution. OVNet might consider joining VII.

Stine closed the conference by saying the cross-fertilization of ideas has been especially useful. Using the metaphor, "tall trees will result from planting small acorns," he foresees great things deriving from the suggestions made in this workshop. We must have the courage to disrupt. We can turn our initial intentions to reality.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Open Voice Network Hosts Interoperable Conversational Agent Workshop

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents

Krikey AI Launches Talking Avatars with ElevenLabs

Phonic Launches End-to-End Speech-to-Speech Platform for Building Voice Agents