-->

Putting Teams of GenAI Agents to Work

Article Featured Image

Generative artificial intelligence (genAI) systems like OpenAI’s ChatGPT, Anthropic’s Claude, and Google Gemini, among many others, have led to remarkable advances in conversational AI, with nearly limitless potential for applications. Even considering language applications alone, there are well-known applications like interactive customer support with chatbots and voicebots, but many more types of applications are in the mix. Consider email processing, mining customer reviews, translation, and resume screening, for example.

But no one agent can (or should) do everything. For many years, software developers have realized that ordinary programs are hard to develop and maintain if all their functions are combined into one big piece of code. To address this problem, the field of software engineering has developed principles such as separation of concerns and encapsulation that makes software more robust, scalable, and maintainable. These principles apply equally to genAI systems; applications are now starting to make use of multiple agents with distinct expertise, collaborating as a team to solve problems together.

As with teams of human workers with different kinds of expertise, it’s clear that many genAI tasks would benefit from being performed by teams of agents rather than a single all-knowing agent. For example, a travel planning assistant could consist of separate agents with specialized knowledge about air travel, hotels, car rental, visas, and local points of interest. This knowledge could all be contained in a single agent, but then the whole system would have to be updated if any of the component agents needed to be revised. With genAI systems, updating and retraining can be expensive. Even worse, updating can also lead to problems like catastrophic forgetting—losing track of old information when new information is added. Another consideration is if any of the agents have to deal with sensitive or private information, it’s better to keep the information confined to a single agent on a need-to-know basis.

It’s easy to think of other applications that could involve multiple agents: Onboarding new employees, managing hospital patients, and servicing customers are a few that come to mind. Some use cases in the enterprise are especially compelling. Consider that a large enterprise can contain dozens, if not hundreds, of departments, like HR, help desk, facilities, or training. It would be very convenient for users to use all of these services if they were combined into a multi-agent system.

For multi-agent systems to address these use cases, they need to be able to communicate with each other. In other words, they need to agree on an API for transferring tasks and information to other agents. Coordinating the work of multiple conversational agents is called “multi-agent orchestration.” There are now quite a few approaches to multi-agent orchestration: CrewAI, Microsoft Autogen, Amazon Bedrock, AutoGPT, SmithOS, Open Voice Interoperability Initiative, the Natural Language Interaction Protocol, the W3C Voice Interaction Community Group’s Intelligent Personal Assistant Architecture, and many others. They all support the coordination of multiple agents, but they differ in important ways.

Most significantly, in some approaches, the orchestration happens when the application is developed; others let agents dynamically come and go in a conversation, depending on the user’s interests and goals. Development-time orchestration is suitable for predictable, structured offline processes where the agents perform tasks that don’t change. In contrast, runtime and user-directed orchestration are suitable for systems that need to be assembled in response to user requests, because users don’t always know what they actually want when they start interacting with one of these agents. Users might have several goals at the outset but change their minds as they go along. This kind of use case requires a maximally interactive and dynamic approach to a multi-agent system, one that can be configured on the fly. Then there’s the fact that some approaches are proprietary and require that all the agents be developed and run on the same platform and even use the same models, while others are based on open standards and are more flexible about the implementation of specific agents.

As always, approaches based on standard agent APIs should be an essential consideration in any requirements analysis for a multi-agent project. But standards in this area are not yet fully mature or widely adopted. Right now, the Open Voice Interoperability Initiative and the Natural Language Interaction Protocol are the only proposed multi-agent standard APIs that I know of. Nevertheless, it’s a good idea to follow this work and look into these emerging standards to see if they apply to any specific use case.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversationaltechnologies.com.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Standards for Openness in AI Models: The Model Openness Framework

Here's how to ensure these tools can be used in mission-critical applications

New Trends in Speech Technology: A Report from the Cutting Edge

Here's the research on which dramatic new capabilities are based.

Standards for Evaluating Generative AI

Assessing the output of genAI systems is easier said than done.