Laying out a Vision for Agentic Speech Tech
Agentic artificial intelligence is a hot topic, and it’s easy to see why. An AI agent can leverage its own built-in capabilities or integrate with external tools for additional functionalities, meaning that it can autonomously decide when to use which tools based on the context and can either interact dynamically with users or operate in a fully automated, self-contained manner.
Agents, assistants, and bots have existed for years—business process management (BPM) and robotic process automation (RPA) tools are examples of standardized, rules-based automation. What differentiates today’s AI agents is their increased autonomy. Unlike traditional bots that mostly follow predefined workflows, agentic AI does not require every potential interaction to be predefined, making it far more adaptable and versatile. AI agents do so by tapping the capabilities of large language models (LLMs).
This is a rapidly growing area with new LLM-based bots or new and improved versions of existing traditional bots. Some examples of AI agents that extend LLM capabilities include OpenAI Operator, Anthropic Computer Use Agent and Google Mariner. These AI agents can use browsers, search the web, run code snippets, and make use of other software tools and utilities. Another example is Saleforce’s Agentforce, which is tailored for customer and employee service scenarios.
Multimodal LLMs
Most enterprise systems traditionally rely on a single mode of input or output—text, images, audio, or video. Multimodal LLMs, however, can process and switch between different input and output modes as needed, allowing for a richer and more seamless user experience.
Speech and voice capabilities are critical components of multimodal LLMs. For instance, an AI customer support agent might use speech recognition to transcribe a caller’s request, process the text using an LLM, and then respond via synthesized voice output. This ability to transition between modalities enhances the AI agent’s usability in real-world scenarios.
The Rise of Agentic Speech Tech
Speech technologies will play a crucial role in several AI agent use cases scenarios. Let’s coin a new term, “agentic speech tech,” the seamless blending of AI agents and speech technologies.
Agentic speech tech brims with transformative potential, enabling these developments:
Richer human-agent interaction: In many cases, AI agents serve as intermediaries between humans and autonomous systems, like AI voicebots handling support calls and providing customer service. Another use case is to summarize conversations and look for insights and trends, track sentiment, flag compliance risks, and trigger relevant workflows or alerts.
Complex business process automation: Company workflows span multiple steps, integrate with different systems, and can even cross organizational boundaries. Speech technology will be useful at key interaction points, such as enabling workflow kick-offs by voice command or voice-based approvals within automated workflows.
How to Leverage the Agentic Speech Tech Opportunity
The next generation of speech technologies must be able to accommodate both humans and AI agents as users, in a variety of applications. These include B2C ones, such as virtual assistants, chatbots, and voice-activated smart devices; B2B use cases, like back-end integrations where speech tech is an embedded function rather than a stand-alone product; and government ones, like public service hotlines or accessibility services for special-needs people.
Speech tech products must evolve to integrate seamlessly into multi-agent ecosystems. Speech recognition APIs that enable AI agents to process voice commands, real-time translation that can support multilingual interactions, and voice authentication APIs that verify identity through biometrics will be key. And speaking of security, as AI-generated deepfakes proliferate, speech tech must incorporate robust security features:
- Voice forensics and deepfake detection that ensures the authenticity of voice inputs.
- Provenance verification that tracks the origin of AI-generated content.
- Fraud prevention mechanisms that detect unauthorized voice interactions.
Traditional speech tech licensing is seat-based (charged per human user). But in an agentic AI world, new billing models like consumption-based pricing, charging based on usage (e.g., per speech-to-text request), and hybrid licensing models will be required.
AI agents will increasingly act as consumers of software services, discovering and integrating APIs automatically. To play in this space, speech tech companies should develop AI-agent-compatible marketplaces where speech solutions are discoverable by both humans and AI agents; adopt API-first architectures that make products accessible as both stand-alone apps and modular components; and publish clear API documentation and service level agreements, ensuring AI agents can easily integrate speech tech capabilities.
Agentic speech tech is an exciting new category and holds much promise of growth as it can significantly expand adoption of speech technology products. Maybe it’s the mic-drop moment for the industry.
Kashyap Kompella, CFA, is an industry analyst, author, educator, and adviser. He is the founder of the AI advisory outfits RPA2AI Research and AI Profs and is a For Humanity Certified AI Auditor.
Related Articles
So-called elevator music can be jazzed up by genAI, but there are gray areas.
08 Jan 2025
Plan for both the short term and the long term.
14 Oct 2024
The tensions and undercurrents in the wake of genAI should be better understood.
30 Aug 2024