Industry-Standard Speech App Building Blocks Take Shape
Speech technology use has been gathering momentum, but adoption has nevertheless been slower than advocates wanted. One reason is that creating these applications is not as simple as designing cloud or mobile software. Speech solutions are hamstrung by a large number of proprietary interfaces. The industry recognizes the need to evolve, and a handful of open standards initiatives have emerged; however, when and how widely they will gain acceptance remain open questions.
Nowadays, consumer and business software alike is large, often involving millions of lines of code and complex, numerous interconnections. Through the years, standard interfaces emerged so third parties could quickly and relatively easily create cloud and mobile applications. Not so, with speech software.
“Imagine if every time that a user wanted to go from one website to another, they had to shut down their browser and open up another one,” notes Jon Stine, executive director of the Open Voice Network (OVN). “That is now the case with speech applications.”
Speech applications have established a few large, well-established niches. Intelligent virtual assistants (IVAs) are now widely available and have changed how businesses and consumers function. Apple’s Siri, Google Assistant, Microsoft’s Cortana, and Samsung’s Bixby enable individuals to rapidly and intuitively shop, play music, complete work, schedule appointments, send messages, and find answers to questions. Additionally, smart speakers like Amazon’s Alexa or Google Home perform such tasks as controlling household devices like thermostats or music systems.
Much more is possible, but only if the development process becomes easier. Without standard interfaces, developers must rebuild their voice applications for each platform, a time-consuming, often frustrating, and expensive process.
Speech Solutions Lack Law and Order
Yet today, little to no interoperability is found among the different systems. Even in cases where solutions can be linked, the user often still takes on the burden of pulling the different pieces together. Typically, users must remember which agents have been registered to their devices; the wake words; and which function each application performs.
The reason? The application development landscape is the Wild West: Anything goes, according to Jim Larson, speech applications consultant at Larson Technical Services. Each speech company creates its own application programming interfaces (APIs). Compounding the problem, they constructed walled gardens around their APIs, so it becomes challenging or impossible to access resources outside of their often tightly controlled ecosystems. In essence once a company or a third party writes a speech application for one device, it has to begin the process all over again when it wants to port it to another.
Therefore, developers spend a lot of time tying their software to different software and hardware infrastructure. They prefer to focus on adding more features: providing users with access to more information sources, or automating steps in business processes. But they can’t, and the problem gets worse every day. “We’re fast approaching a world of millions, if not billions, of conversational assistants,” Larson says. Each one is an island unto itself.
Businesses, consumers, and third-party application development companies desire more freedom and flexibility. While the need to improve interoperability is clear, a lot of work must be done to forge a comprehensive set of standards. The process now is similar to constructing a building not only without blueprints but also lacking standard room sizes, wood lengths, or plumbing pieces. In sum just about every piece of the software stack has to be designed from scratch. Despite the daunting challenge, vendors, ad hoc consortiums, and academic institutions are trying to fill the many voids.
Different Entities Become Voice Application Standard Flag Bearers
To date, voice applications have been largely tied to central virtual assistants, such as Amazon’s Alexa and Google Assistant. A first wave of speech applications worked solely on each device.
Amazon has become a leading speech application development supplier and has been gradually extending its ecosystem. The vendor has been enhancing its architecture so that software runs on Alexa as well as other home assistants. In some cases, clients can use multiple voice services on Alexa. The vendor said it is also investing in machine learning and conversational artificial intelligence (AI) research to improve voice service interoperability.
Founded in 1994, the World Wide Web Consortium (W3C) is an industry consortium that has been a primary force in launching many widely adopted standards, such as HTML and XLM. The W3C formed a working group to create a voice software development architecture that is comprised of three layers, according to Deborah Dahl, principal of Conversational Technologies and chair of that working group. These three layers are the following:
- Client Layer
This area emphasizes the user-facing side of each interaction. It collects and delivers user input and machine output via speakers, headphones, smartphones, desktop devices, and a growing variety of other form factors. This piece supports voice as well as text input and output.
- Dialogue Layer
This piece contains the main components that drive interactions between a speech application and its users. It outlines how capabilities like natural language processing and a growing breadth of AI solutions take user input, understand it, and deliver appropriate output.
- External Data/Services/IVA Providers
The third layer features the data and services on which responses and interactions are built. It outlines how to handle each dialogue and understands how to interact with each dialogue, such as finding or correlating information to service a request.
Illustrating the complexity of the challenges, the lines among the layers are murky. In fact, some components might even shift capabilities from one layer to another. So how suppliers will enhance voice systems so that they easily interact with one another will require a great deal of art as well as science.
The Open Voice Network Sets Lofty Goals
Founded in 2020, the OVN is a nonprofit Linux Foundation community. The consortium’s mission is to develop technical standards and ethical-use guidelines for emerging voice/ conversational artificial intelligence conversations. The Linux Foundation played a key role in the widespread acceptance of the popular operating system.
The OVN is creating an intelligent personal assistant architecture and API addressing a handful of voice application development challenges. These include the following:
Component Interchangeability
The goal is to allow users to replace any existing component with another. For example, a business swaps a customer database from vendor A with one from company B without impacting the voice agent.
Long-term, they want key system components to be replaceable. Companies could, for example, replace an English automation speech recognition (ASR) engine with a Spanish one or an AI engine with a more powerful one.
Switching Among Voice Agents
The goal is to create an architecture that does not have to control every aspect of a voice interaction. Developers can then concentrate on building voice agents that do one or a few things very well and rely on others to enhance additional functions, leading to development of richer sets of voice services.
Specialized solutions can be merged so that users do not have to toggle back and forth. A consumer voice agent, such as Alexa, answers general questions. A second agent is specifically designed to provide information about a specific domain, say, Pandora for music. Users invoke either’s actions by speaking the agent’s wake word.
Move Beyond Switching
Developers would like to mix voice software like they do containers, plugging them in as needed. Mediation occurs when one conversational assistant acts as a user of another conversational assistant’s features. With instant replay, an assistant can replay the conversation just prior to an interruption to remind the user what he was doing.
Channeling occurs when one assistant speaks for another. Some assistants can modify the voice characteristics of another voice assistant. For example, a channeling assistant slows the audio rate for non-native speakers or cognitively impaired speakers or increases the audio volume for the hearing-impaired.
Improve Data Sharing
It would be convenient for voice applications to share data when the user switches among them. The OVN wants to relieve users from the burden of reentering the same information into multiple voice agents or from one voice agent into another.
Automate Services
Another potential benefit is automation, with systems gluing voice agent solutions together to deliver new functions that offload traditional manual entry.
Stanford Creates a Genie
The Stanford Open Virtual Assistant Lab developed the Genie Assistant. The standard voice assistant performs these common skills: plays songs, podcasts, radio programs, and news; helps find restaurants; answers questions; gives weather forecasts; sets timers and reminders; and controls connected smart home devices. It can control appliances, thermostats, lights, fans, door locks, window covers, and vacuum cleaners; it can also manage seven different kinds of sensors: temperature, motion, illuminance, humidity, flood, ultra-light, and battery. The group has a number of projects under way to extend its reach.
Larson notes that once standards like these are put into place, many benefits arise, including the following:
- Users gain access to more sources, more ideas, and more content that goes beyond the realm (or accuracy) of current general-purpose assistants.
- Users do not have to turn off one assistant to pursue knowledge from another.
- Third-party innovators and entrepreneurs have opportunities to develop new products and services—niche-use, add-ons, or industry-specific.
- Companies realize increased investment efficiency, as one conversational assistant communicates with any other assistant, regardless of platform. Build once; use everywhere. No longer will companies be asked to develop separate applications, operating on separate platforms.
- Vendor choice and the opportunity to rely on best-of-breed or best-partner solutions will increase.
Currently, the standards have been broadly outlined but not fully flushed out. The W3C, for instance, notes that the following work must be done:
- specify the interfaces among the components;
- suggest new standards where they are missing;
- refer to existing standards where applicable; and
- refer to existing standards as a starting point to be refined for IVAs.
Essentially, the deployments are now in an early stage of development. “I would say we are in early alpha,” Dahl says. In fact, little of the work from the different groups is currently running in a production environment.
Compounding the hurdles, voice technology continues to rapidly morph. As the standard building blocks were being put into place, ChatGPT emerged and changed the landscape significantly. Its long-term impact seems particularly far reaching. Commercial chatbots today are notoriously brittle, as they are hardcoded to handle a few possible choices of user inputs. Large language models (LLMs), such as GPT-3, are more fluent.
However, developers need ways to tie capabilities and output to other voice systems. On the plus side, OpenAI, the nonprofit entity backing the initiative, has taken an open-source approach and published a number of interfaces to the AI engine. But to date, such work has focused on voice input and user-system dialogue, according to Dahl. Not much has emerged on the back end. Consequently, integrating it with other voice systems requires a lot of hard work. Ultimately, ChatGPT could evolve into another walled garden, one where connecting it to other LLMs will be a cumbersome exercise.
First Signs of Progress Are Seen
While a lot of work remains, progress is being made. The W3C and OVN have been working together and trying to ensure that their work is consistent and interoperable. The former is developing a high-level speech application architecture. The latter is creating specific APIs and eventually conformance testing mechanisms.
Customers are starting to step forward. The Estonian government wants to provide citizens with a common digital, text, or voice interface to any department. It turned to the OVN for guidance and is working to transform the emerging standards into APIs and software that will drive their citizen voice interactions, according to Dahl.
The Estonian government expects to begin testing such features by the end of this year. The voice interface could help individuals gain access to needed information, like how to obtain a fishing license. Eventually, they expect to leverage the open interfaces to build sophisticated applications. One is diagnosing over the telephone whether or not a person has COVID. A second is streamlining the food recall notification process.
Creating standard voice application interfaces is extremely complex work. Potential specifications are starting to take shape and expected to bear their first fruits by the end of this year. Mixing and matching voice functions among conversational assistants and smart speakers appears to be the first places where interoperability will sprout. However, the point where connectivity is as simple and widespread as browsers using the internet appears to be at least a few years away.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or @PaulKorzeniowski.