The Path Ahead (and Behind) for Conversational AI
For half a century, humanlike voice interaction has seemed like it was just around the bend. The voice industry has created and refined many significant technological accomplishments. Having arrived at this point, however, it’s not quite as magical as we’d imagined or hoped … and that’s OK.
Here’s an incredibly short list of the steps that got us here:
- recognize 10 isolated digits (voice “dialing”);
- recognize many isolated words (navigation);
- synthesized speech from text (unlimited novel responses);
- recognize long utterances via grammars (more natural input);
- dialogue script engines, e.g., VoiceXML (standard methods for flows);
- natural sounding text-to-speech (good enough for “canned” prompts);
- intent extraction tools/services (easy natural language processing on input); and
- emotion detection (Confused? Pleased? Angry?).
It is possible with the technology we have today to make quite successful applications. They are useful. Thanks to UX experts, they are usable. But they still require time and artistry to be successful.
And we still fall short of the unspoken goal: conversation. We program a simulation of a constrained human interaction that attempts to anticipate the twists and turns of an encounter. We encode it as: If X is heard, then say Y, and do Z. Not surprisingly the earliest applications were written in actual programming languages (like C). Today we have a range of voice-specific coding platforms to choose from, but they still involve procedural programming. We call these schemes dialogue (not conversation) managers, and they all trace their lineage back to something like VoiceXML.
The tech innovations listed above point in the direction of “conversation”; all of the steps stand as solid advancements (stepping stones?). But we still construct our applications with flowcharts and state diagrams. We define or declare (program) every twist and turn that we can anticipate and debug these “interactions” to insert logic for twists and turns we failed to anticipate. We continue to put a lot of time and effort into tuning prompts just to avoid simple (but numerous) corrections that a human could handle with ease. (Additionally, a human would learn what works better and use that knowledge in future encounters.)
How can we advance the technology of conversational design? To be honest, I’m not sure. But there are people with some ideas.
Lately there has been a lot of interest in training neural networks for conversation. This could be promising as long as there is a large enough corpus of transcribed interactions in a small enough domain. (Humans can learn from a few examples, but machine learning at present requires many.)
However, even if we do find a large enough corpus in a tightly enough constrained domain, there is another problem. The current deep neural networks do not lend themselves to incremental learning. You can’t augment them like you would update a programmatic flowchart. They need to be retrained. Everything is forgotten, and everything (plus the new examples) is relearned. So the completely retrained model will need to be thoroughly tested. This raises the question: How do you meaningfully test a conversation that is not predestined with specific branching activity? Currently, automated testing uses scripts: They direct the application to a specific point in the flowchart and score whether it does what is predicted. With natural conversation, this scheme doesn’t work.
Our technological achievements and progress toward the future are impressive. But we can also all agree that there is a rich path ahead. I propose it is time to raise expectations for the capability and functionality of voice applications.
Not too long ago we marveled at natural “language.” Now we are poised to build natural “conversation.” This will require new technology and new ways of thinking about conversation. This next step won’t be an uber-AI, but it might approximate the conversational toolkit of a preschooler, who can apply generalized conversational skills to talk about anything. For one simple example, the preschooler effortlessly and unconsciously knows how to insert a confirmation phrase to let the other participant know that the conversation is on track. A suite of such basic skills would dramatically reduce the amount of dialogue design required. And as an added bonus, these standardized basic skills would ensure conversational consistency within and across applications.
This will not be the last step, but it is a necessary and inevitable step.
Conversational AI is one of our goals at AVIOS. Our upcoming “Conversational Interaction Conference,” April 12 and 13 (https://www.conversationalinteraction.com/) will have experts who would like to share their ideas with you. See you there!
Emmett Coin is the founder of ejTalk, where he researches and creates engines for human-computer conversation systems. Coin has been working in speech technology since the 1960s at MIT with Dennis Klatt, where he investigated early ASR and TTS.