Boring Is OK, but Exciting Is Better
I happen to think we're doing a fine job of making speech recognition work. After a long, hard slog, speech technology has become mainstream and commonplace. And that's a problem.
Problem? If you're in the business of selling speech technology to the general public, being mainstream isn't much of a problem. Your salesforce no longer has to work against the perception that speech technology is equal parts snake oil, science fiction, and marketing hype. Given all the trouble we've had over the years trying to break into the marketplace, placid acceptance feels just fine. I make a living in speech technology, and I'm delighted.
But as an innovation consultant, I wonder about what happens next. After more than two decades, Version 1 of speech technology is up and running. What about the next 20 years? What about Speech 2.0?
Let's start with our competition—our real competition. We've positioned ourselves as a better-than-DTMF (dual-tone, multifrequency) method to replace humans, but we're not alone in that space the way we were in the 1980s. We've already taken a big hit to our business prospects from the World Wide Web, which offers superior presentation, richer and more varied user interfaces, and terrific accuracy. Sure, sometimes people need to use their phones and not the Web; but that won't be true for long. How can speech technology compete with the ever-more-slick cell phones, PDAs, and simplified user interfaces (UIs) that come with Web 2.0? Apple's iPhone introduces an entirely new user interface for mobile devices; Microsoft may even fix its dreadful mobile-device UI one day, although I wouldn't hold my breath. The Wii's gesture-driven joystick for playing video games scratches the surface of an entirely new paradigm. Every month, every week, every day a new UI concept comes along and chews away at our technological niche.
To strike back, we need a different shtick. We need to move beyond the mindset of replacing Mable, the telephone operator. We need Speech 2.0, a new approach to the value of the speech UI.
Web's Already at 2.0
What makes Web 2.0 tick? Here are a few examples: Wikipedia has become the world's premier repository of information; Digg provides the hottest news stories; and everyone has an account on MySpace. What do these Web 2.0 phenomena all have in common? Disaggregation, or breaking things apart. Wikipedia disaggregates ownership; its content belongs to no one, that is to say, it is owned by everyone in the entire world. Digg's news-rating system disaggregates authority; readers vote on what's important and what's not, and they determine what appears on the front page. MySpace users create content for their own pages, but other users share authority and ownership as they offer links, comments, and ratings.
I don't claim to have all the answers for how this applies to Speech 2.0, but I can point the way. For example, sharing authority. Who has control over the prompts in your application, you or your customer? How about sharing some of that authority and catching up to Yahoo and other Web sites that let users lay out their home pages? If I always select just one option on your speech menu, you can give me the authority to turn off options I never use; better yet, learn my preferences and put me through to the menu choice I pick 99 times out of 100 instead of running me through the maze each time. Let me select the gender, cadence, and the degree of formality in the prompts.
For real Speech 2.0, figure out a way to let customers have some ownership over your applications and let them share with others. Maybe a group of friends will record prompts for each other; maybe your application lets people record tips and tricks to create a speech recognition-driven, customer-generated FAQ. Or maybe your law firm lets customers record the world's largest repository of lawyer jokes, and promotes the top jokes to the music-on-hold queue.
Other collaborative, shared-ownership services also beckon. Anti-spam companies share data from email to find patterns of spam and viruses. A similar service using speech recognition could use common data gleaned from recorded telemarketing calls to screen out that recurring annoyance. Even on the smallest scale, on a single device, multimodal systems require the speech UI to share data ownership with other UIs.
Coming up with new ideas isn't always easy, and sorting brilliant concepts from the nitwit ideas can be surprisingly hard, but if we want speech technology to remain a viable interface, it's time for Speech 2.0.
Moshe Yudkowsky, Ph.D., is president of Disaggregate Consulting in Chicago. He can be contacted at 1-773-764-8727, or by email at speech@pobox.com.