CAN U W8 for Multimodal: SMS and Clunky Heuristics
We all take the browser on our computer for granted. As I write this, my co-worker is watching the new (scary, animated, dark) Eminem video on her IE with a QuickTime helper. Lately, I've been looking at another form of browser, not really designed as a browser, indeed, not even capable of interpreting markup language: SMS. That stands for Short Message Service and is the ubiquitous text messaging service that connects tens of millions of mobile phones sending billions of messages across every geography that has wireless - including the United States.
It's a striking example of how presentation has become more and more layered from application content and logic. SMS is moving beyond its people-to-people basis into a conversational platform for human-to-machine interactions. The specific example one can look at most easily is Google SMS. While some pundits anticipate that Google will use its huge war chest to actually compete with Microsoft and generate its own browser, it already has a browser for mobile phones - it's called SMS. A single common short code, 46645 (yes that spells GOOGL) is all you need to turn just about any mobile phone into an online data access device - without paying for more than you would for a text message. No WAP, no wireless internet, no GPRS or 1xEVDO or any other wireless gobbledygook.
What I mean about separating presentation from logic is this: by setting up an SMS gateway into the Googlesphere, you and I can get directory assistance (white pages and yellow pages), access pricing information via Froogle, or get area codes and zip codes. Heck, you can even do Google searches.
What does this have to do with speech technology? Well, the two technologies actually have some important commonalities. Before I explore these, remember that the underlying theme is separation of presentation from application logic and business rules. That theme is a powerful and pervasive architectural principle underlying current IT planning: embodied in concepts such as Service Oriented Architecture (SOA), the objective is to have most compute processes occur as messages across a distributed computing architecture.
But let's come up nearer to the surface, where the presentation layer is. Both speech and SMS present unique constraints on the user interface. Speech, and SMS browsers it seems, rely heavily on being able to interpret the user's request. In speech systems, we spend lots of money and time on vendor solutions to understand the utterance. In SMS, it's the same problem, but attacked differently. The problem in speech is trying to extract the essential meaning from a string of conversational input ("I want to find the nearest Kentucky Fried Chicken to Exit 95"), while in SMS the user is trying to reduce the input to the absolute minimum number of keystrokes ("KFC ex 95 rt 80").
Another example is directory assistance (DA) - one of the most challenging speech applications there is. In automated speech implementations being deployed across the Verizon territory, much has been made of the repeated prompts for disambiguation, making for a difficult user interface. In SMS, developers are using what we can call 'clunky heuristics' - basically textual shortcuts -- to make a query as flexible as possible. This turns out to allow a zip code or an area code to take the place of a city and state (way too much to type). Even if the resulting solution is still a little clunky, you're not paying $1.25 anymore, so you invest the effort. By the way, how likely is an automated DA system to ask for something besides city and state to make your life easier? And isn't that automated DA system still costing you $1.25, just like it did with a real person?
The contrast is also interesting on the output side. From a speech perspective, audio is limited by the user's ability to remember what she just heard, hence the perennial prompt from speech systems to "have a pencil ready." In SMS we have the same problem for a different reason - text messages are limited to 160 characters. Hence the presentation must be as succinct as possible, or scroll across several messages sent in sequence, hardly an elegant solution.
Any reader who has grappled with a multimodal implementation is highly familiar with these issues - but you know how few you are in number. For the rest of us, facing our speech or our wireless data issues, we rarely connect the two unless, as is sometimes the case, the speech application offers to send a confirmation via SMS.
If we think of speech as a presentation layer with a uniquely conversational, yet constrained amount of information bandwidth, SMS is even more highly constrained. Yet it is used by hundreds of millions of people. More importantly, SMS is admittedly clunky, but its utility and cheap price outweigh its limitations, so users adapt their behavior. They grasp the heuristics, admittedly typing shortcuts in many cases, but as we have discussed, there are also logical heuristics.
In speech, the most radical heuristic being employed today is 'keywords.' But this 'open sesame' approach may need revisiting - is that all there is? Maybe there's something to be learned from the short, simple world of SMS in exploring ways to connect users to their goals more efficiently.
Mark Plakias is principal researcher at OPUS Research. You can reach him at mplakias@opusresearch.net , or better yet, connect to the resources available at www.opusresearch.net.