Multimodality: The Next Wave of Mobile Interaction

Multimodality is new technology that hopes to enhance the mobile user experience by enabling network operators to combine speech, touch and onscreen displays for intuitive and powerful mobile applications. What is Multimodality? Multimodality combines voice and touch (via a keypad or stylus) with relevant onscreen displays to enhance the mobile user experience and expand network operator service offerings. Blending multiple access channels provides new avenues of interaction to users. For example, a mobile user could call a friend by saying “Call Rick” to his or her mobile phone. The multimodal-enabled device could speak “Select which Rick you want to call using your keypad.” In turn, the phone could display everyone named Rick in the user’s directory and place the call after the user selects the appropriate person from the mobile device’s display. This interchange between voice and visuals opens many opportunities for mobile network operators. Although the preceding example is a basic demonstration of multimodality, it has significant implications. For users, multimodality represents an efficient way to interact with a mobile device. There is no need to listen to or scroll through a long list of names to make a selection; instead, users simply and quickly make visual choices that were verbally produced. For network operators, the combination of audible and visual functions represents the future of mobile communications. Soon, applications such as mobile commerce will take advantage of multiple simultaneous channels of communication, providing operators with a new wave of service offerings. The Stages of Multimodality
The comprehensive functionality of multimodality will be brought to the market in manageable stages. First, users will have the “choice” to use one interface or another. In fact, users have this capability today, when they choose to place a call (voice) or send an SMS (text). Today, these interface choices are mutually exclusive. In the future, they’ll be inclusive. Next, users will have the ability to “switch” from one interface to another, multiple times during the same session. To some extent, users have this capability today, when they browse a visual display to find a phone number in their mobile address book, and then switch to voice interaction as the number is dialed. Today, the ability to switch is limited to certain interactions. In the future, users will be able to switch their mode of interaction spontaneously as their situation changes. Multimodality will reach its functional apex in the third phase, when interaction modes can be used in parallel without limitations. As different technologies converge, users will find it very natural to look at a menu, point to an item and say “I’d like that one.” Using speech and touch-screen inputs to get a variety of outputs is just one of the many multimodal experiences in store for mobile phone users. Multimodal applications have the potential to: ·Combine visuals, voice and touch for powerful mobile applications
·Provide users the freedom to choose from multiple modes of interaction
·Break down barriers to the mass adoption of value-added services
·Enable spontaneous and intuitive communication
·Create new and richer services for mobile users
·Attract new customers and deepen loyalty of existing users
·Generate increased ARPU via value-added services
·Offer richer experiences to mobile users ·Simplify the use of value-added services Getting from Here to There First Choose a Mode
Multimodality may seem like the stuff of future networks, but its first phase is here today. Multimodal applications are already being used when users choose between voice or visual interaction. For example, subscribers can place a call by selecting a number from their text-based address book, or they can use spoken commands to tell a voice-activated dialer to initiate the call. Mobile users can get email on their mobile phones. They can read email on screen, or have the message read to them via a text-to-speech application, Both modes of interaction may be available, but the user has to choose one mode over the other. Then Switch
Initially, the choice to use visual or voice interaction will be mutually exclusive. As multimodal applications progress, both choices will not only be available, they’ll be accessible at all times, and users will be able to switch naturally from one mode to another. For example, picking two friends from a mobile chat buddy list and requesting a “conference call” will switch the user from his text-based interaction to a three-way voice conversation. And when the conference is over, the user will go back to the visual buddy list. From exclusive choices, multimodality will progress to seamless switching between modes. And Then Do Both
Ultimately, multimodality promises to let users interact the way they would naturally – by looking and talking and touching all at the same time. First users had to choose. Then they could switch! In the end, multimodal applications will let them use multiple modes of interaction in parallel. Perhaps the clearest example of this is one in which a mobile user points to an entry in an on-screen address book, and says “Send email”. Multimodality lets the user see the information in a visual interface, select the item by touching it, and act upon the selection by speaking a command. As functionality gets more sophisticated, the user experience gets simpler and more intuitive. That’s the promise of mulitmodality. Convergence and Multimodality Today, the trend toward converging many different technologies has produced powerful new mobile devices. Many of today’s mobile personal digital assistants (PDAs) provide the benefits of mobile phones and computers such as larger, color displays. These devices are pointing toward a future where mobile devices will provide powerful network services. This will be further enhanced by new, soon-to-be-released, Class A mobile devices that enable simultaneous voice and data communications channels, helping to bring multimodality into the mainstream. A royalty-free, platform-independent standard is being developed that will allow multimodal telephony-enabled access to information, applications and web services from mobile devices and wireless PDAs. The Speech Application Language Tags (SALT) Forum, founded by Comverse, Microsoft, Intel, Philips, Cisco and SpeechWorks, is developing the SALT standard, a protocol that extends existing mark-up languages such as HTML, XHTML and XML. SALT will let users interact with an application using speech, a keypad, a mouse or a stylus and it will produce data as synthesized speech, audio, plain text, motion video or graphics. Each of these modes will be able to be used independently or concurrently. Additionally, XHTML+Voice, an initiative founded by IBM, Opera and Motorola, provides a set of technologies for enabling WWW developers to add voice interaction to Web content by providing access to appropriate features of Voice XML from an XHTML context. Multimodality in Action Multimodal applications will combine powerful forms of user input and data output to help network operators deliver new services. The key options are: ·Speech Input: Using speech recognition technology, users will be able to search for information (such as the name of an airport), navigate within and between applications, fill in data fields and perform other hands-free functions.
·Keypad Input: Using the navigation capabilities of their mobile devices (arrows, joysticks, stylus, keypad, touch-screen, etc.) users will be able to make selections, enter numbers such as a password or PIN, and perform a wide range of other functions.
·Spoken Output: By simply listening, users will be able to hear both synthesized, prerecorded, streaming or live instructions, feedback, requests, sounds and music.
·Visual Output: Using the mobile device’s display, lists of options, pictures, maps, graphics and endless other possibilities can be presented to users. Multimedia Message Example Multimedia messaging is expected to be a popular multimodal application. With a multimodal-enabled mobile device, users will be able to combine verbal commands and onscreen visuals to send and receive multimedia messages. For example, a user will be able to ask her phone to display new mail by simply saying “open mail.” The user specifies a message to open by touching it and saying something such as “open it .” If the multimedia message has a picture attached, the user simply says, “view picture.” The entire process will be done hands-free, combining voice recognition and onscreen graphics and text displays. Steps Necessary for Multimodal Implementations As we have seen, multimodality is not a far-off dream. Its initial phases of choice and switching capabilities can be implemented today, and can set the stage for full multimodal services in the future. In the future, as technology platforms converge and mobile devices offer more diverse functionalities, full-fledged multimodal applications will be launched. First, mobile devices must be capable of processing multiple channels of communication in order to enable truly parallel interactions. These types of devices are expected with the introduction of Class A handsets. In addition, existing applications will have to be enhanced to take advantage of multimodality. Those applications include voicemail systems and voice-activated dialing. Likewise, data applications of the future should take voice capabilities into account. Finally, a robust and secure platform is necessary to control all aspects of multimodality.

James Colby is the assistant vice president of marketing, Voice Solutions, Comverse.

Companies and Suppliers Mentioned

Multimodality: The Next Wave of Mobile Interaction

Krisp Launches SDK for AI Accent Conversion

SoundHound AI Is Bringing Amelia 7 Agentic AI to Vehicles, TVs, and Smart Devices

Plaud Unveils Plaud NotePin S and Plaud Desktop

Sensory Launches Smart Wakewords