Singing the Praise of Speech Recognition
At last census, I owned 15 or so types of musical instruments, though I'm not proficient in all just yet. And I may purchase a set of bagpipes, which, surprisingly, is still legal here in Chicago.
Every Tuesday, I'm at the Celtic Knot in Evanston, playing in an Irish session (more properly known as a "seisiun"). I also sing, and after extensive research of folk songs just prior to February 14 last year, I discovered a half-dozen love songs in which nobody dies, and now when someone calls for a love song, I can offer a range of body counts.
The other night, one of the singers was two-thirds of the way through a song and stumbled over a verse. So I had to ask myself: Wouldn't it be nice to have an application that could follow along as we sang and whisper the next verse in our ear? This would make for a fine mobile application, using speech recognition and text-to-speech.
I admit that I don't believe it's possible to write this application for a mobile device; the task would be difficult enough for a desktop or server. Unlike ordinary dictation on my mobile device, in which I speak a sentence or two and then wait for and check the transcription, this task requires real-time tracking of each utterance. But if I cue the device in advance with the text I'm supposed to sing, the recognition might be possible in real time. I also admit that I find it difficult to believe that I could get text-to-speech to sing in any manner that would be other than hideously annoying. But it's problems like these that the fun of research and development is all about.
Since this article was due at the time, I thought I had my topic in hand: the difficulty of incorporating speech technology into mobile applications. As proof, I had the paucity of such applications on my phone. I have perhaps two dozen applications on my Android phone, not including the ones forced on me by Google and my service provider, none speech-enabled. The Google search app and mapping app is speech enabled; I've got the Swype beta keyboard from Nuance that accepts dictation as well as touch input, but none of my other apps accept speech recognition directly and just one provides TTS output. If useful applications don't integrate speech technologies, there must be a reason. My preliminary conclusion: Voice is just too difficult.
My current framework of choice for development on mobile devices is Phonegap, which lets me create code in Javascript and HTML5 and works cross-platform. The code supports access to the core functions of mobile applications, but not speech recognition. So I thought I'd found the doom and gloom I was looking for…but not quite. Android provides a decent set of application programming interfaces to let developers access its capabilities, and if you're willing to program in Java, there's a number of simple tutorials that explain how to use speech recognition. And since Phonegap encourages the development of plug-ins to extend its basic capabilities, I tapped out a few more searches and found an extensive official list of plug-ins, including one that extends Phonegap to use speech recognition on Android and another that enables text-to-speech.
A few test sentences with Android speech recognition or Swype's dictation will convince you that recognition takes place elsewhere, as otherwise it's simply too good, and that indeed is the case: Absent a network connection, recognition fails. Probably as a consequence of this client-server relationship, my ability to set parameters for recognition appears to be very limited.
So I'm afraid this article is entirely free of gloom and doom. All the building blocks for speech technology are available, some free and some for a fee; recognition continues to improve; and more powerful phones can't help but improve recognition speed and possibly accuracy.
The missing ingredient seems to be a reason to use speech technology on the phone—does anyone really use Siri or its relatives? I've never seen anyone use these personal assistants outside of a demo. But today I'm optimistic: This is not a failure of speech technology, but an opportunity waiting to be exploited.
Moshe Yudkowsky, Ph.D., is president of Disaggregate Consulting and author of The Pebble and the Avalanche: How Taking Things Apart Creates Revolution. He can be reached at speech@pobox.com..
Related Articles
SpeechTEK 2011 keynote panel explores why mobility is a game changer for the speech industry
01 Sep 2011