World of Fully Human Robots Still Pretty Far Off: Thoughts from SpeechTEK's Final Day

NEW YORK—Automated speech recognition that's on par with a human transcriptionist is still years away, says Roberto Pieraccini, speech scientist and chief executive officer of The International Computer Science Institute.

Speaking in his keynote address on the past and future of speech technology at SpeechTEK 2013, Pieraccini noted that the advances in speech technology for the last 30 years, writ large, have relied on the underpinnings of the statistical approach created at IBM's labs in the 1970s. Many of the advances in accuracy have come, not from increasingly refined theoretical frameworks, but from an increase in raw computing power, storage, and the ability to leverage enormous data sets in recognition results.

"Today, [with] Siri and Google Research, we have no idea how much data they have, but it's an incredible amount," Pieraccini says. "Which shows that you could have all the data you want and you don't reach a human level of performance."

The third day of SpeechTEK 2013 took a broadly philosophical tone, exploring some of the promises and assumptions that underlie the research and marketing of speech technology today.

Looking to 3-D animation, Bruce Balentine, executive vice president of Enterprise Integration Group, finds parallels for text-to-speech (TTS) solutions.

"Graphics people in 1998 thought computer graphics were going to put actors out of business," he says.

Similarly, TTS firms and many enterprises believed and continue to believe that TTS technology will eventually become so 'natural' or human-like that firms will no longer have to hire actors. They'll be able to just punch in a text and get back a deceptively realistic recording for use in interaction voice response (IVR) systems. Balentine believes that pursuing that end is misguided. For one, human voice actors are cheaper than most firms believe, and for another, people don't necessarily want to feel as if they're talking to a person when they're, in fact, talking to a machine.

If one looks at the fully 3-D animated films being made today, they are not, in fact, attempting to look realistic. They're broadly cartoonish, like Pixar's work. Animated films of this sort, in their very style, acknowledge their artificiality and pursue it as an aesthetic end. Where 3-D animation is meant to look realistic is when it works in concert with filmed human actors, as with almost any recent sci-fi movie you care to name.

Balentine believes that the best approach is to use TTS where it's appropriate. TTS doesn't sound natural, he says; it sounds like a machine and that's fine. It makes sense to use it in an IVR, for instance, where there are changing and unpredictable pieces of a prompt, like a user's name, or number, or when being used to read back what is on someone's mobile screen. A static prompt, in an IVR, on the other hand would probably be better served by an actor, if something "natural" sounding is the desired effect. The two can also and should be used in concert—like with 3-D animation in live-action films where 3-D is used to augment the scene.

TTS, similarly, as in the case of a Pixar film, can be used for its own aesthetic ends. Stephen Hawking, for instance, uses his famous monotone TTS communication precisely because it is uninflected with emotion. Sometimes that's an appropriate choice.

Moderating a discussion on the promises of natural language, James R. Lews, Sr., human factors engineer for IBM Software Group, openly posited, "Maybe talking to something like it's a human doesn't make sense in the service space."

The dream of a machine that would be just like a human being is the dream of researchers and scientists, but perhaps it isn't commercially necessary or even desirable at the moment.

"The closer we try to get machines to be human-like without fully getting there, the creepier they become. It's the uncanny valley," says David Attwater, senior scientist for the Enterprise Integration Group. "I think it would be a grave mistake for our machines to start adopting social and emotional attributes. I don't think it's appropriate for machines to become those kind of actors."

Chiming in, Jim Milroy, director of user experience for West Interactive, adds, "Most people want customer service. They want to get where they're going fast. If it's through an IVR, fine...I don't know that anyone is looking for a conversation with their phone."

Something of a consensus seemed to emerge for SpeechTEK's final day of panels and discussions that the existing models of speech had started to slow in their delivery of incredible advancements; that progress would continue to be made steadily, but as long as it was with the same undergirding assumptions, more around best practices, implementation, and connection to broader unified databases of information; that figuring out how best to use the tools to predict and serve was where substantial progress could be made rather than sharpening and refining the tools themselves.