Attitude Correction: Deconstructing Six Myths About TTS

Ever want to stop and say, hey, let’s give the kid some respect? That’s how I feel about TTS. Here’s my contribution to giving the technology its due.

Myth #1: TTS is a subordinate technology to ASR.
ASR vendors have trained us to think you can do a speech app without TTS, and just use PreRecorded Prompts (PRP). That argument cuts both ways: TTS does not need a speech reco system in order to deliver value. TTS, depending on the application, may be deployed with: DTMF; behind mixed DTMF and ASR; and alongside visual output (multimodal).

Myth #2: PRP is better than TTS.
"If it’s a real voice it must sound better than synthesis" is the reflexive perception. Well, as we will see in Myth #3, that isn’t necessarily so. But even if it is for a large percentage of instances, is it more costeffective, does it scale and is it as flexible? Of course not, because PRPs are locked up in human beings. You’ve got the voice talent (who if they’re any good has an agent), and then you’ve got the VUI designer (who may be working for an ASR company), engineers and, of course, the client, who is entranced at the idea of casting. They all need attention, and you need about $1,800 an hour for recording it all. Remember too, humans don’t scale — if you have a mission-critical update and need the talent back in the studio and the agent says sorry, he’s in Cancun, you’ve got a problem. At 16-20 hours of studio time the client could be looking at upwards of $35,000 in prompt recording costs. At SpeechTEK this year the talk about taking speech mainstream and cracking small-to-medium size enterprises and departmental Web applications is absurd if we are still using $1,800-an-hour PRPs to voice the output. As for getting the most caffeinated member of the development team to record prompts into his laptop, let’s not even go there.

Myth #3: The key to Conversational Voice Response (CVR) is better Natural Language & ASR.
Beyond Statistical Language Model (SLM) approaches to conversational voice interfaces, which is what most of the ASR vendors pass off as natural language, are knowledge-based systems which can delve deep into the application domain, which in turn can be filled with strange pronounciations and unanticipated juxtapositions. Indeed, these knowledge-based conversational solutions work better the more domain knowledge is incorporated. Word to the PRPetrator: spoken output from these deep domains will be hard to record and deliver using voice talent. Even once it’s recorded, the ability to make transitions between prerecorded phrases, which calls for that wonderful component called ‘prosody,’ is almost impossible for a string of PRPs that can’t be tweaked in realtime. We would submit that flexible delivery of content relevant to conversation, with appropriate prosody, is key enabler of a ‘conversational’ VUI.

Myth #4: ASR controls navigation.
Directeddialog speech applications are strait jackets designed to keep the caller on course in the session. Design for many consists of eliciting the correct response via minimizing choice. However, for mixed initiative dialogs, callers are able to jump around a lot — move from restaurant finders to taxi locators to train timetables. In a truly robust mixed-initiative environment, contexts change. This is where TTS can help. Contextspecific voices, using a different TTS voice based on where you are in the app, is the best way to keep track of your place in the application. Mixing voices within your application(s) will become expensive and complex with PRPs, not so with TTS.

Myth #5: Ideal solution is sole-sourcing TTS and ASR.
This is where I will get some real hate mail, I think. This myth turns around the fact that there are multiple instances (IBM, Loquendo, Nuance, ScanSoft) of vendors that can supply both ASR and TTS. One-stop shopping is so much more convenient, and the integration will be so much tighter, would be the conventional wisdom. Not if you have long-term perspective, and don’t want to defend charges of lockdown. In a consolidating market, which arguably exists here, larger platform players (eg: Cisco, HP, MSFT, etc) turn to best-ofbreed partners. Why? Because large platform suppliers treat speech technologies as components that plug in, not dominate. Pure-play TTS vendors have a role in such an environment.

Myth #6: Growth in TTS is indexed to ASR.
While this is somewhat true in a command-andcontrol application environment, where ROI comes from understanding the ‘problem’ or ‘command,’ it will be less so as we move into a content-rich environment. Content-rich apps are driven by mass media, messaging, e-learning, stuff that will sell more TTS than ASR. I could go on, but hopefully you get the point. Recently, a large German bank deployed a sophisticated Conversational Speech Platform with no PRPs at all — pure TTS. That’s worthy of some attention.

Mark Plakias is a partner and senior consultant for The Zelos Group. He can be reached at (212) 366-0895.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Attitude Correction: Deconstructing Six Myths About TTS

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

XL8 Delivers Real-Time Spanish Translation Captions to U.S. Public Broadcasters

Northeastern Researchers Develop AI App to Help Speech-Impaired