Listen Up: How Natural-sounding Speech Engines are Changing the Industry
After a 15-year adolescence, text-to-speech technology is coming of age. Every TTS vendor's goal - a truly natural-sounding, voice-activated computer interface that can read text aloud like a human being - is now within reach of the development community. Industry observers all along have said TTS would have to make a quantum leap before it could achieve anything near the natural-sounding speech necessary for broad market acceptance. Today's synthesizers make that leap possible by using new processing and linguistic models to convert computer text into speech that is nearly indistinguishable from actual recorded human speech. TTS is speaking and the market is finally taking note. Better databases, less expensive computer memory and more processing power have enabled linguists and phoneticists to implement better and more advanced solutions than possible with traditional TTS technology. Armed with an unprecedented level of voice quality, developers are already using next-generation speech engines to create voice interfaces where none was practical before, laying the foundation for new applications and products, such as e-mail and Web readers and advanced interactive voice response systems. These speech engines generate words by phonetic rules, so vocabularies are unlimited. The achievement of a truly natural-sounding human voice is already making current TTS applications much more compelling. But the future of the voice interface in general hinges on the ability of the computer to interact with the user like a human would. And TTS is about to cross that threshold.
TALK ABOUT PROGRESS The maturation of the TTS speech interface means, for one thing, that computers must be able to generate questions to clarify what they've heard, not heard or thought they've heard, just like humans do. Until recently, all computer responses have been pre-recorded. While pre-recording solved the problem of a realistic voice interface, it also restricted the computer to answering only the questions the application developer anticipated when creating the system. The newest synthesizers enable the computer to generate any question necessary to clarify any spoken input. That's because automatic speech recognition is evolving as well. Next-generation ASR uses natural language understanding, an artificial intelligence-based technology, not only to recognize words but to understand their meaning contextually. The new TTS synthesizers allow developers to create natural language dialogue systems that combine TTS with natural language speech recognition. A natural language dialogue system enables a computer to behave like "Human 2" in the following human dialogue:
Human 1: "I would like a ticket to (Mumble) on Friday the seventeenth."
Human 2: "What was the destination?"
Human 1: "Boston." (muffled by cell phone interference)
Human 2: "Was that Austin with an 'A' or Boston with a 'B'?"
Human 1: "Boston with a 'B.'"
Though the technology model driving these new applications is not groundbreaking, what developers are doing with it is. Basic speech synthesis is a two-step process. First, standard text is converted into a phonetic representation with markers for stress and other pronunciation guides. Then, the voice is created through a synthesis process, via a digital signal processor, a microprocessor or both, and the phonetic representation becomes spoken sound. Of the two main TTS technologies - formant and concatenative synthesis - it's the latter, with its process of splicing processed speech fragments into recognizable human speech, that is leading the way.
EASY AS A-B-C Formant synthesis, dominant in production TTS applications in the 1980s and early 1990s, models speech synthesis based on the way humans produce sound using their lungs and vocal chords, and modifying the size and shape of the mouth cavity. Formant synthesizers generate a waveform, then run that waveform through a variety of filters that modify it into a speech wave. Typically, formant systems required less memory but more processing power than concatenative systems. Despite the ability to vary word pitch and duration, the sound was decidedly synthetic and applications remained limited. Concatenative systems, in contrast, use chips to store segments of actual recorded human speech in the form of phonemes, diphones and triphones, which are fragments and combinations of the smallest units of speech that distinguish one utterance from another. The challenge to this model was twofold. On one hand, the challenge lay in balancing speech quality with the limitations of computer memory. Developers realized that the larger the segments of speech they used, the more natural the voice would sound. However, more memory was needed to store and access these segments. On the other hand, because of the nature of phonetic speech, joining the speech segments together in a natural way was problematic. Developers refer to the fluid contours of continuous human speech as intonation, melody and prosody. Without it, concatenative speech sounds uneven, disjointed and obviously artificial - the major shortfalls of the previous generation of TTS engines.
MORE PROCESSING POWER ENABLES MORE SOPHISTICATED SOLUTIONS As processors and memory continue to grow in capability and drop in price, it has become possible for developers to use larger voice segments that make it easier to develop more natural-sounding speech. At the same time, developers have broken new ground in the ability to join these voice segments effectively. The solution lay not in how best to join phonemes - the tiniest snippets of human speech - but in choosing a different speech segment with which to build speech, namely, diphones. A diphone is actually a phoneme-size unit that contains the halves of two adjacent phonemes, created by cutting two adjacent phonemes down the middle. This is done because phonemes are relatively constant in the middle but change at the edges to meet up with the next or previous phoneme. Linking two diphones seamlessly in new combinations is easier than joining two phonemes. The linking process is analogous to cursive handwriting, where a person writing script changes the shape of the letter's ending to match the start of the new letter. Most people form the center of their letters the same all the time so that people will recognize the letter. Cutting a line of cursive letters where the letters join one another would make it difficult to join them seamlessly in new rearrangements so that the lines flow. Cutting the letters in the center would make it easier to match them up. The same holds for phonemes and diphones. The new concatenative speech synthesizers join even larger segments, such as syllables, words and even entire phrases, where there are several hundred thousand possible segment combinations to each unit. Using larger segments is nothing new, but making it work well has until now confounded developers. The challenge is to achieve the highest quality speech with the smallest possible database and the least amount of processing. The computer must be able to quickly find the best segment to use and then glue these segments together in such a way that end users don't hear the concatenation points.
TTS COMES OF AGE The use of large units of human speech with enhanced methods of selecting, gluing and modifying these segments, along with refinements to the basic TTS algorithms, have helped TTS through its final growing pains to become a mature technology. TTS developers liken the current state of the technology to that of the Internet once it became possible to establish the first World Wide Web sites on the network. These synthesizers promise to change the way people interact with call centers and voice-mail systems, and will likely engender a host of applications we have not imagined. In the past, speech companies focused on systems with a relatively small footprint, CPU and memory, trading off speech quality for space. However for many current applications, CPU and memory limitations are no longer restrictions. It was a logical next step to focus on other issues, such as improving speech quality.
Some technology watchers predict the future will be filled with devices that converse with us, from our houses and cars to our wristwatches and cellular phones.
In large segment concatenation, segments covering each possible sound might be recorded several times, each time with a different prosodic contour in terms of pitch, speed or emphasis. The application searches the database for the closest match and then modifies it to fit the needed linguistic context. However, with the increased database comes the need for more complex methods of locating the appropriate segments and storing them compactly. Places to look for TTS in the near future include e-mail and unified messaging systems, data access, security systems, text-based sales and services of all kinds, navigation systems, personal computer-based agents, server-based telephony, voice-mail systems and new telephone directory services, where actual dialogue will replace cumbersome key pad menus. The new synthesizers also promise to enable telecommunication companies to take interactive voice response systems to the next level. Consumers will be able to easily retrieve information from automated systems, where a perfectly natural sounding voice will read his or her emails, account information, or read news headlines, stock quotes or Web content. Beyond these applications, a natural-sounding computer/voice interface is one of the key ingredients for interacting with devices in a natural dialogue. Some technology watchers predict the future will be filled with devices that converse with us, from our houses and cars to our wristwatches and cellular phones. The growth of computer processing power will enable developers to go beyond the natural-sounding voice itself, to create applications that speak as naturally as any expressive and perceptive reader. A person reading aloud can appreciate tone and meaning, and express humor, irony or the contextual meaning of a narrative's elements, assuming voices for the two sides of a dialogue or anticipating the causes and effects of various events. Computers will eventually have the intelligence to add that level of understanding and contextualization to the prosody of synthetic speech, to formulate any question and ask any question back. Consumer demand and post-deregulation telecommunications competition will continue to be major factors in driving the popularity of TTS. Increased computing power has made it easier for developers to create TTS-based applications. These and other factors have made TTS systems much less expensive on a per-port basis. As a result, TTS is emerging as a major feature in applications in a wide variety of industries. Listen for it in a phone system near you.