Navigating TTS's Pitfalls
Text-to-speech (TTS) technology has experienced exceptional advances in the past few years. It has improved to the point that, when used properly, it is difficult to differentiate TTS output from a recorded prompt.
While there are definite benefits to using a TTS engine, such as reduced operating costs and increased speed to market, this technology is complex and, therefore, requires additional care and feeding. One essential element of TTS’s care and feeding that is frequently overlooked is the original input data. Bad data continues to be the one Achilles’ heel that impacts TTS’s audibility. One of the largest contributing factors to failed TTS implementations is inconsistent and/or poor data, which can lead to inaudible word pronunciations and, in turn, impact self-service throughput rates.
Some of the potential data issues you may encounter include:
• Misspellings: One of the most common issues that causes problems with TTS engines is misspelled words. The best way to head off this issue is to correct spelling errors during data entry using spell-checking software. Unfortunately, by the time this issue is discovered, some, if not all, of the data has usually already been entered. If this is the case, then the issue can be resolved in several ways. A technique commonly used to overcome misspelled words is to create an alternate spelling in the user dictionary. Using this method, misspelled words are remapped to the correct spelling, allowing the system to speak the misspelled words correctly. Another, yet more complex solution, is to create an on-the-fly editor that reviews inserted text before it is read and replaces the misspelled word with the correct spelling.
• Irregular and inconsistent abbreviations: Other issues that trip up TTS engines are irregular or inconsistent abbreviations. For example, the word “hospital” is frequently abbreviated as hos, hosp, or hsptl. Again, an alternate spelling can be provided in the user dictionary for each of these abbreviations, but a better method would be to create a standard for which the data is inserted into the system in the first place. This standard would need to be socialized in all aspects of the data entry process to ensure that no deviations exist, and that all of the data going forward will be in this standard format. If the data set is older and fairly static, then analysis will be needed to determine the inconsistencies in the data. Once this has been done, or once you’ve been able to develop standards, they can then be either handled by the user dictionary or a program to translate the data before the TTS engine encounters it.
• Capitalizations: Capitalization can also trigger mispronunciations by the TTS engine. For example, Roman numerals, such as XXI, in lowercase might be said as “x x i” instead of 21. So be aware of this potential hazard.
• Foreign words and unique spellings: Foreign-language words can often cause problems for a TTS engine. They are best overcome by adding phonetic pronunciations to the user dictionary. The same goes for words with unique spellings. However, the same method that is used for misspelled words can be used here, and the standard spelling can be added to the user dictionary as an alternate.
• Addresses and phone numbers: Standard types of data, like addresses and phone numbers, are often passed to the TTS engine from a variable in an application and are often out of context. Using Speech Synthesis Markup Language (SSML) is an effective way to compensate. Putting the SSML elements, such as <say-as/>, around the data can help the TTS engine play back your data in an understandable manner. For example, using SSML will ensure that phone numbers are always read as phone numbers instead of being read as integers.
Another issue around such data is that at the default speed it can be difficult for a caller to comprehend the address, especially if it is a location other than his home address. To counteract this tendency, using the <prosody/> tag to slow the rate of playback is most helpful. But be warned that slowing down the playback too much can have a negative effect on the sound of the TTS, so only slow it down a small amount.
To have a successful TTS implementation, it is essential to have data that is clean and structured. This process is time-consuming and can be expensive; however, the benefits that are achieved through tuning are invaluable to the overall audibility and success of your application.
Aaron Fisher is director of speech services at West Interactive, overseeing the design, development, and implementation of speech applications for the company. He can be reached at asfisher@west.com.