The Difficulties with Names
If you have an unusual name because it's rare, ethnic or your parents got creative with its spelling, you're probably used to people butchering it when they address or introduce you. Some people admit it's useful to have an uncommon name, since they use it to screen out telemarketers ("Hello, er, um, Mr. 'Bucket'?" "No, Goodbye!" answers Mr. Buquet). While few of us want to help telemarketers (at least, when they call us), we all want the best voice dialing, directory assistance, reverse directories, security, access to email, directions and ordering systems; in other words, we want improvements in the full range of voice activated services, to enable them to be fast, accurate and personalized. To that end, it helps to know the difference between Bucket and Buquet. Having one's name mispronounced is a common malady, more common than many people realize. It's hard to pronounce names. As difficult as it is for people, imagine how hard it is for computers! Achieving better performance has been the quest for several research teams, some working over 20 years. - Excuse Me: Was that Cook or Koch? Names are hard to pronounce because of several factors. Foremost is the sheer quantity of them and the preponderance of rare names. Smith, the most common US surname, accounts for 1% of the country.
Figure 1 shows how the rank order of names associates with the population. The graph is an analysis of databases obtained from Donnelley Publishing; it corresponds well with Social Security information. The 50% mark is near the 2000th name - that is, about 2000 names (Smith through Arroyo, Gannon, Worthington) covers half of the population. There are a total of 2 million surnames. The data graphically shows how many rare names there are: 20% of us have names rarer than the top 50,000 (Boetcher, Marchioni, Yehle). As many as one in 100 households have a unique family name, such as the single households of Adeyooye, Caioppo, Xoumphayvient, and Zabdy. Paradoxically, its quite common to have a rare name! The second factor is that our names derive from dozens of languages and nearly every country in the world, resulting in an unusually high level of ethnic diversity. Perhaps the most important factor is the many variations for some names. One person's "wrong" rendition may be another person's "correct" pronunciation, because of the influences of large ethnic populations in a region, how much assimilation has occurred over time, and even personal preferences for pronunciations. (Given names [first names] and business names show similar variation.) Let's explore one factor, i.e., regional variations. Regional variants for name pronunciations are influenced by immigrant settlement patterns. It's common knowledge some cities are strongholds for certain ethnic populations; however, few people realize the extent to which name distributions vary from place to place.
Figure 2 shows the most common names in some cities. Few have surname rankings resembling the national ranking. (Chicago is the exception here: 18 of its top 20 names are in the national top 20 list; in the other cities, only 50-75% of their top 20 names are in the national top 20.) In NYC, the most common names include many Hispanic, Jewish and Asian names. In San Francisco, the top names include Asian and Irish names not commonly found elsewhere. In Boston, Irish names predominate; in Shreveport, French names; and in Milwaukee, German names. The implication is Germanic names (such as Greulich, Doetsch, Wendschlag) more likely retain pronunciations close to the original German in Milwaukee, but not elsewhere. Elsewhere, an Anglicization/assimilation process occurs. Quite often the Anglicization is only partial; that is, the name is not pronounced exactly in accordance with how English words are pronounced. Instead, the original pronunciation and ethnography "flavors" the original pronunciation. Whether the general public uses an ethnic pronunciation is determined by a complex interplay of many processes. It is influenced by the majority cultures interest in a minority's culture, values and even its foods. The details of this influence are studied by the field of socio-linguistics and are beyond the scope of this article. Needless to say, the full range of accommodation is seen in different areas of the country and at different time periods. Some immigrants seemingly reject their native culture and, to "blend in," accept full assimilation of their names said by "standard" English pronunciation rules. Some even adopt common US names, knowing the general populace can't easily render "correct" pronunciations of Chandrashekhar, Nahekeaopono or Kaweiuokalani. Other immigrant populations are successful at interesting and educating the public around them in many aspects of their culture, spilling over into adoption of more native pronunciations of their names. Some names have a great number of variations. Koch, for instance, has at least six distinctly different pronunciations that vary regionally. The challenge for TTS systems is to use the best pronunciation for names like Koch, which has no majority pronunciation - only 30 or 40% of the population uses the most common pronunciation. The ASR implications are even more important: the programs must recognize most of those variant pronunciations - otherwise recognition fails when these variations are encountered. The regional variations mentioned are, at this point, too complex for most TTS systems; but ASR systems are beginning to account for regional variability. - HOW DO THESE PROGRAMS WORK and WHY DO THEY FAIL? Understanding how these systems work will help explain some of the "failures" or errors one occasionally sees. For pronunciations, all ASR and TTS systems roughly work the same way. Most rely on a large dictionary; many also contain rules for "out of vocabulary words and names. (Even the base dictionaries in most systems were originally generated back in the lab by running a rule-based system; a dictionary is used in deployed systems to save run time.) Rules: Rule-based systems embody basic knowledge of how words and names are pronounced. The rules distinguishing most "hard" from "soft" c's might be written as: c[eiy] > s # soft c when followed by e,i,y, as in center, city, cycle c[ao] > k # hard c when followed by a or o, as in cat, cot The rules can become quite complex, with context sensitivity determined by a rough ethnographic classification for the name. Rule systems vary in their ability to predict pronunciations. The best actually pronounce names better than humans (1). In their lifetime most people make acquaintance with only a few thousand names, a small fraction of the number researched for the best programs. Dictionaries: It is impossible for most companies to authenticate the pronunciations of tens or hundreds of thousands of names; therefore, linguists created many dictionary entries using their intuition of how names might be pronounced. Errors arise because no person's intuition is fully accurate. Is Kreamer a homonym for Kramer? Should Cremer also be- Kremmer? Furthermore, dictionaries used in many systems contain inconsistencies because multiple people edited them. Some recognition errors reveal mistakes in the underlying rule-based system. A VP noticed one ASR system wouldn't recognize his name, Wagener, until he said it as "wage ner." This was caused by an incorrect analysis of compound words, or their morphology. Just as humans implicitly see the compounds in Wineberg, Winegarden and Winemiller, so too must pronunciation systems in order to properly recognize the silent "e," the long "i" vowel, etc. An inappropriate context for compound-name rules would pronounce Winegrad as "wine grad" (and was the likely reason for the system mistaking Wagener as "wage ner"). Another revealing error was the TTS system that pronounced Malone as "mal wun." Here the pronunciation for "1" [one] was matching the letters in Malone possibly after an inappropriate morphological analysis. - EVALUATION METRICS How can one be assured a system performs well - what are metrics for evaluation? With most features of a system (TTS: intelligibility and naturalness; ASR: word accuracy, task completion rates, noise immunity), one can be confident the results from a small, established test are representative of the whole system (limited, of course, by the variability associated with small sample-set size). However, there is a problem with establishing a standard testset of names (for example, including Keogh, Riordan, and O'Shaughnesy): all developers can place in their system dictionaries the troublesome names on the test and the system immediately appears to have better test performance (recognizing or synthesizing more accurately) than it does in actual service. (By the time you have reached this point in the article, every developer has already corrected Buquet, and the other names mentioned, but they probably won't admit it!) Thus, fixed, published vocabulary tests will yield inaccurate results. Better metrics derive from actively updated, private lists. It is certainly time consuming and expensive for each company to develop, verify and refine lists on its own, but I argue there are few shortcuts to honest results. If your company is interested in developing tests for recognition or synthesis of names, the following are guidelines to consider: Test set: A key requirement is a wide sampling of names, both common and uncommon. The US Census Bureau publishes lists of names, but without any rare names. Don't rely on published sets of name zingers" - most software fixed those particular items long before you read about them. (This doesn't imply there are no more problematic names: there are many more!) You can also mine your contact list and company directories for unusual names. Pronunciations: The best test is one that gathers judgements of which pronunciations are possibly right, which are definitely wrong, from many people. Find out how the names are actually pronounced - do not guess or use your intuition; many people are surprised how often their intuitions are wrong. When obtaining representative pronunciations, be aware that multiple pronunciations may exist. Your experience with Devon or Gautier may not match the actual commonest pronunciations nationally, or within your community of interest. (An anecdote from the deployment of a Reverse Directory service reinforces this point: a Chicago subscriber complained the system mispronounced his name, Koch. When told his pronunciation (Cook) was less common in Chicago than our system default, he was surprised to learn there were other pronunciations! As per-line customization wasn't possible, Mr. "Cook" accepted that our system chose the best single default.) Scoring: For ASR purposes, an obvious scoring metric is recognition success or a low recognition "distance." Error sensitivity grows with the number of items in the dictionary: for name dialing, a contact list of 100 people is less demanding than an auto-attendant choosing among a 5,000-name company directory, and Automated Directory Assistance is several magnitudes harder still. For TTS applications, most researchers agree a 3-tiered scoring scheme is adequate. While category labels may differ ["excruciatingly correct," "sensible," "outright wrong" vs. "clearly acceptable," "somewhere in between," "clearly bad," etc], most capture the pronunciation differences among a) a person saying their own name, b) how other people saying this name, and c) no one saying the name this way. The pronunciation accuracy of the best systems typically exceed 99% for common names and are better than 92-94% for very rare names. Depending on the exact mix of common and uncommon names in your test, frequency-weighted results can be better than 96-97%. This is not typical performance, but the best systems are better than humans. Accuracies this high could mean your ASR/TTS systems performance would be limited by other factors, and not by the difficult area of proper-name pronunciation. - PROGRAMS IN ACTUAL USE The best programs are used in a variety of applications. Some involve a map companys need to provide accurate pronunciations for streets and town names to its customers; others improve the accuracy of voice dialing on mobile phones (the rules were modeled by a company providing software for embedded chips). Targus Information Systems uses the software within its SpeechCapture Express application, increasing the capture rate of names and addresses obtained with speech recognition transactions. Sprints Voice Command uses the software for high accuracy TTS playback of names of people and businesses voice dialed by their customers. Applications for the software for company auto-attendants and automating Directory Assistance are currently planned; as such software continues to be refined, more applications and widespread use is likely. - WHAT WILL FUTURE BRING? How are these systems being improved? What is at the forefront of research? There are three major thrusts in the field. The first is improving the dictionaries, both in scope (number of entries), breadth (number of variations) and overall accuracy (eliminating spurious guesses). As companies gain experience with pronunciations via auto-attendants and automation trials for Directory Assistance, the best dictionaries will begin to reflect, for each name, its variation of pronunciations and relative frequencies. The second is continuing to improve the accuracy of handcrafted rule-based systems, augmented by small exception dictionaries. These are currently capable of achieving the highest accuracy when they properly imitate the underlying rules that people use. A third approach involves automated learning theory: let computer algorithms analyze the underlying relationships in a pronunciation dictionary and devise the most efficient and predictive rule set. This promising line of inquiry, automatically developing rule sets without human intervention, is very attractive. The current state of the art obtains 20-30% errors, which is still too high. Automated learning algorithms are likely hampered by the inconsistencies and errors still found in most public research dictionaries (2). It remains to be seen whether automated learning can achieve higher accuracies than earlier attempts (neural networks, analogy systems and the like). So far, the handcrafted systems have a definite edge. Footnotes: 1. M.F. Spiegel, Proper Name Pronunciations for Speech Technology Applications, Intl Journal of Speech Technology, in press, 2003. 2. A. F. Llitjós, "Improving Pronunciation Accuracy of Proper Names with Language Origin Classes," CMU Masters Thesis, Dept of Computer Science, 2001. Available as http://www.cs.cmu.edu/~aria/papers/mthesis-cmu.pdf
Murray Spiegel has been active in the speech community for 20 years and directs speech application research at Telcordia Technologies. He can be reached at
spiegel@research.telcordia.com.