Bin Laden Speaking
In November, an international stir followed Al-Jazeeras release of a tape purported to be of Osama bin Laden. Al-Jazeera is the satellite television news network based in the Persian Gulf nation of Qatar. The measures required to authenticate a recording as politically-charged and important to international security as the bin Laden tape are significantly different from those used in most everyday access-security systems. Most access-security applications have multiple users whose voices must be processed within seconds after they present themselves for authentication. Their voices are transmitted over a single channel (e.g., wireline/wireless telephone) and they also generally know they are interacting with an automated system. The system may restrict what they say for authentication(e.g. a password), invoke backup measures (e.g., secondary passwords), and test for liveness vs. tape-recordings). In contrast, the bin Laden tape contained the recorded voice of an individual speaking freely to a human audience. The recording itself was of poor-quality. It had apparently been transmitted by phone at one point and recorded many times before it was played by al-Jazeera. Consequently, the tape required slow, careful analysis by human experts assisted by computers. Content And Quality The speaker in the bin Laden tape made references to recent events, including the bombing in Bali, the hostage siege in Moscow, the killing of a U.S. soldier in Kuwait, the assassination of an American diplomat in Jordan, and the bomb attack against a French oil tanker off the coast of Yemen. The analysis, therefore, included attempts to determine whether the words and phrases concerning those events had been interpolated into older bin Laden recordings. Linguistic analysis of such a recording includes verifying the speaker is using the correct Arabic dialect, employing a bin Laden style of oratory, and exhibiting acoustic patterns that match other bin Laden recordings. Stylistic elements include preference for certain words, speed of articulation, dynamics, idiosyncratic articulation and/or intonation patterns, and even characteristic fillers (e.g., "uh," "see"). A speaker may, for example, routinely pronounce the word "didn't" as "dint," "didint," or "din." Some speakers frequently end sentences on a rising pitch making statements sound like questions. Such patterns are compared with authenticated recordings of bin Laden. Good mimics can imitate the style of an individual but they dont have the physiology of that person. Consequently, authentication whether by human experts or automated tools examines acoustics patterns that contain information about the size and shape of the speakers throat, mouth, nose, etc. The use of such features makes it difficult for professional mimics to fool speaker-authentication systems. The noise and distortion of the bin Laden tape made analyses difficult because it affected those features. The challenge in such cases is to eliminate as much noise is possible without removing or further distorting acoustics patterns needed for authentication. Live or TTS Could the bin Laden tape have been created using concatenated text-to-speech synthesis (TTS) or voice conversion technology? Voice conversion transforms the voice of one person into someone elses voice. For example, it would make Judith Markowitz voice sound like the voice of Humphrey Bogart. Today, conversions produced by such systems may be recognizable as the target-speakers voice but they often sound stilted and unnatural. They sound artificial says Dr. Carline Henton, president of Talknowledgy (see 'The State of TTS,' this issue). "The problem is that many so-called voice conversion systems are based on the same limited rules as parametric TTS systems such as DECTalk use." Bin Laden would get better results using commercial concatenative TTS. In order to generate flexible, natural-sounding TTS, though, hes have to spend a minimum of ten hours in a professional recording studio providing high-quality samples of his speech. The recorded material would be segmented into labeled units and stored in a large database. It might be possible to use existing tapes of bin Laden's voice for this purpose but they would lack necessary acoustic variants. They also wouldnt have sufficient consistency in quality, volume, and the other factors necessary to produce units that, when concatenated, sound as if they were spoken naturally and at the same time. According to Henton mismatches of this sort could be covered up. You could hide any acoustic artifacts of the concatenation process by having a sufficiently noisy-enough channel, which is typical of Bin Ladens speeches. Unfortunately, the resulting speech would fail to reproduce the emotional nature of bin Ladens speeches -- which are designed to stir followers into taking violent action. "Current high-end TTS systems are good, but I haven't heard any synthetic speech system that could reproduce the hectoring and invective in his speeches", says Henton. "Besides that, human intervention is needed to tweak the occasional artificial-sounding bubble in a synthetic utterance." Its unlikely that these technologies were used in the November 2002 bin Laden tape but we shouldnt eliminate them from consideration I the future.
Dr. Judioth Markowitz is the associate editor of Speech Technology Magazine
and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or jmarkowitz@pobox.com.