March 8, 2004
By Bill Byrne Senior Voice Interface Engineer - Google, Inc.
Features

In the Studio: Setting High Standards for Prerecorded Audio

Both touch-tone and speech recognition-based telephone solutions typically depend on prerecorded audio for prompt delivery, especially in customer facing call center applications. And with the growing demand for these applications, more and more consultants and professional services organizations are trying their hand at prerecorded audio production. However, many have found the prompt recording process to be deceptively complex. After all, voice acting, directing and post-production are specialized skills that require specific technical and design expertise as well as artistic talent.

The Right Voice Talent
Voice talent selection is perhaps the most important step in creating quality prompts. After all, not only do good voice talents deliver technically and artistically sound results, they do it in much less time than their less experienced counterparts. For a given set of prompts, it’s not uncommon for a less skilled talent to take up to three times longer than a better one. For this reason, paying more for the better talent not only improves the user experience, it can also save money.

However, picking the right talent requires a thorough understanding of what typically makes a talent “good” for this particular kind of work in the first place.

Studio Experience – There are many people who initially seem like they might have the perfect voice for a telephone application. They may be stage actors, singers, teachers, preachers or even salespeople with full, clear, resonating voices. But more often than not, when these same people are asked to perform in a studio, they cannot consistently produce the quality that was expected of them. This shouldn’t be surprising.

First of all, just because someone seems to have a nice sounding voice in an environment to which they’re accustomed doesn’t mean he or she will be able to reproduce the same quality on the spot in a recording studio, reading lines from a script. Getting prompts to sound natural on the first or second take comes from a lot of practice. Most professional voice actors learn techniques to control their breathing and posture to make sure none of the natural quality of their voice is lost. Other techniques are used to focus on the copy itself, which allows them to deliver the lines efficiently and accurately.

Secondly, while experienced voice actors find the studio setting to be completely familiar and comfortable, it presents a host of distractions for those not used it. The sound booth can be very cramped and is enclosed behind a thick window. The sound engineer controls communication, the talent cannot move around while performing and must speak into a microphone at a particular angle in order to reduce extra mouth noise. It takes time to get used to this working environment and even stage actors, who otherwise might have an advantage over other less experienced candidates, can have a difficult time in what they see as an unnatural and restrictive environment.

Industry Experience – Prompt recording for telephone applications is in many ways unique from most types of voice work. In particular, many of the phrases heard in these applications are built up through the joining of several prompts. This is because recording prompts in smaller fragments, as shown below, gives us a way to speak dynamic data such as dates and dollar amounts using prerecorded audio instead of text-to-speech. It also allows us to reuse common prompts in different parts of an application or across several applications that use the same voice talent. The example below shows how this is done to deliver the caller’s account status in a customer service application for a utilities company (individual prompts are enclosed in square brackets):

1 [Your total account balance is] [two hundred] [fifty-five] [dollars and] [forty-two] [cents]. [This includes a past due amount of] [forty] [dollarsa]. [Your next payment of] [eighty-three] [dollarsb] [is due on] [April] [second].

In order to get this example to sound seamless and natural when it’s played to the customer over the telephone, these fragments must be carefully recorded with particular intonation contours and stress patterns. In fact, some phrases are recorded with two or three distinct prosodic templates in order to be usable in different contexts. For example, the first prompt for “dollars” (dollarsa) is used at the end of the phrase and requires a falling intonation. In contrast, its second use (dollarsb) falls in the middle of the phrase and therefore ends with a slightly rising intonation.

Recording prompts effectively with this degree of detail is much easier if the voice talent has experience with this type of material since it takes practice and a fair amount of coaching to get it right. The problem is, only a small percentage of voice actors do have it. The majority of work is in other areas, such as radio and TV ads, narration and voice over. Therefore, even voice actors with a lot of studio experience cannot be expected to master it immediately.

Voice Quality – The most popular voice actors typically have resonant voices whose strong, full and smooth tone stands out from the rest. This quality, which can be empirically measured, is especially critical when it comes to telephone applications because only a portion of the quality that is originally recorded in the studio comes through over the phone. In more specific terms, prompts are typically recorded at 44Khz at 16 or 24 bits, but the bandwidth of a telephone line is limited to 8Khz at 8 bits. In short, if what you start with in the studio isn’t very robust, then what you end up with over the phone will most likely be unacceptable.

The Knack – Despite all the important criteria noted above, some voice actors just seem to “get it” when it comes to prompt recording for speech applications, even if they don’t have much experience in this particular area. If you find a person with both an intuitive grasp as well as experience, then you’ve really found a prize. There are many intangible qualities that contribute to this special knack that some people have, but the one that stands out the most is an ear for spoken discourse. That is to say, some voice talents know exactly how a prompt ought to sound with respect to stress, intonation and style even without direction. It’s similar to performers who do impressions of famous people. It’s not only the voice that they’re impersonating, it’s also the mannerisms, speech style, expressions, etc. And the performers can’t really explain exactly how they do it. It just comes to them naturally. The point is, in some cases, a person’s individual talent will make him a quick study in this business, which can make up for some of the other important skills or experience that might be missing.

The Right Director
The director is responsible for the overall recording project, including both the creative and technical aspects. However, given that there is usually a sound engineer involved, the director focuses on prompt accuracy as well as style and persona. For this reason, the director must understand the context in which the each prompt will be played, as discussed earlier. For example, the following prompt (2) has at least two readings, if not more, depending on the stress and intonation it is given.

2 a. Is that the right one? (first time the question is asked)
b. Is that the right one? (second time question is asked)

Ideally, the director will be involved with the design process from the beginning, ensuring a good understanding of the prompts if any questions come up during the session. This points to a potential problem for application development teams that outsource their entire prompt recording to third party studios. After all, if the studio simply receives the recording script and some notes, the results will be less than adequate. One might argue that prompt recording scripts can include explanatory remarks to make the recording choice unambiguous. While this can help, it would take an unreasonable amount of time on the designer’s part to add enough detailed notes to adequately describe prompt requirements. Furthermore, even this cannot take the place of a director who’s familiar with the application. Such arrangements require close contact and communication from both sides.

Directors will use different techniques to get the talent to deliver the right results. Some talents do well when they are able to hear an example, in which case the director should be able to give one on the spot. Others need alternative methods and may take longer to get it right. In any case, the director must be able to motivate a talent to get results without being critical. Any degradation in the trust or communication between director and voice talent can easily ruin a session.

Finally, in addition to skill and experience, a good director will provide an organized recording script to work from. Although this may seem straightforward, it is often overlooked and can easily disrupt the session. A well-designed recording script will be clear and consistent in its presentation. It will have the prompt name, text, notes to help the director get the correct read and perhaps a place to tally the number of takes and record the best one. In the best-case scenario, the script will be generated directly from a prompt database where all the records are stored. Experienced voice talents are used in all kinds of scripts, from last minute, handwritten disasters to color coordinated masterpieces. The key is to limit distractions so talents can focus on what they do best.

Unfortunately, directors with a lot of experience in prompt recording of this kind are relatively hard to find. As with voice talents, there is little formal training for studio direction as it pertains to telephone applications. One can be sure however, that this problem will begin to disappear as the industry grows.

The Right Audio Engineer
In addition to a good voice talent and director, a successful prompt recording session requires technical expertise in order to guarantee the best sound quality and the most appropriate use of time and finances. For this a good sound engineer is invaluable.

There are several things the engineer can do in preparation for the session as well as during and afterward. If it’s going to be the same voice talent, the engineer will find previous recordings prior to the session and match the levels accordingly. This is especially important if the new prompts are going to be used with others that were previously recorded, for example, in a new feature of an existing application. Even if it’s a new talent, the engineer will spend some time to check levels and adjust settings to fit the talent’s voice. Once the session has started a good engineer will be able to anticipate the director’s needs and help with certain tasks. For example, if a certain take needs to be played back, the engineer will be able to locate and replay the prompt quickly. Additionally the engineer can help listen for audio flaws either in the talent’s performance or with the equipment, which will require a second take. Once the session is over, audio prompts need to be normalized, edited, and then converted to the proper format for telephony applications. The best engineers can do this both accurately and quickly, thereby enhancing the quality while saving both time and money.

Conclusion
Prompt recording is an integral part of the speech application development process. As with anything, the best results come from using the best talent and techniques available. In addition, the more communication there is between designers, directors, voice talents and audio engineers, the better the results will be. We’ve covered many different criteria that can be used to gauge the effectiveness of these roles. However, when an experienced individual cannot be found, it’s important to keep our eyes open for professionals in related fields whose raw talent may just be enough to effectively do the job. k

Dr. Bill Byrne is manager of the Voice Center at SAP Labs and consulting assistant professor of Symbolic Systems at Stanford University. He can be reached at william.byrne@sap.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

In the Studio: Setting High Standards for Prerecorded Audio

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

Northeastern Researchers Develop AI App to Help Speech-Impaired

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents