The Rise (and Risks) of Speech Synthesis Applications
Text-to-speech (TTS) is a decades-old field, but adoption was generally limited to a few areas because synthetic voices felt unnatural and robotic. But in the past five years, thanks to deep learning, synthetic voices (a.k.a. neural TTS) can be made to sound more natural and pleasant. The pitch, pace, pronunciation, accent, emotion, and speaking style can be tuned as needed. Large cloud vendors such as Amazon, Google, IBM, and Microsoft offer APIs that allow developers to easily add speech capabilities to a variety of applications. In addition to the big vendors, a number of innovative startups and specialists are imagining new possibilities with synthetic speech.
These different use cases fall into two broad categories.
Read-Out-Loud Use Cases
Companies can use stock voices or create customized synthetic voices (including celebrity voices) for these use cases.
Customer service. Automatic voice response is one of the oldest use cases, but the conversational paths and responses had to be carefully pre-recorded and fully scripted. Open-ended conversations were limited because it’s not possible to pre-record every potential response. Now, by using artificial intelligence on a limited corpus of training data (i.e., pre-recorded audio), a synthetic voice can be created and leveraged for open-ended conversational applications.
News reading. Many publications (for example, The Washington Post, BBC, The Wall Street Journal) use TTS so readers can listen to articles. Some media sites offer “listen to stories” as a premium feature for paid subscribers.
Emails. Emails are read out to you (for example, in Microsoft Outlook). This enables a hands-free experience when, say, you’re driving.
Assistive technologies. Voice banking can help people with motor neuron diseases to generate their own synthetic voice that can be used on assistive speech devices. Some apps enable users with speech difficulties to speak via TTS interface devices. To help users with vision challenges, there are apps that read out prescription labels, product tags, and labels, as well as apps that provide cues and descriptors of a user’s surroundings.
Rich Content Use Cases
This set of use cases often involves both audio and video content.
Dubbing and voiceovers. Dubbing and voiceovers for videos are not new, but the rise of streaming platforms such as Netflix created a global audience and new demand for dubbing content into multiple languages. Using a mix of speech recognition, machine translation, and synthetic voices, the audio can be dubbed into different languages in the original actors’ voices. Lip-syncing used to be an issue with content dubbed into a different language, but now AI helps create synthetic lip movements that match the spoken word.
Audio editing. This is an innovative use case that helps reduce the barriers to audio editing. Using an auto-generated transcript or text, you can remove filler words, add new audio, or remove snippets, all by modifying the corresponding text. This has the potential to reduce the cost and time of editing considerably.
Online and metaverse safety. Using AI, voices can be transformed or changed while retaining their emotional and expressive qualities. Just like gamers apply visual skins to their avatars, a voice skin can be applied to protect privacy and reduce harassment in gaming environments or audio-based social media (for example, in Twitter spaces or Clubhouse).
Ethical Concerns and Risks
Alas, as legitimate use cases increase, so does the potential for misuse and fraud.
User consent to use synthetic voices. In a documentary about the deceased celebrity chef Anthony Bourdain, his synthetic voice was used to make him “speak” a few lines that he never actually said. Such examples raise the issue of consent and what’s permissible and what’s not.
Deepfakes. It’s not hard to imagine how we might be flooded with sophisticated deepfakes of public figures (and even private citizens) as synthetic audio and video capabilities get even better. It can be a misinformation minefield, with consequences for public trust and the reliability of information sources.
Voice phishing and frauds. The Wall Street Journal reported that the voice of a CEO was spoofed to make phone calls to his colleagues with instructions for funds transfers, an entirely new category of cybercrime that is enabled by a combination of synthetic voices and social engineering.
Companies will need to be adept at using synthetic voice use cases responsibly to improve user experience, deliver better customer service, and create new products and services. But they’ll also need to guard against adversarial attacks by malicious actors. It’s a brave new world for speech applications.
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.