AI Rapidly Automating Audio Content Generation
Artificial intelligence is stepping further into audio content generation and altering its development dramatically. Established vendors, like Google, Meta, and Microsoft, as well as startups, such as Revoicer and WellSaid, are leveraging generative AI and delivering more flexible, capable, realistic content. These solutions help companies, content creators, podcasters, and entrepreneurs create audio content for a growing array of applications. The advances do come with a few caveats, such as a need for specialized skills, data privacy concerns, and high costs, but the market is poised to grow significantly in the next few years.
Nowadays, content is king, and audio content has become a popular tool that businesses use to connect with customers, partners, and suppliers. Traditionally, creating such material was a time-consuming, manually intensive, expensive process.
Next-generation generative AI solutions simplify audio data collection, generation, and distribution. Consequently, they have a bright future: the global AI voice generator market reached $3.6 billion in 2023 and is expected to rise to $10.6 billion in 2032, reflecting a compound annual growth rate (CAGR) of about 20 percent, according to Zion Market Research.
Fellow research firm Market.us has released numbers that are a little lower, but the larger industry forecast is basically the same: It will be robust!
Experts agree that the AI voice generator market is experiencing rapid growth, driven by technological innovations, particularly in deep learning and natural language processing, that have significantly improved the quality and accuracy of AI voice generation, it says in its latest report.
These innovations are likely to expand the potential use cases and increase demand across sectors such as entertainment, healthcare, and education, Market.us says further.
And “the opportunities for further advancements and applications are vast, ensuring a positive market outlook,” Market.us analysts concludes in the report. In particular, “integrating AI voice generators with augmented reality, virtual reality, and the Internet of Things opens up new avenues for growth.”
In the present, though, a growing number of applications are already leveraging the technology. Here are a few examples.
- Text-to-speech. Advanced TTS systems use AI to convert written text into spoken audio. The tools are becoming more sophisticated and can generate natural-sounding voices with human-like intonation. Increasingly, they can evoke different emotions. These applications are found in contact centers, web sites, and a growing number of intelligent devices, such as smartphones.
- Voiceover automation. Many organizations have relied on professional studios to produce high-quality audio content. The new systems automate a larger portion of the process. Companies use the verbal content in advertisements, tutorials, and sales collateral, and AI is becoming powerful enough to generate entire synthetic podcasts.
- Speech synthesis. As organizations build large data models, companies are using speech synthesis to voice virtual assistants, produce marketing content, and mimic wording and various dialects across the globe.
- Audio editing and enhancement. Automation is a major attraction with the new tools. AI solutions automate sound mixing by performing tasks, like automatically reducing background noise, and even creating dynamic, adaptive soundtracks.
Generative AI-powered audio content creation tools have the potential to enhance performance in many ways. They include the following:
- Improved audio quality. AI models analyze audio recordings and remove unwanted gaps and noises, resulting in better sounding audio content.
- Time savings. AI generates audio content rapidly, significantly faster than manual processes. “AI voice generators allow for the rapid creation of audio content, which can be particularly beneficial for time-sensitive projects or campaigns,” says Jack Stratford, a customer support agent at Revoicer. Another benefit is the tools enable organizations to create much larger volumes of content than they could in the past.
- Reduced expenses. Technology is so popular because machines cost less than humans. Automating audio content creation lowers labor costs and streamlines production. The new products reduce the need for human voice actors, sound engineers, and studio time. Consequently, they change audio cost metrics dramatically, opening the market up to smaller organizations. Companies eliminate the need for expensive audio studios or carrying audio equipment from place to place. One ripple effect is the number of potential creators increases because the infrastructure to produce quality audio content becomes more accessible and inexpensive.
- Improved consistency. Human beings make mistakes. AI removes human-related emotions, fatigue, or mood changes from the production process. Consequently, these solutions produce audio with consistent quality, tone, and style, which helps to improve brand perception.
- Productivity boosts. Content creators have more time to focus on developing quality content and less time on fine tuning audio production equipment.
- More personalized content. The automation capabilities make it simpler for organizations to tailor content to individual preferences. They can adjust volumes to different demographics, lowering the volume for younger listeners and increasing it for older ones. They can also change inflections to reach individuals who speak with distinct dialects or accents in different parts of the world. Content can also be customized with industry-specific jargon, slang, colloquialisms, and more.
- Wider accessibility to content. Many individuals have disabilities that make it difficult for them to work with various types of media. These products convert written material into speech, making it accessible to visually impaired individuals and those who have trouble reading. One interesting use case centers on individuals with amyotrophic lateral sclerosis, better known as Lou Gehrig’s disease. “Companies train AI models with a person’s voice when they realize they have ALS, explained Brian Cook, CEO of WellSaid. “When they lose the ability to speak, the system sounds like them talking.”
- Multilingual output. The world is becoming smaller, so companies want to create content that can be distributed in more locations. These solutions enable them to translate information from one language to another quickly and easily.
The Breadth of AI Tools Expands
For a market that has so much potential, it is still in the very earliest stages of development. Nonetheless, a number of companies—both startups and industry standards—have been pushing the boundaries of what is possible. Among some of the most active are IBM, Google, Amazon Web Services, Microsoft, Baidu, Samsung, Synthesio, Speechify, Speechelo, Wondercraft AI, ElevenLabs, OpenAI, Cerence WellSaid Labs, CereProc (recently acquired by Capacity), Listnr AI, and Respeecher.
Google, for example, has pioneered advancements in AI-driven voice synthesis, particularly through its Google Cloud Text-to-Speech and Google Assistant. Recent updates to its Google Cloud Text-to-Speech API allow developers to build more lifelike and expressive voices for their applications. The API now offers more than 220 voices across more than 40 languages. Its deep learning technology powers a variety of devices and applications, further expanding AI voice technology in both consumer and enterprise sectors. And its AudioPaLM product combines audio generation models with language models to assist with speech recognition and speech-to-speech translation. This tool can be fine-tuned to consume and produce tokenized audio as needed and translate the content into different languages.
Amazon continues to dominate the smart speaker industry with Alexa and its AWS Polly service, helping companies to integrate voice capabilities into their applications and devices. Its most recent advances have brought more sophisticated AI-driven conversational abilities into Alexa, allowing users to generate more dynamic and context-aware responses.
Microsoft, through its Azure AI Speech platform, has also made significant strides with AI, offering high-quality speech-to-text and text-to-speech solutions for a variety of industries, including healthcare, retail, and customer service.
And IBM focuses on integrating AI voice technology into enterprise solutions through IBM Watson Text-to-Speech, which enables businesses to create custom, scalable voice applications.
And though not necessarily considered a speech technology powerhouse, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp, and several other social media and communications apps, has also been active in voice AI development. Its Voicebox generative AI model specializes in creating audio from existing clips. The software also includes audio editing, sampling, and stylizing features and performs tasks like removing background noises, which improves audio quality.
A second Meta solution, Audiobox, generates audio sound effects via voice input and natural language text prompts. Individuals follow natural language prompts to describe a sound or type of audio they want to generate.
Some other lesser-known companies that have made a splash in voice AI include Make-An-Audio, developed by TikTok parent company ByteDance, which can generate personalized audio snippets from natural language inputs and existing audio; Murf.ai, which provides text-to-audio tools for corporate and entertainment purposes like advertisements, educational lessons, and presentations; WellSaid Labs, which offers a studio platform that allows users to craft and curate custom voices for specific use cases; ElevenLabs, whose solution is used to voice audiobooks and news articles, animating video game characters, in-film pre-production, localizing media in entertainment, creating dynamic audio content for social media and advertising, and training medical professionals in up to 32 languages; and Revoicer, which focuses on AI-generated voiceovers, having already created roughly 100,000 voiceovers and 1 million minutes of audio.
A Slew of Challenges Arise
While interest in these products is growing, corporations must clear several noteworthy hurdles to deploy them. These include the following:
- Technical immaturity. These solutions are new and therefore often push organizations out of their comfort zones because they have little to no experience with them. “The next 12 months will be spent figuring out how to bulletproof specific use cases and applications, plus prepping both contact center administrators and IT professionals to support approaches to digital customer care that include voice AI,” explains Dan Miller, founder of Opus Research. Because staff lacks expertise, enterprises need to turn to their suppliers or third-party specialists for help.
- Dependence on data. Training AI audio generation models is laborious. Massive volumes of audio are needed to make sure that the model understands and accommodates the many nuances found in human speech.
- Significant infrastructure investments. “Training these models can be quite complex and resource-intensive,” states Revoicer’s Stratford. “It requires specialized hardware and software, and it can take a lot of time and effort to get the models to perform well.”
- Use cases also impact how much processing power is needed to deliver quality results. “Suppliers face a constant balance between speed and quality,” explains WellSaid’s Cook. “High quality requires a lot of processing and compute power. To deliver a fast call center IVR high-quality response becomes challenging.”
- Technical limitations. These interactions take place between a machine and person. Consequently, suppliers must deal with issues that arise when people pause to gather their thoughts or systems encounter latency as queries are sent to the cloud for processing, according to Miller.
- Good but not perfect. Collecting enough high-quality data to build models represents a significant investment. Then, corporations must constantly prune the model to improve its accuracy. If the data is biased, outdated, or insufficient, the results become flawed. The reality is that systems never reach 100 percent accuracy. So what is good enough? Eighty-five percent? Ninety percent? Ninety-five percent? Making the business case to justify the substantial investments needed to increase the accuracy numbers is a matter with which management constantly grapples.
- A lack of emotion. Traditionally, the solutions often sounded like machines—awkward and artificial—making them less engaging. Improvements have occurred, but the systems can have difficulty understanding and responding to complex intonations, like humor and anger.
- Ethical considerations. AI generation technology finds itself in the middle of ethical debates. Questions arise about the process of collecting and generating data models. Also, these systems’ ability to mimic individual voices without a person’s consent create questions about proper usage.
- The data collection conundrum. Data models often rely on individual interactions, a process that raises questions about data ownership. The challenges start with consent. Users are not always fully aware about how their interactions, both verbal and textual, are stored and used. Suppliers often outline their intentions in complex legal documents that individuals sign when they access the system. The wording can be hard to decipher, and the implications unclear. Governments, like in the European Union, have been crafting laws designed to add more transparency to the process. In addition, suppliers are coming out with new usage models. In some cases, they share revenue with participants whose input builds their data models.
- Unintended monitoring. Devices that use voice AI often listen for trigger words to wake themselves up and provide customers with needed information. Sometimes, the customers do not know that the system is on and data is being collected even though they have not explicitly authorized such a process.
- Inherent bias. Ultimately, human beings write the code used by the data models and AI content generation solutions. Every individual is a product of his or her environment and carries preconceived notions about the world, which can be reflected in the solutions they create. The industry has been trying to identify and eliminate bias, but again, the foundations for these systems are constructed by humans who are imperfect.
- Copyright infringement. In many cases, individual input information that does not legally belong to them into data models, leading to potential abuse or misuse of that data. As a result, copyright and ownership questions arise. This area has always been evolving quickly and has been difficult to navigate.
- Voice cloning misuse. Voice cloning technology has become quite sophisticated and allows content creators to mimic individual voices and then use text-to-speech to generate virtual audio content. Potential uses include fraud, spreading misinformation, and market manipulation.
- A lack of trust. The ability to create realistic audio deepfakes can lead to a general skepticism about the authenticity of audio content, making it harder for people to trust what they hear. Companies that deploy the technology could find that customers reject rather than embrace the new solutions.
How to Pick the Right Use Case
Given the myriad of capabilities and deployment open questions, companies struggle to determine where to deploy AI content generation. These products seem best-suited to routine, high-volume, or personalization-heavy tasks, like voice assistants and transcription services. The tools now might not fit well with applications requiring a lot of emotion or mimicking complex audio delivery.
Despite the limitations, adoption has been moving at a rapid pace. Most companies, 74 percent, now use AI to generate content, according to eMarketer. That number is expected to grow: “The use of AI to generate content is inevitable,” concludes Opus Research’s Miller.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.
5 Companies That Matter
- ElevenLabs.A startup that specializes in developing AI-powered natural-sounding speech synthesis software using deep learning.
- Meta Platforms.The parent company of Facebook, Instagram, Threads, and WhatsApp and developer of a number of generative AI models.
- OpenAI. The AI research organization largely credited with fathering the generative AI sector with its launch of ChatGPT.
- Speechify.Makers of an AI-powered text-to-speech voiceover generator tool.
- Wondercraft AI.Makers of free AI audio editing software for podcasts, ads, audiobooks, narrations, and more.