AI Teaches Avatars How to Talk
Avatars have been a popular way for individuals and companies to express themselves digitally for more than 30 years. Technology upgrades over the years have taken this virtual content from static to dynamic, with speaking avatars becoming more common.
But for all the innovation, a long-standing challenge has been coordinating avatars’ words and body movements. New artificial intelligence tools are emerging that simplify such work, but easily integrating them into business workflows could be as big a challenge.
Avatars are virtual representations of human beings. They have been quite popular in the consumer market. Their lifelike nature creates appealing video game characters. Furthermore, individuals have been replacing their profile photos with anime-style, Japanese cartoon, and magic avatars.
Recently, avatars have begun carving out growing niches in the business market. Companies use them in their customer service systems, onboarding documentation, marketing promotions, and training applications.
Initially, the depictions were always based on humans, with individual selfies as a popular design foundation. As technology has advanced, avatar construction has evolved. AI avatar generators can create the virtual images from scratch and deliver unique depictions.
With other advances, pictures are becoming richer and more interactive, serving as foundational parts of cartoonlike and augmented reality/virtual reality presentations. In sum, the personas are evolving from static mutes to interactive, talking characters.
Avatar Speech Building Blocks
Adding speech to avatars is a complex process that marries three separate development activities: creating human facial movements; generating spoken words; and combining the two. Creating such software has been exceedingly complex. One reason is that speech application development tools have shortcomings. The software needed to build avatars often had limited functionality and was difficult to manipulate. This sector lacked the low-code/no-code capabilities that are becoming increasingly common in other application development foundations. Also, few programming tools exist, and those that do require very specialized skills to run them. Even experienced app developers find the creation process manually intensive, time-consuming, and frustrating. The end result? Developers often could only produce static characters and simple movements, content that was not very engaging.
In response, vendors have been creating new AI-based solutions based on a more modern foundation. They leverage the exponential power boost evident in generative AI systems, like ChatGPT.
Consequently, the industry stands at a crossroads, one where new personalized, talking, dynamic avatars are emerging. To create these animations, vendors needed to improve the three types of development tools.
Computer vision systems became more adept at tracking facial features. Increasingly, they create lifelike replicas in a growing variety of formats, including 2-D, 3-D animation, and AR/VR.
Speech recognition solutions have also become richer. They are evolving from simple answers and robotic responses to richer, more lifelike interactions.
Lip sync technology is becoming more granular and increasingly delivering precise lower face, facial hair, and neck movements.
Because of the changes, the quality of the virtual depictions has been improving. The latest generation of animated avatars is more lifelike and provides a more immersive alternative than the static representations of yesteryear.
New AI Lip Sync Tools Brings Advantages
These new AI solutions have the potential to deliver a wide and growing range of business benefits. They can do the following:
- Streamline development. Developers want to spend their time enhancing their avatars’ appearance, speech, and delivery. They don’t want to fret about how to tie different software components together. Luckily, AI tools can automate a growing number of mundane software infrastructure programming tasks.
- Increase flexibility. AI tools have been developing options so programmers can adjust lip movements, fine-tune synchronizations, review synced avatars, and automatically make adjustments when necessary. The tools now support a wider range of characters, allowing users to customize clothing, head movements, gestures, and more seamlessly.
- Enhance quality. New tools allow users to upscale their work to support rich media and create higher-quality, higher-resolution avatars. They also offer a range of styles, from anime to lifelike, making it simpler to create unique depictions.
- Support multiple languages. AI-powered tools translate and lip-synch content into multiple languages, dialects, and accents, enabling localization for global audiences.
- Improve marketing. Maintaining an original speaker’s appearance helps companies connect with local audiences. Sounding authentic creates a tight bond. Multilingual support expands the number of potential customers.
- Boost productivity. AI reduces the time and cost associated with traditional development methods. AI lip sync tools facilitate seamless dialogue replacement and editing, making content updates and delivering multiple versions easier to implement. It also provides real-time previews, enhancing quality and shrinking testing time.
These new AI-based tools also automate more of the lip sync process. They enable programmers to capture wording and tone more accurately and deliver what they envision more easily.
Because of the improvements, demand for such items has been growing. Research firm Global Market Insights valued the AI avatar market at $5.9 billion in 2023 and expects it to reach $57.9 billion by 2032, growing at a compound annual rate of 30 percent.
And though the gaming and entertainment sector is projected to hold a significant share of the market, business uses are growing. Virtual agents and assistants make up 33.5 percent of all avatars, according to GMI, which noted that they are becoming very popular in the retail and e-commerce, financial services, telecommunications, healthcare, education, and automotive sectors.
AI Lip Sync Tool Limitations
The new tools are very helpful, but they still create deployment impediments. These include the following:
Data collection. AI relies on very, very large volumes of voice and video data models to synchronize lip and body movements. Human speech is complex. “You need a massive amount of audio to train AI data models to understand the nuances of human speech,” explains Jack Stratford, a customer support agent at Revoicer. Video is the same, and so is lip-synching. So, an impediment is identifying where the data will come from and putting processes to collect it. In addition, companies need to safeguard the information.
Infrastructure investments. AI avatar software requires special-purpose processors and systems. Garnering the required computer resources is difficult. These solutions are expensive and processing- and resource-intensive. Compounding the issue, very few IT techs are familiar with these systems, so competition for such individuals is intense, again raising costs.
Audio quality issues. AI tools build data models with voice samples collected in a wide range of ways. Each item is fed into a common data model, like adding pieces to a puzzle. Quality input leads to quality output, so high-quality audio is essential for accurate lip syncing. Companies, therefore, should use high-quality microphones for recording. Furthermore, voice sample providers need to enunciate clearly to help the AI map lip movements accurately. They also should reduce audio background noise. Increasingly, echo canceling features are being integrated into the AI systems.
Video issues. High-quality video is also crucial for the best results. Users also need to ensure that the background is well-lit. Higher definition video improves lip movement precision.
Situational awareness. AI solutions struggle with visual obstructions, such as a person talking while covering their mouth with a hand or holding a phone. These scenarios lead to misaligned lip movements.
Not seeing the whole picture. Lip sync’s goal is to have the audio correctly correspond to mouth and body movements. However, the technology’s increasing flexibility creates new barriers. With face swapping capabilities, one face is juxtaposed onto another body. In those cases, developers need to be cognizant of not only the voice and body movements but also background resolution and color.
Robotic responses. No matter how lifelike they might be, renderings are still machine-made products. An over-reliance on technology eliminates human creativity and the personal touch found with individually designed avatars.
The need for constant tuning. AI creates probability rather than certainty models. No matter how much information is available and how strong the algorithms are, the system will never be totally flawless. Consequently, companies must constantly update their data models, work that can become tedious, time-consuming, and costly.
Ethical Considerations
Vendors create the development tools, but they have little to no control over who uses them and how. Consequently, legal and ethical issues have been arising. A big one is the unauthorized use of the likenesses of real people. The new solutions enable third parties to replicate any individual’s voice and facial movements without consent and use them in nefarious ways. These deepfakes might exploit the images for false advertising, making it seem like a well-known person endorses a product, or spreading misinformation. Nowadays, news spreads quickly. False narratives can take flight quickly, particularly as it becomes harder to distinguish between the real and the simulation.
There’s also always the risk of copyright infringement. Voice cloning could expose companies to legal ramifications for inappropriate use of intellectual property. Businesses have to be sure that they use AI lip sync responsibly, respecting content ownership and ensuring transparency.
Security is a significant consideration because AI relies on personal information, according to Dan Miller, founder of Opus Research. Realistic replicas can trick security systems that rely on biometric verification. Once in, the bad guys can impersonate individuals and commit identity theft schemes, financial fraud, and ransomware.
Equally damaging is the erosion of trust. The ability to create realistic audio deepfakes can lead to a general skepticism about the authenticity of audio content, making it harder for people to trust what they see and hear.
And finally, these technologies are likely to displace some developers. Automated systems could replace the individuals involved in creating avatars, not only the artists but also people who ushered them through the creative process. Questions arise about whether such practices are ethical and how those people should be protected.
Business Use Cases Expand
Despite the challenges, AI tools are gaining traction. Avatars have been associated with the consumer market, but commercial acceptance is rising in different sectors. Chief among them are the following:
- Media. Nowadays, there are more media producers than ever. SMB and individuals have created viable businesses by becoming content creators on platforms like YouTube, Instagram, and TikTok. Using AI avatar lip sync tools enables them to repurpose content and adjust dialogues more easily.
- Entertainment. Personalization is important in this space, and the new tools enable companies to create a wider range of avatars quickly. Producers edit dialogue in post-production, synchronizing new audio with existing video, allowing for script changes and versions suited to different audiences.
- Education. Schools use AI avatar lip sync to enhance learning content. Avatars are being embedded into educational videos and can even act as tutors. They help students walk through various lesson plans, explaining content and providing an environment where students feel more confident in their abilities.
- Marketing. Avatars become part of a company’s brand identity and enhance storytelling by developing a wider range of characters, personalizing content for specific audiences and improving engagement and effectiveness. Customers are more interested in engaging video content, and companies can provide avatars that make a good impression and increase the level of engagement.
- Customer service. Companies are looking to use virtual technologies to automate and improve customer service. AI has the potential to help companies move away from traditional, often ineffective interactive voice response systems to more immersive voice interactions, according to Opus Research’s Miller. In fact, customer service already represented more than 33.5 percent of worldwide AI avatar revenue in 2023, according to Global Market Insights.
- Collaboration. Businesses are increasingly adopting remote operations. AI avatars enrich dialogue during company and department meetings. They can also strengthen bonds among remote workers who do not interact face to face.
- Internal training. Companies provide training to individuals for different positions. Avatars remove the intimidation that users might feel and make the content more accessible.
- Employee onboarding. When employees start at a new job, they need to wade through a lot of often cumbersome information and fill out numerous forms. These tools make the content simple and more engaging. Newbies are up and running faster, improving the company’s productivity.
Avatar technology continues to advance at an incredible pace. Vendors are adding speech to traditionally mute characters. The change delivers richer, more engaging content, but its application development environment lacks the features found with some other digital tools. To take advantage of the functionality, companies often need to connect much of the development infrastructure pieces themselves. Lip sync vendors are moving to address the limitations and extend the reach of the virtual characters.
Paul Korzeniowski is a freelance writer who specializes in technology issues. He has been covering speech technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at paulkorzen@aol.com or on Twitter @PaulKorzeniowski.
5 Companies That Matter
- Colossyan, providers of an artificial intelligence video creation platform that uses AI avatars and text-to-speech narration.
- Speechify, makers of an artificial intelligence-powered text-to-speech voiceover generator tool.
- Synthesia, a synthetic media generation company that develops software used to create artificial intelligence-generated video content.
- VEED, provider of a free artificial intelligence video editor with text-to-video, avatars, auto-subtitles, voice translations, and more.
- WellSaid Labs, provider of an advanced artificial intelligence voice platform and text-to-speech technology.