Meta, Google Develop Their Own AI Speech Models
Facebook parent company Meta last month unveiled Voicebox, an advanced artificial intelligence tool for generating speech from text and related tasks, such as editing, sampling, and stylizing. Google at the same time introduced AudioPaLM, its own large language model that can tackle speech understanding and generation tasks.
Meta’s Voicebox lets users create audio clips and edit prerecorded audio while maintaining the original content and style. It can generate speech in six languages (English, French, Spanish, German, Polish, and Portuguese), even when the sample speech and text are in different languages.
Voicebox uses machine learning not only to remove background noise but also to fill in gaps created by external noises, like car horns or barking dogs, in the original audio sample. It can also replace misspoken words without requiring users to completely re-record the original audio clip.
And in a recent blog post, Meta developers claimed that Voicebox could generate speech that closely mirrors how people naturally converse in the real world. Voicebox was reportedly trained with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks.
Google’s AudioPaLM, meanwhile, combines two other proprietary Google models, PaLM-2 and AudioLM, to produce a unified multimodal architecture that can process and produce both text and speech. This allows AudioPaLM to handle a variety of applications, ranging from voice recognition to voice-to-text conversion, Google said.
AudioPaLM uses a joint vocabulary that can represent both speech and text with a limited number of discrete tokens. Tasks like speech recognition, text-to-speech synthesis, and speech-to-speech translation, can now be unified into a single architecture and training process.
In addition to speech generation, AudioPaLM can also generate transcripts, either in the original language or directly as a translation, and generate speech in the original source.
Microsoft Comparisons
Meta claimed that Voicebox can produce audio clips using just a two-second audio sample. The previous gold standard had been from Microsoft, which claimed that it’s VALL-E models needed just three seconds of audio.
In a research paper, Meta developers working on Voicebox claimed it could generate audio samples 20 times faster than Microsoft. “Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech, with 1 percent error rate degradation as opposed to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models,” they wrote in a research paper.
Voicebox, they said further, outperforms VALL-E on zero-shot text-to-speech in both intelligibility (5.9 percent vs. 1.9 percent word error rates) and audio similarity (0.580 vs. 0.681), For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce average word error rate from 10.9 percent to 5.2 percent and improves audio similarity from 0.335 to 0.481.
Google also touted its AudioPaLM as superior to VALL-E and similar products, asserting that AudioPaLM “significantly” outperformed other systems in speech translation. It said AudioPaLM can perform zero-shot speech-to-text translation, meaning it can accurately translate speech into text for languages it has never encountered before, can transfer voices across languages based on short spoken prompts, and can capture and reproduce distinct voices in different languages.
Both Meta and Google also claimed that their respective Voicebox and AudioPaLM technologies can preserve paralinguistic information, such as speaker identity and intonation.
Both companies have also maintained that the potential applications for their technologies are many.
Google sees AudioPaLM being used in multilingual voice assistants, automated transcription services, and other systems that need to understand or generate written or spoken language. These could include video subtitling or dubbing in multiple languages without losing the original speaker’s voice.
Voicebox developers called their system “the first versatile, efficient model that successfully performs task generalization,” adding further that it “could usher in a new era of generative AI for speech,” and “represents an important step forward in generative AI research.”
Meta acknowledged the potential for misuse and unintended negative consequences with Voicebox. To address these concerns, the company is working on a classifier to differentiate between authentic speech and audio generated by Voicebox. It is also not making the Voicebox model or code publicly available just yet.
“We look forward to continuing our exploration in the audio domain and seeing how other researchers build on our work,” its researchers said.
Google’s researchers pointed to several other areas for future research, including understanding audio tokens and how to measure and optimize them. They also emphasized the need for established benchmarks and metrics for generative audio tasks.