The 2017 Speech Industry Star Performers: Acapela
Acapela Puts a Creative Spin on Voice Creation
Creating synthetic text-to-speech (TTS) voices typically involves rich audio material recorded by professional voice actors in professional studios under the supervision of linguistic experts. It can be expensive and time-consuming.
But Acapela Group has broken that mold. Based on its work with deep learning and deep neural networks during the past year, the Belgian company recently launched Acapela DNN, which can create synthetic versions of any voice using just a few minutes of speech recordings and the associated text transcripts.
In cases where the TTS voice will be used by the disabled or those who have lost the ability to speak due to injury, surgery, or disease, Acapela DNN can work with 10 to 15 minutes of speech. For more professional applications, such as video games or passenger information systems, Acapela DNN might need an hour’s worth of recordings or more. Obviously, the more data it has to work with, the more closely the synthesized voice will match the original.
The secret is using deep neural networking to learn the relationship between input texts and their acoustic realizations by different speakers. Unique algorithms use Acapela’s Voice ID technology to define the digital signatures of speakers’ vocal tracts; additional training helps match the voice imprints with more granular details, such as speaker accents and speech patterns.
Acapela DNN is trained offline with Acapela’s TTS portfolio of 100 speakers in 34 languages and accents, including a unique repertoire of 20 children’s voices.
“Acapela DNN represents Acapela’s ultimate talking machine, benefiting from our speech expertise and learning from our vast voice and language databases to model voice identities and reproduce speech, in many languages. This is much more than concatenating speech recordings from the studio,” says Vincent Pagel, manager of Acapela’s R&D and Linguistics groups. “We are talking about creating a voice signal and persona from scratch in many languages, and it is happening now. We need only one week to release a new voice based on a few minutes of speech recordings.”
To further simplify the creation of TTS applications, Acapela in October released version 9.4 of its TTS software development kit, with new voices, audio boost, an enhanced spelling mode, an updated part-of-speech tagger, a new breathing model, enhanced phonetizer and stress restoration systems, phrasing/chunker alignment, and improved morphological analysis, leading to more accurate pronunciations. Acapela also optimized memory usage and enhanced the synchronization on words.
Acapela also expanded the use cases for its voices. Quickomat, a Swedish self-service vending machine provider, deployed a system in Norway for buying bus tickets. Acapela voices provide the talking interface for the disabled, relaying information usually displayed on machine screens.
In addition, Acapela is providing its U.S. English synthetic voice Sharon to Kokoro, a humanoid robot that was tested earlier this year at Japan’s Narita Airport. Kokoro, speaking both English and Japanese, provided assistance at the airport’s insurance counter, guiding passengers through the terminal and informing them about overseas insurance procedures and registration, important notices, medical and health information, flight and departure gates, and more.
In another very unique use case, Acapela was tapped this year to supply the voices for a-ACADEMY, a platform designed by Avallain to give children in Kenya access to educational interactive content. Avallain is using Acapela’s Box, an online service that allows users to create voice samples in any of the 34 languages Acapela supports. A-ACADEMY is currently using Acapela’s U.K. English voices to provide accurate pronunciation for English lessons.
Acapela says the applications it can create are endless. The company clearly is on a mission to push the boundaries of technology, allowing everyone—and everything—to have a voice.