-->

aiOla Releases Whisper-Medusa AI Model

aiOla, a speech recognition technology provider, has released Whisper-Medusa, an open-source artificial intelligence model based on a multi-head attention architecture.

aiOla's Whisper-Medusa greatly improves speed compared to OpenAI's Whisper by altering how the model predicts tokens. While Whisper predicts one token at a time, Whisper-Medusa can predict 10at a time, resulting in a 50 percent increase in speech prediction speed and generation runtime.

Whisper-Medusa, based on multi-head attention, is trained using weak supervision. In this process, the main components of Whisper are initially frozen while additional parameters are trained. This training process involves using Whisper to transcribe audio datasets and employing these transcriptions as labels for training Medusa's additional token prediction modules. aiOla currently offers Whisper-Medusa as a 10-head model, with future plans to release a 20-head version with equivalent accuracy.

"Creating Whisper-Medusa was not an easy task, but its significance to the community is profound," said Gill Hetz, vice president of research at aiOla, in a statement. "Improving the speed and latency of LLMs is much easier to do than with automatic speech recognition systems. The encoder and decoder architectures present unique challenges due to the complexity of processing continuous audio signals and handling noise or accents. We addressed these challenges by employing our novel multi-head attention approach, which resulted in a model with nearly double the prediction speed while maintaining Whisper's high levels of accuracy. It's a major feat, and we are very proud to be the first in the industry to successfully leverage multi-head attention architecture for automatic speech recognition systems and bring it to the public. "

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

aiOla Unveils AI Model That Instantly Adapts to Industry Jargon

aiOla's model increases speech recognition accuracy when transcribing domain-specific dialogue.