-->
  • June 12, 2024
  • FYI

Amazon Proposes the SpeechVerse Framework

Article Featured Image

Researchers from Amazon’s AWS AI Labs have introduced SpeechVerse, a multimodal framework enabling large language models (LLMs) to execute a variety of speech tasks through natural language instructions.

SpeechVerse integrates textual LLMs with speech encoders in one supervised training setup for a more comprehensive understanding of both speech and text. The multitask learning leverages shared representations across diverse tasks to enhance generalization and efficiency.

SpeechVerse’s multimodal model architecture includes an audio encoder, a convolution downsampling module that shortens the audio feature sequence, and a pretrained LLM that uses these audio features and textual instructions to perform the requested tasks.

The audio encoder extracts semantic features from audio using the pretrained model. The downsampling module adjusts the audio for compatibility with the LLM, which then processes the text and audio input. Models can be fine-tuned, effectively freezing pretrained components to efficiently handle diverse speech tasks, such as automatic speech recognition, speech translation, and natural language processing.

The SpeechVerse model leverages the robust language understanding of its LLM backbone to adapt to open-ended tasks that were not included during initial training or multimodal fine-tuning. This involves unique prompting and decoding strategies, including constrained and joint decoding, which enhance the model’s ability to generalize to completely unseen tasks.

Using SpeechVerse, the Amazon researchers developed three variants of the multimodal models. They included Task-FT, where each model is trained individually for a specific task; Multitask-WLM, a single multitask model trained by pooling datasets for all tasks together; and Multitask-BRQ, which uses the BERT-based speech pretraining with random-projection quantizer (BEST-RQ) architecture for the audio encoder. BEST-RQ reportedly has performed well with automatic speech recognition while being simpler than other self-supervised-learning methods, such as wav2vec 2.0.

SpeechVerse has also performed well in preliminary testing. In comparative analysis against conventional baselines, SpeechVerse performed better on 9 of 11 tasks, showcasing its robust instruction-following capability. ASR benchmarks showed that SpeechVerse was effective at core speech understanding and speech language understanding, with task-specific pretrained speech recognition models showing promising results. SpeechVerse models also proved equal to or better than other state-of-the-art models across diverse tasks like speech recognition and speech translation. It was also shown to be more resilient across out-of-domain datasets, unseen prompts, and novel tasks.

By decoupling task specification from model design, SpeechVerse represents a versatile framework capable of dynamically adapting to new tasks through natural language without the need for retraining, the researchers concluded.

Based on the positive results so far, the researchers reportedly plan to enhance SpeechVerse’s capabilities to follow more complex instructions and generalize to new domains.

Congresswoman Uses TTS to Deliver a House Speech

Rep. Jennifer Wexton, a Democratic congresswoman from Virginia, in early May delivered a speech on the floor of the U.S. House of Representatives using a text-to-speech application on an iPad.

Wexton, 55, was diagnosed a year ago with a degenerative brain condition called progressive supranuclear palsy, or PSP, which makes it difficult to speak. She was first elected to the House in 2018 and announced in September that she would not seek reelection because of her condition.

“PSP makes it very difficult for me to speak, and I use an assistive app so that you and our colleagues can understand me,” Wexton said in the speech.

Wexton has been using the app for the past two months or so. With it, she can load a prewritten speech into the app and have it read aloud. She has also used it in real time on a few occasions.

With the voice app, Wexton spoke about legislation she introduced to rename the post office in Purcellville, Va., after the late Madeleine Albright, a former secretary of state under President Bill Clinton.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues