Why Speech Researchers Need Better Benchmarks
The speech-to-text industry has grown by leaps and bounds in recent years. Word error rates that once seemed remarkable are now table stakes thanks to improved training and testing techniques deployed within increasingly powerful automatic speech recognition models.
However, as real-world speech recognition has evolved to cover a wide spectrum of uses for long-form audio, the research community is lacking important updates to the most popular, publicly available English datasets that are used to train and test ASR models. This has led to a mismatch between research and real-world application: People increasingly use speech recognition for long-form audio, such as transcribing and captioning content, yet the breakthrough model architectures developed by the research community are often only evaluated on short-form use cases.
Put simply, as the speech-to-text industry invests in building the future of long-form ASR, neglecting to update these legacy datasets runs the risk of slowing us down.
The problem lies in how publicly available datasets are typically segmented. One standard research setup, for example, is segmenting an audiobook into chunks at the end of every sentence—a useful approach when speech recognition is limited to short-form audio, such as giving commands to a digital assistant or navigating an automated customer service line. As ASR models and compute power have improved exponentially, however, the market has sprinted far ahead of those research standards with new ASR techniques that can leverage increased context in long-form audio to enhance accuracy. Today, real-world audio increasingly occurs as unsegmented, long-form recordings, not short and discrete segments. The datasets weren't built to train models for this growing use case.
While researchers have tackled this mismatch problem with techniques like better segmentation, large context acoustic modeling, and rescoring with appropriate language models, there's still little consensus for best practices on training long-form ASR models, in part because the research community doesn't have common benchmarks for measuring performance. To build better, faster, and more accurate long-form ASR, we first need to agree on and adopt those benchmarks. That is how we can ensure we're heading in the right direction.
As much as innovation is sparked by healthy competition, it also depends on open collaboration across the research community. While published studies backed by private data have certainly pushed forward advanced ASR capabilities, they cannot be replicated by other researchers, emphasizing the importance of publicly available datasets. We need to develop new techniques and public benchmarks, drawn from standardized data that reflects how people actually use speech recognition today.
Our team at Rev and the Johns Hopkins University Center for Language and Speech Processing hopes to aid in that process. In our latest research, titled "Updated Corpora and Benchmarks for Long-Form Speech Recognition," which we presented at the IEEE International Conference on Acoustics, Speech and Signal Processing, we released updated versions of the TED-LIUM 3, Gigaspeech, and VoxPopuli datasets to encourage more robust research into long-form ASR. Our initial tests show that a simple strategy of combining original and long-form segments for training is effective at reducing the performance gap: For both TED-LIUM and GigaSpeech in particular, long-form training resulted in significant reductions in the deletion rate, leading to better performance. Perhaps more importantly, these updated datasets can now function as public benchmarks for future research.
Unlike pre-segmented datasets, these updated datasets include word-level timestamps obtained through forced alignments, giving researchers more control over how they can segment audio. For training and testing purposes, this means audio can be split anywhere to reflect the appropriate inference time.
With this new variable in play, researchers can experiment with other segment lengths or dynamic segmentation to study long-form ASR performance. Now, the challenge is catching up to the market: The rise of ASR for long-form audio reminds us how rapidly the speech-to-text industry has improved, and it also sets a course for the research ahead as we build tools to match real-world use cases.
By releasing the updated TED-LIUM 3, Gigaspeech, and VoxPopuli datasets publicly, we hope to help researchers looking for ways to lower word error rate and improve long-form ASR performance. Building the future of long-form ASR is a responsibility we all share, and we believe these benchmarks can serve an impactful role on that journey.