-->

New Trends in Speech Technology: A Report from the Cutting Edge

Article Featured Image

It’s always exciting to hear splashy announcements of new artificial intelligence capabilities and technologies, like OpenAI’s GPT-4o or the Sora video generator. Did you ever wonder how these dramatic innovations come to be? In fact, they don’t just come out of nowhere; they are based on a tremendous amount of basic research performed by a community of experts at universities and companies.

These experts focus on extending the underlying technologies of image processing, speech recognition, natural language understanding, and dialogue processing. They spend most of their time working on research projects in their labs, but they also meet on a regular basis at technical conferences. These conferences are where they present their research results and exchange ideas, either in formal sessions or informally in the halls and over coffee. This kind of lively interchange lays the groundwork for future advances.

Two of the most prominent conversational AI technical conferences, the Language Resources and Evaluation Conference (LREC) and the Computational Linguistics Conference (COLING), met jointly in Turin, Italy, in May as LREC-COLING-2024.

Over 2,000 international experts in language technology from dozens of countries gathered to share their research. More than 1,500 papers were presented, representing the very newest advances in natural language and speech technology. These papers were selected, after a thorough peer-review process, from more than 4,000 submissions. Conferences like LREC-COLING-2024 are where these findings are first made available outside of the researchers’ labs for discussion and consideration by their technical communities, a vital part of the process of developing new technologies.

I attended, and I thought it would be interesting to summarize some broad themes of the cutting-edge research that was presented there. By looking at the research presentations, we can get a glimpse of things to come. You will be hearing about how this research has been put into practice as product announcements are made in the coming months. Here are a few examples of common themes from the conference.

A lot of work was presented on trying to understand meanings based on how you say something instead of what you say. This kind of research has applications in several areas:

Emotion recognition. It’s important to recognize the emotions conveyed in the tone of voice, emphasis, and pacing of utterances independent of the spoken words. Try saying something like “I don’t know” in a sad, happy, angry, or excited tone of voice, and you can easily see how different the emotions sound. In an area like customer service, it’s crucial to understand the users’ emotions as well as the words.

Medical applications. How something is said is also critical in medical applications. Speech patterns can provide information useful for diagnosing the many diseases that affect a person’s speech, like Parkinson’s and Alzheimer’s. Speech patterns can also reflect conditions like depression.

Automatic detection of hate speech and other offensive language. Unfortunately, online text can contain hate speech, which not only upsets users but reflects badly on any platform or website that delivers it. The volume of online text makes relying on human moderation impractical; the automatic moderation techniques described at the conference could prove very helpful.

More languages. Much research focused on extending the techniques used with languages like English to less common languages. Google is embarking on an ambitious project to provide computational resources to help with the analysis of many more of the world’s 7,500 languages, starting with the publication of a common format for describing languages. Many papers described resources for processing less common but still widely spoken languages.

Finding more precise meanings. Research was presented on systems that extract more detailed meanings from complex texts—for example, systems that can make more difficult inferences, like being able to answer questions like “Who is the mayor of the city where the Louvre is located?” The system has to first find and identify Paris as the city where the Louvre is located and then correctly identify its mayor.

Data and evaluation. Finally, many papers covered these unsung heroes of language technology. While LLMs and genAI are glamorous technologies that everyone is familiar with, they couldn’t exist without the vast amounts of data that they’re trained on, which is how they learn to understand and generate language. Over 140 papers described data covering a wide range of common and uncommon languages as well as specialized domains like polymer science or cancer drugs. Evaluating conversational AI technologies is also a critical component of scientific research—it’s hard to improve any technology without understanding how well it works. Over 70 papers covered various aspects of evaluation.

Look for future applications that use the trends I’ve reviewed here. 

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversationaltechnologies.com.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Standards for Evaluating Generative AI

Assessing the output of genAI systems is easier said than done.

With Conversational AI, the Standards Work Heats Up

The W3C has published recent standards that will impact AI-powered speech applications.

How to Make ChatGPT Usable for Enterprises

Not surprisingly, a standardized format would make the process a lot easier.