Building the Next Generation of ASR: Speech Emotion Recognition Apps
In the winter 2021 issue of Speech Technology, I discussed the potential use cases of speech emotion recognition (SER) and its ability to enhance the customer experience (“Speech Emotion Recognition: The Next Step in the User Experience”). Now let’s move on to the challenges of SER and how we can build next-generation SER applications.
Limited Availability of Realistic Speech Emotion Datasets
As a field, SER is more than two decades old, but SER is relatively new compared to automated speech recognition (ASR). Today, ASR has really taken off because of the field’s leveraging of artificial intelligence. SER, on the other hand, has been slow off the blocks because, unlike ASR, the data available to train AI models has been rather limited.
Traditional SER datasets were either acted or induced. Acted datasets were created by paid actors speaking out set phrases with specified emotions. Induced datasets were a slight improvement over these, with certain emotions elicited by having the speakers watch particular clips or asking them to imagine specific situations. These datasets are sparse, and the kind of SER use cases we are envisaging today require automated emotion detection in interactive conversations; AI models trained on such datasets won’t work well in the real world. SER systems trained and tested on speech segments with predefined/limited emotions won’t be able to handle spontaneous speech in actual usage.
Note that the constraint is not the availability of real-world emotion-laden speech but annotating/labeling the data to create standardized datasets. Compared to other types of data (images, for example), tagging speech’s emotional content can be more subjective. That brings us to the next issue: speech emotion modeling.
Modeling Emotion Is Complicated
Speech emotion modeling, or how to represent the emotions embedded in speech, is both complex and critical. One of the traditional approaches has been to model speech emotion as belonging to one of the main categories—anger, distrust, fear, happiness, sadness, or neutral. Machine learning favors a dimensions-based approach over a discrete category-based approach. In the former, both acoustic features of speech, linguistic and non-linguistic, are used. It is possible to employ a mix of technical features of sound (spectral information, energy), prosody (intonation, intensity, rhythm), and more to train SER models.
Nonverbal vocalizations, such as laughter, sighing, breathing, and hesitations/pauses, contain useful signals for emotion detection. We also need to account for non-emotional conditions that bear on how the voice sounds—being tired, having a cold, or consuming alcohol or other substances, for example. Consumer-facing SER applications in the wild have to deal with multiple languages, cross-cultural speech patterns, far-field acoustics, speaker identification, group dynamics, speech turns, and so on.
Though we are discussing SER here, any other non-speech cues—such as visual information, if available—can also be inputs to the model. For example, in some scenarios both audio and video content may be available. The text of the speech itself can be analyzed using natural language processing (NLP). Going beyond literal interpretations, NLP can potentially help detect irony, sarcasm, or humor.
All this points to the importance of having high-quality data—the richness of the datasets will determine the performance of SER. And machine learning techniques are playing a big role here:
- Semi-supervised learning techniques can be applied to label data. Here, human researchers label a small subset of the data and let the algorithm label the rest of the corpus.
- An extension of this approach is active learning, where there is a human in the loop to improve the quality of automatic labeling. In active learning, if the algorithm has low confidence in its data classification, it routes that speech data to a human annotator.
- Synthetic speech data can be generated using small amounts of real speech, and techniques such as generative adversarial networks (GAN) can be used to make them close to realistic speech quality.
- Transfer learning, which refers to applying knowledge from one context to another, can be useful. Examples include leveraging adult emotion models for child emotion recognition training or using non-speech audio (such as music) to train SER models.
In sum, speech emotion recognition is a complex field with many moving parts—linguistic and non-linguistic, contextual, even visual. Machine learning, along with human assistance, will have a large role to play in getting to next-gen SER applications.
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and is the co-author of Practical Artificial Intelligence: An Enterprise Playbook
.