NLP and Data Statements: Let’s Speak the Right Language(s)
The field of natural language processing has progressed by leaps and bounds in the past few years. But the spectacular progress masks a discomfiting truth. The “language” part of natural language processing is assumed to be English by default. I like English and it’s my go-to language, but there are more than 7,000 languages in our world. However, most of the NLP research is happening in a handful of languages—only about 20 or so—which are referred to as high-resource languages. Note that even among these high-resource languages, the research and development are predominantly English-LP.
There is a long tail of languages, the so-called low-resource languages, with not much NLP research happening because of the unavailability of datasets required for NLP. The high-resource or low-resource categorization of languages is not related to the language itself. These labels denote whether machine learning or statistical methods can be directly leveraged or not for a given language.
The false equivalence of NLP to ELP is not a new finding per se, but I am glad there is more attention being drawn to the issue now. A recommendation that’s gaining increasing support is the “Bender Rule,” proposed by NLP researcher Emily M. Bender at the University of Washington. The Bender Rule—“Always state the language being worked on up front”—is deceptively simple but contains multitudes. It’s a recognition that “English” and “natural language” are not synonymous. From a language processing perspective, English is not (and cannot be) the proxy for all languages. As someone who understands five languages (and took classes to learn another three), I can certainly vouch that English has too many idiosyncrasies to be a good representative language.
The crux is that if the language of an NLP project or application is explicitly not stated (even more so when the language happens to be English), English applications are deemed the mainstream default and work in other languages is considered language-specific. We implicitly risk relegating important work in non-English or low-resource languages to the sidelines. This only serves to worsen the already wide gap between high-resource languages and low-resources languages in applied language technology. Use cases for speech recognition, language aids, translations, synthetic voice banks, search, and discovery will be English-centric and remain inaccessible to non-English speakers. A vicious cycle sets in. The usage and utility of languages without technology applications will decrease even further. As with animal species, languages are vulnerable to extinction. Many of the current languages will vanish and humanity will be poorer as a result.
The creation and onboarding of new datasets in low-resource languages is an ongoing but herculean effort. There is some very interesting (and fruitful) research into coming up with new and better AI techniques to overcome the dataset deficit. But those are topics for a different day.
When using AI in facial recognition systems, we acknowledge the urgency of cataloging the data used to train the machine learning models so that we understand their limitations. Along similar lines, while stating the language up front is a start, “data statements” should become par for the course for NLP applications. As Professor Bender recommends, data statements should provide information on why and how the dataset was created/curated. They should also contain the speaker demographics, annotator demographics, speech situation, text characteristics, recording quality, and provenance data. I barely scratched the surface here, but I urge you to read the paper “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science,” by Professor Bender and Batya Friedman, for a more detailed exposition of these important ideas and the suggested form and format of the data statements. Make them an integral part of your NLP and speech technology application development process.
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.