Dr Vukosi Marivate, an associate professor in the University of Pretoria’s Department of Computer Science and the Absa Chair of Data Science in the Faculty of Engineering, Built Environment and Information Technology, is a recipient of the Google AI Research Scholar Award for 2022.
This award supports early-career researchers who are pursuing cutting-edge research in fields relevant to Google. This includes machine learning and data mining, and machine translation, among other computer science-related fields. It will provide financial support to the investigation of Dr Marivate’s research team into consolidating learnings of language models and language tools for South African languages and beyond.
According to Dr Marivate, recent advances in natural language processing (NLP) have only benefitted well-represented languages, negating research into lesser-known global languages. This is, in part, due to the availability of curated data and research resources, as well as NLP algorithms that can exploit this abundance of data. Languages with fewer resources have the double challenge of small amounts of data and algorithms that do not cater for this paucity of data.
“Over the last few years, there has been an increase in grassroots organisations involved in NLP in the Global South. They have brought with them renewed energy and a focus on low-resource languages,” says Dr Marivate. “We propose consolidating our work at UP, which has focused on creating NLP resources and new tools for South African languages. Our focus is on exploring approaches for efficient and effective language models and tools for South African languages.”
Questions that Dr Marivate aims to address include the following:
- How do we reduce the impact of data paucity for African NLP tasks?
- How do we use the similarities within languages to improve models for languages such as Sesotho and siSwati, which have very few resources?
- What are the main lessons to document for other researchers with similar challenges?
In his research team’s prior work in this area over the last five years, they have investigated ways to improve the tools and resources available for resource-poor languages. “Given our location, we focused on South African languages as a base for our research. We have investigated augmentation methodologies for short text, developing word embeddings to assist with augmentation methods for low-resource languages, and curated new word nets for South African languages and cross-lingual models,” he says.
The research has the following goals:
- Consolidate our learning on low-resource NLP in South African languages
- Engage with linguists to audit the models from a linguistic perspective
- Release new or updated models and data sets
- Release documentation on processes followed and documentation
A challenge that is faced by researchers in African languages is having a full end-to-end guide on how to look at an NLP task, curate the correct data, make a choice on the best models, and train and evaluate the models. “To this end, our work aims to document and create a reusable template for tackling low-resource language tasks through an African language lens.” Dr Marivate’s team will focus on nine South African languages and three NLP tasks (news and document classification, named entity recognition and translation).