Posted on September 11, 2025
Beyond the symbols: Natural language processing as an adaptive problem
Prof Vukosi Marivate, Chairholder of the Absa-UP Chair of Data Science at the University of Pretoria (UP) and Professor of Computer Science in the Faculty of Engineering, Built Environment and Information Technology, delivered his inaugural address on 26 August 2024. The title of his presentation was “Beyond the symbols: Natural language processing as an adaptive problem”.
Prof Marivate has worked on projects related to science, energy, public safety and utilities. He is a co-founder of Lelapa AI, an African start-up focused on AI for Africans by Africans, and leads the African Institute for Data Science and Artificial Intelligence (AfriDSAI), a transdisciplinary research institute hosted at UP. He specialises in developing machine learning (ML) and artificial intelligence (AI) methods to extract insights from data, with a particular focus on the intersection of ML/AI and natural language processing (NLP). His research is dedicated to improving the methods, tools and availability of data for local or low-resourced languages. As the leader of the Data Science for Social Impact Research Group in the Department of Computer Science, he uses data science to solve social challenges.
His global recognition is exemplified by his recent receipt of a $1-million (R18-million) donation from Google to boost African-led AI research. It will directly support catalytic activities at AfriDSAI, including fellowships for masters’ and doctoral students, and postdoctoral researchers conducting research in priority sectors such as health care, NLP and climate resilience, and the promotion of ethical and inclusive AI practices across the continent.
Since 2015, he has worked to address the limitations of NLP for African and other low-resourced languages. He explains that, despite the global growth of AI, most NLP systems remain rooted in data-rich languages, excluding linguistic and cultural contexts around the world. “This disconnect is not only technical, but also historical.” African languages have long been under-represented in digital infrastructure, resulting in systemic gaps in research, resources and applications.
In his address, he reflected on his research journey, which includes developing techniques for dealing with data scarcity, improving models to adapt rather than dominate, and working to expand available datasets through collaborative data creation. In many African countries, historically local languages are looked down upon as a consequence of colonialism. As a result, one of the largest gaps in the availability of data is in language. In comparison to the percentage share of internet content in English, which stands at 53%, local African languages have a share of only 0.02%. He says that the challenges faced by low-resourced languages are analogous to a high-interest credit card: “The more we wait to resolve the challenges, the more expensive it becomes to pay off the debt.”
He has found that global AI/NLP trends are inclined to overlook African contexts. Yet, language is a critical entry point to equity and relevance in technology. He believes that, by working with what we have, increasing and protecting our data, assembling armies and redefining our science, the mountain can be moved with collective involvement. This forms part of the research focus of the Data Science for Social Impact Research Group. Here researchers engage in multidisciplinary research, establish resources to develop language models, develop AI policies and engage in partnerships with industry.
With an understanding of the state of our languages, the first step is to “work with what we have”. Research aimed at augmenting data has included the improvement of short text classification through global augmentation methods, tackling Afro-centric code-mixed data scarcity, examining the impact of speaker diversity, sentence distribution and augmentative strategies on automatic speech recognition performance and implications for future data collection in low-resourced settings, and investigating the efficacy of large language models in reflective assessment methods through chain of thought prompting.
The next step is to expand the data that has been collected. One of the many outputs of this is the African Next Voices Project, which entails collecting 3 000 hours of speech data across seven South African languages: isiZulu, isiXhosa, Xitsonga, Setswana, Sesotho, Tshivenda and isiNdebele to enhance AI models and language technologies. This effort will support the development of robust speech recognition and synthesis systems, improving access to digital services and educational resources in these languages. Several large language models are also being developed for story generation, machine translation, diagnostics and disease prediction, teaching and learning, and agricultural extension in African languages.
“In addition to increasing data in African languages, it is important to protect our data, people and languages,” says Prof Marivate. “Data is about people. It is important to engage in discussions about equitable licensing if we are to protect sources of data, protect our heritage, enable an ecosystem and balance the power dynamics.” In this regard, the Data Science for Social Impact Research Group collaborates with the University of Pretoria’s Data Science Law Lab, under the supervision of Prof Chijioke Okorie. A project related to the development of open-source tools entailed addressing inequitable openness in licences for sharing African data and datasets through the Nwulite Obodo open data licence.
Prof Marivate calls the next step “assembling our armies”. This takes place by establishing interdisciplinary research communities, such as the Masakhane Research Foundation. This is a grassroots research organisation that aims to strengthen and spur NLP research in African languages, for Africans, by Africans. Its goal is for Africans to shape and own these technological advances towards human dignity, wellbeing and equity through inclusive community building, and open, participatory and multidisciplinary research. It has a community of more than 3 000 members, representing more than 150 African languages and more than 200 datasets.
“We also need to get the tools into people’s hands,” he says. Several inclusive large language models have been developed that understand and respond to African languages. This is a significant step forward for the promotion of linguistic diversity and improving the accessibility of AI to different linguistic communities across the continent.
What Prof Marivate calls “moving the mountain” is encapsulated in the growth of African NLP research. Such research is heavily concentrated on the most widely spoken African languages, as well as African accents of widely spoken non-indigenous languages and code-mixing efforts, particularly in North African Arabic dialects and South African languages.
The largest sharing of insights into ML/AI and NLP for low-resourced languages is the Deep Learning Indaba. Prof Marivate was the co-founder of this platform, the leading grassroots ML and AI conference on the African continent that aims to strengthen African machine learning. Its vision is for Africans to become critical contributors, owners and shapers of coming advances in AI and ML. “The 2025 Indaba included 150 communities, with 1 294 registered participants, of which 40.3% were women. Almost 400 participants were supported in their attendance through travel grants to capacitate the youth.”
With the establishment of AfriDSAI at the University of Pretoria, he believes that the powerful tools of AI and data science can be leveraged to better utilise African talent and data through UP’s cutting-edge research, capacity building and convening power. Prof Marivate summarises the future of NLP as follows: “If we fund and support diverse African students, early-career faculty members and ecosystem builders, we can build enduring AI capacity, locally relevant research and equitable technological outcomes, because talent and ideas in Africa are constrained by access, mentorship and institutional support.”
Looking ahead, he believes his work will focus on designing adaptable NLP systems that can evolve with the languages and communities they serve. “This includes advancing model architectures that require minimal data, creating evaluation frameworks that prioritise linguistic diversity, and incorporating ethical, community-led principles into NLP research and deployment.” By treating NLP as an adaptive problem rather than just a technical one, he argues for a future where language technology is shaped not just by scale, but by equity, participation and relevance.
Watch the video here
Copyright © University of Pretoria 2025. All rights reserved.
Get Social With Us
Download the UP Mobile App