Posted on May 28, 2021
The Masakhane Natural Language Processing (NLP) research project has been named a joint winner of the inaugural 2021 Wikimedia Foundation Research Award of the Year. Dr Vukosi Marivate, ABSA Chair of Data Science at UP’s Faculty of Engineering, Built Environment and Information Technology, led the team, with members of UP’s Department of Computer Science working in collaboration with researchers from the African Master’s in Machine Intelligence programme of the African Institute for Mathematical Sciences in Ghana. They received the award for their research paper titled, ‘Participatory research for low-resourced machine translation: A case study in African languages.’
The Wikimedia Foundation Research Award was established by the Wikimedia Foundation in 2021 to recognise recent research that has the potential to radically impact Wikimedia projects or research. According to the foundation, this paper along with the Masakhane community have attempted to fundamentally change the approach taken towards “low-resourced languages” in Africa. The foundation calls this research an inspiring example of working towards knowledge equity.
“This is a great honour, and it shows that we can innovate even in the ways we conduct research,” Dr Marivate said. “There will be more coming out of the project shortly, including a new African language translation service in the next few weeks, new datasets and new machine learning models.”
Despite the fact that 2 000 of the world’s languages are African, African languages are barely represented in technology. This is further exacerbated by the continent’s colonialist past, which has been devastating for African languages in terms of their support, preservation and integration, and has resulted in a technological space that does not understand African names, cultures, places or history.
“This is important, beyond just the impact on Wikimedia, as language matters,” said Dr Marivate. “Being understood by other people is one of the things that unites families, tribes, nations and the African diaspora. It is essential when it comes to being included in both digital and societal environments. For example, there may be 11 official languages in South Africa, but only if a government minister makes a one-line statement in one of those languages, will it be translated, and even then, it is translated into English. Anyone speaking one of the other official languages will not have access to the speech or to government directives. They would not have translations of any value for education and medical scenarios either, and would have no way of making their voice heard or understood.”
The research paper outlines a novel approach for participatory research around machine translation for African languages. It describes how this approach can overcome the challenges these languages face in joining the web, and some of the technologies from which other languages benefit today.
As part of the research, Dr Marivate and his team are working on methods to build automated tools more easily to process local language data for tasks such as understanding communication on chat groups, automated labelling of local language data and discovering patterns in local language texts.
The project also gave rise to the establishment of the Masakhane community, a grassroots community that aims to develop NLP systems for Africa by Africans. Masakhane, which roughly translates as “we build together” in isiZulu, focuses on getting Africans to shape and own technological advances towards human dignity, wellbeing and equity through inclusive community building, open participatory research and multi-disciplinarity.
Members of the community provided data for the research project, and assisted in building models and testing sample translations to improve the accuracy of the machine translation tool. They represent the African countries and languages that form part of the research, and comprise experts from a range of relevant professions; these individuals include data scientists, researchers, language practitioners, translators and software developers.
As Dr Marivate points out, artificial intelligence needs data to function correctly. Humans who speak the language make a unique scientific contribution at the beginning, when data sets are collected. This means going beyond the walls of institutions, where the research is being done, and taking it into the field. Artificial intelligence also requires learning opportunities to successfully translate text from one language to another. When it comes to translating French to English, there are huge data sets available for artificial intelligence to use. This is not true of African languages. Zulu, for example, is spoken by 0,16% of people on the planet.
“This has been a major roadblock for breakthroughs based on natural language processing,” Dr Marivate said. “It is not only translation that is affected; text to speech and vice versa is also a non-starter. All of this limits learning and inclusion across the continent within the broader global environment.”
Wikipedia founder Jimmy Wales acknowledged the essential and urgent need for the research when he announced the introduction of the Wikimedia Foundation Research Award of the Year at the online 2021 Wikimedia Workshop.
Dr Marivate described how a language’s prevalence within a society is ultimately attached to the people who speak this language, and where they live. Part of ensuring that the artificial intelligence is functioning correctly involves a human evaluation of the result. The next step is the release of the beta version of the machine translation model for more human feedback. “This research is one way in which Africa can increase its contribution across the globe,” he said.
Copyright © University of Pretoria 2023. All rights reserved.
COVID-19 Corona Virus South African Resource Portal
To contact the University during the COVID-19 lockdown, please send an email to [email protected]
Get Social With Us
Download the UP Mobile App