The Masakhane Natural Language Processing (NLP) research project has been named a joint winner of the inaugural 2021 Wikimedia Foundation Research Award of the Year. Dr Vukosi Marivate, ABSA Chair of Data Science in the Faculty of Engineering, Built Environment and Information Technology (IBIT) at the University of Pretoria (UP) led the team, with members of UP’s Department of Computer Science working in collaboration with researchers from the African Master’s in Machine Intelligence programme of the African Institute for Mathematical Sciences in Ghana. They received the award for their research paper titled ‘Participatory research for low-resourced machine translation: A case study in African languages.’
The Wikimedia Foundation Research Award of the Year was established by the Wikimedia Foundation in 2021 to recognise recent research with the potential to radically impact Wikimedia projects or research. According to the Foundation, this paper, and the Masakhane community, have attempted to fundamentally change the approach taken towards ‘low-resourced languages’ in Africa. The Foundation calls this research an inspiring example of work towards knowledge equity.
"This is a great honour, and it shows that we can innovate even in the ways we conduct research. There will be more coming out of the project shortly, including a new African language translation service in the next few weeks, new datasets, and new machine learning models,” Dr Marivate said.
Despite the fact that 2 000 of the world’s languages are African, African languages are barely represented in technology. This is further exacerbated by the continent’s colonialist past, which has been devastating for African languages in terms of their support, preservation and integration, and has resulted in technological space that does not understand African names, cultures, places or history.
“This is important, beyond just the impact on Wikimedia, as language matters. Being understood by other people is one of the things that unites families, tribes, nations, and the African diaspora. It is essential when it comes to being included in both the digital and societal environments,” said Dr Marivate. “For example, there may be eleven official languages in South Africa, but only if a government minister makes a one-line statement in one of those languages, will it be translated, and even then, it is translated into English. Anyone speaking one of the other official languages will not have access to the speech, or to government directives. They would not have translations of any value for education and medical scenarios either, and would have no way of making their voice heard or understood.”
The research paper describes a novel approach for participatory research around machine translation for African languages. It describes how this approach can overcome the challenges these languages face in joining the web, and some of the technologies from which other languages benefit today.
As part of the research, Dr Marivate and his team are working on methods to more easily build automated tools to process local language data for tasks such as understanding communication on chat groups, automated labelling of local language data, and discovering patterns in local language texts.
Going beyond the walls of institutions
The project also gave rise to the establishment of the Masakane community, a grassroots community that aims to develop NLP systems for Africa by Africans. Masakhane, which roughly translates as “we build together” in isiZulu, has as its goal for Africans to shape and own technological advances towards human dignity, wellbeing and equity through inclusive community building, open participatory research and multidisciplinarity.
Members of the Masakhane community provided data for the research project, and also assisted in building models and testing sample translations to improve the accuracy of the machine translation tool. They represent the African countries and languages that form part of the research, and comprise individuals from a range of relevant professions, including data scientists, researchers, language practitioners, translators, and software developers.
As Dr Marivate points out, artificial intelligence (AI) needs data to function correctly. Humans who speak the language make a unique scientific contribution at the beginning, when the data sets are collected. This means going beyond the walls of institutions, where the research is being done, and taking it into the field. AI also requires learning opportunities to successfully translate text from one language to another. When it comes to translating French to English, there are huge data sets available for the AI to use. This is not true of African languages. Zulu, for example, is spoken by 0,16% of people on the planet.
“This has been a major roadblock for breakthroughs based on natural language processing. It is not only translation that is affected. Text to speech and vice versa is also a non-starter. All of which limits learning and inclusion across the continent within the broader global environment,” Dr Marivate said.
Wikipedia founder Jimmy Wales acknowledged the essential and urgent need for the research when he announced introducing the Wikimedia Foundation Research Award of the Year at the online 2021 Wikimedia Workshop.
Dr Marivate described how a language’s prevalence within a society is ultimately attached to the people who speak this language, and where they live. Part of ensuring that the AI is functioning correctly, is a human evaluation of the result. The next step is the release of the beta version of the machine translation model for more human feedback. “This research is one way in which Africa can increase its contribution across the globe,” he said.
Within the body of research with the team, Dr Marivate, as a chief investigator, has co-authored several papers addressing Machine Learning, Natural Language Processing, Social Media, Society, and Web Technologies.
Find out more here.