NLP for Ghanaian Languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
NLP for Ghanaian Languages Paul Azunre1∗ , Salomey Osei2∗ , Salomey Afua Addo3∗ , Lawrence Asamoah Adu-Gyamfi∗ , Stephen Moore4∗ , Bernard Adabankah5∗ , Bernard Opoku6∗ , Clara Asare-Nyarko4∗ Samuel Nyarko15∗ , Cynthia Amoaba∗ , Esther Dansoa Appiah8∗ , Felix Akwerh2∗ Richard Nii Lante Lawson9∗ , Joel Budu10∗ , Emmanuel Debrah4∗ , Nana Boateng1∗ Wisdom Ofori∗, Edwin Buabeng-Munkoh∗ , Franklin Adjei11∗ , Isaac K. E. Ampomah12∗ Joseph Otoo13∗ , Reindorf Borkor2∗ , Standylove Birago Mensah2∗ , Lucien Mensah7∗ Mark Amoako Marcel∗ , Anokye Acheampong Amponsah14∗ , and James Ben Hayfron-Acquah2∗ . ∗ NLP Ghana,1 Algorine,2 Kwame Nkrumah University of Science and Technology, 3 Leuphana University Luneburg,4University of Cape Coast, 5 Edinburgh Napier University, arXiv:2103.15475v2 [cs.CL] 1 Apr 2021 6 Accra Institute of Technology,7 Tulane University, 8 University of Tromso, 9 AiMlCamp, 10 University of Strathclyde, 11 Azubi Africa, 12 Ulster University, 13 Centre for Research, Data Science and IT Solutions, 14 University of Energy and Natural Resources, 15 Integrated Geospatial Intelligence Application Centre, SRH Berlin University of Applied Science. Abstract This is not the case for Ghanaian languages and other languages across Africa, although the con- NLP Ghana is an open-source non-profit or- tinent has over 2000 languages and the highest ganisation aiming to advance the development density of languages in the world (Tiayon, 2005). and adoption of state-of-the-art NLP tech- niques and digital language tools to Ghanaian Many of these languages in Africa are yet to languages and problems. In this paper, we first evolve from simple online existence to optimal on- present the motivation and necessity for the ef- line presence (Tiayon, 2013). Though there seems forts of the organisation; by introducing some to be progress in the development of applications popular Ghanaian languages while presenting as far as Ghanaian languages are concerned, their the state of NLP in Ghana. We then present the usability is limited, leaving much room for im- NLP Ghana organisation and outline its aims, provement in their operationalisation and adoption scope of work, some of the methods employed into the Ghanaian tonal, multilingual and digitisa- and contributions made thus far in the NLP community in Ghana. tion systems (Martinus and Abbott, 2019; Kügler, 2016). 1 Introduction In the much-applauded interventions by Google The advancement in machine learning compu- and Microsoft through their translation services, tational power coupled with the recent invest- quite a number of African languages have been ment within the domain by technological com- integrated, but Ghanaian languages are excluded panies has stimulated considerable interest and (Google, 2020; Microsoft, 2021). A historic move brought about a legion of applications in nat- worth mentioning is Baidu Translate’s incorpora- ural language digitisation in developed coun- tion of the Twi language in their translation ser- tries, with much focus on the English language vice. Notwithstanding, its output in the Twi lan- (Martinus and Abbott, 2019). In fact, English is guage semantically is questionable, as it is of- the most computerised language in the world and ten does not make sense and truncates Twi words corpora in the language are best documented, due (Baidu, 2021). In fact, nothing can be more de- to availability of authentic electronic texts, and motivating than situations where professional writ- its usage as both mother tongue and second lan- ers who work with African languages, students, guage (Leech and Fligelstone, 1992; Kenny, 2014; tourists, among others, cannot use otherwise com- Martinus and Abbott, 2019; Varab and Schluter, monly available translation technology to perform 2020). simple tasks (Tiayon, 2005). Major challenges
boil down to lack of good (indeed often any) train- 3 Background Information on Ghanaian ing data, as well as lack of adoption of major Languages language technologies for local problems. This makes it difficult for the writing systems of many Ghana is a multilingual country with at least 75 African languages to effectively pass the tests of local languages (Eth, 2021). Gur languages are user-friendliness and internet visibility (Tiayon, spoken in the northern part by about 24% of the to- 2013). tal population while Kwa languages are spoken in the southern part by about 75% of the population Akorbi (2017) sufficiently underscores that not (Schneider et al., 2004). Research suggests that much funding is available for translation from about 51% of adults in Ghana are literate in both and into African languages. Moreover, there has English and an indigenous language, while smaller been general failure to use Ghanaian languages portions of the population are literate in either together with other African languages in various English only or Ghanaian language only (Adika, specialised fields. This has equally hindered their 2012). There are nine (9) government-sponsored development in the areas of electronic and online Ghanaian languages so far studied in Ghanaian resources, as well as human language technology educational institutions, namely, Akan (Akuapem (Shoba, 2018). Twi, Asante Twi and Fante dialects only), Da- NLP Ghana seeks to close some of the gaps that gaare, Dagbani, Dangme, Ewe, Ga, Gonja, Kasem, were just identified. While the focus is on Ghana- Mfantse and Nzema (Abokyi et al., 2018). ian languages, the tools and techniques are devel- Akan is the most commonly spoken Ghanaian oped with an additional goal of generalisability to language and the most used after English in the other low resource language scenarios. electronic public media. In some cases, it is used more than English (Brown; Adika, 2012). Lan- guages that belong to the same ethnic group are reciprocally understandable (Chen, 2014; Noels, 2 State of NLP in Ghana 2014). For instance, languages such as Dagbani and Mampelle, popularly spoken in the north- Ghanaian languages are yet to evolve to optimal ern part of Ghana are mutually intelligible with online presence and internet visibility. To date, the Frafra and Waali languages of the Upper there is no reliable machine translation system Regions of Ghana. These four languages are for any Ghanaian language (Google, 2021). This of Mole-Dagbani ethnicity. (Abokyi et al., 2018; makes it harder for the global Ghanaian diaspora Wikipedia, 2021). The chart in Table 1 shows to learn their own languages. Ghanaian languages language speaker data provided by Ethnologue and culture also risk not being preserved electron- (Campbell, 2008). ically in an increasingly digitised future. Con- Ghanaian Pidgin is noteworthy in that it is not sequently, service providers and health workers a government-sponsored language, and is an En- trying to reach remote areas hit by emergencies, glish Creole that is a variant of the West African disasters, etc. face needless additional obstacles Pidgin English. It is spoken heavily in the southern to providing life-saving care. Beyond transla- parts of the country and used heavily by the youth tion, fundamental tools for computational analy- on social media and online in general (Deumert, sis such as corpus-processing and analysis tools 2020; Suglo, 2015). (Schneider et al., 2004) iden- are lacking. Tools for summarisation, classifica- tifies two varieties which he describes as uned- tion, language detection, voice-to-text transcrip- ucated and educated/student varieties of Ghana- tion are limited, and in most cases completely non- ian Pidgin English. The former is usually used existent (Varab and Schluter, 2020). This is a ma- as lingua franca in highly multilingual contexts jor risk to Ghanaian national security. Availabil- while the latter is often used as in-group language ity of these tools is directly correlated with the so- to express solidarity. The main differences be- phistication and efficiency of cyber-security solu- tween them are lexical rather than structural. NLP tions that can be deployed to defend critical social, Ghana hopes to add Ghanaian Pidgin English to its cultural, and cyber infrastructure from both inter- projects based on the crucial role it plays in some nal and external threats (Chambers et al., 2018; of these Ghanaian communities. Siracusano et al., 2019). With respect to selecting languages for NLP
Languages Number of Speakers attributed to the region being far away from the Akan (Fante/Twi) 9,100,000 capital and therefore further away from the most Ghanaian Pidgin English 5,000,000 lucrative economic activities. Including these lan- Ewe 3,820,000 guages in NLP Ghana’s projects ensures better rep- Abron 1,170,000 resentation for equitable and sustainable develop- Dagbani 1,160,000 ment. Dangme 1,020,000 Dagaare 924,000 4 NLP Ghana Agenda Konkomba 831,000 NLP Ghana is an open-source movement of like- Ga 745,000 minded volunteers who have dedicated their skills Farefare 638,000 and time to building an ecosystem of: Kusaal 535,000 Mampruli 316,000 1. Open-source data sets. Gonja 310,000 2. Open-source computational methods. Sehwi 305,000 Nzema 299,000 3. NLP researchers, scientists and practitioners Wasa 273,000 excited about revolutionising and improving every aspect of Ghanaian life through this in- Table 1: Common government sponsored languages in creasingly powerful and influential technol- Ghana, (Wikipedia contributors, 2021) ogy. Although NLP Ghana is currently working on Ghana projects in general, representative lan- Ghanaian languages, particularly Akan due to guages from both the southern and northern parts number of speakers, the ultimate goal is to de- of the country are considered for full represen- velop language tools applicable throughout the tation. Several Ghanaian languages are impor- West African sub-region and beyond, complement- tant not only for Ghana, but also its surround- ing efforts such as (∀ et al., 2020). ing West African countries. Since enhanced re- NLP Ghana seeks to develop better data sources gional trade is an important target societal bene- to train state-of-the-art (SOTA) NLP techniques fit for the advances in language technologies we for Ghanaian languages and to contribute to adapt- are discussing, this is also taken into account. For ing SOTA techniques to work better in a lower re- instance, Akan is spoken in Côte d’Ivoire while source setting. In other words, it aims to build Ewe is spoken in Togo, Benin and Nigeria. More- functional systems for local applications such as over, Gurune or Frafra is equally spoken in Burk- a “Google Translate for the Ghanaian languages”. ina Faso (Deumert, 2020). The group equally seeks to train and benchmark Languages in the southern part of Ghana se- algorithms for a number of crucial tasks in these lected for initial exploration are Akan (Asante Twi, languages – translation, named entity recognition Akuapem Twi and Fante dialects), Ewe, Ga and (NER), POS tagging and sentiment analysis, train- Pidgin. Akan, Ewe and Pidgin are among the ing classical text embeddings such as word2vec most widely-spoken languages in Ghana, making and FastText, as well as fine-tuning contextual them straight-forward additions to the NLP Ghana embeddings such as BERT (Devlin et al., 2018), target languages (Deumert, 2020). Ga is spoken DistilBERT (Sanh et al., 2019) and RoBERTA as native language in the capital Accra and may (Liu et al., 2019). NLP Ghana is therefore open to be particularly valuable to international travelers. both local and international entity collaborations This makes Ga another straight-forward addition in the pursuit of its mission. to the initial target language set. 5 NLP Ghana Contributions Northern Ghanaian languages initially selected for NLP Ghana projects include Dagaare, Dag- NLP Ghana has developed translators between En- bani, Gonja and Gurune (Frafra). Northern Ghana glish and some of the most-widely spoken Ghana- is typically perceived as being poorer, with less lo- ian languages – Twi, Ewe and Ga. Our transla- cal language digitisation, education and resources tors are already available to the general Ghana- (Yaro and Hesselberg, 2010). This is sometimes ian public as a Web Application, as well as mo-
bile applications via the Google Play Store and the this by liaising with stakeholders, interacting with Apple Store. Response from the public has been product users and relaying feedback to teams to largely positive, suggesting the crucial need for ensure continuous improvement of products. such services. Two main streams have been used to collect Embeddings have also been developed for Akan data: crowd-sourcing and human-correction of as the most widely spoken Ghanaian language machine translated data from in-house translator (Azunre et al., 2021a). These include both static codes. The former is acquired by collecting vol- embeddings such as fastText (Bojanowski et al., untary responses of people through Google Form 2017) and contextual embeddings such as BERT surveys. This exercise allows people to translate (Devlin et al., 2018). Both models have been open- a set of randomly drawn English sentences into sourced and made available to the public via a few Akan. In total, about 697 sentence pairs have been lines of Python code. generated with this method. The latter generated As indicated earlier, NLP Ghana also aims to about 50, 000 preliminary translations which were produce large training data sets for Akan, Ewe, Ga distributed to well-qualified native speakers to re- and other Ghanaian languages – starting with the vise. This has yielded approximately 25, 000 trans- first three since a functional translator has been lations into Akuapem Twi which is inclined to- developed for these languages. Work on train- wards Asante Twi in terms of tone and vocabulary ing data for the Akuapem dialect of Akan was re- (Azunre et al., 2021b). cently completed as part of a collaboration with The process of collecting data and processing Zindi Africa. One of the goals of NLP Ghana is them is not without challenges. One of the major to create at least 50, 000 sentences for each target challenges has been the inability to employ pro- language, providing a reasonably sized data set for fessional translators to verify and review machine fine-tuning modern neural network architectures translations due to financial constraints. Moving on the data. forward, NLP Ghana hopes to extend data collec- Data used to augment internally-created data tion capabilities beyond text data to other forms for these projects include, but is not limited to, of data such as audio data (oral corpora) on sev- the JW300 multilingual data set (Agic and Vulic, eral Ghanaian languages. The group also aims to 2020) and the Bible (Adjeisaha1 et al., 2020). annotate data sets to enhance the works of NLP re- searchers in carrying out downstream NLP tasks 6 Participation and Methodology such as Named Entity Recognition (NER), Part of Speech (POS) tagging etc. This effort will re- Our volunteers mainly identify as students (both quire significant funding by various stakeholders graduate and undergraduate), ML researchers, to yield a good quantity of quality data. data scientists, mathematicians, engineers, lectur- ers, programmers and local language instructors. 7 Conclusion At the moment, member count exceeds one hun- dred (100). The combined skills of members are Research works on NLP have provided several in- utilised in various teams – Data, Engineering, Re- dispensable tools useful in this modern internet search and Communications – to shape and exe- age. This paper presented the state of NLP appli- cute the broad NLP Ghana agenda. cations to Ghanaian languages such as Akan, Ga, Specifically, the Data Team is responsible for and Ewe. One of the major challenges has been the data collection and storage while the Engineering lack of evaluation data sets to efficiently develop Team manages data, networks and platforms. The machine learning models for NLP tasks including Research Team leads the way directing technical machine translation, named entity recognition and agenda and devising strategies in an effort to op- document classification for Ghanaian languages. timise the execution of research programmes and NLP Ghana has built an open-source community meeting set Key Performance Indicators (KPIs). of researchers with different levels of expertise, The Research Team is further divided into Unsu- working together to develop data sources, tech- pervised Methods, Supervised Methods, and Eval- niques and models for Ghanaian and other low- uation, by corresponding technical area. The Com- resource languages. To this end, the group has al- munications Team is also responsible for internal ready open-sourced some models and data sets to and external information flow. The team does further research activities for Ghanaian languages.
Acknowledgments Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard We are grateful to the Microsoft for Startups Social Nii Lante Lawson, Joel Budu, Emmanuel De- Impact Program – for supporting this research ef- brah, Nana Boateng, Wisdom Ofori, Edwin fort via providing GPU compute through Algorine Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Es- sel Ampomah, Joseph Otoo, Reindorf Borkor, Research. We would like to thank Julia Kreutzer, Standylove Birago Mensah, Lucien Mensah, Jade Abbot and Emmanuel Agbeli for their con- Mark Amoako Marcel, Anokye Acheampong structive feedback. We are grateful to the review- Amponsah, and James Ben Hayfron-Acquah. 2021b. ers for their valuable comments. We would also English–twi parallel corpus for machine translation. 2nd AfricaNLP Workshop at EACL. like to thank the Ghanaian American Journal for their work in sharing our vision with the Ghanaian Baidu. 2021. Baidu Translate. Baidu Translate. [On- public. line; accessed 4-February-2021]. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with References subword information. Transactions of the Associa- tion for Computational Linguistics, 5:135–146. 2021. Ethnic and linguistic groups. https://www.britannica.com/place/Ghana/Plant-and-animal-life#ref55175. K-Ogilvie Brown. S.(2009). concise encyclopedia of Accessed: 2021-03-29. languages of the world. Samuel Nana Abokyi et al. 2018. The interface of Lyle Campbell. 2008. Ethnologue: Languages of the modern partisan politics and community conflicts in world. africa: the case of northern ghana conflicts. Nathanael Chambers, Ben Fry, and James McMasters. Gordon Senanu Kwame Adika. 2012. English in 2018. Detecting denial-of-service attacks from so- ghana: Growth, tensions, and trends. International cial media text: Applying nlp to computer security. Journal of Language, Translation and Intercultural In Proceedings of the 2018 Conference of the North Communication, 1:151–166. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Michael Adjeisaha1, Guohua Liua, Volume 1 (Long Papers), pages 1626–1635. and Richard Nuetey Nortey. 2020. English twi parallel-aligned bible corpus for encoder-decoder based Lijiang machine Chen. 2014. translation. Machine translation model based Academia Journal of Scientific Research, 8(12). on non-parallel corpus and semi-supervised trans- ductive learning. arXiv preprint arXiv:1405.5654. Željko Agic and Ivan Vulic. 2020. Jw300: A wide- coverage parallel corpus for low-resource languages. Ana Deumert. 2020. Sub-saharan africa. The Rout- In Proceedings of the 57th Annual Meeting of the ledge Handbook of Pidgin and Creole Languages, Association for Computational Linguistics, pages page 15. 3204—-3210. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Akorbi. 2017. Demystifying the African Localiza- Kristina Toutanova. 2018. Bert: Pre-training of deep tion Industry – Challenges and Opportunities. bidirectional transformers for language understand- Demystifying the African Localization Industry. ing. arXiv:1810.04805. [Online; accessed 4-February-2021]. ∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Paul Azunre, Salomey Osei, Salomey Afua Addo, Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Fagbohungbe, Solomon Oluwole Akinola, Sham- Bernard Adabankah, Bernard Opoku, Clara suddee Hassan Muhammad, Salomon Kabongo, Sa- Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, lomey Osei, et al. 2020. Participatory research for Esther Dansoa Appiah, Felix Akwerh, Richard low-resourced machine translation: A case study in Nii Lante Lawson, Joel Budu, Emmanuel De- african languages. Findings of EMNLP. brah, Nana Boateng, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Es- Google. 2020. Google Language support. sel Ampomah, Joseph Otoo, Reindorf Borkor, Google Language Support. [Online; accessed Standylove Birago Mensah, Lucien Mensah, 1-February-2021]. Mark Amoako Marcel, Anokye Acheampong Google. 2021. Google Translate. Google Translate. Amponsah, and James Ben Hayfron-Acquah. 2021a. [Online; accessed 1-February-2021]. Contextual text embeddings for twi. 2nd AfricaNLP Workshop at EACL. Dorothy Kenny. 2014. Lexis and creativity in transla- tion: A corpus based approach. routledge. Paul Azunre, Salomey Osei, Salomey Afua Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Frank Kügler. 2016. Tone and intonation in akan. Into- Bernard Adabankah, Bernard Opoku, Clara nation in African tone languages, 24:89–129.
G Leech and S Fligelstone. 1992. Computers and cor- Wikipedia contributors. 2021. pus analysis. in: Cs butler (ed.), computers and writ- Languages of ghana — Wikipedia, the free encyclopedia. ten texts. [Online; accessed 15-January-2021]. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Joseph A Yaro and Jan Hesselberg. 2010. The contours dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, of poverty in northern ghana: policy implications for Luke Zettlemoyer, and Veselin Stoyanov. 2019. combating food insecurity. Institute of African Stud- Roberta: A robustly optimized bert pretraining ap- ies Research Review, 26(1):81–112. proach. arXiv:1907.11692. Laura Martinus and Jade Z Abbott. 2019. A focus on neural machine translation for african languages. arXiv preprint arXiv:1906.05685. Microsoft. 2021. Microsoft Translator. Microsoft Translator. [Online; accessed 4-February- 2021]. Kimberly A Noels. 2014. Language variation and ethnic identity: A social psychological perspective. Language & Communication, 35:88–96. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled ver- sion of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108. Edgar W Schneider, Kate Burridge, Bernd Kortmann, Rajend Mesthrie, and Clive Upton. 2004. A hand- book of varieties of english: A multimedia reference tool two volumes plus CD-ROM. De Gruyter Mou- ton. Feziwe Martha Shoba. 2018. Exploring the use of par- allel corpora in the complilation of specialised bilin- gual dictionaries of technical terms: a case study of English and isiXhosa. Ph.D. thesis. Giuseppe Siracusano, Martino Trevisan, Roberto Gon- zalez, and Roberto Bifulco. 2019. Poster: on the application of nlp to discover relationships between malicious network entities. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 2641–2643. Ignatius Suglo. 2015. Language Attitudes towards Ghanaian Pidgin English among Students in Ghana. GRIN Verlag. Charles Tiayon. 2005. Community interpreting: an african perspective. Hermeneus: Revista de la Facultad de Traducción e Interpretación de Soria, (7):175–192. Charles Tiayon. 2013. About professional writ- ing and translation in African languages. Meta Glossia21 Blogpsot. [Online; accessed 1-February-2021]. Daniel Varab and Natalie Schluter. 2020. Danewsroom: A large-scale danish summarisation dataset. In Pro- ceedings of The 12th Language Resources and Eval- uation Conference, pages 6731–6739. Wikipedia. 2021. Language and Religion. http://www.ghanaembassy.org/index.php?page=language-and-religion. [Online; accessed 1-February-2021].
This figure "image.PNG" is available in "PNG" format from: http://arxiv.org/ps/2103.15475v2
You can also read