WIKIMEDIA FOUNDATION RESEARCH AWARD OF THE YEAR IN 2021 - PARTICIPATORY RESEARCH FOR LOW-RESOURCED MACHINE TRANSLATION: A CASE STUDY IN AFRICAN ...

Page created by Harry Scott
 
CONTINUE READING
WIKIMEDIA FOUNDATION RESEARCH AWARD OF
THE YEAR IN 2021 - PARTICIPATORY RESEARCH FOR
LOW-RESOURCED MACHINE TRANSLATION: A CASE
STUDY IN AFRICAN LANGUAGES
Kathleen Siminyu;
;

© 2021, KATHLEEN SIMINYU

              This work is licensed under the Creative Commons Attribution License
              (https://creativecommons.org/licenses/by/4.0/legalcode), which permits unrestricted
              use, distribution, and reproduction, provided the original work is properly credited.
              Cette œuvre est mise à disposition selon les termes de la licence Creative Commons
              Attribution (https://creativecommons.org/licenses/by/4.0/legalcode), qui permet
              l’utilisation, la distribution et la reproduction sans restriction, pourvu que le mérite de la
              création originale soit adéquatement reconnu.

IDRC Grant/ Subvention du CRDI: 109187-002-Laying the foundations for artificial intelligence
for development (AI4D) in Africa
Participatory Research for Low-resourced Machine Translation:
                                                          A Case Study in African Languages

                                           ∀∗, Wilhelmina Nekoto1 , Vukosi Marivate2 , Tshinondiwa Matsila1 , Timi Fasubaa3 ,
                                                 Tajudeen Kolawole4 , Taiwo Fagbohungbe5 , Solomon Oluwole Akinola6 ,
                                               Shamsuddeen Hassan Muhammad7,39 , Salomon Kabongo4 , Salomey Osei4 ,
                                           Sackey Freshia8 , Rubungo Andre Niyongabo9 , Ricky Macharm10 , Perez Ogayo11 ,
                                        Orevaoghene Ahia12 , Musie Meressa13 , Mofe Adeyemi14 , Masabata Mokgesi-Selinga15 ,
                                          Lawrence Okegbemi5 , Laura Jane Martinus16 , Kolawole Tajudeen4 , Kevin Degila17 ,
                                                Kelechi Ogueji12 , Kathleen Siminyu18 , Julia Kreutzer19 , Jason Webster20 ,
                                                    Jamiil Toure Ali1 , Jade Abbott21 , Iroro Orife3 , Ignatius Ezeani38 ,
arXiv:2010.02353v2 [cs.CL] 6 Nov 2020

                                         Idris Abdulkabir Dangana23,7 , Herman Kamper24 , Hady Elsahar25 , Goodness Duru26 ,
                                            Ghollah Kioko27 , Espoir Murhabazi1 , Elan van Biljon12,24 , Daniel Whitenack28 ,
                                        Christopher Onyefuluchi29 , Chris Emezue40 , Bonaventure Dossou31 , Blessing Sibanda32 ,
                                            Blessing Itoro Bassey4 , Ayodele Olabiyi33 , Arshath Ramkilowan34 , Alp Öktem35 ,
                                                               Adewale Akinfaderin36 , Abdallah Bashir37
                                                                ∗
                                                                  Masakhane, Africa 1 Independent, 2 University of Pretoria, 3 Niger-Volta LTI,
                                        4
                                            African Masters in Machine Intelligence, 5 Federal University of Technology, Akure, 6 University of Johannesburg,
                                                      7
                                                        Bayero University, Kano, 8 Jomo Kenyatta University of Agriculture and Technology, 9 UESTC,
                                                                 10
                                                                    Siseng Consulting Ltd, 11 African Leadership University, 12 InstaDeep Ltd,
                                                            13
                                                               Sapienza University of Rome, 14 Udacity, 15 Parliament, Republic of South Africa,
                                              16
                                                 Explore Data Science Academy, 17 UCD, Konta, 18 AI4Dev, 19 Google Research, 20 Percept, 21 Retro Rabbit,
                                             23
                                                Di-Hub, 24 Stellenbosch University, 25 Naver Labs Europe, 26 Retina, AI, 27 Lori Systems, 28 SIL International,
                                                             29
                                                                Federal College of Dental Technology and Therapy, Enugu, 31 Jacobs University,
                                                     32
                                                        Namibia University of Science and Technology, 33 Data Science Nigeria, 34 Praekelt Consulting,
                                                35
                                                   Translators without Borders, 36 Amazon, 37 Max Planck Institute for Informartics, University of Saarland,
                                                              38
                                                                 Lancaster University, 39 University of Porto, 40 Technical University of Munich
                                                                              masakhane-nlp@googlegroups.com

                                                               Abstract                                    As MT researchers cannot solve the prob-
                                                                                                           lem of low-resourcedness alone, we pro-
                                              Research in NLP lacks geographic diver-                      pose participatory research as a means to
                                              sity, and the question of how NLP can                        involve all necessary agents required in
                                              be scaled to low-resourced languages has                     the MT development process. We demon-
                                              not yet been adequately solved. “Low-                        strate the feasibility and scalability of par-
                                              resourced”-ness is a complex problem go-                     ticipatory research with a case study on
                                              ing beyond data availability and reflects                    MT for African languages. Its imple-
                                              systemic problems in society.                                mentation leads to a collection of novel
                                              In this paper, we focus on the task of Ma-                   translation datasets, MT benchmarks for
                                              chine Translation (MT), that plays a cru-                    over 30 languages, with human evalua-
                                              cial role for information accessibility and                  tions for a third of them, and enables par-
                                              communication worldwide. Despite im-                         ticipants without formal training to make
                                              mense improvements in MT over the past                       a unique scientific contribution. Bench-
                                              decade, MT is centered around a few high-                    marks, models, data, code, and evaluation
                                              resourced languages.                                         results are released at https://github.
                                              ∗
                                              ∀ to represent the whole Masakhane community.                com/masakhane-io/masakhane-mt.
1   Introduction                                        Language           Articles       Speakers    Category
                                                        English           6,087,118   1,268,100,000   Winner
Language prevalence in societies is directly            Egyptian Arabic    573,355      64,600,000    Hopeful
bound to the people and places that speak this          Afrikaans           91,002      17,500,000    Rising Star
                                                        Kiswahili           59,038      98,300,000    Rising Star
language. Consequently, resource-scarce lan-            Yoruba              32,572      39,800,000    Rising Star
guages in an NLP context reflect the resource           Shona                5,505       9,000,000    Scraping by
scarcity in the society from which the speak-           Zulu                 2,219      27,800,000    Hopeful
                                                        Igbo                 1,487      27,000,000    Scraping by
ers originate (McCarthy, 2017). Through the             Luo                      0       4,200,000    Left-behind
lens of a machine learning researcher, “low-            Fon                      0       2,200,000    Left-behind
                                                        Dendi                    0         257,000    Left-behind
resourced” identifies languages for which few           Damara                   0         200,000    Left-behind
digital or computational data resources exist,
often classified in comparison to another lan-         Table 1: Sizes of a subset of African language
guage (Gu et al., 2018; Zoph et al., 2016).            Wikipedias1 , speaker populations2 , and categories
However, to the sociolinguist, “low-resourced”         according to Joshi et al. (2020) (28 May 2020).
can be broken down into many categories:
low density, less commonly taught, or endan-           translation (MT) using parallel corpora only,
gered, each carrying slightly different mean-          and refer the reader to Joshi et al. (2019) for an
ings (Cieri et al., 2016). In this complex defini-     assessment of low-resourced NLP in general.
tion, the “low-resourced”-ness of a language is
a symptom of a range of societal problems,             Contributions. We diagnose the problems
e.g. authors oppressed by colonial govern-             of MT systems for low-resourced languages
ments have been imprisoned for writing nov-            by reflecting on what agents and interactions
els in their languages impacting the publica-          are necessary for a sustainable MT research
tions in those languages (Wa Thiong’o, 1992),          process. We identify which agents and inter-
or that fewer PhD candidates come from op-             actions are commonly omitted from existing
pressed societies due to low access to tertiary        low-resourced MT research, and assess the im-
education (Jowi et al., 2018). This results            pact that their exclusion has on the research.
in fewer linguistic resources and researchers          To involve the necessary agents and facilitate
from those regions to work on NLP for their            required interactions, we propose participatory
language. Therefore, the problem of “low-              research to build sustainable MT research com-
resourced”-ness relates not only to the avail-         munities for low-resourced languages. The fea-
able resources for a language, but also to the         sibility and scalability of this method is demon-
lack of geographic and language diversity of           strated with a case study on MT for African
NLP researchers themselves.                            languages, where we present its implementa-
   The NLP community has awakened to the               tion and outcomes, including novel translation
fact that it has a diversity crisis in terms of lim-   datasets, benchmarks for over 30 target lan-
ited geographies and languages (Caines, 2019;          guages contributed and evaluated by language
Joshi et al., 2020): Research groups are ex-           speakers, and publications authored by partici-
tending NLP research to low-resourced lan-             pants without formal training as scientists.
guages (Guzmán et al., 2019; Hu et al., 2020;
                                                       2    Background
Wu and Dredze, 2020), and workshops have
been established (Haffari et al., 2018; Axelrod        Cross-lingual Transfer. With the success of
et al., 2019; Cherry et al., 2019).                    deep learning in NLP, language-specific fea-
   We scope the rest of this study to machine          ture design has become rare, and cross-lingual
transfer methods have come into bloom (Upad-               While this provides a solution for small quanti-
hyay et al., 2016; Ruder et al., 2019) to transfer         ties of training data or monolingual resources,
progress from high-resourced to low-resourced              the extent to which standard BLEU evaluations
languages (Adams et al., 2017; Wang et al.,                reflect translation quality is not clear yet, since
2019; Kim et al., 2019). The most diverse                  human evaluation studies are missing.
benchmark for multilingual transfer by Hu et al.
                                                           Targeted Resource Creation. Guzmán et al.
(2020) allows measurement of the success of
                                                           (2019) develop evaluation datasets for low-
such transfer approaches across 40 languages
                                                           resourced MT between English and Nepali,
from 12 language families. However, the in-
                                                           Sinhala, Khmer and Pashtolow. They high-
clusion of languages in the set of benchmarks
                                                           light many problems with low-resourced trans-
is dependent on the availability of monolin-
                                                           lation: tokenization, content selection, and
gual data for representation learning with pre-
                                                           translation verification, illustrating increased
viously annotated resources. The content of the
                                                           difficulty translating from English into low-
benchmark tasks is English-sourced, and hu-
                                                           resourced languages, and highlight the ineffec-
man performance estimates are taken from En-
                                                           tiveness of accepted state-of-the-art techniques
glish. Most cross-lingual representation learn-
                                                           on morphologically-rich languages. Despite
ing techniques are Anglo-centric in their de-
                                                           involving all agents of the MT process (Sec-
sign (Anastasopoulos and Neubig, 2019).
                                                           tion 3), the study does not involve data curators
Multilingual       Approaches. Multilingual                or evaluators that understood the languages in-
MT (Dong et al., 2015; Firat et al., 2016a,b;              volved, and resorts to standard MT evaluation
Wang et al., 2020) addresses the transfer of               metrics. Additionally, how this effort-intensive
MT from high-resourced to low-resourced                    approach would scale to more than a handful
languages by training multilingual models for              of languages remains an open question.
all languages at once. (Aharoni et al., 2019;
                                                           3   The Machine Translation Process
Arivazhagan et al., 2019) train models to trans-
late between English and 102 languages, for                We reflect on the process enabling a sustainable
the 10 most high-resourced African languages               process for MT research on parallel corpora
on private data, and otherwise on public                   in terms of the required agents and interac-
TED talks (Qi et al., 2018). Multilingual                  tions, visualized in Figure 1. Content creators,
training often outperforms bilingual training,             translators, and curators form the dataset cre-
especially for low-resourced languages.                    ation process, while the language technologists
However, with multilingual parallel data                   and evaluators are part of the model creation
being also Anglo-centric, the capabilities to              process. Stakeholders (not displayed) create
translate from English versus into English                 demand for both processes.
vastly diverge (Zhang et al., 2020).                          Stakeholders are people impacted by the
   Another recent approach, mBART (Liu                     artifacts generated by each agent in the MT
et al., 2020), leverages both monolingual and              process, and can typically speak and read the
parallel data and also yields improvements                 source or the target languages. To benefit from
in translation quality for lower-resource lan-             MT systems, the stakeholders need access to
guages such as Nepali, Sinhala and Gujarati.3              technology and electricity.
   3
                                                              Content Creators produce content in a lan-
     Note that these languages have more digital re-
sources available and a longer history of written texts    guage, where content is any digital or non-
than the low-resourced languages we are addressing here.   digital representation of language. For digi-
tal content, content creators require keyboards,
and access to technology.
   Translators translate the original content,
including crowd-workers, researchers, or trans-
lation professionals. They must understand the
language of the content creator and the target
language. A translator needs content to trans-
late, provided by content creators. For digital
content, the translator requires keyboards and
technology access.
   Curators are defined as individuals in-
volved in the content selection for a
dataset (Bender and Friedman, 2018), requir-       Figure 1: The MT Process, in terms of the neces-
ing access to content and translations. They       sary agents, interactions and external constraints
                                                   and demand (excluding stakeholders).
should understand the languages in question
for quality control and encoding information.
   Language Technologists are defined as in-          By contrast, parts of the process for exist-
dividuals using datasets and computational lin-    ing low-resourced MT are constrained. His-
guistic techniques to produce MT models be-        torically, many low-resourced languages had
tween language pairs. Language technologists       low demand from stakeholders for content cre-
require language preprocessors, MT toolkits,       ation and translation (Wa Thiong’o, 1992).
and access to compute resources.                   Due to missing keyboards or limited access
   Evaluators are individuals who measure          to technology, content creators were not em-
and analyse the performance of a MT model,         powered to write digital content (Adam, 1997;
and therefore need knowledge of both source        van Esch et al., 2019). This is a chicken-or-
and target languages. To report on the perfor-     egg problem, where existing digital content in
mance on models, evaluators require quality        a language would attract more stakeholders,
metrics, as well as evaluation datasets. Evalu-    which would incentivize content creators (Kaf-
ators provide feedback to the Language Tech-       fee et al., 2018). As a result, primary data
nologists for improvement.                         sources for NLP research, such as Wikipedia,
                                                   often have a few hundred articles only for low-
3.1   Limitations of Existing Approaches           resourced languages despite large speaker pop-
If we place a high-resource MT pair such           ulations, see Table 1. Due to limited demand,
as English-to-French into the process defined      existing translations are often domain-specific
above, we observe that each agent nowadays         and small in size, such as the JW300 corpus
has the necessary resources and historical         (Agić and Vulić, 2019) whose content was cre-
stakeholder demand to perform their role ef-       ated for missionary purposes.
fectively. A “virtuous cycle” emerged where           When data curators are not part of the so-
available content enabled the development of       cieties from where these languages originate,
MT systems that in turn drove more transla-        they are are often unable to identify data
tions, more tools, more evaluation and more        sources or translators for languages, prohibit-
content, which cycled back to improving MT         ing them from checking the validity of the cre-
systems.                                           ated resource. This creates problems in en-
coding, orthography or alignment, resulting          guages. Where this condition cannot be sat-
in noisy or incorrect translation pairs (Taylor      isfied, at least a knowledge transfer between
et al., 2015). This is aggravated by the fact that   agents should be enabled. We hypothesize that
many low-resourced languages do not have a           using a participatory approach will allow re-
long written history to draw from and therefore      searchers to improve the MT process by iterat-
might be less standardized and using multiple        ing faster and more effectively.
scripts. In collaboration with content creators,        Participatory research, unlike conventional
data curators can contribute to standardization      research, emphasizes the value of research
or at least recognize potential issues for data      partners in the knowledge-production process
processing further down the line.                    where the research process itself is defined
   As discussed in Section 1, language tech-         collaboratively and iteratively. The “partici-
nologists are fewer in low-resourced societies.      pants” are individuals involved in conducting
Furthermore, the techniques developed in high-       research without formal training as researchers.
resourced societies might be inapplicable due        Participatory research describes a broad set
to compute, infrastructure or time constraints.      of methodologies, organised in terms of the
Aside from the problem of education and com-         level of participation. At the lowest level
plexity, existing techniques may not apply due       is crowd-sourcing, where participants are in-
to linguistic and morphological differences in       volved solely in data collection. The highest
the languages, or the scale, domain, or quality      level—extreme citizen science–involves partic-
of the data (Hu et al., 2020; Pires et al., 2019).   ipation in the problem definition, data collec-
   Evaluators usually resort to potentially un-      tion, analysis and interpretation (English et al.,
suitable automatic metrics due to time con-          2018).
straints or missing connections to stakehold-           Crowd-sourcing has been applied to low-
ers (Guzmán et al., 2019). The main evaluators      resourced language data collection (Ambati
of low-resourced NLP that is developed today         et al., 2010; Guevara-Rukoz et al., 2020; Mil-
typically cannot use human metrics due to the        lour and Fort, 2018), but existing studies high-
inability to speak the languages, or the lack of     light how the disconnect between the data
reliable crowdsourcing infrastructure, identi-       creation process and model creation process
fied as one of the core weaknesses of previous       causes challenges. In seeking to create cross-
approaches (in Section 2).                           disciplinary teams that emphasize the values
   In summary, many agents in the MT process         in a societal context, a participatory approach
for low-resourced languages are either missing       which involves participants in every part of
invaluable language and societal knowledge, or       the scientific process appears pertinent to solv-
the necessary technical resources, knowledge,        ing the problems for low-resourced languages
connections, and incentives to form interac-         highlighted in Section 3.1.
tions with other agents in the process.                 To show how more involved participatory
                                                     research can benefit low-resource language
3.2   Participatory Research Approach                translation, we present a case study in MT for
                                                     African languages.
We propose one way to overcome the limita-
tions in Section 3.1: ensuring that the agents       4   Case Study: Masakhane
in the MT process originate from the coun-
tries where the low-resourced languages are          Africa currently has 2144 living lan-
spoken or can speak the low-resourced lan-           guages (Eberhard et al., 2019). Despite this,
African languages account for a small fraction       to bring together distributed participants, en-
of available language resources, and NLP             abling each language speaker to train, con-
research rarely considers African languages.         tribute, and evaluate their models. The experi-
In the taxonomy of Joshi et al. (2020), African      ment is evaluated in terms of the quantity and
languages are assigned categories ranging            diversity of participants and languages, and the
from “The Left Behinds” to “The Rising               variety of research artifacts, in terms of bench-
Stars”, with most languages not having any           marks, human evaluations, publications, and
annotated data. Even monolingual resources           the overall health of the community. While
are sparse, as shown in Table 1.                     a set of novel human evaluation results are
   In addition to a lack of NLP datasets, the        presented, they serve as demonstration of the
African continent lacks NLP researchers. In          value of a participatory approach, rather than
2018, only five out of the 2695 affiliations of      the empirical focus of the paper.
the five major NLP conferences were from
                                                     4.1   Methodology
African institutions (Caines, 2019). ∀ et al.
(2020) attribute this to a culmination of cir-       To overcome the challenge of recruiting par-
cumstances, in particular their societal embed-      ticipants, a number of strategies were em-
ding (Alexander, 2009) and socio-economic            ployed. Starting from local demand at a ma-
factors, hindering participation in research ac-     chine learning school (Deep Learning Indaba
tivities and events, leaving researchers dis-        (Engelbrecht, 2018)), meetups and universities,
connected and distributed across the conti-          distant connections were made through Twitter,
nent. Consequently, existing data resources          conference workshops,4 and eventually press
are harder to discover, especially since these       coverage5 and research publications.6 To over-
are often published in closed journals or are        come the limited tertiary education enrollments
not digitized (Mesthrie, 1995).                      in Sub-Saharan Africa (Jowi et al., 2018), no
   For African languages, the implementation         prerequisites were placed on researchers join-
of a standard crowd-sourcing pipeline as for         ing the project. For the agents outlined in Sec-
example used for collecting task annotations         tion 3, no fixed roles are imposed onto par-
for English, is at the current stage infeasible,     ticipants. Instead, they join with a specific
due to the challenges outlined in Section 3 and      interest, background, or skill aligning them
above. Additionally, no standard MT evalua-          best to one or more of agents. To obtain cross-
tion set for all of the languages in focus exists,   disciplinarity, we focus on the communication
nor are there prior published systems that we        and interaction between participants to enable
could compare all models against for a more          knowledge transfer between missing connec-
insightful human evaluation. We therefore re-        tions (identified in Section 3.1), allowing a
sort to intrinsic evaluation, and rely on this       fluidity of agent roles. For example, someone
work becoming the first benchmark for future         who initially joined with the interest of using
evaluations.                                           4
                                                         ICLR AfricaNLP 2020: https://africanlp-
   We invite the reader to adopt a meta-             workshop.github.io/
                                                       5
                                                         https://venturebeat.com/2019/
perspective of this case study as an empirical       11/27/the-masakhane-project-wants-
experiment: Where the hypothesis is that par-        machine-translation-and-ai-to-
ticipatory research can facilitate low-resourced     transform-africa/
                                                       6
                                                         https://github.com/masakhane-
MT development; the experimental method-             io/masakhane-community/blob/master/
ology is the strategies and tools employed           publications.md
machine translation for their local language (as   detailed demographics of a subset of partic-
a stakeholder) to translate education material,    ipants from a voluntary survey in February
might turn into a junior language technologist     2020. 86.5% of participants responded pos-
when equipped with tools and introductory ma-      itively when asked if the community helped
terial and mentoring, and guide content cre-       them find mentors or collaborators, indicating
ation more specifically for resources needed       that the health of the community is positive.
for MT.                                            This is also reflected in joint research publica-
   To bridge large geographical divides, the       tions of new groups of collaborators.
community lives online. Communication
                                                   Research Artifacts. As a result of mentor-
occurs on GitHub and Slack with weekly
                                                   ship and knowledge exchange between agents
video conference meetings and reading groups.
                                                   of the translation process, our implementa-
Meeting notes are shared openly so that contin-
                                                   tion of participatory research has produced
uous participation is not required and time com-
                                                   artifacts for NLP research, namely datasets,
mitment can be organized individually. Sub-
                                                   benchmarks and models, which are publicly
interest groups have emerged in Slack chan-
                                                   available online.8 . Additionally, over 10 partic-
nels to allow focused discussions. Agendas for
                                                   ipants have gone on to publish works address-
meetings and reading groups are public and
                                                   ing language-specific challenges at confer-
democratically voted upon. In this way, the re-
                                                   ence workshops, such as (Dossou and Emezue,
search questions evolve based on stakeholder
                                                   2020; Orife, 2020; Orife et al., 2020; Öktem
demands, rather than being imposed upon by
                                                   et al., 2020; Van Biljon et al., 2020; Martinus
external forces.
                                                   et al., 2020; Marivate et al., 2020).
   The lack of compute resources and prior
exposure to NLP is overcome by providing tu-       Dataset Creation. The dataset creation pro-
torials for training a custom-size Transformer     cess is ongoing, with new initiatives still
model with JoeyNMT (Kreutzer et al., 2019)         emerging. We showcase a few initiatives be-
on Google Colab7 . International researchers       low to demonstrate how bridging connections
were not prohibited from joining. As a re-         between agents facilitates the MT process.
sult, mutual mentorship relations emerged,
whereby international researchers with more          1. A team of Nigerian participants, driven
language technology experience guided re-               by the internal demand to ensure that ac-
search efforts and enabled data curators or             cessible and representative data of their
translators to become language technologists.           culture is used to train models, are trans-
In return, African researchers introduced the           lating their own writings including per-
international language technologists to African         sonal religious stories and undergraduate
stakeholders, languages and context.                    theses into Yoruba and Igbo9 .

4.2     Research Outcomes                            2. A Namibian participant, driven by a pas-
                                                        sion to preserve the culture of the Damara,
Participants. A growth to over 400 partici-
                                                        is hosting collaborative sessions with
pants of diverse disciplines, from at least 20
                                                        Damara speakers, to collect and trans-
countries, has been achieved within the past
                                                        late phrases that reflect Damara culture
year, suggesting the participant recruitment
                                                      8
process was effective. Appendix A contains            https://github.com/masakhane-io
                                                      9
                                                      https://github.com/masakhane-
   7
       https://colab.research.google.com           io/masakhane-wazobia-dataset
around traditional clothing, songs, and        the benchmarks they valued most. 16 of the
     prayers.10                                     selected target languages are categorized as
                                                    “Left-behind” and 11 are categorized as “Scrap-
  3. Creating a connection between a trans-         ing by” in the taxonomy of (Joshi et al., 2020).
     lator in South-Africa’s parliament and         The benchmarks are hosted publicly, includ-
     a language technologist has enabled the        ing model weights, configurations and prepro-
     process of data curation, allowing access      cessing pipelines for full reproducibility. The
     to data from the parliament in South-          benchmarks are submitted by individual or
     Africa’s languages (which are public but       groups of participants in form of a GitHub Pull
     obfuscated behind internal tools).11 .         Request. By this, we ensure that the contact to
                                                    the benchmark contributors can be made, and
These stories demonstrate the value of includ-
                                                    ownership is experienced.
ing curators, content creators, and translators
as participants.                                    4.3     Human MT Evaluation
Benchmarks. We publish 46 benchmarks                To our knowledge, there is no prior research
for neural translation models from English into     on human evaluation specifically for machine
39 distinct African languages, and from French      translations of low-resourced languages. Until
into two additional languages, as well as from      now, NLP practitioners were left with the hope
three different languages into English.12 Most      that successful evaluation methodologies for
were trained on the JW300 corpus (Agić and         high-resource languages would transfer well
Vulić, 2019). From this corpus, we select          to low-resourced languages. This lack of study
the English sentences most commonly found           is due to the missing connections between the
(and longer than 4 tokens) in all languages,        community of speakers (content creators and
as a global set of test sources. For individ-       translators), and the language technologists.
ual languages, test splits are composed by se-      MT evaluations by humans are often done ei-
lecting the translations that are available from    ther within a group of researchers from the
this subset. While this biases the test set to-     same lab or field (e.g. for WMT evaluations14 ),
wards frequent segments, it prevents cross-         or via crowdsourcing platforms (Ambati and
lingual overlap between training and test data      Vogel, 2010; Post et al., 2012). Speakers of
which has to be ensured for cross-lingual trans-    low-resource languages are traditionally under-
fer learning. For training data, other sources      represented in these groups, which makes such
like Autshumato (McKellar, 2014), TED (Cet-         studies even harder (Joshi et al., 2019; Guzmán
tolo et al., 2012), SAWA (De Pauw et al.,           et al., 2019).
2009), Tatoeba13 , Opus (Tiedemann, 2012),             One might argue that human evaluation
and data translated or curated by participants      should not be attempted before reaching a vi-
were added. Language pairs were selected            able state of quality, but we found that early
based on the individual demands of each of the      evaluation results in an improved understand-
33 participants, who voluntarily contributed        ing of the individual challenges of the target
  10
                                                    languages, strengthens the network of the com-
     https://github.com/masakhane-
io/masakhane-khoekhoegowab
                                                    munity, and most importantly, improves the
  11
     http://bit.ly/raw-parliamentary-               connection and knowledge transfer between
translations                                        language technologists, content creators and
  12
     Benchmark scores can be found in Appendix C.
  13                                                  14
     https://tatoeba.org/                                  http://www.statmt.org/wmt19/
curators.                                               by both stakeholders and language technolo-
   The “low-resourced”-ness of the addressed            gists. Within only 10 days, we gathered a total
languages pose challenges for evaluation be-            of 707 evaluated translations covering Igbo
yond interface design or recruitment of eval-           (ig), Nigerian Pidgin (pcm), Shona (sn), Luo
uators proficient in the target language. For           (luo), Hausa (ha, twice by two different an-
the example of Igbo, evaluators had to find             notators), Kiswahili (sw), Yoruba (yo), Fon
solutions for typing diacritics without a suit-         (fon) and Dendi (ddn). We did not impose pre-
able keyboard. In addition, Igbo has many               scriptions in terms of number of sentences to
dialects and variations which the MT model is           evaluate, or time to spend, since this was volun-
uninformed of. Medical or technical terminol-           tary work, and guidelines or estimates for the
ogy (e.g., “data”) is difficult to translate and        evaluation of translations into these languages
whether to use loan words required discussion.          are non-existent.
Target language news websites were found to
be useful for resolving standardization or termi-       Evaluation Technique. Instead of a direct
nology questions. Solutions for each language           assessment (Graham et al., 2013) often used
were shared and often also applicable for other         in benchmark MT evaluations (Barrault et al.,
languages.                                              2019; Guzmán et al., 2019), we opt for post-
                                                        editing. Post-edits are grounded in actions that
Data. The models are trained on JW300                   can be analyzed in terms of e.g. error types for
data.15 To gain real-world quality estimates be-        further investigations, while direct assessments
yond religious context, we assess the models’           require expensive calibration (Bentivogli et al.,
out-of-domain generalization by translating a           2018). Embedded in the community, these
English COVID-19 survey with 39 questions               post-edit evaluations create an asset for the
and statements regarding COVID-19,16 where              interaction of various agents: for the language
the human-corrected and approved translations           technologists for domain adaptation, or for the
can directly serve the purpose of gathering re-         content creators, curators, or translators for
sponses. The domain is challenging as it con-           guidance in standardization or domain choice.
tains medical terms and new vocabulary. Fur-
thermore, we evaluate a subset of the Multitar-         Results. Table 2 reports evaluation results in
get TED test data (Duh, 2018)17 . The obtained          terms of BLEU evaluated on the benchmark
translations enrich the TED datasets, adding            test set from JW300, and human-targeted TER
new languages for which no prior translations           (HTER) (Snover et al., 2006), BLEU (Papineni
exist. The size of the TED evaluations vary             et al., 2002) and ChrF (Popović, 2015) against
from 30 to 120 sentences. Details are given in          human corrected model translations. For ha
Table 3, Appendix B.                                    we find modest agreement between evaluators:
                                                        Spearman’s ρ = 0.56 for sentence-BLEU mea-
Evaluators. 11 participants of the commu-
                                                        surements of the post-edits compared to the
nity volunteered to evaluate translations in
                                                        original hypotheses. Generally, we observe
their language(s), often involving family or
                                                        that the JW300 score is misleading, overesti-
friends to determine the most correct transla-
                                                        mating model quality (except yo). Training
tions. The evaluator role is therefore taken
  15
                                                        data size appears to be a more reliable predic-
     Except for Hausa: multiple domains, see Table 4.
  16                                                    tor of generalization abilities, illustrating the
     https://coronasurveys.org/
  17
     http://www.cs.jhu.edu/˜kevinduh/a/                 danger of chasing a single benchmark. How-
multitarget-tedtalks/                                   ever, ig and yo both have comparable amounts
Trg.     Train.   Autom.: JW300            Human: COVID                        Human: TED
 lang.      size      BLEU ↑       HTER ↓     HBLEU ↑ HCHRF ↑          HTER ↓    HBLEU ↑ HCHRF ↑
 ddn       6,937       22.30        1.11        0.27          0.08         -          -          -
 pcm      20,214       23.29        0.98        3.03          0.19       0.84       9.76       25.16
 fon      27,510       31.07        0.92       15.43         23.22         -          -          -
 luo     136,459       34.33          -           -             -        1.26       7.90       20.88
                                    0.71       26.96         43.97       0.73      20.42       39.31
 ha      333,845       41.11
                                    0.64       26.56         46.71         -          -          -
 ig      414,467       34.85        0.85       11.94         29.86       0.55      33.74       49.67
 yo      415,100       38.62        0.09       85.92         89.90       0.51      49.22       58.41
 sn      712,455       30.84        0.53       31.31         54.04         -          -          -
 sw      875,558       48.94          -           -             -        0.32      60.47       78.67

Table 2: Evaluation results for translations from English. Metrics are computed based on Polyglot-
tokenized translations. HTER are mean sentence-level TER scores computed with the Pyter Python
package. BLEU and ChrF are computed with Sacrebleu and tokenize “none” (Post, 2018).

of training data, JW300 scores, and carry di-          to low-resourced NLP. The sheer volume and
acritics, but exhibit very different evaluation        diversity of participants, languages and out-
performances, in particular on COVID. This             comes, and that for many for languages fea-
can be explained by the large variations of ig         tured, this paper constitutes the first time that
as discussed above: Training data and model            human evaluation of an MT system has been
output are not consistent with respect to one          performed, is evidence of the value of partici-
dialect, while the evaluator had to decide on          patory approaches for low-resourced MT. For
one. We also find difference in performance            future work, we will (1) continue to iterate,
across domains, with the TED domain appear-            analyze and widen our benchmarks and eval-
ing easier for pcm and ig, while the yo model          uations, (2) build richer and more meaningful
performs better on COVID.                              datasets that reflect priorities of the stakehold-
                                                       ers, (3) expand the focus of the existing com-
5     Conclusion                                       munity for African languages to other NLP
We proposed a participatory approach as a so-          tasks, and (4) help implement similar commu-
lution to sustainably scaling NLP research to          nities for other geographic regions with low-
low-resourced languages. Having identified             resourced languages.
key agents and interactions in the MT devel-
opment process, we implement a participatory
approach to build a community for African
                                                       Acknowledgements
MT. In the process, we discovered successful
strategies for distributed growth and commu-           We would like to thank Benjamin Rosman and
nication, knowledge sharing and model build-           Richard Klein for their invaluable feedback, as
ing. In addition to publishing benchmarks and          well as the anonymous EMNLP reviewers, and
datasets for previously understudied languages,        George Foster and Daan van Esch. We would
we show how the participatory design of the            also like to thank Google Cloud for the grant
community enables us to conduct a human eval-          that enabled us to build baselines for languages
uation study of model outputs, which has been          with larger datasets.
one of the limitations of previous approaches
References                                           Naveen Arivazhagan, Ankur Bapna, Orhan Fi-
                                                       rat, Dmitry Lepikhin, Melvin Johnson, Maxim
Lishan Adam. 1997. Content and the web for             Krikun, Mia Xu Chen, Yuan Cao, George F.
   african development. Journal of information         Foster, Colin Cherry, Wolfgang Macherey,
   science, 23(1):91–97.                               Zhifeng Chen, and Yonghui Wu. 2019. Mas-
Oliver Adams, Adam Makarucha, Graham Neubig,           sively multilingual neural machine translation
  Steven Bird, and Trevor Cohn. 2017. Cross-           in the wild: Findings and challenges. CoRR,
  lingual word embeddings for low-resource lan-        abs/1907.05019.
  guage modeling. In Proceedings of the 15th
                                                     Amittai Axelrod, Diyi Yang, Rossana Cunha,
  Conference of the European Chapter of the As-
                                                      Samira Shaikh, and Zeerak Waseem, editors.
  sociation for Computational Linguistics: Vol-
                                                      2019. Proceedings of the 2019 Workshop on
  ume 1, Long Papers, pages 937–947, Valencia,
                                                      Widening NLP. Association for Computational
  Spain. Association for Computational Linguis-
                                                      Linguistics, Florence, Italy.
  tics.
Željko Agić and Ivan Vulić. 2019. JW300: A        Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà,
   wide-coverage parallel corpus for low-resource      Christian Federmann, Mark Fishel, Yvette Gra-
   languages. In Proceedings of the 57th Annual        ham, Barry Haddow, Matthias Huck, Philipp
   Meeting of the Association for Computational        Koehn, Shervin Malmasi, Christof Monz, Math-
   Linguistics, pages 3204–3210, Florence, Italy.      ias Müller, Santanu Pal, Matt Post, and Mar-
   Association for Computational Linguistics.          cos Zampieri. 2019. Findings of the 2019 con-
                                                       ference on machine translation (wmt19). In
Roee Aharoni, Melvin Johnson, and Orhan Firat.         Proceedings of the Fourth Conference on Ma-
  2019. Massively multilingual neural machine          chine Translation (Volume 2: Shared Task Pa-
  translation. In Proceedings of the 2019 Confer-      pers, Day 1), pages 1–61, Florence, Italy. Asso-
  ence of the North American Chapter of the Asso-      ciation for Computational Linguistics.
  ciation for Computational Linguistics: Human
  Language Technologies, Volume 1 (Long and          Emily M Bender and Batya Friedman. 2018. Data
  Short Papers), pages 3874–3884, Minneapolis,         statements for natural language processing: To-
  Minnesota. Association for Computational Lin-        ward mitigating system bias and enabling bet-
  guistics.                                            ter science. Transactions of the Association for
                                                       Computational Linguistics, 6:587–604.
Neville Alexander. 2009. Evolving african ap-
  proaches to the management of linguistic diver-    Luisa Bentivogli, Mauro Cettolo, Marcello Fed-
  sity: The acalan project. Language Matters,          erico, and Federmann Christian. 2018. Ma-
  40(2):117–132.                                       chine translation human evaluation: an investi-
                                                       gation of evaluation based on post-editing and
Vamshi Ambati and Stephan Vogel. 2010. Can             its relation with direct assessment. In 15th Inter-
  crowds build parallel corpora for machine trans-     national Workshop on Spoken Language Trans-
  lation systems? In Proceedings of the NAACL          lation 2018, pages 62–69.
  HLT 2010 Workshop on Creating Speech and
  Language Data with Amazon’s Mechanical             Andrew Caines. 2019. The geographic diversity
  Turk, pages 62–65, Los Angeles. Association          of nlp conferences.
  for Computational Linguistics.
                                                     Mauro Cettolo, Christian Girardi, and Marcello
Vamshi Ambati, Stephan Vogel, and Jaime Car-          Federico. 2012. Wit3 : Web inventory of tran-
  bonell. 2010. Active learning-based elicitation     scribed and translated talks. In Proceedings of
  for semi-supervised word alignment. In Pro-         the 16th Conference of the European Associ-
  ceedings of the ACL 2010 Conference Short Pa-       ation for Machine Translation (EAMT), pages
  pers, pages 365–370, Uppsala, Sweden. Associ-       261–268, Trento, Italy.
  ation for Computational Linguistics.
                                                     Colin Cherry, Greg Durrett, George Foster, Reza
Antonios Anastasopoulos and Graham Neubig.             Haffari, Shahram Khadivi, Nanyun Peng, Xi-
  2019. Should all cross-lingual embeddings            ang Ren, and Swabha Swayamdipta, editors.
  speak english?                                       2019. Proceedings of the 2nd Workshop on
Deep Learning Approaches for Low-Resource           Prasad, Evan Crew, Chieu Nguyen, and
  NLP (DeepLo 2019). Association for Computa-         Françoise Beaufays. 2019. Writing across the
  tional Linguistics, Hong Kong, China.               world’s languages: Deep internationalization
                                                      for gboard, the google keyboard. arXiv preprint
Christopher Cieri, Mike Maxwell, Stephanie            arXiv:1912.01218.
  Strassel, and Jennifer Tracey. 2016. Selection
  criteria for low resource language programs.      Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
  In Proceedings of the Tenth International Con-      2016a. Multi-way, multilingual neural machine
  ference on Language Resources and Evalua-           translation with a shared attention mechanism.
  tion (LREC 2016), pages 4543–4549, Portorož,       In Proceedings of the 2016 Conference of the
  Slovenia. European Language Resources Asso-         North American Chapter of the Association for
  ciation (ELRA).                                     Computational Linguistics: Human Language
                                                      Technologies, pages 866–875, San Diego, Cali-
Guy De Pauw, Peter Waiganjo Wagacha, and              fornia. Association for Computational Linguis-
  Gilles-Maurice de Schryver. 2009. The SAWA          tics.
  corpus: A parallel corpus English - Swahili. In
  Proceedings of the First Workshop on Language     Orhan Firat, Baskaran Sankaran, Yaser Al-
  Technologies for African Languages, pages 9–        onaizan, Fatos T. Yarman Vural, and
  16, Athens, Greece. Association for Computa-        Kyunghyun Cho. 2016b.            Zero-resource
  tional Linguistics.                                 translation with multi-lingual neural machine
                                                      translation.    In Proceedings of the 2016
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu,             Conference on Empirical Methods in Nat-
  and Haifeng Wang. 2015. Multi-task learning         ural Language Processing, pages 268–277,
  for multiple language translation. In Proceed-      Austin, Texas. Association for Computational
  ings of the 53rd Annual Meeting of the Asso-        Linguistics.
  ciation for Computational Linguistics and the
  7th International Joint Conference on Natural     ∀, Iroro Orife, Julia Kreutzer, Blessing Sibanda,
  Language Processing (Volume 1: Long Papers),         Daniel Whitenack, Kathleen Siminyu, Laura
  pages 1723–1732, Beijing, China. Association         Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi
  for Computational Linguistics.                       Marivate, Salomon Kabongo, Musie Meressa,
                                                       Espoir Murhabazi, Orevaoghene Ahia, Elan van
Bonaventure FP Dossou and Chris C Emezue.              Biljon, Arshath Ramkilowan, Adewale Akin-
  2020. Ffr v1. 0: Fon-french neural machine           faderin, Alp Öktem, Wole Akin, Ghollah Kioko,
  translation. arXiv preprint arXiv:2003.12111.        Kevin Degila, Herman Kamper, Bonaventure
Kevin Duh. 2018. The multitarget ted talks task.       Dossou, Chris Emezue, Kelechi Ogueji, and
  http://www.cs.jhu.edu/˜kevinduh/a/
                                                       Abdallah Bashir. 2020. Masakhane – machine
  multitarget-tedtalks/.
                                                       translation for africa. In “AfricaNLP” Work-
                                                       shop at the 8th International Conference on
David M Eberhard, Gary F. Simons, and Charles D.       Learning Representations.
  Fenning. 2019. Ethnologue: Languages of the
  worlds. twenty-second edition.                    Yvette Graham, Timothy Baldwin, Alistair Moffat,
                                                      and Justin Zobel. 2013. Continuous measure-
Herman Engelbrecht. 2018. The deep learning           ment scales in human evaluation of machine
  indaba report. ACM SIGMultimedia Records,           translation. In Proceedings of the 7th Linguistic
  9(3):5–5.                                           Annotation Workshop and Interoperability with
                                                      Discourse, pages 33–41, Sofia, Bulgaria. Asso-
PB English, MJ Richardson, and Catalina Garzón-      ciation for Computational Linguistics.
  Galvis. 2018. From crowdsourcing to extreme
  citizen science: participatory research for en-   Jiatao Gu, Hany Hassan, Jacob Devlin, and Vic-
  vironmental health. Annual review of public          tor O.K. Li. 2018.         Universal neural ma-
  health, 39:335–350.                                  chine translation for extremely low resource lan-
                                                       guages. In Proceedings of the 2018 Conference
Daan van Esch, Elnaz Sarbar, Tamar Lucassen,           of the North American Chapter of the Associ-
  Jeremy O’Brien, Theresa Breiner, Manasa              ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Pa-           2020. The state and fate of linguistic diver-
  pers), pages 344–354, New Orleans, Louisiana.       sity and inclusion in the nlp world. ArXiv,
  Association for Computational Linguistics.          abs/2004.09095.

Adriana Guevara-Rukoz, Isin Demirsahin, Fei         James Jowi, Charles Ochieng Ong’ondo, and
  He, Shan-Hui Cathy Chu, Supheakmungkol              Mulu Nega. 2018. Building phd capacity in sub-
  Sarin, Knot Pipatsrisawat, Alexander Gutkin,        saharan africa.
  Alena Butryna, and Oddur Kjartansson. 2020.       Lucie-Aimée Kaffee, Hady ElSahar, Pavlos Vou-
  Crowdsourcing Latin American Spanish for            giouklis, Christophe Gravier, Frédérique Lafor-
  low-resource text-to-speech. In Proceedings         est, Jonathon Hare, and Elena Simperl. 2018.
  of The 12th Language Resources and Evalua-          Mind the (language) gap: generation of multi-
  tion Conference, pages 6504–6513, Marseille,        lingual wikipedia summaries from wikidata for
  France. European Language Resources Associ-         articleplaceholders. In European Semantic Web
  ation.                                              Conference, pages 319–334. Springer.
Francisco Guzmán, Peng-Jen Chen, Myle Ott,         Yunsu Kim, Yingbo Gao, and Hermann Ney. 2019.
  Juan Pino, Guillaume Lample, Philipp Koehn,         Effective cross-lingual transfer of neural ma-
  Vishrav Chaudhary, and Marc’Aurelio Ran-            chine translation models without shared vocabu-
  zato. 2019. The flores evaluation datasets          laries. In Proceedings of the 57th Annual Meet-
  for low-resource machine translation: Nepali–       ing of the Association for Computational Lin-
  english and sinhala–english. In Proceedings         guistics, pages 1246–1257, Florence, Italy. As-
  of the 2019 Conference on Empirical Meth-           sociation for Computational Linguistics.
  ods in Natural Language Processing and the
  9th International Joint Conference on Natu-       Julia Kreutzer, Jasmijn Bastings, and Stefan Rie-
  ral Language Processing (EMNLP-IJCNLP),              zler. 2019. Joey NMT: A minimalist NMT
  pages 6100–6113.                                     toolkit for novices. In Proceedings of the 2019
                                                       Conference on Empirical Methods in Natural
                                                       Language Processing and the 9th International
Reza Haffari, Colin Cherry, George Foster,
                                                       Joint Conference on Natural Language Pro-
  Shahram Khadivi, and Bahar Salehi, editors.
                                                       cessing (EMNLP-IJCNLP): System Demonstra-
  2018.     Proceedings of the Workshop on
                                                       tions, Hong Kong, China.
  Deep Learning Approaches for Low-Resource
  NLP. Association for Computational Linguis-       Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
  tics, Melbourne.                                    Sergey Edunov, Marjan Ghazvininejad, Mike
                                                      Lewis, and Luke Zettlemoyer. 2020. Multilin-
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-     gual denoising pre-training for neural machine
  ham Neubig, Orhan Firat, and Melvin John-           translation. arXiv preprint arXiv:2001.08210.
  son. 2020.      Xtreme: A massively multi-
  lingual multi-task benchmark for evaluating       Vukosi Marivate, Tshephisho Sefara, Vongani
  cross-lingual generalization. arXiv preprint        Chabalala, Keamogetswe Makhaya, Tumisho
  arXiv:2003.11080.                                   Mokgonyane, Rethabile Mokoena, and Abio-
                                                      dun Modupe. 2020. Investigating an approach
Pratik Joshi, Christain Barnes, Sebastin Santy,       for low resource language dataset creation, cu-
  Simran Khanuja, Sanket Shah, Anirudh Srini-         ration and classification: Setswana and sepedi.
  vasan, Satwik Bhattamishra, Sunayana Sitaram,       arXiv preprint arXiv:2003.04986.
  Monojit Choudhury, and Kalika Bali. 2019.         Laura Martinus, Jason Webster, Joanne Moon-
  Unsung challenges of building and deploying         samy, Moses Shaba Jnr, Ridha Moosa, and
  language technologies for low resource lan-         Robert Fairon. 2020. Neural machine transla-
  guage communities. In Sixteenth International       tion for south africa’s official languages. arXiv
  Conference on Natural Language Processing           preprint arXiv:2005.06609.
  (ICON), Hyderabad, India.
                                                    Arya McCarthy. 2017. The new digital divide:
Pratik M. Joshi, Sebastin Santy, Amar Budhi-          Language is the impediment to informa-
  raja, Kalika Bali, and Monojit Choudhury.           tion access.    https://hilltopicssmu.
wordpress.com/2017/04/08/the-new-                  Maja Popović. 2015. chrF: character n-gram f-
  digital-divide-language-is-the-                     score for automatic MT evaluation. In Proceed-
  impediment-to-information-access/.                  ings of the Tenth Workshop on Statistical Ma-
  Accessed: 2020-05-30.                               chine Translation, pages 392–395, Lisbon, Por-
                                                      tugal. Association for Computational Linguis-
Cindy A. McKellar. 2014. An english to xit-           tics.
  songa statistical machine translation system for
  the government domain. In Proceedings of           Matt Post. 2018. A call for clarity in reporting
  the 2014 PRASA, RobMech and AfLaT Interna-          BLEU scores. In Proceedings of the Third Con-
  tional Joint Symposium, pages 229–233.              ference on Machine Translation: Research Pa-
                                                      pers, pages 186–191, Belgium, Brussels. Asso-
Rajend Mesthrie. 1995. Language and social his-       ciation for Computational Linguistics.
  tory: Studies in South African sociolinguistics.
                                                     Matt Post, Chris Callison-Burch, and Miles Os-
  New Africa Books.
                                                      borne. 2012. Constructing parallel corpora for
                                                      six Indian languages via crowdsourcing. In
Alice Millour and Karën Fort. 2018. Toward           Proceedings of the Seventh Workshop on Sta-
  a lightweight solution for less-resourced lan-      tistical Machine Translation, pages 401–409,
  guages: Creating a POS tagger for alsatian          Montréal, Canada. Association for Computa-
  using voluntary crowdsourcing. In Proceed-          tional Linguistics.
  ings of the Eleventh International Conference
  on Language Resources and Evaluation (LREC-        Ye Qi, Devendra Sachan, Matthieu Felix, Sar-
  2018), Miyazaki, Japan. European Languages           guna Padmanabhan, and Graham Neubig. 2018.
  Resources Association (ELRA).                        When and why are pre-trained word embed-
                                                       dings useful for neural machine translation?
Alp Öktem, Mirko Plitt, and Grace Tang. 2020.         In Proceedings of the 2018 Conference of the
  Tigrinya neural machine translation with trans-      North American Chapter of the Association for
  fer learning for humanitarian response. arXiv        Computational Linguistics: Human Language
  preprint arXiv:2003.11523.                           Technologies, Volume 2 (Short Papers), pages
                                                       529–535, New Orleans, Louisiana. Association
Iroro Orife. 2020. Towards neural machine trans-       for Computational Linguistics.
   lation for edoid languages. arXiv preprint
   arXiv:2003.10704.                                 Sebastian Ruder, Ivan Vulić, and Anders Søgaard.
                                                       2019. A survey of cross-lingual word embed-
Iroro Orife, David I Adelani, Timi Fasubaa,            ding models. Journal of Artificial Intelligence
   Victor Williamson, Wuraola Fisayo Oyewusi,          Research, 65:569–631.
   Olamilekan Wahab, and Kola Tubosun. 2020.
                                                     Matthew Snover, Bonnie Dorr, Richard Schwartz,
   Improving yor\ub\’a diacritic restoration.
                                                      Linnea Micciulla, and John Makhoul. 2006. A
   arXiv preprint arXiv:2003.10564.
                                                      study of translation edit rate with targeted hu-
                                                      man annotation. In Proceedings of association
Kishore Papineni, Salim Roukos, Todd Ward, and        for machine translation in the Americas, vol-
  Wei-Jing Zhu. 2002. Bleu: a method for auto-        ume 200.
  matic evaluation of machine translation. In Pro-
  ceedings of the 40th Annual Meeting of the As-     Alex S Taylor, Siân Lindley, Tim Regan,
  sociation for Computational Linguistics, pages       David Sweeney, Vasillis Vlachokyriakos, Lillie
  311–318, Philadelphia, Pennsylvania, USA. As-        Grainger, and Jessica Lingel. 2015. Data-in-
  sociation for Computational Linguistics.             place: Thinking through the relations between
                                                       data and community. In Proceedings of the
Telmo Pires, Eva Schlinger, and Dan Garrette.          33rd Annual ACM Conference on Human Fac-
  2019. How multilingual is multilingual BERT?         tors in Computing Systems, pages 2863–2872.
  In Proceedings of the 57th Annual Meeting of
  the Association for Computational Linguistics,     Jörg Tiedemann. 2012. Parallel data, tools and in-
  pages 4996–5001, Florence, Italy. Association          terfaces in opus. In Lrec, volume 2012, pages
  for Computational Linguistics.                         2214–2218.
Shyam Upadhyay, Manaal Faruqui, Chris Dyer,
  and Dan Roth. 2016. Cross-lingual models
  of word embeddings: An empirical compari-
  son. In Proceedings of the 54th Annual Meet-
  ing of the Association for Computational Lin-
  guistics (Volume 1: Long Papers), pages 1661–
  1670, Berlin, Germany. Association for Compu-
  tational Linguistics.

Elan Van Biljon, Arnu Pretorius, and Julia
  Kreutzer. 2020. On optimal transformer depth
  for low-resource language translation. arXiv
  preprint arXiv:2004.04418.

Ngugi Wa Thiong’o. 1992. Decolonising the
  mind: The politics of language in African lit-
  erature. East African Publishers.

Xinyi Wang, Yulia Tsvetkov, and Graham Neubig.
  2020. Balancing training for multilingual neu-
  ral machine translation.

Yuxuan Wang, Wanxiang Che, Jiang Guo, Yi-
  jia Liu, and Ting Liu. 2019. Cross-lingual
  BERT transformation for zero-shot dependency
  parsing. In Proceedings of the 2019 Confer-
  ence on Empirical Methods in Natural Lan-
  guage Processing and the 9th International
  Joint Conference on Natural Language Pro-
  cessing (EMNLP-IJCNLP), pages 5721–5727,
  Hong Kong, China. Association for Computa-
  tional Linguistics.
Shijie Wu and Mark Dredze. 2020. Are all
  languages created equal in multilingual bert?
  arXiv preprint arXiv:2005.09093.
Biao Zhang, Philip Williams, Ivan Titov, and Rico
  Sennrich. 2020. Improving massively multilin-
  gual neural machine translation and zero-shot
  translation. In Proceedings of the 58th Annual
  Meeting of the Association for Computational
  Linguistics (ACL), Online.
Barret Zoph, Deniz Yuret, Jonathan May, and
  Kevin Knight. 2016. Transfer learning for low-
  resource neural machine translation. In Pro-
  ceedings of the 2016 Conference on Empiri-
  cal Methods in Natural Language Processing,
  pages 1568–1575, Austin, Texas. Association
  for Computational Linguistics.
Secondary School
                                                              Language           Domain      Size
        Undergraduate
        Masters                                               Nigerian Pidgin    COVID       39
        PhD
                                                                                 TED         100
        Online Courses
                                                              Luo                TED         30
                                                              Yoruba             COVID       39
              (a) Highest Level of Education.                                    TED         80
        Student                                               Hausa              COVID       78
        Researcher                                                               TED         120
        Data Scientist
                                                              Igbo               COVID       39
        Academic

        Other
                                                                                 TED         50
        Software Engineer                                     Fon                COVID       39
        Linguist                                              Swahili            TED         55
                                                              Shona              COVID       39
                         (b) Occupation.
                                                              Dendi              COVID       39
Figure 2: Education (a) and occupation (b) of a
subset of 37 participants as indicated in a voluntary   Table 3: Number of sentences for collected post-
survey in February 2020.                                edits for TED talks and COVID surveys.

A    Demographics                                       data comes tokenized with Polyglot.18 . The ta-
                                                        ble also features the target categories according
Figure 2 shows the demographics for a subset            to (Joshi et al., 2020) as of 28 May 2020.
of participants from a voluntary survey con-
ducted in February 2020. Between then and
now (May 2020), the community has grown
by 30%, so these figures have to be seen as
a snapshot. Nevertheless we can see that the
educational background and the occupation is
fairly diverse, with a majority of undergraduate
students (not necessarily Computer Science).

B    Evaluation Data

Table 3 reports the number sentences that were
post-edited in the human evaluation study re-
ported in Section 4.

C    Benchmark Scores

Table 4 contains BLEU scores on the JW300
test set for all benchmark models. BLEU
scores are computed with Sacrebleu (Post,                 18
                                                             https://polyglot.readthedocs.io/
2018) with tokenizer ’none’ since the JW300             en/latest/index.html
Source    Target                          Best Test BLEU   Category
               English   Afrikaans (Autshumato)          19.56            Rising Star
               English   Afrikaans (JW300)               45.48            Rising Star
               English   Amharic                         2.03             Rising Star
               English   Arabic (TED, custom)            9.28             Underdog
               English   Cinyanja                        30.00            Scraping by
               English   Dendi                           22.30            Left Behind
               English   Efik                            33.48            Left Behind
               English   È.dó                          12.49            Left Behind
               English   È.s̀.án                       6.2              Left Behind
               English   Fon                             31.07            Left Behind
               English   Hausa (JW300+Tatoeba+more)      41.11            Hopeful
               English   Igbo                            34.85            Scraping by
               English   Isoko                           38.91            Left Behind
               English   Kamba                           27.90            Left Behind
               English   Kimbundu                        32.76            Left Behind
               English   Kikuyu                          37.85            Scraping by
               English   Lingala                         48.64            Scraping by
               English   Luo                             34.33            Left Behind
               English   Nigerian Pidgin                 23.29            Left Behind
               English   Northern Sotho (Autshumato)     19.56            Scraping by
               English   Northorn Sotho (JW300)          15.40            Scraping by
               English   Sesotho                         41.23            Scraping by
               English   Setswana                        19.66            Hopeful
               English   Shona                           30.84            Scraping by
               English   Southern Ndebele (I)            4.01             Left Behind
               English   Southern Ndebele (II)           26.61            Left Behind
               English   kiSwahili (JW300)               48.94            Rising Star
               English   kiSwahili (SAWA)                3.60             Rising Star
               English   Tigrigna (JW300)                4.02             Hopeful
               English   Tigrigna (JW300+Tatoeba+more)   14.88            Hopeful
               English   Tiv                             44.70            Left Behind
               English   Tshiluba                        42.52            Left Behind
               English   Tshivenda                       49.57            Scraping by
               English   Urhobo                          28.82            Left Behind
               English   isiXhosa (Autshumato)           13.32            Hopeful
               English   isiXhosa (JW300)                6.00             Hopeful
               English   Xitsonga (JW300)                4.44             Scraping by
               English   Xitsonga (Autshumato)           13.54            Scraping by
               English   Yoruba                          38.62            Rising Star
               English   isiZulu (Autshumato)            1.96             Hopeful
               English   isiZulu (JW300)                 4.87             Hopeful
               Efik      English                         33.68            Winner
               French    Lingala                         39.81            Scraping by
               French    Swahili Congo                   33.73            Left Behind
               Hausa     English                         25.27            Winner
               Yoruba    English                         39.44            Winner

Table 4: Benchmarks as of Nov 6, 2020. If not indicated, training domain is JW300. BLEU scores are
computed with Sacrebleu (tokenize=’none’) on the JW300 test sets. Target languages are categorized
according to (Joshi et al., 2020) as of 28 May 2020.
You can also read