Samanantar: The Largest Publicly Available
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages Gowtham Ramesh1∗ Sumanth Doddapaneni1∗ Aravinth Bheemaraj2,5 Mayank Jobanputra3 Raghavan AK4 Ajitesh Sharma2,5 Sujit Sahoo2,5 Harshita Diddee4 Mahalakshmi J4 Divyanshu Kakwani3,4 Navneet Kumar2,5 Aswin Pradeep2,5 Kumar Deepak2,5 Vivek Raghavan5 Anoop Kunchukuttan4,6 Pratyush Kumar1,3,4 Mitesh Shantadevi† Khapra1,3,4‡ 1 Robert Bosch Center for Data Science and Artificial Intelligence, 2 Tarento Technologies, 3 Indian Institute of Technology, Madras 4 AI4Bharat, 5 EkStep Foundation, 6 Microsoft Abstract 1 Introduction arXiv:2104.05596v2 [cs.CL] 29 Apr 2021 We present Samanantar, the largest publicly Deep Learning (DL) has revolutionized the field available parallel corpora collection for Indic of Natural Language Processing, establishing new languages. The collection contains a total of state of the art results on a wide variety of 46.9 million sentence pairs between English NLU (Wang et al., 2018, 2019) and NLG tasks and 11 Indic languages (from two language (Gehrmann et al., 2021). Across languages and families). In particular, we compile 12.4 mil- tasks, a proven recipe for high performance is lion sentence pairs from existing, publicly- available parallel corpora, and we addition- to pretrain and/or finetune large models on mas- ally mine 34.6 million sentence pairs from the sive amounts of data. Particularly, significant web, resulting in a 2.8× increase in publicly progress has been made in machine translation due available sentence pairs. We mine the par- to encoder-decoder based models (Bahdanau et al., allel sentences from the web by combining 2015; Wu et al., 2016; Sennrich et al., 2016b,a; many corpora, tools, and methods. In particu- Vaswani et al., 2017). While this has been favorable lar, we use (a) web-crawled monolingual cor- for resource-rich languages, there has been limited pora, (b) document OCR for extracting sen- benefit for resource-poor languages which lack par- tences from scanned documents (c) multilin- gual representation models for aligning sen- allel corpora, monolingual corpora and evaluation tences, and (d) approximate nearest neighbor benchmarks (Koehn and Knowles, 2017). One ef- search for searching in a large collection of fort to close this gap is training multilingual models sentences. Human evaluation of samples from with the hope that performance on resource-poor the newly mined corpora validate the high languages improves from supervision on resource- quality of the parallel sentences across 11 lan- rich languages (Firat et al., 2016; Johnson et al., guage pairs. Further, we extracted 82.7 mil- 2017b). Such transfer learning works best when lion sentence pairs between all 55 Indic lan- guage pairs from the English-centric parallel the resource-rich languages are related to the low- corpus using English as the pivot language. resource languages (Nguyen and Chiang, 2017; We trained multilingual NMT models span- Dabre et al., 2017) and it is difficult to achieve high ning all these languages on Samanantar and quality translation with limited in-language data compared with other baselines and previously (Guzmán et al., 2019). The situation is particularly reported results on publicly available bench- dire when an entire group of related languages is marks. Our models outperform existing mod- low-resource making transfer-learning infeasible. els on these benchmarks, establishing the util- ity of Samanantar. Our data and models will This disparity across languages is exemplified by be available publicly1 and we hope they will the limited progress made in translation involving help advance research in Indic NMT and mul- Indic languages. Given the very large collective tilingual NLP for Indic languages. speaker base of over 1 billion speakers, the pref- ∗ erence for Indic languages and increasing digital * The first two authors have contributed equally. † † Dedicated to the loving memory of my grandmother. penetration, a good translation system is a necessity ‡ ‡ Corresponding author: miteshk@cse.iitm.ac.in to provide equitable access to information and con- 1 https://indicnlp.ai4bharat.org/samanantar tent. For example, educational videos for primary,
secondary and higher education should be available ·107 1 in different Indic languages. Similarly, various gov- Newly Mined ernment advisories, policy announcements, high Existing Sources court judgments, etc. should be disseminated in all 0.8 major regional languages. Despite this fundamen- tal need, the accuracy of machine translation (MT) 0.6 systems to and from Indic languages are poorer compared to those for several European languages (Bojar et al., 2014; Barrault et al., 2019, 2020). The 0.4 primary reason for this is the lack of large-scale parallel data between Indic languages and English. 0.2 Consequently, Indic languages have a poor repre- sentation in WMT shared tasks on translation and 0 allied problems (post-editing, MT evaluation, etc.), hi bn ml ta te kn mr gu pa or as further affecting attention by researchers. Thus, de- Figure 1: Total number of En-X parallel sentences in spite the huge practical need, Indic MT has signifi- Samanantar for different Indic languages that are com- cantly lagged while other resource-rich languages piled from existing sources and newly mined. With have made rapid advances with deep learning. the newly mined data, the number of parallel sentences What does it take to improve MT on the large across all En-X pairs increases by a ratio of 2.8×. set of related low-resource Indic languages? The answer is straightforward: create large parallel datasets and then train proven DL models. How- sentence aligner, and (c) IndicCorp (Kakwani et al., ever, collecting new data with manual translations 2020), the largest corpus of monolingual data for at the scale necessary to train large DL models Indic languages which requires approximate near- would be slow and expensive. Instead, several est neighbour search using FAISS followed by a recent works have proposed mining parallel sen- more accurate alignment using LaBSE. In sum- tences from the web (Schwenk et al., 2019a, 2020; mary, we propose a series of pipelines to collect El-Kishky et al., 2020). The representation of parallel data from publicly available data sources. Indic languages in these works is however poor Combining existing datasets and the new (e.g., CCMatrix contains parallel data for only 2 datasets that we collect from different sources, we Indic languages). In this work, we aim to signifi- present Samanantar3 - the largest publicly avail- cantly increase the amount of parallel data on Indic able parallel corpora collection for Indic languages. languages by combining the benefits of many re- Samanantar contains ∼ 46.9M parallel sentences cent contributions: large Indic monolingual corpora between English and 11 Indic languages, rang- (Kakwani et al., 2020; Ortiz Suarez et al., 2019), ing from 142K pairs between English-Assamese accurate multilingual representation learning (Feng to 8.6M pairs between English-Hindi. Of these et al., 2020; Artetxe and Schwenk, 2019), scal- 34.6M pairs are newly mined as a part of this work able approximate nearest neighbor search (Johnson whereas 12.4M are compiled from existing sources. et al., 2017a; Subramanya et al., 2019; Guo et al., The language-wise statistics are shown in Figure 1. 2020), and open-source tools for optical charac- In addition, we mine 82.7M parallel sentences be- ter recognition of Indic scripts in rich text doc- tween the 11 2 Indic language pairs using English uments2 . By combining these methods, we pro- as the pivot. To evaluate the quality of the mined pose different pipelines to collect parallel data from sentences we collect human judgments from 38 an- three different types of sources : (a) scanned par- notators for a total of about 10,000 sentence pairs allel documents which require Optical Character across the 11 language pairs. The results show that Recognition followed by sentence alignment us- the parallel sentences mined from the corpus are ing a multilingual representation model, such as of high quality and validate the adopted thresholds LaBSE (Feng et al., 2020), (b) news websites with for alignment using LaBSE representations. The multilingual content which require a crawler, an results also show the potential for further improv- article aligner based on date ranges followed by a ing LaBSE-based alignment, especially for low- 2 3 https://anuvaad.org Samanantar in Sanskrit means semantically similar
resource languages and for longer sentences. This chine readable format. Examples of such sources parallel data along with the human judgments will include some news websites which publish articles be made publicly available as a benchmark on in multiple languages. Next, we consider sources cross-lingual semantic similarity. which contain (almost) parallel documents which To evaluate if Samanantar advances the state are not in machine readable format. Examples of of the art for Indic NMT, we train a model us- such sources include PDF documents such as In- ing Samanantar and compare it with existing mod- dian parliamentary proceedings. The text in these els. We make a practical choice of training a joint documents is not always machine readable as it model which can leverage lexical and syntactic may have been encoded using legacy proprietary similarities between Indic languages. We com- encodings (not UTF-8). Lastly, we consider Indic- pare our joint model, called IndicTrans, trained Corp which is the largest collection of monolingual on Samanantar with (a) commercial translation sentences for 11 Indic languages mined from the systems (Google, Microsoft), (b) publicly available web. These sentences are collected from monolin- translation systems (OPUS-MT (Tiedemann and gual documents in multiple languages but may still Thottingal, 2020), mBART50 (Tang et al., 2020), contain parallel sentences as the content is India- CVIT-ILMulti (Philip et al., 2020), and (c) mod- centric. The pipelines required for mining each els trained on all existing sources of parallel data of these sources may have some common compo- between Indic languages. Across 44 publicly avail- nents (e.g., sentence pair scorer) and some unique able test sets spanning 10 Indic languages, we ob- components (e.g., OCR for non-machine readable serve that IndicTrans performs better than all exist- documents, annoy index4 for web scale monolin- ing models on 37 datasets. On several benchmarks, gual documents, etc.). We describe these pipelines IndicTrans trained on Samanantar outperforms in detail in the following subsections. all existing models by a significant margin, estab- 2.1 Existing sources lishing the utility of our corpus. The three main contributions of this work viz. We first briefly describe the existing sources of (i) Samanantar, the largest collection of parallel parallel sentences for Indic languages which are corpora for Indic languages, (ii) IndicTrans, a joint enumerated in Table 1. The Indic NLP Catalog6 model for translating from En-Indic and Indic-En, helped identify many of these sources. Recently, and (iii) human judgments on cross-lingual textual the WAT 2021 shared task also compiled many similarity for about 10,000 sentence pairs will be existing Indic language parallel corpora. made publicly available. The following sentence aligned corpora were available from OPUS (Tiedemann, 2012). We 2 Samanantar: A Parallel Corpus for downloaded the latest available versions of these Indic Languages corpora on 21 March 2021: ELRC_29227 : Parallel text between English and In this section, we describe Samanantar, the largest 5 Indic languages collected from Wikipedia on publicly available parallel corpora collection for health and COVID-19 domain. Indic languages. It contains parallel sentences be- GNOME8 , KDE49 , Ubuntu10 : Parallel text tween English and 11 Indic languages, viz., As- between English and 11 Indic languages from the samese (as), Bengali (bn), Gujarati (gu), Hindi (hi), localization files of GNOME, KDE4 and Ubuntu. Kannada (kn), Malayalam (ml), Marathi (mr), Odia Global Voices11 : Parallel text between English (or), Punjabi (pa), Tamil (ta) and Telugu (te). In ad- and 4 Indic languages extracted from news dition, it also contains parallel sentences between articles published on Global Voices which is an the 55 Indic language pairs obtained by pivoting international, multilingual community of writers, through English (en). To build this corpus, we first translators, academics, and digital rights activists. collated all existing public sources of parallel data 4 https://github.com/spotify/annoy for Indic languages that have been released over https://github.com/facebookresearch/faiss 6 the years, as described in Section 2.1. We then https://github.com/AI4Bharat/indicnlp_catalog 7 expand this corpus further by mining parallel sen- https://elrc-share.eu 8 https://l10n.gnome.org tences from three types of sources from the web. 9 https://l10n.kde.org First, we consider sources which contain (almost) 10 https://translations.launchpad.net 11 parallel or comparable documents available in ma- https://globalvoices.org
en-as en-bn en-gu en-hi en-kn en-ml en-mr en-or en-pa en-ta en-te Total JW300 46 269 305 510 316 371 289 - 374 718 203 3400 banglanmt - 2380 - - - - - - - - - 2380 iitb - - - 1603 - - - - - - - 1603 cvit-pib - 92 58 267 - 43 114 94 101 116 45 930 5 wikimatrix - 281 - 231 - 72 124 - - 95 92 895 OpenSubtitles - 372 - 81 - 357 - - - 28 23 862 Tanzil - 185 - 185 - 185 - - - 92 - 647 KDE4 6 35 31 85 13 39 12 8 78 79 14 402 PMIndia V1 7 23 42 50 29 27 29 32 28 33 33 333 GNOME 29 40 38 30 24 23 26 21 33 31 37 332 bible-uedin - - 16 62 61 61 60 - - - 62 321 Ubuntu 21 28 27 25 22 22 26 20 29 25 24 269 ufal - - - - - - - - - 167 - 167 sipc - 21 - 38 - 30 - - - 35 43 166 GlobalVoices - 138 - 2 - - - 326 1 - - 142 TED2020
Wiki-Matrix (Schwenk et al., 2019a): This tions of English Wikipedia documents into 5 Indic corpus contains parallel text from Wikimedia. languages covering a diverse set of topics. We download this corpus from OPUS which is TICO19 (Anastasopoulos et al., 2020): Parallel filtered with a LASER (Artetxe and Schwenk, text between English and 9 Indic languages 2019) margin score of 1.04 for all language pairs containing CoViD-19 related information from as recommended in the original Wiki-Matrix paper. Pubmed, Wikipedia, Wikinews, etc. The dataset It has parallel text between English and 6 Indic provides CoViD-19 terminologies and benchmark languages. (dev and test sets) to train and benchmark transla- tion systems. The following datasets are collected from UFAL (Ramasamy et al., 2012): Parallel text sources which are not included in OPUS: between English and Tamil collected from news, ALT (Riza et al., 2016): Parallel text between cinema and Bible websites. English and 2 Indic languages created by manually URST (Shah and Bakrola, 2019): Parallel text translating sentences from English Wikinews. obtained by translating English sentences from the BanglaNMT (Hasan et al., 2020): Parallel text MSCOCO captioning dataset (Lin et al., 2015) to between English and Bengali created by collating Gujarati. and mining data from various English-Bengali WMT-2019-wiki23 , WMT-2019-govin24 : Paral- parallel and comparable sources. lel text between English and Gujarati provided as CVIT-PIB (Philip et al., 2020): Parallel text training data for the WMT-2019 Gujarati–English between English and 9 Indic languages extracted news translation shared task. by aligning and mining parallel sentences from press releases of the Press Information Bureau17 of As shown in Table 1, these sources25 collated India. together result in a total of 12.4M parallel sentences IITB (Kunchukuttan et al., 2018): Parallel text (after removing duplicates) between English and 11 between English and Hindi mined from various Indic languages. It is interesting to note that there English-Hindi parallel sources. is no publicly available system which has been MTEnglish2Odia18 : Parallel text between trained using parallel data from all these existing English and Odia created by collating various sources. sources like Wikipedia, TDIL19 , Global Voices, etc. 2.2 Mining parallel sentences from machine NLPC20 (Fernando et al., 2021): Parallel text readable comparable corpora between English and Tamil extracted from We identified 12 news websites which publish ar- publicly available government resources such as ticles in multiple Indic languages. For a given annual reports, procurement reports, circulars website, the articles across languages are not nec- and websites by National Languages Processing essarily translations of each other. However, con- Center, University of Moratuwa. tent within a given date range is often similar as OdiEnCorp 2.0 (Parida et al., 2020): Parallel the sources are India-centric with a focus on local text between English and Odia mined from events, personalities, advisories, etc. For example, Wikipedia, online websites and non-machine news about guidelines for CoViD-19 vaccination readable documents. get published in multiple Indic languages. Even PMIndia V1 (Haddow and Kirefu, 2020): Parallel if such a news article in Hindi is not a sentence- text between English and 11 Indic languages by-sentence translation, it may contain some sen- collected by crawling and extracting the PMIndia tences which are accidentally or intentionally par- website21 . 23 http://data.statmt.org/wmt19/translation- SIPC22 (Post et al., 2012): Crowdsourced transla- task/wikipedia.gu-en.tsv.gz 24 17 http://data.statmt.org/wmt19/translation- https://pib.gov.in task/govinraw.gu-en.tsv.gz 18 https://soumendrak.github.io/MTEnglish2Odia 25 We have not included CCMatrix (Schwenk et al., 2020) 19 http://tdil-dc.in/index.php and CCAligned (El-Kishky et al., 2020) in the current version 20 https://github.com/nlpcuom/English-Tamil-Parallel- of Samanantar. CCMatrix is not publicly available at the Corpus time of writing. Some initial models trained with CCAligned 21 https://www. pmindia.gov.in/en/news-updates showed performance degradation on some benchmarks, and 22 https://github.com/joshua-decoder/indian-parallel- we will include it in a subsequent version if further analysis corpora and cleanup shows it is beneficial for training MT models.
en-as en-bn en-gu en-hi en-kn en-ml en-mr en-or en-pa en-ta en-te Total IndicParCorp 55 4885 2424 4846 3507 4590 2600 835 1819 3403 4119 33081 Wikipedia 4 331 50 222 89 102 24 - 70 162 84 1138 PIB - 74 74 402 51 28 74 - 205 105 66 1078 PIB_archives 6 27 29 289 21 13 29 - 31 23 16 484 Nouns_dictionary - - - 72 54 66 57 - 54 63 64 430 Prothomalo - 284 - - - - - - - - - 284 Drivespark - - - 40 57 50 - - - 66 68 280 General_corpus - 224 - - - - - - - - - 224 Oneindia - 5 12 91 14 10 - - - 38 32 203 NPTEL - 24 21 73 8 5 15 - - 18 22 187 OCR - 14 - - - - - - - 169 2 185 Nativeplanet - - - 32 32 27 - - - 25 41 156 Mykhel - 24 - 16 30 27 - - - 35 21 153 Newsonair - - - 111 - - - - - - - 111 DW - 23 - 56 - - - - - - - 79 Timesofindia - - 3 31 - - 25 - - - - 59 Indianexpress - - - 41 - 13 - - - - - 55 Goodreturns - - - 13 8 11 - - - 7 9 47 Catchnews - - - 36 - - - - - - - 36 DD_National - - - 33 - - - - - - - 33 Khan_academy - 4 2 6
allel to sentences from a corresponding English a sentence break is not inserted when we encounter article. Hence, we consider such news websites to common Indian titles such as Shri. (equivalent to be a good source of parallel sentences. Mr. in English) which are followed by a period. We also identified two sources from educa- Parallel Sentence Extraction. At the end of the tion domain - NPTEL26 and Khan Academy27 above step we have sentence tokenised articles in which provide educational videos (also available English and a target language (say, Hindi). Fur- on youtube) with parallel human translated subti- ther, all these websites contain metadata based on tles in different languages including English and which we can cluster the articles according to the Indic languages. month in which they were published (say, January We use the following steps to extract parallel 2021). We assume that to find a match for a given sentences from the above sources: Hindi sentence we only need to consider all En- Article Extraction. For every news website, we glish sentences which belong to articles published build custom extractors using BeautifulSoup28 or in the same month as the article containing the Selenium29 . BeautifulSoup is a Python library for Hindi sentence. This is a reasonable assumption parsing HTML/XML documents and is suitable for as content of news articles is temporal in nature. websites where the content is largely static. How- Let S = {s1 , s2 , . . . , sm } be the set of all sen- ever, many websites have dynamic content which tences across all English articles in a particular gets loaded from a data source (a database or file) month. Similarly, let T = {t1 , t2 , . . . , tn } be the and requires additional action events to be triggered set of all sentences across all Hindi articles in that by the user. This requires the extractor to interact same month. Let f (s, t) be a scoring function with the browser and perform repetitive tasks such which assigns a score indicating how likely it is as scrolling down, clicking, etc. Selenium allows that s ∈ S, t ∈ T form a translation pair. For a automation of such web browser interactions and is given Hindi sentence ti ∈ T , the matching English used for scraping content from such dynamic sites. sentence can be found as: For NPTEL, we programmatically gather all the youtube video links of courses mentioned on the s∗ = arg max f (s, ti ) s∈S NPTEL translation page30 . We then collect Indic and English subtitles for every video using youtube- We chose f to be the cosine similarity function dl31 . For Khan Academy, we use youtube-dl to on vectorial embeddings of s and t. We compute search the entire channel for videos containing sub- these embeddings using LaBSE (Feng et al., 2020) titles for English and any of the 11 Indic languages which is a multilingual embedding model that en- and download them. We skip the auto-generated codes text from different languages into a shared youtube captions to ensure that we only get high embedding space. LaBSE is trained on 17 billion quality translations. We collect subtitles for all monolingual sentences and 6 billion bilingual sen- available courses/videos on March 7th, 2021 tence pairs using the Masked Language Model- Tokenisation. Once the main content of the arti- ing (Devlin et al., 2019) and Translation Language cle is extracted in the above step, we split it into Modeling objectives (Conneau and Lample, 2019). sentences and tokenize the sentences. We used The authors have shown that it produces state of the the tokenisers available in the Indic NLP Library32 art results on multiple parallel text retrieval tasks (Kunchukuttan, 2020) and added some more heuris- and is effective even for low-resource languages. tics to account for Indic punctuation characters and Hereon, we refer to the cosine similarity between sentence delimiters. For example, we ensured that the LaBSE embeddings of s, t as the LaBSE Align- 26 ment Score (LAS). https://nptel.ac.in 27 https://www.khanacademy.org Post Processing. Using the above described pro- 28 https://www.crummy.com/software/BeautifulSoup cess, we find the top matching English sentence, 29 https://pypi.org/project/selenium s∗ , for every Hindi sentence, ti . We now apply 30 https://nptel.ac.in/Translation a threshold and select only those pairs for which 31 https://github.com/tpikonen/youtube-dl: We use a fork of youtube-dl that lets us download multiple subtitles per the cosine similarity is greater than a threshold t. language if available. This was necessary for some NPTEL Across different sources we found 0.75 to be a good youtube videos which had one erroneous and one correct threshold. We refer to this as the LAS threshold. subtitle file for english. We heuristically remove the incorrect one after download. Next, we remove duplicates in the data. We con- 32 https://github.com/anoopkunchukuttan/indic_nlp_library sider two pairs (si , ti ) and (sj , tj ) to be duplicate if
si = sj and ti = tj . We also remove any sentence Language Number of sentences pair where the English sentence was less than 4 as 2.38 words. Lastly, we use a language identifier33 and bn 77.7 eliminate pairs where the language identified for si en 100.6 or ti does not match the intended language. gu 46.6 hi 77.3 2.3 Mining parallel sentences from kn 56.5 non-machine readable comparable ml 67.9 corpora mr 41.6 While web sources are machine readable, there are or 10.1 official documents that are generated which are pa 35.3 not always machine readable. This includes pro- ta 47.8 ceedings of the legislative assembly of different te 60.5 states in India that are published in English as well as the local language of the state. For example, Table 4: Number of sentences (in millions) in the mono- lingual corpora from IndicCorp for English and 11 In- in the state of Tamil Nadu, the proceedings get dic languages. IndicCorp is the largest available such published in English as well as Tamil. These doc- corpus and contributes the largest fraction of parallel uments are often translated sentence-by-sentence sentences (IndicParCorp) to Samanantar. by human translators and thus the translation pairs are of high quality. In this work, we considered 3 such sources: (a) documents from Tamil Nadu of a page which may overflow on to the next government34 which get published in English and page. Since each page is independently passed to Tamil, (b) speeches from Bangladesh Parliament35 Google’s Vision API, we ensure that an incomplete and West Bengal Legislative Assembly36 which get sentence at the bottom of one page is merged with published in English and Bengali, and (c) speeches an incomplete sentence at the top of the next page. from Andhra Pradesh37 and Telangana Legislative Parallel Sentence Extraction. Unlike the Assemblies38 which get published in English and previous section, here we have the exact Telugu. The documents available on these sources information about which documents are are public and have a regular URL format. Most of parallel. This information is typically en- these documents either contained scanned images coded in the URL of the document itself (e.g., of the original document or contain proprietary en- https://www.tn.gov.in/en/budget2020.pdf and codings (non-UTF8) due to legacy issues. As a https://www.tn.gov.in/ta/budget2020.pdf). Hence, result, standard PDF parsers cannot be used to ex- for a given Tamil sentence, ti we only need to tract text from them. We use the following pipeline consider the sentences S = {s1 , s2 , . . . , sm } for extracting parallel sentences from such sources. which appear in the corresponding English article. Optical Character Recognition (OCR). We used The search space is thus much smaller than that Google’s Vision API which supports OCR in in the previous section, where we considered S to English as well as all the 11 Indic languages that be a collection of all sentences from all articles we consider in this work. Specifically, we pass published in the same month. Once this set S each document through Google’s OCR service has been identified, for a give ti , we identify the which returns the text contained in the document. matching sentence, s∗ , using LAS as described in Tokenisation. Once the text is extracted from the the previous subsection. PDFs, we use the same tokenisation process as Post-Processing. We use the same post-processing described in the previous section. Here, we apply as described in the previous subsection, viz., (a) extra heuristics to handle sentences at the bottom filtering based on a threshold on LAS, (b) removing 33 https://github.com/aboSamoor/polyglot duplicates, (c) filtering short sentences, and (d) 34 https://www.tn.gov.in/documents/deptname filtering sentence pairs where either the source of 35 http://www.parliament.gov.bd target text is not in the desired language. 36 http://www.wbassembly.gov.in 37 https://www.aplegislature.org, https://www.apfinance.gov.in 38 38 https://finance.telangana.gov.in margin threshold of 1.04
2.4 Mining parallel sentences from web scale monolingual corpora Recent works have shown that it is possible to mine parallel sentences from web scale monolin- gual corpora. For example, both Schwenk et al. (2019b) and Feng et al. (2020) align sentences in large monolingual corpora (e.g., CommonCrawl) by computing the similarity between them in a shared multilingual embedding space. In this work, we consider IndicCorp, the largest collection of monolingual corpora for Indic languages. Table 4 shows the number of sentences in version 2.0 of Figure 2: Histogram of the cosine similarity between IndicCorp for each of the languages that we consid- the product quantised representations in FAISS of sen- ered. It should be obvious that a brute force search tence pairs known to be parallel. The wide distribution where every sentence in the source language is com- implies that applying a threshold on the approximated pared to every sentence in the target language to similarity would not be effective. find the best matching sentence is infeasible. How- ever, unlike the previous two subsections where we could restrict the set of sentences, S, to include so that computing the inner product is the same as only those sentences which were published in a computing the cosine similarity. FAISS first finds given month or belonged to a known parallel docu- the top-p clusters by computing the distance be- ment, there is no easy way of restricting the search tween each of the cluster centroids and the given space for IndicCorp. In particular, IndicCorp only Hindi sentence. We set the value of p to 1024. contains a list of sentences with no meta-data about Within each of these clusters, FAISS then searches the month or article to which each sentence belongs. for the nearest neighbors. This retrieval is highly The only option then is to iterate over all target sen- optimized to scale. In our implementation, on av- tences to find a match. To do this efficiently, we erage we were able to process 1100 sentences per use FAISS39 (Johnson et al., 2017a) which does second, i.e., we were able to retrieve the nearest efficient indexing, clustering, semantic matching, neighbors for 1,100 sentences per second when and retrieval of dense vectors as explained below. mining from the index of the entire IndicCorp. Indexing. We first compute the sentence embed- Recomputing cosine similarity. Notice that ding using LaBSE for all English sentences in In- FAISS computes cosine similarity on the quantized dicCorp. Note that IndicCorp is a sentence level vector, in our case of dimension 64. For a given corpus (one sentence per line) so we do not need pair of sentences, how do these approximate sim- any pre-processing or sentence splitting to extract ilarity scores compare with the cosine similarity sentences. Once the embeddings are computed, on the LaBSE embeddings (LAS)? We found that we create a FAISS index where these embeddings while the relative ranking produced by FAISS is are stored in 100k clusters. Since LaBSE embed- good, the similarity scores on the quantized vectors dings are very high dimensional, we use the Prod- vary widely. In other words, while FAISS identi- uct Quantisation of FAISS to reduce the amount of fies sentence pairs which are likely to have large space required to store these embeddings. In par- LAS, the similarity scores on the quantized vec- ticular, each 786 dimensional LaBSE embedding tors vary widely. To study this better, we collected is quantized into a m dimensional vector (m = 64) 100 gold standard en-hi sentence pairs and com- where each dimension is represented using an 8-bit puted the similarity score on the quantized vectors integer value. - the scores that would be used by FAISS. The his- Retrieval. For every Indic sentence (say, Hindi sen- togram of these scores are shown in in Figure 2. We tence) we first compute the LaBSE embedding and found that the scores vary widely from 0.42 to 0.78 then query the FAISS index for its nearest neighbor and hence it is difficult to choose an appropriate based on inner product. Note that we normalise threshold on the similarity of the quantized vector. both the English index and the Hindi embeddings However, the relative ranking provided by FAISS 39 https://github.com/facebookresearch/faiss is still good, i.e., for all the 100 query Hindi sen-
tences that we analysed FAISS retrieved the correct ·107 1 matching English sentence from an index of 100.6 Monolingual Corpus - IndicCorp M sentences at the top-1 position. Based on this ob- Non Machine Readable Sources servation, we follow a two-step approach: First, we 0.8 Machine Readable Sources retrieve the top-1 matching sentence from FAISS using the quantized vector. Then, we compute the 0.6 LAS between the full LaBSE embeddings of the retreived sentence pair. On this computed LAS, we apply a LAS threshold of 0.80 (slightly higher 0.4 than that described in the previous subsection) for filtering. This modified FAISS mining, combining 0.2 quantized vectors for efficient searching and full embeddings from LaBSE for accurate thresholding, 0 as bn gu hi kn ml mr or pa ta te was an essential innovation to mine a large number of parallel sentences. Figure 3: Total number of newly mined En-X par- Post-processing. We follow the same post- allel sentences in Samanantar from different sources. processing steps as described in Section 2.2. Across languages, mining from IndicCorp is the domi- nant source accounting for 85% of all newly identified We used the same process as described above to parallel sentences. extract parallel sentences from Wikipedia. More specifically, we treated Wikipedia as a collection of monolingual sentences in different languages. then we extract (thi , tta ) as a Hindi-Tamil parallel We then created a FAISS index of all sentences sentence pair. Further, we use a very strict de- from English Wikipedia. Next, for every source duplication criterion to avoid the creation of very sentence from the corresponding Indic language similar parallel sentences. For example, if an en Wikipedia (say, Hindi Wikipedia) we retrieved the sentence is aligned to m hi sentences and n ta sen- nearest neighbor from this index. We followed the tences, then we would get mn hi-ta pairs. However, exact same steps as above with the only difference these pairs would be very similar and not contribute that the sentences from IndicCorp were replaced much to the training process. Hence, we retain only by sentences from Wikipedia. We found that we 1 randomly chosen pair out of these mn pairs. were able to mine more parallel sentences using this approach as opposed to aligning bilingual arti- 2.6 Statistics of Samanantar cles using Wikipedia’s interlanguage links and then mining parallel sentences only from these aligned Table 2 summarizes the number of parallel sen- articles. tences obtained from each of these sources between English and the 11 Indic languages that we con- sidered. Overall, we mined 34.6M parallel sen- 2.5 Mining Corpora between Indic tences in addition to the 12.4M parallel sentences languages. already available between English and Indic lan- So far, we have discussed mining parallel corpora guages. We thus contribute 2.8× more data over between English and Indic languages. To sup- existing sources. Figure 3 shows the relative con- port translation between Indic languages, we also tribution of different sources in mining new paral- need direct parallel corpus between these languages lel sentences. The dominant source is IndicCorp since zero-shot translation is insufficient to address contributing over 85% of the sentence pairs. How- translation between non-English languages (Ari- ever, we do note that there is the possibility of care- vazhagan et al., 2019a). Following Freitag and fully collating more high quality machine-readable Firat (2020) and Rios et al. (2020), we use English and non-machine-readable sources. Table 3 sum- as a pivot to mine parallel sentences between In- marises the number of parallel sentences that we dic languages from all the English-centric corpora mined between the 11 2 = 55 Indic language pairs. described earlier in this section. For example, let We mined 82.7M parallel sentences resulting in a (sen , thi ) and (ŝen , tta ) be mined parallel sentences 5.29× increase in publicly available sentence pairs between en-hi and en-ta respective. If sen = ŝen between these languages.
3 Analysis of the Quality of the Mined There, the STS of two given sentences is character- Parallel Corpus ized as an ordinal scale of six levels ranging from complete semantic equivalence (5) to complete se- In Samanantar, parallel sentences have been mined mantic dissimilarity (0). We follow the same guide- using a series of pipelines at different content scales lines in defining six ordinal levels as exemplified with multiple tools and models such as LaBSE, in Table 1 of Agirre et al. (2016). These guidelines FAISS, and Anuvaad. Further, the parallel sen- were explained to 38 annotators across 11 Indic tences have been mined for different languages languages with a minimum of 3 annotators per lan- from different sources. It is thus important to char- guage. Each of the annotators was a native speaker acterize the quality of these mined sentences across in the language assigned to them and was also flu- pipelines, languages, and sources. One way to do ent in English. The annotators have experience so is to quantify the improvement in performance ranging from 1-20 years in working on language of NMT models when trained with the additional and related tasks, with a mean of 5 years. The mined parallel sentences, which we discuss in the annotation task was performed on Google forms next section. However, to inform further research in the following manner: Each form consisted of and development, it also valuable to intrinsically 30 parallel sentences coming from an annotation and manually evaluate the cross-lingual Semantic batch as defined earlier. Annotators were shown Textual Similarity (STS) of pairs of sentences. In one pair of sentences at a time and were asked to this section, we describe the task and results of score it in the range of 0 to 5. The SemEval-2016 such an evaluation for the English-Indic parallel guidelines for each ordinal value were visible to the sentences in Samanantar. annotators at all times. After annotating 30 parallel 3.1 Annotation Task and Setup sentences, the annotators submitted the form and then resumed again with a new form. The anno- We sampled 10,069 sentence pairs (English and tators were compensated for the work at the rate Indic sentences) across 11 Indic languages and of Rs 100 to Rs 150 (1.38 to 2.06 USD) per 100 sources. To understand the sensitivity of the STS words read. scores assigned by the annotators with the align- ment quality as estimated with LAS, we sampled 3.2 Annotation Results and Discussion sentences equally in three sets: The results of the annotation of semantic textual similarity (STS) of over 9,500 sentence pairs with • sentences which were definitely accepted over 30,000 annotations are shown language-wise (LAS greater than 0.1 of the chosen thresh- in Table 5. We make the following key observations old), from the data. • sentences which were marginally accepted Sentence pairs included in Samanantar have (LAS greater than but within 0.1 of the chosen high semantic similarity. Overall, the mined threshold), and parallel sentences (the ‘All accept’ column) have a mean STS score of 4.17 and a median STS score of • sentences which were rejected (LAS lower 5. On a scale of 0 to 5, where 5 represents perfect than but within 0.1 of the chosen threshold). semantic similarity, a mean score of 4.17 indicates that the annotators rate the parallel sentences to Thus, for every source and language pair, we per- be of high quality. Furthermore, the chosen LAS form a stratified sampling with equal number of thresholds sensitively filter out sentence pairs - the sentences in each of the above three sets. After definitely accept sentence pairs have a high aver- all sentences are sampled, we randomly shuffled age STS score of 4.53, which reduces to 3.8 with the language-wise sentence pairs such that there is marginally accept, and significantly falls to 2.9 with no ordering preserved across sources or LAS. We the reject sets. then divided the language-wise sentence pairs into annotation batches of 30 parallel sentences each. Mean STS scores depend on the resource-size of For defining the annotation scores, we base our corresponding Indic language. The mean STS work on the SemEval-2016 Task 1 (Agirre et al., scores are a function of the resource-size of the 2016), wherein semantic textual similarity was corresponding Indic language, at least at the ex- studied for mono-lingual and cross-lingual tasks. treme ends. The two languages with the smallest re-
Annotation data Semantic Textual Similarity score Spearman correlation coefficient Language STS, # Bitext # Anno- All Definitely Marginally LAS, LAS, Reject Sentence pairs tations accept accept accept STS Sentence len len Assamese 689 1,973 3.48 3.83 3.06 2.14 0.25 -0.39 0.15 Bengali 957 3,814 4.53 4.82 4.23 3.51 0.41 -0.42 -0.14 Gujarati 779 2,333 3.94 4.41 3.46 2.56 0.44 -0.30 -0.07 Hindi 1,277 4,679 4.38 4.75 3.99 3.03 0.44 -0.18 -0.12 Kannada 957 2,839 4.08 4.51 3.66 2.62 0.37 -0.38 -0.08 Malayalam 917 2,781 3.94 4.40 3.49 2.30 0.36 -0.33 0.02 Marathi 779 2,324 4.14 4.56 3.66 2.76 0.39 -0.37 -0.01 Odia 500 1,497 3.97 4.07 3.86 4.18 0.08 -0.41 -0.02 Punjabi 689 2,265 4.16 4.58 3.71 2.27 0.34 -0.25 0.13 Tamil 1,044 3,123 4.11 4.48 3.74 2.42 0.36 -0.40 -0.17 Telugu 951 2,968 4.51 4.76 4.25 3.60 0.32 -0.40 -0.08 Overall 9,570 30,596 4.17 4.53 3.8 2.9 0.33 -0.35 -0.03 Table 5: Results of the annotation task to evaluate the semantic similarity between sentence pairs across 11 lan- guages. Human judgments confirm that the mined sentences (All accept) have a high semantic similarity and with a moderately high correlation between the human judgments and LAS. source sizes (As, Or) have the the lowest mean STS across languages, the sentence length is computed scores, while the two languages with the highest for the English sentences in each pair. We find that resource sizes (Hi, Bn) are in the top-3 mean STS sentence length is negatively correlated with LAS scores. This indicates that the mining of parallel with a Spearman correlation coefficient of -0.35, sentences with multilingual representation models while sentence length is almost uncorrelated with such as LaBSE is more accurate for resource-rich STS with a Separaman correlation coefficient of languages. -0.03. In other words, sentence pairs with longer sentences are unlikely to have high alignment on LAS and STS are moderately correlated across LaBSE representations and thus be included in languages. The Spearman correlation coefficient Samanantar. This is also evidenced by the average between LAS and STS is a moderately positive sentence length in the three sets of definitely accept, value of 0.33. This suggests that sentence pairs marginally accept, and reject of 11.36, 13.4, 12.9, which have a higher LAS are likely to be rated respectively. However, this preference for shorter to be semantically similar by annotators. How- sentences seems to be an accident of the LaBSE ever, the correlation coefficient is also not very representation rather than reflecting semantic simi- high (say > 0.5) indicating potential for further larity as shown by lack of any correlation between improvement in learning multilingual representa- sentence length and STS. tions with LaBSE-like models. Further, for the low- resource languages such as Assamese and Odia the correlation values are lower, indicating potential In summary, the annotation task established that for improvement in alignment. the parallel sentences in Samanantar are of high quality and the chosen thresholds are validated. LAS is negatively correlated with sentence The task also established that LaBSE-based align- length, while STS is not. In the above two points ment should be further improved for low-resource we have highlighted the potential for improving the languages (such as, As, Or) and for longer sen- LaBSE model in aligning sentences. We identified tences. We will release this parallel dataset and one specific opportunity to do this by analyzing the human judgments on the over 9,500 sentence pairs correlation between sentence length and LAS, and as a dataset for evaluating cross-lingual semantic sentence length and STS score. To be consistent similarity between English and Indic languages.
4 IndicTrans: Multingual, single Indic pre-processing done on the data are Unicode nor- script models malization and tokenization. When the target lan- guage is Indic, the output in Devanagari script is The languages in the Indian subcontinent exhibit converted back to the corresponding Indic script. many lexical and syntactic similarities on account All the text processing is done using the Indic NLP of genetic and contact relatedness (Abbi, 2012; library. Subbārāo, 2012). Genetic relatedness manifests in the two major language groups considered in Training Data For all models, we use all the this work: the Indo-Aryan branch of the Indo- available parallel data between English and Indic European family and the Dravidian family. Owing languages. We then remove any overlaps with any to the long history of contact between these lan- test or validation data that we use. We use a very guage groups, the Indian subcontinent is a linguis- strict criteria for identifying such overlaps. In par- tic area (Emeneau, 1956) exhibiting convergence ticular, while finding overlaps, we remove all punc- of many linguistic properties between languages tuation characters and lower case all strings in the of these groups. Hence, we explore multilingual training and validation/test data. We then remove models spanning all these Indic languages to en- any translation pair, (en, t), from the training data able transfer from high resource languages to low if (i) the English sentence en appears in the vali- resource languages on account of genetic related- dation/test data of any En-X language pair or (ii) ness (Nguyen and Chiang, 2017) or contact relat- the Indic sentence t appears in the validation/test edness (Goyal et al., 2020). More specifically, we data of the corresponding En-X language pair. Note explored 2 types of multilingual models for trans- that, since we train a joint model it is important to lation involving Indic languages: (i) One to Many ensure that no en sentence in the test/validation for English to Indic language translation (O2M: data appears in any of the En-X training sets. In 11 pairs) (ii) Many to One for Indic language to particular, if there is en sentence in the En-Hi vali- English translation (M2O: 11 pairs). dation/test data then any pair containing this data should not be in any of the En-X training sets. We Data Representation The first major design do not use any data sampling while training and choice we made was to represent all the Indic leave the exploration of these strategies for future language data in a single script. The scripts for work (Arivazhagan et al., 2019b). these languages are all derived from the ancient Brahmi script. Though each of these scripts have Validation Data For both the models, we used their own Unicode codepoint range, it is possible all the validation data from the benchmarks de- to get a 1-1 mapping between characters in these scribed in Section 5.1. different scripts since the Unicode standard takes Vocabulary We use a subword vocabulary learnt into account the similarities between these scripts. using subword-nmt (Sennrich et al., 2016b) for Hence, we convert all the Indic data to the Devana- building our models. For both the models, we learn gari script (we could have chosen any of the other separate vocabularies for the English and Indic lan- scripts as the common script, except Tamil). This guages from the English-centric training data using allows better lexical sharing between languages 32k BPE merge operations each. for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and Network & Training We use transformer-based allows using a smaller subword vocabulary. models (Vaswani et al., 2017) for training our NMT The first token of the source sentence is a special models. Table 6 shows the model configuration token indicating the source language (Tan et al., details. 2019; Tang et al., 2020). The model can make The models were trained using fairseq (Ott et al., a decision on the transfer learning between these 2019) on 8 V-100 GPUs. We optimized the cross languages based on both the source language tag entropy loss using the Adam optimizer with a and the similarity of representations. When mul- label-smoothing of 0.1 and gradient clipping of tiple target languages are involved, we follow the 1.0. We use mixed precision training with Nvidia standard approach of using a special token in the Apex40 . We use an initial learning rate of 5e-4, input sequence to indicate the target language to 40 generate (Johnson et al., 2017b). Other standard https://github.com/NVIDIA/apex
Attribute Value tions of news articles. Encoder layers 6 UFAL EnTam: This benchmark is part of the Decoder layers 6 UFAL EnTam corpus (Ramasamy et al., 2012). The Embedding size 1536 dataset contains parallel sentences in English-Tamil Number of heads 16 with sentences from the Bible, cinema and news Feed-forward dim 4096 domain. Model parameters (in million) 400 PMI: We create this benchmark from PMIndia corpus(Haddow and Kirefu, 2020) to test English- Table 6: Model configuration for the IndicTrans Model Assamese systems. The dataset consists of 1000 validation and 2000 test samples and we ensure that the dataset does not have very short sentences 4000 warmup steps and the same learning rate an- (80 words). nealing schedule as proposed in (Vaswani et al., 2017). We use a global batch size of 64k tokens. 5.2 Evaluation Metrics We train each model on 8 V100 GPUs and use early We use BLEU scores for the evaluation of the stopping with the patience set to 5 epochs. models. To ensure consistency and reproducibility Decoding We use beam search with a beam size across the models, we provide SacreBLEU sig- of 5 and length penalty set to 1. natures in the footnote for Indic-English41 and English-Indic42 evaluations. For Indic-English, we 5 Experimental Setup use the the in-built, default mteval-v13a tokenizer. For En-Indic, we first tokenize using the IndicNLP We evaluate the usefulness of the parallel data re- tokenizer before running sacreBleu. The evaluation leased as a part of this work, by comparing the script will be made available for reproducibility. performance of a translation system trained using this data with existing state of the art models on a 5.3 Models wide variety of benchmarks. Below, we describe We compare the performance of the following mod- the models and benchmarks used for this compari- els: son. Commercial MT systems. We use the translation APIs provided by Google Cloud Platform (v2) and 5.1 Benchmarks Microsoft Azure Cognitive Services (v3) to trans- We use the following publicly available bench- late all the sentences in the test set of the bench- marks for evaluating all the models: marks described above. Publicly available MT systems. We consider the WAT2020 (Nakazawa et al., 2020): This bench- following publicly available NMT systems: mark is part of the “Indic tasks” track of WAT2020. OPUS-MT 43 : As a part of their ongoing work The dev and test sets are a subset of the Mann on NLP for morphologically rich languages, the Ki Baat test set (Siripragada et al., 2020) which Helsinki-NLP group has released translation mod- consists of Indian Prime Minister’s speeches els for bn-en, hi-en, ml-en, mr-en, pa-en, en-hi, translated to 8 Indic languages. en-ml and en-mr. These models were trained us- ing all parallel sources available from OPUS as WAT2021: This benchmark is part of the Multi- described in section 2.1. We refer the readers to the IndicMT track of WAT2021. The dataset is sourced URL mentioned in the footnote for further details from the PMIndia corpus (Haddow and Kirefu, about the training data. For now, it suffices to know 2020). It is a multi-parallel test set for 11 Indic that these models were trained using lesser amount languages containing sentences from the news do- of data as compared to the total data released in main. this work. WMT test sets: Dev and test sets from the News mBART5044 : This is a multilingial many-to-many track of WMT 2014 English-Hindi shared task model which can translate between any pair of 50 (Bojar et al., 2014), WMT 2019 English-Gujarati 41 BLEU+1+smooth.exp+tok.13a+version.1.5.1 shared task (Barrault et al., 2019) and WMT 2020 42 BLEU+case.mixed+numrefs.1+smooth.exp+tok.none+version.1.5.1 English-Tamil shared task (Barrault et al., 2020). 43 https://huggingface.co/Helsinki-NLP 44 All these benchmarks consists of human transla- https://huggingface.co/transformers/model_doc/mbart.html
PMI UFAL EnTam WAT2020 WAT2021 WMT News en-as 1000 / 2000 - - - - en-bn - - 2000 / 3522 1000 / 2390 - en-gu - - 2000 / 4463 1000 / 2390 - / 1016 en-hi - - 2000 / 3169 1000 / 2390 520 / 2506 en-kn - - - 1000 / 2390 - en-ml - - 2000 / 2886 1000 / 2390 - en-mr - - 2000 / 3760 1000 / 2390 - en-or - - - 1000 / 2390 - en-pa - - - 1000 / 2390 - en-ta - 1000 / 2000 2000 / 1000 1000 / 2390 1989 / 1000 en-te - - 2000 / 3049 1000 / 2390 - Table 7: List of all available benchmarks with sizes of validation and test sets across languages. Commercial MT Systems Public MT Systems Trained on existing data Trained on Samanantar Google Microsoft CVIT OPUS-MT mBART50 Transformer mT5 IndicTrans bn 20.6 21.8 - 11.4 4.7 24.2 24.8 28.4 gu 32.9 34.5 - - 6. 33.1 34.6 39.5 hi 36.7 38. - 13.3 33.1 38.8 39.2 43.2 kn 24.6 23.4 - - - 23.5 27.8 34.9 ml 27.2 27.4 - 5.7 19.1 26.3 26.8 33.4 WAT2021 mr 26.1 27.7 - .4 11.7 26.7 27.6 32.4 or 23.7 27.4 - - - 23.7 - 33.4 pa 35.9 35.9 - 8.6 - 36. 37.1 42. ta 23.5 24.8 - - 26.8 28.4 26.8 32. te 25.9 25.4 - - 4.3 26.8 28.5 35.1 bn 17. 17.2 18.1 9. 6.2 16.3 16.4 19.2 gu 21. 22. 23.4 - 3. 16.6 18.9 23.0 hi 22.6 21.3 23. 8.6 19. 21.7 21.5 23.5 WAT2020 ml 17.3 16.5 18.9 5.8 13.5 14.4 15.4 19.6 mr 18.1 18.6 19.5 .5 9.2 15.3 16.8 19.6 ta 14.6 15.4 17.1 - 16.1 15.3 14.9 17.9 te 15.6 15.1 13.7 - 5.1 12.1 14.2 17.8 hi 31.3 30.1 24.6 13.1 25.7 25.3 26. 29.4 WMT gu 30.4 29.9 24.2 - 5.6 16.8 21.9 23.4 ta 27.5 27.4 17.1 - 20.7 16.6 17.5 24.3 UFAL ta 25.1 25.5 19.9 - 24.7 26.3 25.6 30.1 pmi as - 16.7 - - - 7.4 - 28.7 Table 8: BLEU scores for translation from Indian languages to English acorss different available benchmarks. Excepting for the WMT benchmark, IndicTrans trained on Samanantar outperforms all other models (including commercial MT systems). languages. In particular, it supports the following data between multiple language pairs. We refer language pairs which are relevant for our work: en- the readers to the original paper for details of the bn, en-gu, en-hi, en-ml, en-mr, en-ta, en-te and the monolingual pre-training data and the bilingual reverse directions. This model is first pre-trained on fine-tuning data (Tang et al., 2020). Once again, large amounts of monolingual data from all the 50 we note that these models use much lesser bilingual languages and then jointly fine-tuned using parallel data in Indic languages as compared to the amount
of data released in this work. 6 Results and Discussion Models trained on all existing parallel data. To The results of our experiments on Indic-En and evaluate the usefulness of the parallel sentences in En-Indic tranlslation are reported in Tables 8 and Samanantar, we train a few well studied models 9 respectively. Similarly, Figures 4a and 4b give a using all parallel data available prior to this work. quick comparison of the performance of different Transformer: We train one transformer model each models averaged over all the benchmarks that we for every en-Indic language pair and one for every used per language. In particular, it compares (i) cur- Indic-en language pair (22 models in all). Each rent best existing MT systems for every language model contains 6 encoder layers, 6 decoder layers pair (this system could be different for different and 8 attention heads per layer. The input embed- language pairs), (ii) current best model trained on dings are of size 512, the output of each attention all existing data, and (iii) IndicTrans trained on head is of size 64 and the feedforward layer in the Samanantar. Below, we list down the main obser- transformer block has 2048 neurons. The overall vations from our experiments. architecture is the same as the TransformerBASE Compilation of existing resources was a fruitful model described in (Vaswani et al., 2017). We use exercise. From Figures 4a and 4b we observe that byte pair encoding (BPE) with a vocabulary size current state-of-the-art models trained on all exist- of ≈32K for every language. We use Adam as the ing parallel data (curated as a subset of Samanan- optimizer, an initial learning rate of 5e-4 and the tar) perform competitively with commercial MT same learning rate annealing schedule as proposed systems. In particular, in 11 of the 22 bars (11 in (Vaswani et al., 2017). We train each model languages in either direction) shown in the two fig- on 8 V100 GPUs and use early stopping with the ures, models trained on existing data outperform patience set to 5 epochs. commercial MT systems. mT5: We use Multilingual T5 (mT5) which is a IndicTrans trained on Samanantar leads to state massively multilingual pretrained text-to-text trans- of the art performance. Again, referring to Fig- former model for NLG (Xue et al., 2021). This ures 4a and 4b, we observed that IndicTrans trained model supports 9 out of the 11 Indic languages on Samanantar outperforms all existing models for that we consider in the work. However, it is not all the languages in both the directions (except, a translation model but pre-trained using monolin- Gujarati-English). The absolute gain in BLEU gual data in multiple languages for predicting a score is higher for the Indic-En direction as com- corrupted/dropped span in the input sequence. We pared to the En-Indic direction. This is on account take this pre-trained model and finetune it for the of better transfer in many to one settings compared translation task using all existing sources of paral- to one-to-many settings (Aharoni et al., 2019) and lel data. We finetune one model for every language better language model on the target side. pair of interest (18 pairs). We use the mT5BASE Performance gains are higher for low resource model which has 6 encoder layers, 6 decoder lay- languages. We observe significant gains for ex- ers and 12 attention heads per layer. The input isting low resource languages such as, or, as, and embeddings are of size 768, the output of each at- kn, especially in the Indic-En direction. We hy- tention head is of size 64 and the feedforward layer pothesise that these languages benefit from other in the transformer block has 2048 neurons. We use related languages with more resources during the AdaFactor as the optimizer, with a constant learn- joint-training on a common script. ing rate of 1e-3. We train each model on 1 v3 TPU Pre-training needs further investigation. mT5 and use early stopping with the patience set to 25K which is pre-trained on large amounts of monolin- steps. gual corpora from multiple languages does not al- Models trained using Samanantar. We train ways outperform a TransformerBASE model which the proposed IndicTrans model using the entire is just trained on existing parallel data without Samanantar corpus. any pre-training. While this does not invalidate Note that for all the models that we have trained the value of pre-training, it does suggest that pre- or finetuned as a part of this work, we have ensured training needs to be optimized for the specific lan- that there is no overlap between the training set guages. As future work, we would like to explore and any of the test sets or development sets that we pre-training using the monolingual corpora on In- have used. dic languages available from IndicCorp. Further,
You can also read