Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. arXiv:1906.02358v10 [cs.CL] 3 Sep 2021 However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field. Index Terms—Sinhala, Natural Language Processing, Resource Poor Language F 1 I NTRODUCTION one hand, according to Liddy [24], these systems can be 1 categorized into the following layers; phonological, morpho- Sinhala language, being the native language of the Sin- logical, lexical, syntactic, semantic, discourse, and pragmatic. halese people [2–4], who make up the largest ethnic group The phonological layer deals with the interpretation of lan- of the island country of Sri Lanka, enjoys being reported as guage sounds. As such, it consists of mainly speech-to-text the mother tongue of approximately 16 million people [5, 6]. and text-to-speech systems. In cases where one is working To give a brief linguistic background for the purpose of with written text of the language rather than speech, it is aligning the Sinhala language with the baseline of English, possible to replace this layer with tools which handle Op- primarily it should be noted that Sinhala language belongs tical Character Recognition (OCR) and language rendering to the same the Indo-European language tree [7, 8]. How- standards (such as Unicode [26]). The morphological layer ever, unlike English, which is part of the Germanic branch, analyses words at their smallest units of meaning. As such, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, analysis on word lemmas and prefix-suffix-based inflection unlike English, which borrowed the Latin alphabet, has its are handled in this layer. Lexical layer handles individual own writing system, which is a descendant of the Indian words. Therefore tasks such as Part of Speech (PoS) tagging Brahmi script [9–15]. By extension, this makes Sinhala Script happens here. The next layer, syntactic, takes place at the a member of the Aramaic family of scripts [16, 17]. Thus phrase and sentence level where grammatical structures by inheritance, Sinhala writing system is abugida (alphasyl- are utilized to obtain meaning. Semantic layer attempts to labary), which to say that consonant-vowel sequences are derive the meanings from the word level to the sentence written as single units [18]. It should be noted that the mod- level. Starting with Named Entity Recognition (NER) at ern Sinhala language have loanwords from languages such the word level and working its way up by identifying the as Tamil, English, Portuguese, and Dutch due to various contexts they are set in until arriving at overall meaning. The historical reasons [19]. Regardless of the rich historical array discourse layer handles meaning in textual units larger than a of literature spanning several millennia (starting between sentence. In this, the function of a particular sentence maybe 3rd to 2nd century BCE [20, 21]), modern natural language contextualized within the document it is set in. Finally, the processing tools for the Sinhala language are scarce [22]. pragmatic layer handles contexts read into contents without Natural Language Processing (NLP) is a broad area cov- having to be explicitly mentioned [23, 24]. Some forms ering all computational processing and analysis of human of anaphora (coreference) resolution [27–31] fall into this languages. To achieve this end, NLP systems operate at application. different levels [23–25]. A graphical representation of NLP On the other hand, Wimalasuriya and Dou [25] catego- layers and application domains are shown in Figure 1. On rize NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval • Nisansa de Silva is with the Department of Computer Science & Engi- (IR), Information Extraction (IE), and Natural Language Un- neering, University of Moratuwa. E-mail: nisansa@cse.mrt.ac.lk derstanding (NLU). Information Retrieval covers applications, Manuscript revised: September 7, 2021. which search and retrieve information which are relevant 1. Englebretson and Genetti [1] observe that in some contexts the to a given query. For pure IR, tools and methods up-to Sinhala language is also referred as Sinhalese, Singhala, and Singhalese and including the syntactic layer in the above analysis are
2 Natural Language Understanding Domain knowledge Pragmatic Layer Application reasoning and execution Contextual Utterance Reasoning Discourse context planning Discourse Layer Information Extraction NER and Semantic tagging Semantic model Semantic realisation Semantic Layer Information Retrieval Parsing Syntactic realisation Syntactic Layer Grammar PoS tagging Lexicon Lexical realisation Lexical Layer Morphological Analysis Morphological Rules Morphological realisation Morphological Layer Speech Analysis Pronunciation model Speech synthesis Phonological Layer Input text Output text Fig. 1: NLP layers and tasks [23] used. Information Extraction, on the other hand, extracts this survey, we shall also consider the Sri Lankan NLP tools structured information. The difference between IR and IE repository, lknlp3 . This manuscript is at version 4.0.0. The is the fact that IR does not change the structure of the latest version of the manuscript can be obtained from arXiv4 documents in question. Be them structured, semi-structured, or ResearchGate5 . or unstructured, all IR does is fetching them as they are. In Figure 3 in Appendix A shows the most prolific re- comparison, IE, takes semi-structured or unstructured text searchers in the domain of Sinhala NLP. The nodes contain and puts them in a machine readable structure. For this, IE the name of the researcher along with the total number of utilizes all the layers used by IR and the semantic layer. Nat- Sinhala NLP papers that researcher has authored. The edges ural Language Understanding is purely the idea of cognition. between two researchers are labeled with the number of Most NLU tasks fall under AI-hard category and remain Sinhala NLP papers the relevant pair of researchers have co- unsolved [23]. However, with varying accuracy, some NLU authored. Given that the objective of the visualization is to tasks such as machine translation2 are being attempted. portray the co-operation between researchers, the threshold The pragmatic layer of the above analysis belongs to the used here is to have at least 3 papers on Sinhala NLP NLU tasks while the discourse layer straddles information with on another researcher. We have also added labels to extraction and natural language understanding [23]. clusters in cases where all or the majority of researchers The objective of this paper is to serve as a comprehensive in those clusters have the same affiliation. It is observable survey on the state of natural language processing resources that the cluster from the Department of Computer Science & for the Sinhala language. The initial structure and content Engineering, University of Moratuwa is the most prolific in of this survey are heavily influenced by the preliminary Sinhala NLP research. surveys carried out by de Silva [22] and Wijeratne et al. The remainder of this survey is organized as follows; [23]. However, our hope is to host this survey at arXiv Section 2 introduces some important properties and con- as a perpetually evolving work which continuously gets ventions of the Sinhala language which are important for updated as new research and tools for Sinhala language the development and understanding of Sinhala NLP. Sec- are created and made publicly available. Hence, it is our tion 3 discusses the various tools and research available for hope that this work will help future researchers who are Sinhala NLP. In this section we would discuss both pure engaged in Sinhala NLP research to conduct their literature Sinhala NLP tools and research as well as hybrid Sinhala- surveys efficiently and comprehensively. For the success of 3. https://github.com/lknlp/lknlp.github.io 2. This is, however, not without the criticism of being nothing more 4. https://arxiv.org/abs/1906.02358 than a Chinese room [32] rather than true NLU. 5. http://bit.ly/31AhvvR
3 English work. We will also discuss research and tools which Herath et al. [49] categorize Sinhala suffixes along the contributes to Sinhala NLP either along with or by the help attributes of: gender, number, definiteness, case, and con- of Tamil, the other official language of Sri Lanka. Section 4 junctive. They further claim that there are 3 types of suffixes: gives a brief introduction to the primary language sources Suf1 adds gender, number, and definiteness; Suf2 adds case; used by the studies discussed in this work. Finally, Section 5, and Suf3 adds conjunctive. Conjunctive is claimed to be concludes the survey. equivalent to too and and in English. We show an extension of the suffix structure proposed by Herath et al. [49] in 2 P ROPERTIES OF THE S INHALA L ANGUAGE Table 4. Before moving on to discussing Sinhala NLP resources, we shall give a brief introduction to some of the important 3 S INHALA NLP RESOURCES properties of Sinhala language, which impact the develop- ment of Sinhala NLP resources. Sinhala grammar has two In this section we generally follow the structure shown in forms: written (literary) and spoken. These forms differ from Figure 1 for sectioning. However, in addition to that, we each-other in their core grammatical structures [1, 33, 34]. also discuss topics such as available corpora, other data The written form strictly adheres to the SOV (Subject, sets, dictionaries, and WordNets. We focus on NLP tools Object, and Verb) configuration [35, 36]. Further, in the and research rather than the mechanics of language script written form, subject-verb agreement is enforced [37] such handling [50–55]. One of the earliest attempts on Sinhala that, in order to be grammatically correct, the subject and the NLP was done by Herath et al. [56]. However, progress verb must agree in terms of: gender (male/female), num- on that project has been minimal due to the limitations of ber (singular/plural) and person (1st/2nd/3rd). However, their time. The later work by Nandasara [57] has not caught in spoken Sinhala, the SOV order can be neglected [38] much of the advances done up to the time of its publication. and male singular 3rd person verb can be used for all Given that it was a decade old by the time the first edition nouns [37]. Sinhala is also a head-final language, where of this survey was compiled, we observe the existence of the complements and modifiers would appear before their many new discoveries in Sinhala NLP which have not been heads [39] this is similar to that of English and dissimilar taken into account by it. A review on some challenges and to that of French. In total, according to Abhayasinghe [40], opportunities of using Sinhala in computer science was there are 25 types of simple sentence structures in Sinhala. done by Nandasara and Mikami [58]. At this point, it is Similar to many Indo Aryan languages, animacy plays a worthy to note that the largest number of studies in Sinhala major role in Sinhala grammar in syntactic and semantic NLP has been on optical character recognition (OCR) rather roles [41–43]. Comparative studies done by Noguchi [44] than on higher levels of the hierarchy shown in Figure 1. On and by Miyagishi [34, 45] have found that animacy extends the other hand, the most prolific single project of Sinhala its influence from phrase level to sentence level in Sinhala NLP we have observed so far is an attempt to create an (e.g., Usage of post-positions [35, 46]). On this matter, Ta- end-to-end Sinhala-to-English translator [18, 59–76]. Tamil, ble 1 explains grammatical cases and inflections of animate the other official language of Sri Lanka is also a resource common nouns while Table 2 explains grammatical cases poor language. However, due to the existence of larger and inflections of inanimate common nouns. We provide populations of Tamil speakers worldwide, including but not a comparative analysis of parsing the very simple English limited to economic powerhouses such as India, there are sentence “I eat a red apple” and its Sinhala, Hindi, and French more research and tools available for Tamil NLP tasks [23]. translations in Fig 2. English and French parsing was done Therefore, it is rational to notice that Sinhala and Tamil NLP using the Stanford Parser6 . Hindi parsing was done using the endeavours can help each other. Especially, given the above IIIT-Hyderabad Parser7 and the study by Singh et al. [47]. fact, that these are official languages of Sri Lanka, results Herath et al. [48, 49] argue that pure Sinhala words did in the generation of parallel data sets in the form of official not have suffixes and that adding suffixes was incorporated government documents and local news items. A number to Sinhala after 12th century BC with the influx of Sanskrit of researchers make use of this opportunity. We shall be words. With this, they declare Sinhala to have to following discussing those applications in this paper as well. Further, types of words: there have been some fringe implementations, which bridge Sinhala with other languages such as Japanese [8, 21, 77–79]. 1) Suffixes 2) Nouns 3) Cases 3.1 Corpora 4) Verbs For any language, the key for NLP applications and im- 5) Conjunctions and articles plementations is the existence of adequate corpora. On this 6) Adjectives matter, a relatively substantial Sinhala text corpus8 was 7) Demonstratives, Interrogatives, and negatives created by Upeksha et al. [80, 81] by web crawling. Later a 8) Particles and prefixes smaller Sinhala newes corpus9 was created by de Silva [22]. They further divide nouns into five groups: material, agen- Both of the above corpora are publicly available. However, tive, common, abstract, and proper. In addition to these, none of these come close to the massive capacity and range they also introduce compound nouns.We show the noun of the existing English corpora. A word corpus of ap- categorization proposed by Herath et al. [48] in Table 3. proximately 35,000 entries was developed by Weerasinghe 6. http://nlp.stanford.edu:8080/parser/ 8. https://osf.io/a5quv/ 7. http://ltrc.iiit.ac.in/analyzer/ 9. https://osf.io/tdb84/
4 TABLE 1: Examples for grammatical cases and inflection of animate common nouns Singular Plural Form Case Masculine Feminine Masculine Feminine Definite Indefinite Definite Indefinite ගැහැ ය 1 Nominative සා ෙස ගැහැ ය ස් ගැහැ ගැහැ ෙය 2 Accusative සා ෙස ගැහැ ය ගැහැ යක ගැහැ 3 Auxiliary ගැහැ ය ට 4 Dative සාට ෙස ට ගැහැ යට ට ගැහැ ට ගැහැ යකට 5 Genitive සාෙ ෙස ෙ ගැහැ යෙ ගැහැ යකෙ ෙ ගැහැ ෙ 6 Locative 7 Instrumental සාෙග ෙස ෙග ගැහැ යෙග ගැහැ යකෙග ෙග ගැහැ ෙග 8 Ablative 9 Vocative ස ස ගැහැ ය ගැහැ ය ගැහැ TABLE 2: Examples for grammatical cases and inflection released16 a massive corpus of text and stop words taken of inanimate common nouns Note that the grammatical cases of Auxiliary and Vocative do not exist from a decade of Sinhala Facebook posts. A parallel corpus for inanimate nouns in Sinhala. of Sinhala and English was collected by Banón et al. [86] containing 217,407 sentences and available to download Singular from their website17 . However the later audit by Caswell Plural Form Case Definite Indefinite et al. [87] raised issues on the quality of that data set. 1 Nominative A parallel corpus18 of aligned Sinhala-English documents ෙප ත ෙප ත ෙප 2 Accusative and sentences obtained from crawling the web was released 4 Dative ෙප තට ෙප තකට ෙප වලට by Sachintha et al. [88]. 5 Genitive ෙප ෙ As for Sinhala-Tamil corpora, Hameed et al. [89] claim ෙප තක ෙප වල 6 Locative ෙප ෙත to have built a sentence aligned Sinhala-Tamil parallel cor- 7 Instrumental ෙප ෙත pus and Mohamed et al. [90] claim to have built a word ෙප ත ෙප ව 8 Ablative ෙප aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, neither of them was publicly available. TABLE 3: Noun categorization by Herath et al. [48] A very small Sinhala-Tamil aligned parallel corpus created by Farhath et al. [91] using order papers of government Type Examples of Sri Lanka is available to download19 . A Sinhala and Tamil annotated corpus for emotion analysis was created Material , ෙගය by Jenarthanan et al. [92]. Agentive ව නා, ම Common ෙග යා, සා 3.2 Data Sets Abstract , උස Specific data sets for Sinhala, as expected, is scarce. How- Proper ෙක ළඹ, අමර ever, a Sinhala PoS tagged data set [93–95] is available to Compound බ ( +බ ), ම ( +ම ) download from github20 . Further, a Sinhala NER data set created by Manamini et al. [96] is also available to download from github21 . et al. [82]. But it does not seem to be online anymore. A Facebook has released FastText [97–99] models for the number of Sinhala-English parallel corpora were introduced Sinhala language trained using the Wikipedia corpus. They by Guzmán et al. [83]. This includes a 600k+ Sinhala- are available as both text models22 and binary files23 . Using English subtitle pairs10 initially collected by [84], 45k+ the above models by Facebook, Lakmal et al. [100] have Sinhala-English sentence pairs from GNOME11 , KDE12 , and created an extended FastText model trained on Wikipedia, Ubuntu13 . Guzmán et al. [83] further provided two mono- News, and official government documents. The binary file24 lingual corpora for Sinhala. Those were a 155k+ sentences of of the trained model is available to be downloaded. Herath filtered Sinhala Wikipedia14 and 5178k+ sentences of Sinhala 16. https://bit.ly/2GEI4d6 common crawl15 . Wijeratne and de Silva [85] have publicly 17. https://www.paracrawl.eu/ 18. https://github.com/kdissa/comparable-corpus 10. http://bit.ly/2KsFQxm 19. http://bit.ly/2HTMEme 11. http://bit.ly/2Z8q0fo 20. http://bit.ly/2Krhrbv 12. http://bit.ly/2WLY6bI 21. http://bit.ly/2XrwCoK 13. http://bit.ly/2wLVZGt 22. http://bit.ly/2JXAyL8 14. http://bit.ly/2EQZ7oM 23. http://bit.ly/2JY5J9c 15. http://bit.ly/2ZaQFZo 24. http://bit.ly/2WowH0h
5 S S NP VP NP VP NP V VBP NP JJ NP PRP DT JJ NN PRP JJ NN I eat a red apple මම ර ඇප ෙග ය ක (a) English (b) Sinhala S S NP VP VN NP NP VGF CLS V DET MWN PRP JJ NN VM VP N ADJ म लाल सेब खाता ँ Je mange une pomme rouge (c) Hindi (d) French Fig. 2: Parse trees for the sentence “I eat a red apple” in four languages. et al. [101] has compiled a report on the Sinhala lexicon for earliest and most extensive Sinhala-English dictionary avail- the purpose of establishing a basis for NLP applications. A able for consumption was by Malalasekera [103]. However, comparative analysis of Sinhala word embedding has been this dictionary is locked behind copyright laws and is not conducted by Lakmal et al. [100]. A dataset25 consisting available for public research and development. This copy- of 3576 Sinhala documents drawn from Sri Lankan news right issue is shared with other printed dictionaries [104– websites and tagged (CREDIBLE, FALSE, PARTIAL or UN- 109] as well. The dictionary by Kulatunga [110] is publicly CERTAIN) was published by Jayawickrama et al. [102] available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary 3.3 Dictionaries data set is from a discontinued FireFox plug-in EnSiTip [111] A necessary component for the purpose of bridging Sinhala which bears a more than passing resemblance to the above and English resources are English-Sinhala dictionaries. The dictionary by Kulatunga [110]. Hettige and Karunananda [62] claim to to have created a lexicon to help in their 25. https://github.com/LIRNEasia/MisinformationCorpusSinhala attempt to create a system capable of English-to-Sinhala
6 TABLE 4: Extension of the suffix structure proposed by Herath et al. [49] Number = Singular (S) / Plural (P) Definite = Definite (D) / Indefinite (I) / Undecided (U) Case = Nominative (N) / Accusative (A) / Dative (Da) / Genitive (G) / Instrumentive (In) / Auxiliary (Au) / Locative (L) / Ablative (Ab) Conjun = with suffix th (Y) / without suffix th (N) Attributes Suf1 Suf2 Suf3 Noun -tree- Number Definite Case Conjun - - - ගස් (-s) P U N, A, V N අ - - ගස (the-) S D N, A, V N අ - - ගස (a-) S I N, A N අක - - ගසක (of a-) S I G N අ ට - ගසට (to the-) S D Da N - ඒ - ගෙස (in the-) S D Au N - වල - ගස්වල (on -s) P U L N - වලට - ගස්වලට (to -s) P U Da N - එ - ගෙස (of the-) S D G N අ ඉ - ගස (from a-) S I Au N - - උ ග (-s and) P U N, A Y අ - ගස (-and) S D N, A Y අ - උ ගස (a- too) S I N, A Y අක - ගසක (of a -, too) S U N, A Y අ ට ගසට (to the- too) S D Da Y - ඒ ගෙස (in -s too) S D Au Y - වල ගස්වල (on -s too) P U L Y - වලට ගස්වලට (to -s too) P U Da Y - එ ගෙස (of the- too) S D G Y අ ඉ උ ගස (from a- too) S I Au Y machine translation. A review on the requirements for what Arukgoda et al. [121] have cloned to use in their appli- English-Sinhala smart bilingual dictionary was conducted cation uploaded on github26 . However, even at its peak, due by Samarawickrama and Hettige [112]. to the lack of volunteers for the crowd-soured methodology There exists the government sponsored trilingual dic- of populating the WordNet, it was at best an incomplete tionary [113], which matches Sinhala, English, and Tamil. product. Another effort to build a Sinhala Wordnet was However, other than a crude web interface on the ministry initiated by Welgama et al. [122] independently from above; website, there is no efficient API or any other way for a but it too have stopped progression even before achieving researcher to access the data on this dictionary. Weerasinghe the completion level of Wijesiri et al. [119]. and Dias [114] have created a multilingual place name database for Sri Lanka which may function both as a dic- 3.5 Morphological Analyzers tionary and a resource for certain NER tasks. As shown in Fig 1, morphological analysis is a ground level necessary component for natural language processing. 3.4 WordNets Given that Sinhala is a highly inflected language [22, 37, 38], WordNets [115] are extremely powerful and act as a versatile a proper morphological analysis process is vital. The ear- component of many NLP applications. They encompass a liest attempt on Sinhala morphological analysis we have number of linguistic properties which exist between the observed are the studies by Herath et al. [48, 49]. They are words in the lexicon of the language including but not more of an analysis of Sinhala morphology rather than a limited to: hyponymy, hypernymy, synonymy, and meronymy. working tool. As such we discussed the observations and Their uses range from simple gazetteer listing applica- conclusions of these works at Section 2. It is also worth to tions [25] to information extraction based on semantic simi- note that these works predates the introduction of Sinhala larity [116, 117] or semantic oppositeness [118]. An attempt unicode and thus use a transliteration of Sinhala in the Latin has been made to build a Sinhala Wordnet by Wijesiri et al. alphabet. [119]. For a time it was hosted on [120] but it too is now defunct and all the data and applications are lost other than 26. https://github.com/jseanm1/aruthSWSD
7 The next attempt by Herath et al. [123] creates a modular and Silva [140] proposed a stochastic POS tagger based on a unit structure for morphological analysis of Sinhala. Much small 10,000 word corpus drawn from Facebook and Twitter. later, as a step on their efforts to create a system with the ability to do English-to-Sinhala machine translation, Hettige 3.7 Parsers and Karunananda [59] claim to have created a morphologi- The PoS tagged data then needs to be handed over to cal analyzer (void of any public data or code), which links a parser. This is an area which is not completely solved to their studies of a Sinhala parser [60] and computational even in English due to various inherent ambiguities in grammar [18]. Hettige et al. [71] further propose a multi- natural languages. However, in the case of English, there agent System for morphological analysis. Welgama et al. are systems which provide adequate results [141] even if not [124] attempted to evaluate machine learning approaches perfect yet. The Sinhala state of affairs, is that, the first parser for Sinhala morphological analysis. Yet another indepen- for the Sinhala language was proposed by Hettige and dent attempt to create a morphological parser for Sinhala Karunananda [60] with a model for grammar [18]. The study verbs was carried out by Fernando and Weerasinghe [125]. by Liyanage et al. [38] is concentrated on the same given that Later, another study, which was restricted to morphological they have worked on formalizing a computational grammar analysis of Sinhala verbs was conducted by Dilshani and for Sinhala. While they do report reasonable results, yet Dias [126]. There was no indication on whether this work again, do not provide any means for the public to access the was continued to cover other types of words. Further, other data or the tools that they have developed. Kanduboda [37] than this singular publication, no data or tools were made have worked on Sinhala differential object markers relevant publicly accessible. Nandathilaka et al. [127] proposed a rule for parsing. based approach for Sinhala lemmatizing. The work by Wel- The first attempt at a Sinhala parser, as mentioned above, gama et al. [128] claim to have set a set of gold standard was by Hettige and Karunananda [60] where they created definitions for the morphology of Sinhala Words; but given prototype Sinhala morphological analyzer and a parser as that their results are not publicly available, further usage or part of their larger project to build an end-to-end transla- confirmation of these claims cannot not be done. The table 5 tor system. The function of the parser is based on three provides a comparative summery of the discussion above. dictionaries: Base Dictionary, Rule Dictionary, and Concept Dictionary. They are built as follows: 3.6 Part of Speech Taggers The next step after morphological analysis is Part of Speech • The Base Dictionary: prakurthi (base words), nipatha (PoS) tagging. The PoS tags differ in number and function- (prepositions), upasarga (prefixes), and vibakthi (Irreg- ality from language to language. Therefore, the first step in ular Verbs). creating an effective PoS tagger is to identifying the PoS • The Rule Dictionary: inflection rules used to gener- tag set for the language. This work has been accomplished ate various forms of verbs and nouns from the base by Fernando et al. [93] and Dilshani et al. [94]. Expanding words. on that, Fernando et al. [93] has introduced an SVM Based • The Concept Dictionary: synonyms and antonyms PoS Tagger for Sinhala and then Fernando and Ranathunga for the words found in the base dictionary. [95] give an evaluation of different classifiers for the task Parsers are, in essence, a computational representation of Sinhala PoS tagging. While here it is obvious that there of the grammar of a natural language. As such, in building has been some follow up work after the initial foundation, Sinhala parsers, it is crucial to create a computational model it seems, all of that has been internal to one research group for Sinhala grammar. The first such attempt was taken at one institution as neither the data nor the tools of any by Hettige and Karunananda [18] with special consideration of these findings have been made available for the use of given to Morphology and the Syntax of the Sinhala language external researchers. Several attempts to create a stochastic as an extension to their earlier work [60]. Here, it is worthy PoS tagger for Sinhala has been done with the studies to note that, unlike in their earlier attempt [60], where they by Herath and Weerasinghe [129], Jayaweera and Dias [130], explicitly mentioned that they are building a parser, in this and Jayasuriya and Weerasinghe [131] being the most no- study [18], they use the much conservative claim of building table. Within a single group which did one of the above a computational grammar. Under Morphology, they again stochastic studies [130], yet another set of studies was car- handled Sinhala inflection. Their system is based on a Finite ried out to create a Sinhala PoS tagger starting with the foun- State Transducer (FST) and Context-Free Grammar (CFG) dation of Jayaweera and Dias [132] which then extended where they they modeled 85 rules for nouns and 18 rules to a Hidden Markov Model (HMM) based approach [133] for verbs. The specific implementation is more partial to a and an analysis of unknown words [134, 135]. Further, this rule-based composer rather than parser. It is also worthy to group presented a comparison of few Sinhala PoS taggers note that this system could only handle simple sentences that are available to them [136]. A RESTFul PoS tagging which only contained the following 8 constituents: Attribu- web service created by Jayaweera and Dias [137] using the tive Adjunct of Subject, Subject, Attributive Adjunct of Object, above research can still be accessed27 via POST and GET. Object, Attributive Adjunct of Predicate, Attributive Adjunct A hybrid PoS tagger for Sinhala language was proposed of the Complement of Predicate, Complement of Predicate, and by Gunasekara et al. [138]. The study by Kothalawala et al. Predicate. With these, they propose the following grammar [139] discussed the data availability problem in NLP with a rules for Sinhala: Sinhala POS tagging experiment among others. Withanage S = Subject Akkyanaya 27. http://bit.ly/2F0jKid Subject = SimpleSubject | ComplexSubject
8 TABLE 5: Morphological Analyzers comparison Base: Rule-based (RB) / Machine Learning (ML) Able to Handle Part of Speech (Handles): Yes (Y) / No (N) Outputs: Yes (Y) / No (N) / No Information (O) Abbreviations: Nouns (Nu), Verbs (Ve), Adjectives (Aj), Adverbs (Av), Function Words (Fn), Root (R), Person (P), Number (Nb), Gender (G), Article (A), Case (C) Handles Output Base Modus Operandi Nu Ve Aj Av Fn R P Nb G A C Hettige and Karunananda [59] RB Finite State Automata Y Y N N N Y Y Y Y Y Y Hettige et al. [71] RB Agent-based Y Y Y Y Y N N N N N N Nandathilaka et al. [127] RB N/A Y N N N N Y N N N N N Welgama et al. [124] ML Morfessor algorithm Y Y Y Y Y Y N N N N N Fernando and Weerasinghe [125] RB Finite State Transducer N Y N N N Y Y Y Y N Y Dilshani and Dias [126] RB N/A N Y N N N O O O O O O ComplexSubject = SimpleSubject ConSub Similar to Hettige and Karunananda [18], the work SimpleSubject = Noun | Adjective Noun by Liyanage et al. [38] also builds a CFG for Sinhala covering ConSub = Conjunction SimpleSubject 10 out of the 25 types of simple sentence structures in Sin- Akkyanaya = VerbP | Object VerbP hala reported by Abhayasinghe [40]. This parser is unable Object = SimpleObject | ComplexObject to parse sentences where inanimate subjects do not consider ComplexObject = Conjunction SimpleObject the number. Further, sentences which contain, compound SimpleObject = Noun | Adjective Noun verbs, auxiliary verbs, present participles, or past participles VerbP = Verb | Adverb Verb cannot be handled by this parser. If the verbs have imperative mood or negation those too cannot be handled by this. Non- The later work by Liyanage et al. [38] also involves verbal sentences which end with adjectives, oblique nominals, formalizing a computational grammar for Sinhala. They locative predicates, adverbials, or any other language entity claim that Sinhala can have any order of words in practice. which is not a verb cannot be handled by this parser. However, they do not note that this is happening because The study by Kanduboda [37] covers not the whole of practices of the spoken language, which does not share Sinhala parsing but analyzes a very specific property of the strong SOV conventions of the written language, are Sinhala observed by Aissen [143] which states that it is pos- slowly seeping into written text. However, they do make sible to notice Differential Object Marking (DOM) in Sinhala note of how Sinhala grammar is modeled as a head-final active sentences. Kanduboda [37] define this as the choice language [39]. They propose the Sinhala Noun Phrase (N P ) of /wa/ and /ta/ object markers. They further observe three to be defined as shown in equation 1 where N N is a noun unique aspects of DOM in Sinhala: (a) it is only observed which can be of types: common noun (N ), pronoun (P rN ) in active sentences which contain transitive verbs, (b) it can or proper noun (P ropN ). The adjectival phrase (ADJP ) occur with accusative marked nouns but not with any other is then defined as as shown in equation 2 where: Det is cases, (c) it exists only if the sentence has placed an animate a Determiner, Adj is the adjective, and Deg is an optional noun in the accusative position. They do a statistical analysis operator Degrees which can be used to intensify the meaning and provide a number of short gazetteer lists as appendixes. of the adjective in cases where the adjective is qualitative. However, they observe that further work has to be done While they note that according to Gunasekara [142], there for this particular language rule in Sinhala given that they has to three classes of adjectives (qualitative, quantitative, found some examples which proved to be exceptions to the and demonstrative), they do not implement this distinction general model which they proposed. in their system. Similarly, they propose Sinhala Verb Phrase (V P ) to be defined as shown in equation 3 where V is a single verb. They here note that they are ignoring compound 3.8 Named Entity Recognition Tools verbs and auxiliary verbs in their grammar. The adverbial phrases (ADV P ) are then recursively defined as as shown As shown in Fig 1, once the text is properly parsed, it in equation 4. has to be processed using a Named Entity Recognition (NER) system. The first attempt of Sinahla NER was done N P = [ADJP ][N N ] (1) by Dahanayaka and Weerasinghe [144]. Given that they were conducting the first study for Sinhala NER, they based their approach on NER research done for other languages. In i this, they gave prominent notice to that of Indic languages. h ADJP = [Det] [Deg][Adj] (2) On that matter, they were the first to make the interesting observation that NER for Indic languages (including, but V P = [ADV P ][V ] (3) not limited to Sinhala) is more difficult than that of English by the virtue of the absence of a capitalization mechanic. Following prior work done on other languages, they used " # Conditional Random Fields (CRF) as their main model and i compared it against a baseline of a Maximum Entropy (ME) h ADV P = [N P ] [ADV P ] [Deg][ADV ] (4) model. However, they only use the candidate word, Context
9 TABLE 6: NER system comparison * Denotes a baseline. F1 to F11 denotes Context Words, Word Prefixes and Suffixes, Length of the Word, Frequency of the Word, First Word/ Last Word of a Sentence, (POS) Tags, Gazetteer Lists, Clue Words, Outcome Prior, Previous Map, and Cutoff Value Features CRF ME F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 Dahanayaka and Weerasinghe [144] Yes Yes* Yes Yes No No No No No No No No No Senevirathne et al. [145] Yes No Yes Yes Yes No Yes No No Yes No Yes No Manamini et al. [96] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Words around the candidate word, and a simple analysis of work on NER [96] and PoS tagging [95]. Sinhala suffixes as their features. The follow up work by Senevirathne et al. [145] kept the 3.9 Semantic Tools CRF model with all the previous features but did not report Applications of the semantic layer are more advanced than comparative analysis with an ME model. The innovation in- the ones below it in Figure 1. But even with the obvious troduced by this work is a richer set of features. In addition lack of resources and tools, a number of attempts have to the features used by Dahanayaka and Weerasinghe [144], been made on semantic level applications for the Sinhala they introduced, Length of the Word as a threshold feature. Language. The earliest attempt on semantic analysis was They also introduced First Word feature after observing done by Herath et al. [147] using their earlier work which certain rigid grammatical rules of Sinhala. A feature of clue dealt with Sinhala morphological analysis [48]. A Sinhala Words in the form of a subset of Context Words feature semantic similarity measure has been developed for short was first proposed by this work. Finally, they introduced sentences by Kadupitiya et al. [148]. This work has been a feature for Previous Map which is essentially the NE value then extended by Kadupitiya et al. [149] for the application of the preceding word. Some of these feature extractions use case of short answer grading. Data and tools for these are done with the help of a rule-based post-processor which projects are not publicly available. A deterministic process utilizes context-based word lists. flow for automatic Sinhala text summarizing was proposed The third attempt at Sinhala NER was by Manamini et al. by Welgama [150]. The study by Wimalasuriya [151], which [96] who dubbed their system Ananya. They inherit the CRF has the same name as the above work by Welgama [150], model and ME baseline from the work of Dahanayaka and uses graph based TextRank algorithm for automatic Sinhala Weerasinghe [144]. In addition to that, they take the en- text summarizing. hanced feature list of Senevirathne et al. [145] and enrich it There have been multiple attempts to do word sense further more. They introduce a Frequency of the Word feature disambiguation (WSD) [152–156] for Sinhala. For this, Aruk- based on the assumption that most commonly occurring goda et al. [121] have proposed a system named Aruth words are not NEs. Thus, they model this as a Boolean based on the Lesk Algorithm[157]. An online tool29 , an API30 value with a threshold applied on the word frequency. They of the algorithm, and code along with data on github31 extend the First Word feature proposed by Senevirathne are available. For the same task, Marasinghe et al. [158] et al. [145] to a First Word/ Last Word of a Sentence feature have proposed a system based on probabilistic modeling. noting that Sinhala grammar is of SOV configuration. They A dialogue act recognition system which utilizes simple introduce a (PoS) Tag feature and a gazetteer lists based classification algorithms has been proposed by Palihakkara feature keeping in line with research done on NER in et al. [159]. other languages. They formally introduce clue Words, which Text classification is a popular application on the seman- was initially proposed as a sub-feature by Dahanayaka tic layer of the NLP stack. A very basic Sinhala text classi- and Weerasinghe [144], as an independent feature. Utilizing fication using Naı̈ve Bayes Classifier, Zipf’s Law Behavior, the fact that they have the ME model unlike Dahanayaka and SVMs was attempted by Gallege [160]. A smaller imple- and Weerasinghe [144], they introduce a complementary mentation of Sinhala news classification has been attempted feature to Previous Map named Outcome Prior, which uses the by de Silva [22]. As mentioned in Section 3.2, their news underlying distribution of the outcomes of the ME model. corpus is publicly available32 . Another attempt on Sinhala Finally, they introduce a Cutoff Value feature to handle the text classification using six popular rule based algorithms over-fitting problem. was done by Lakmali and Haddela [161]. Even-though they The table 6 provides comparative summery of the dis- talk about building a corpus named SinNG5, they do not cussion above. It should be noted that all three of these mod- indicate of means for others to obtain the said corpus. els only tag NEs of types: person names, location names and Another study by Kumari and Haddela [162] utilizes the organization names. The Ananya system by Manamini et al. SinNG5 corpus as the data set for their attempt to use [96] is available to download at GitHub 28 . The data and LIME [163] for human interpretability of Sinhala document code for the approaches by Dahanayaka and Weerasinghe classification. However, they too do not provide access cor- [144] and by Senevirathne et al. [145] are not accessible to pus. Nanayakkara and Ranathunga [164] have implemented the public. Azeez and Ranathunga [146] proposed a fine- 29. http://aruth.herokuapp.com/ grained NER model for Sinhala building on their earlier 30. https://bit.ly/3sJEYbS 31. https://github.com/jseanm1/aruthSWSD 28. http://bit.ly/2XrwCoK 32. https://osf.io/tdb84/
10 a system which uses corpus-based similarity measures for created Sinhala document readers for visually impaired per- Sinhala text classification. Gunasekara and Haddela [165] sons to be used on Android devices. An OCR and Text-to- claim to have created a context aware stop word extraction Speech system for Sinhala named Bhashitha was proposed method for Sinhala text classification based on simple TF- by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195] IDF. An LSTM based textual entailment system for Sinhala proposed a machine translation system to convert Sinhala was proposed by Jayasinghe and Sirts [166]. and English Braille documents into voice. A separate group A simple MLP based method to classify sentiments has done work on Sinhala text-to-speech systems indepen- in Sinhala text was initially proposed by Medagoda [167] dent to above [196]. based on their prior work [168]. A word2vec based tool33 On the converse, Nadungodage et al. [197] have done a for sentiment analysis of Sinhala news comments is avail- series of work on Sinhala speech recognition with special able. A methodology for constructing a sentiment lexicon notice given to Sinhala being a resource poor language. This for Sinhala Language in a semi-automated manner based project divides its focus on: continuity [198], active learn- on a given corpus was proposed by Chathuranga et al. ing [199], and speaker adaptation [200]. A Sinhala speech [169]. Demotte et al. [170] proposed a sentiment analysis recognition for voice dialing which is speaker independent system based on sentence-state LSTM Networks for Sinhala was proposed by Amarasingha and Gamini [201] and on news comments. In the subsequent work [171], they used the other end, a Sinhala speech recognition methodology word similarity to generate a Sinhala semantic lexicon. They for interactive voice response systems, which are accessed followed this up with a further study [172] which discussed through mobile phones was proposed by Manamperi et al. a number of other deep learning models such as RNN and [202]. Priyadarshani [203] proposes a method for speaker Bi-LSTM in the domain of Sinhala sentiment analysis. Jaya- dependant speech recognition based on their previous work suriya et al. [173] proposed a method to classify Sinhala on: dynamic time warping for recognizing isolated Sin- posts in the domain of sports into positive and negative hala words [204], genetic algorithms [205], and syllable class sentiments. Ranathunga and Liyanage [174] claimed segmentation method utilizing acoustic envelopes [206]. that using word embedding models as semantic features can The method proposed by Gunasekara and Meegama [207] compensate for the lack of well developed language-specific utilizes an HMM model for Sinhala speech-to-text. A Sin- linguistic or language resources in the case of analysing hala speech recognizer supporting bi-directional conver- sentiment of Sinhala news comments. sion between Unicode Sinhala and phonetic English was A model for detecting grammatical mistakes in Sinhala proposed by Punchimudiyanse and Meegama [208]. The was developed by Pabasara and Jayalal [175]. They followed work by Karunanayake et al. [209] transfer learns CNNs for this up with an grammatical error detection and correc- transcribing free-form Sinhala and Tamil speech data sets tion model [176]. Gunasekara et al. [177] used annotation for the purpose of classification. Dilshan [210] conducted projection for semantic role labeling for Sinhala. A cross- a study for the specific use case of transcribing number lingual document similarity measurement using the use- sequences in continuous Sinhala speech. The work by Di- case of Sinhala and English was developed by Isuranga et al. nushika et al. [211] uses automatic speech recognition of [178]. Sinhala for speech command classification. Gamage et al. [212] explored the use of combinational acoustic models 3.10 Phonological Tools such as Deep Neural Network - Hidden Markov Model On the case of phonological layer, a report on Sinhala pho- (DNN-HMM) [213] and Subspace Gaussian Mixture Model netics and phonology was published by Wasala and Gamage (SGMM) [213] in Sinhala speech recognition. Karunathilaka [179]. Wickramasinghe et al. [180] discussed the practical et al. [214] explore Sinhala speech recognition using deep issues in developing Sinhala Text-to-Speech and Speech learning models such as: pre-trained DNN, DNN, TDNN, Recognition systems. Based on the earlier work by Weeras- TDNN+LSTM. inghe et al. [181], Wasala et al. [182] have developed meth- The Sinhala speech classification system proposed ods for Sinhala grapheme-to-phoneme conversion along by Buddhika et al. [215] does so without converting the with a set of rules for schwa epenthesis. This work was then speech-to-text. However, they report that this approach extended by Nadungodage et al. [183]. Weerasinghe et al. only works for specific domains with well-defined limited [184] developed a Sinhala text-to-speech system. However, vocabularies. Kavmini et al. [216] presented a Sinhala speech it is not publicly accessible. They internally extended it to command classification system which can be used for down- create a system capable of helping a mute person achieve stream applications. synthesized real-time interactive voice communication in Sinhala [185]. A rule based approach for automatic segmen- 3.11 Optical Character Recognition Applications tation of a small set of Sinhala text into syllables was pro- posed by Kumara et al. [186]. An ew prosodic phrasing method While it is not necessarily a component of the NLP stack to help with Sinhala Text-to-Speech process was proposed shown in Fig 1, which follows the definition by Liddy [24], by Bandara et al. [187, 188, 189]. Sodimana et al. [190] it is possible to swap out the bottom most phonological layer proposed a text normalization methodology for Sinhala text- of the stack in favour of an Optical Character Recognition to-speech systems. Further, Sodimana et al. [191] formalized (OCR) and text rendering layer. a step-by-step process for building text-to-speech voices for An attempt for Sinhala OCR system has been taken Sinhala. Both Jayamanna [192] and Mishangi [193] have by Rajapakse et al. [217] before any other work has been done on the topic. Much later, a linear symmetry based 33. http://bit.ly/2QKI9Np approach was proposed by Premaratne and Bigun [218, 219].
11 They then used hidden Markov model-based optimization has been proposed by Dharmapala et al. [248]. The study on the recognized Sinhala script [220]. Similarly, Hewavitha- by Walawage and Ranathunga [249] specifically focuses on rana et al. [221] used hidden Markov models for off-line segmentation of overlapping and touching Sinhala hand- Sinhala character recognition. Statistical approaches with written characters. Silva and Jayasundere [250] focused on histogram projections for Sinhala character recognition is recognizing character modifiers in Sinhala handwriting. proposed by Hewavitharana and Kodikara [222], by Ajward Summarizing on optically recognized old Sinhala text et al. [223], and by Madushanka et al. [224]. Karunanayaka for the purpose of archival search and preservation was et al. [225] also did off-line Sinhala character recognition explored by Rathnasena et al. [251]. The work of Peiris with a use case for postal city name recognition. A separate [252] also focused on OCR for ancient Sinhala inscriptions. group had attempted Sinhala OCR [226] mainly involving A neural network based method for recognizing ancient the nearest-neighbor method [227]. A study by Ediriweera Sinhala inscriptions was proposed by Karunarathne et al. [228] uses dictionaries to correct errors in Sinhala OCR. [253]. Chanda et al. [254] proposed a Gaussian kernel SVM An early attempt for Sinhala OCR by Dias et al. [229] has based method for word-wise Sinhala, Tamil, and English been extended to be online and made available to use via script identification. desktops [230] and hand-held devices [231] with the ability to recognize handwriting. A simple neural network based 3.12 Translators approach for Sinhala OCR was utilized by Rimas et al. [232]. A series of work has been done by a group towards English A fuzzy-based model for identifying printed Sinhala charac- to Sinhala translation as mentioned in some of the above ters was proposed by Gunarathna et al. [233]. Premachandra subsections. This work includes; building a morphological et al. [234] proposes a simple back-propagation artificial analyzer [59], lexicon databases [62], a transliteration sys- neural network with hand crafted features for Sinhala char- tem [63], an evaluation model [68], a computational model acter recognition. Another neural network with specialized of grammar [18], and a multi-agent solution [74]. After feature extraction for Sinhala character recognition was working on human-assisted machine translation [64], Het- proposed by Jayamaha and Naleer [235]. On the matter of tige and Karunananda [67, 69] have attempted to establish a neural networks and feature extraction, a feature selection theoretical basics for English to Sinhala machine translation. process for Sinhala OCR was proposed by Kumara and A very simplistic web based translator was proposed [65, Ragel [236]. Jayawickrama et al. [237] worked on Sinhala 66]. The same group have worked on a Sinhala ontology printed characters with special focus on handling diacritic generator for the purpose of machine translation [73] and vowels. However, they opted to refer to diacritic vowels a phrase level translator [75] based on the previous work as modifiers in their work. Gunawardhana and Ranathunga on a multi-agent system for translation [72]. Further, an [238] proposed a limited approach to recognize Sinhala application of the English to Sinhala translator on the use letters on Facebook images. A CNN-based methodology case of selected text for reading was implemented [70]. They to improve printed Sinhala character OCR was proposed later continued their work on multi agent English to Sinhala by Liyanage [239]. A meta-study on the effects of text translation with the AGR organizational model [76]. genre, image resolution, and algorithmic complexity needed Another group independently attempted English-to- for Sinhala OCR from books and newspapers was con- Sinhala machine translation [255] with a statistical ap- ducted by Anuradha et al. [240]. An OCR and Text-to- proach [256]. Wijerathna et al. [257] and De Silva et al. Speech system for Sinhala named Bhashitha was proposed [258] have proposed simple rule based translators. An by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195] example-based method applied on the English-Sinhala sen- uses OCR on Sinhala and English Braille documents on tence aligned government domain corpus was proposed which they then run a machine translation system in order by Silva and Weerasinghe [259]. A translator based on a to convert them into voice. look-up system was proposed by Vidanaralage et al. [260]. Fernando et al. [241] claim to have created a database In a preprint, Joseph et al. [261] proposes an evolutionary for handwriting recognition research in Sinhala language algorithm for Sinhala to English translation with a basis of and further claims that the data set is available at National Point-wise Mutual Information (PMI) and claims that the Science Foundation (NSF) of Sri Lanka. However, the paper code will be shared once the paper is accepted. However, provides no URLs and we were not able to find the data they do not report any quantitative results to be compared set on the NSF website either. The work by Karunanayaka and the reported qualitative results are also superficial. Fer- et al. [242] is focused on noise reduction and skew correction nando et al. [262] tries to solve the Out of vocabulary (OOV) of Sinhala handwritten words. A genetic algorithm-based problem for Sinhala in the context of Sinhala-English-Tamil approach for non-cursive Sinhala handwritten script recog- statistical machine translation. Fonseka et al. [263] used Byte nition was proposed by Jayasekara and Udawatta [243]. Ni- Pair Encoding (BPE) for English to Sinhala neural machine laweera et al. [244] compare projection and wavelet-based translation. As an another solution to the OOV problem, techniques for recognizing handwritten Sinhala script. Silva an analysis of subword techniques to improve English to and Kariyawasam [245] worked on segmenting Sinhala Sinhala Neural Machine Translation (NMT) was conducted handwritten characters with special focus on handling dia- by Naranpanawa et al. [264]. A data augmentation method critic vowels. A comparative study of few available Sinhala to expand bilingual lexicon terms based on case-markers handwriting recognition methods was done by Silva et al. for the purpose of solving the OOV problem in the domain [246]. Silva et al. [247] uses contour tracing for isolated char- of NMT was proposed by Fernando et al. [265]. A Singlish acters in handwritten Sinhala text. A Sinhala handwriting to Sinhala converter which uses an LSTM was proposed OCR system which utilizes zone-based feature extraction by De Silva [266].
12 Most of the cross Sinhala and Tamil work has been done ten Sinhala [294] and animation of finger-spelled words and in the domain of machine translation. A neural machine number signs [295]. A simple Sinhala chat bot which utilizes translation for Sinhala and Tamil languages was initiated a small knowledge base has been proposed by Hettige and by Tennage et al. [267, 268]. Then they further enhanced Karunananda [61]. A study on the effect of word embed- it with transliteration and byte pair encoding [269] and dings on a Sinhala chat bot was conducted by Gamage used synthetic training data to handle the rare word prob- et al. [296] where they used, the fasttext model trained by lem [270]. This project produced Si-Ta [271] a machine Facebook [97–99], on a RASA36 chat bot. Dissanayake and translation system of Sinhala and Tamil official documents. Hettige [297] implemented a question and answer generator In the statistical machine translation front, Farhath et al. for Sinhala with the limited PoS: pronouns, adjectives, verbs, [272] worked on integrating bilingual lists. The attempts and adverbs. by Weerasinghe [273] and Sripirakas et al. [274] were also fo- Fernando [298] proposed a method for inexact matching cused on statistical machine translation while Jeyakaran and of Sinhala proper names. A study on determining canonical Weerasinghe [275] attempted a kernel regression method. word order of colloquial Sinhala sentences using priority A yet another attempt was made by Pushpananda et al. information was conducted by Kanduboda and Tamaoka [276] which they later extended with some quality im- [299] which they later extended [300–302]. Jayakody et al. provements [277]. An attempt on real-time direct translation [303] uses simple KNN and SVM methods on a PoS tagged between Sinhala and Tamil was done by Rajpirathap et al. Sinhala corpus to create a question-answering system which [278]. Dilshani et al. [279] have done a study on the linguistic they name Mahoshadha. Sandaruwan et al. [304] have at- divergence of Sinhala and Tamil languages in respect to tempted to identify abusive Sinhala comments in social machine translation. Mokanarangan [280] claims to have media using text mining and machine learning techniques. built a named entity translator between Sinhala and Tamil An extremely simple plagiarism detection tool which only for official government documents. But this work is locked uses n-grams of simply tokenized text was proposed by Bas- behind an institutional repository wall and thus is not ac- nayake et al. [305]. Another simple plagiarism detection tool cessible by other researchers. Arukgoda et al. [281] studied that uses synonymy and Hyponymy-Hypernymy (which the possibility of using deep learning techniques to improve they call Generalization in the paper) was attempted by Ra- Sinhala-Tamil translation. Pramodya et al. [282] compared jamanthri and Thelijjagoda [306]. Transformers, Recurrent Neural Networks, and Statistical A dataset consisting of Sinhala documents drawn from Machine Translation (SMT) in the context of Tamil to Sinhala Sri Lankan news websites was published by Jayawickrama machine translation. The work by Nissanka et al. [283] used et al. [102] along with the benchmark misinformation clas- monolingual word embedding to improve NMT between sification models. A trending topic detection model for Sinhala and Tamil. While not related to Tamil, there have Sinhala tweets using simple clustering and ranking algo- been attempts to link Sinhala NLP with Japanese by Herath rithms was proposed by Jayasekara and Ahangama [307]. A et al. [21, 77, 78], Thelijjagoda [79], and Kanduboda [8]. machine learning approach to detect hate speech in Sinhala There has been an attempt to use dictionary-based machine was proposed by De Silva [308]. The work by Liyanage translation [284] between Sinhala and the liturgical language and Ranathunga [309, 310] attempt to use LSTMs for math- of Buddhism, Pali [285–287]. The study by Anuradha and ematical word problem generation in Sinhala and other Thelijjagoda [195] uses machine translation on the unique languages. The problem of recognizing Sinhala and English application of converting Sinhala and English Braille docu- code-mixed data where the Sinhala text is written in Singlish ments, which they have run OCR on, into voice. was explored by Smith and Thayasivam [311] using an XGB classifier and a CRF model building on their previous 3.13 Miscellaneous Applications work [312], which analysed such data. Sandathara et al. In this section, we discuss NLP tools and research which [313] proposed a system which they named Arunalu that are either hard to categorize under above sections or are they claimed to use Voice recognition, Natural Language reasonably involving multiples of them. The first miscel- Processing, Machine Learning, and Deep Learning concepts laneous application of Sinhala NLP is spell checking. The to help individuals with dyslexia overcome problems of open-source data driven approach proposed by Wasala et al. reading Sinhala. Rajitha et al. [314] used the data set that [288, 289] claims to be able to check and correct spelling they introduced in their previous work [88] to prove that errors in Sinhala. The approach by Jayalatharachchi et al. task specific supervised distance learning metrics outper- [290] attempts to obtain synergy between two algorithms form their unsupervised counterparts, for document align- for the same purpose. These efforts [288, 290] were then ment. extended by Subhagya et al. [291]. A rule-based Sinhala spell checker named SinSpell based on Hunspell34 was introduced 4 P RIMARY S OURCES by Liyanapathirana et al. [292]. They have also made the tool available35 online for use. The study by Sithamparanathan Even though the main objective of this survey is to cover and Uthayasanker [293] extended the Generic Environment NLP tools and research, we noticed that much of these NLP for context-aware spell correction to handle Sinhala and Tamil. tools and research depend on primary sources of Sinhala On the matter of Sinhala sign language, strides have language such as printed books in the role of knowledge been made in the domains of computer interpreting for writ- sources and ground truth. Therefore, for the benefit of other researchers who venture into Sinhala NLP, we decided to 34. http://hunspell.github.io/ 35. http://nlp-tools.uom.lk/sinspell/ 36. https://rasa.com/
You can also read