Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages Maja Popović Mihael Arčan DFKI – Language Technology Lab Insight Centre for Data Analytics Berlin, Germany National University of Galway, Ireland maja.popovic@dfki.de mihael.arcan@insight-centre.org Abstract come widely used in the last years: open source tools such as Moses (Koehn et al., 2007) have The best way to improve a statistical ma- made it possible to build translation systems for chine translation system is to identify con- any language pair, domain or text type within days. crete problems causing translation errors Despite the fact that for certain language pairs, e.g. and address them. Many of these prob- English-French, high quality SMT systems have lems are related to the characteristics of been developed, a large number of languages and the involved languages and differences be- language pairs have not been (systematically) in- tween them. This work explores the main vestigated. The largest study about European lan- obstacles for statistical machine transla- guages in the Digital Age, the META-NET Lan- tion systems involving two morphologi- guage White Paper Series1 in year 2012 showed cally rich and under-resourced languages, that only English has good machine translation namely Serbian and Slovenian. Systems support, followed by moderately supported French are trained for translations from and into and Spanish. More languages are only fragmen- English and German using parallel texts tary supported (such as German), whereby the ma- from different domains, including both jority of languages are weakly supported. Many written and spoken language. It is shown of those languages are also morphologically rich, that for all translation directions structural which makes the SMT task more complex, espe- properties concerning multi-noun colloca- cially if they are the target language. A large part tions and exact phrase boundaries are the of weakly supported languages consists of Slavic most difficult for the systems, followed by languages, where both Serbian and Slovenian be- negation, preposition and local word order long. Both languages are part of to the South differences. For translation into English Slavic language branch, Slovenian2 being the third and German, articles and pronouns are the official South Slavic language in the EU and Ser- most problematic, as well as disambigua- bian3 is the official language of a candidate mem- tion of certain frequent functional words. ber state. For all these reasons, a systematic inves- For translation into Serbian and Slovenian, tigation of SMT systems involving these two lan- cases and verb inflections are most diffi- guages and defining the most important errors and cult. In addition, local word order involv- problems can be very very beneficial for further ing verbs is often incorrect and verb parts studies. are often missing, especially when trans- In the last decade, several SMT systems have lating from German. been built for various South Slavic languages and 1 Introduction English, and for some systems certain morpho- syntactic transformations have been applied more The statistical approach to machine translation (SMT), in particular phrase-based SMT, has be- 1 http://www.meta-net.eu/whitepapers/ c 2015 The authors. This article is licensed under a Creative key-results-and-cross-language-comparison 2 Commons 3.0 licence, no derivative works, attribution, CC- together with Croatian and Bulgarian 3 BY-ND. together with Bosnian and Montenegrin
or less successfully. However, all the experiments Different SMT systems for subtitles were devel- are rather scattered and no systematic or focused oped in the framework of the SUMAT project,6 research has been carried out in order to define ac- including Serbian and Slovenian (Etchegoyhen et tual translation errors as well as their causes. al., 2014). However, only the translations be- This paper reports results of an extensive er- tween them have been carried out as an example ror analysis for four language pairs: Serbian and of closely related and highly inflected languages. Slovenian with English as well as with German, which is also a challenging language – complex 3 Language characteristics (both as a source and as a target language) and Serbian (referred to as “sr”) and Slovenian (“sl”), still not widely supported. SMT systems have as Slavic languages, have quite free word order and been built for all translation directions using pub- are highly inflected. The inflectional morphology licly available parallel texts originating from sev- is very rich for all word classes. There are six eral different domains including both written and distinct cases affecting not only common nouns, spoken language. but also proper nouns as well as pronouns, adjec- tives and some numbers. Some nouns and adjec- 2 Related work tives have two distinct plural forms depending on One of the first publications dealing with SMT sys- the number (less than five or not). There are also tems for Serbian-English (Popović et al., 2005) three genders for the nouns, pronouns, adjectives and Slovenian-English (Sepesy Maučec et al., and some numbers leading to differences between 2006) are reporting first results using small bilin- the cases and also between the verb participles for gual corpora. Improvements for translation into past tense and passive voice. English are also reported by reducing morpho- As for verbs, person and many tenses are ex- syntactic information in the morphologically rich pressed by the suffix, and, similarly to Spanish and source language. Using morpho-syntactic knowl- Italian, the subject pronoun (e.g. I, we, it) is of- edge for the Slovenian-English language pair was ten omitted. In addition, negation of three quite shown to be useful for both translation directions important verbs, “biti (sr/sl)” (to be), “imati (sr) / in (Žganec Gros and Gruden, 2007). However, no imeti (sl)” (to have) and “hteti (sr) / hoteti (sl)” (to analysis of results has been carried out in terms want), is formed by adding the negative particle of what were actual problems caused by the rich to the verb as a prefix. In addition, there are two morphology and which of those were solved by the verb aspects, namely many verbs have perfective morphological preprocessing. and imperfective form(s) depending on the dura- tion of the described action. These forms are either Through the transLectures project,4 the different although very similar or are distinguished Slovenian-English language pair became available only by prefix. in the 2013 evaluation campaign of IWSLT.5 They As for syntax, both languages have a quite free report the BLEU scores of TED talks translated by word order, and there are no articles, neither indef- several systems, however a deeper error analysis inite nor definite. is missing (Cettolo et al., 2013). Although the two languages share a large degree Recent work in SMT also deals with the Croat- of morpho-syntactic properties and mutual intel- ian language, which is very closely related to Ser- ligibility, a speaker of one language might have bian. First results for Croatian-English are re- difficulties with the other. The language differ- ported in (Ljubešić et al., 2010) on a small weather ences are both lexical (including a number of false forecast corpus, and an SMT system for the tourist friends) as well as grammatical (such as local word domain is presented in (Toral et al., 2014). Fur- order, verb mood and/or tense formation, question thermore, SMT systems for both Serbian and Croa- structure, dual in Slovenian, usage of some cases). tian are described in (Popović and Ljubešić, 2014), however only translation errors caused by lan- 4 SMT systems guage mixing are analysed, not the problems re- lated to the languages themselves. In order to systematically explore SMT issues re- lated to the targeted languages, five different do- 4 https://www.translectures.eu/ mains were used in total. However, not all do- 5 International Workshop on Spoken Language Translation, 6 http://workshop2013.iwslt.org/ http://www.sumat-project.eu
(a) number of sentences (b) average sentence length # of Sentences sl-en sl-de sr-en sr-de Avg. Sent. Length sl sr en de DGT 3.2M 3M / / DGT 16.0 / 17.3 16.6 Europarl 600k 500k / / Europarl 23.4 / 27.0 25.4 EMEA 1M 1M / / EMEA 12.7 / 12.3 11.8 OpenSubtitles 1.8M 1.8M 1.8M 1.8M OpenSubtitles 7.7 7.6 9.2 8.9 SEtimes / / 200k / SEtimes / 22.4 23.8 / Table 1: Corpora characteristics. mains were used for all language pairs due to glish as well as in German in order to have a com- unavailability. It should be noted that accord- pletely same condition for all systems. ing to the META-NET White Papers, both lan- All systems have been trained using phrase- guages have minimal support, with only fragmen- based Moses (Koehn et al., 2007), where the word tary text and speech resources. For the Slovenian- alignments were build with GIZA++ (Och and English and Slovenian-German language pairs, Ney, 2003). The 5-gram language model was build four domains were investigated: DGT transla- with the SRILM toolkit (Stolcke, 2002). tion memories provided by the JRC (Steinberger et al., 2012), Europarl (Koehn, 2005), Euro- 5 Evaluation and error analysis pean Medicines Agency corpus (EMEA) in the The evaluation has been carried out in three steps: pharmaceutical domain, as well as the Open- first, the BLEU scores were calculated for each of Subtitles7 corpus. All the corpora are down- the systems. Then, the automatic error classifica- loaded from the OPUS web site8 (Tiedemann, tion has been applied in order to estimate actual 2012). For the Serbian language, only two do- translation errors. After that, manual inspection of mains were available: the enhanced version of language related phenomena leading to particular the SEtimes corpus9 (Tyers and Alperen, 2010) errors is carried out in order to define the most im- containing “news and views from South-East Eu- portant issues which should be addressed for build- rope” for Serbian-English, and OpenSubtitles for ing better systems and/or develop better models. the Serbian-English and Serbian-German language pairs. It should be noted that all the corpora 5.1 BLEU scores contain written texts except OpenSubtitles, which As a first evaluation step, the BLEU scores (Pap- contains transcriptions and translations of spoken ineni et al., 2002) have been calculated for each of language thus being slightly peculiar for machine the translation outputs in order to get a rough idea translation. On the other hand, this is the only cor- about the performance for different domains and pus containing all language pairs of interest. translation directions. Table 1 shows the amount of parallel sentences The scores are presented in Table 2: for each language pair and domain (a) as well as • the highest scores are obtained for transla- the average sentence length for each language and tions into English; domain (b). For each domain, a separate system has been trained and tuned on an unseen portion of • the scores for translations into German are in-domain data. Since the sentences in OpenSubti- similar to those for translations into Slovenian tles are significantly shorter than in other texts, the and Serbian; tuning and test sets for this domain contain 3000 • the scores for Serbian and Slovenian are bet- sentences whereas all other sets contain 1000 sen- ter when translated from English than when tences. Another remark regarding the OpenSubti- translated from German; tles corpus is that we trained our systems only on those sentence pairs, which were available in En- • the best scores are obtained for DGT (which 7 contains a large number of repetitions), fol- http://www.opensubtitles.org/ 8 lowed by EMEA (which is very specific do- http://opus.lingfil.uu.se/ 9 http://nlp.ffzg.hr/resources/corpora/ main); the worst scores are obtained for spo- setimes/ ken language OpenSubtitles texts.
Domain/Lang. pair sl-en sr-en sl-de sr-de en-sl de-sl en-sr de-sr DGT 77.3 / 59.3 / 72.1 58.6 / / Europarl 58.9 / 33.8 / 56.0 36.5 / / EMEA 69.7 / 53.8 / 66.0 56.2 / / OpenSubtitles 38.4 33.2 21.5 22.4 26.2 19.6 22.8 18.4 SEtimes / 43.8 / / / / 35.8 / Table 2: BLEU scores for all translation outputs. In addition, all the BLEU scores are compared more reordering errors can be observed in with those of Google Translate10 outputs of all German outputs; tests. All systems built in this work outperform the Google translation system by absolute differ- • all error rates are higher in translations from ence ranges from 1 to 10%, confirming that the German than from English, except inflec- languages are weakly supported for machine trans- tions. lation. The results of the automatic error analysis are valuable and already indicate some promising di- 5.2 Automatic error classification rections for improving the systems, such as word Automatic error classification has been performed order treatment and handling morphologic gen- using Hjerson (Popović, 2011) and the error distri- eration. Nevertheless, more improvement could butions are presented in Figure 1. For the sake of be obtained if more precise guidance about prob- brevity and clarity, as well as for avoiding redun- lems and obstacles related to the language proper- dancies, the error distributions are not presented ties and differences were available (apart from the for all translation outputs, but have been averaged general ones already partly investigated in related in the following way: since no important differ- work). ences were observed neither between domains (ex- cept that OpenSubtitles translations exhibit more 5.3 Identifying linguistic related issues inflectional errors than others) nor between Ser- Automatic error analysis has already shown that bian and Slovenian (neither as source nor as the that different language combinations show differ- target language), the errors are averaged over do- ent error distributions. This often relates to linguis- mains and two languages called “x”. Thus, four tic characteristics of involved languages as well as error distributions are shown: translation from and to divergences between them. In order to explore into English, and translation from and into Ger- those relations, manual inspection of about 200 man. sentences from each domain and language pair an- The following can be observed: notated by Hjerson together with their correspond- ing source sentences has been carried out. • translations into English are “the easiest”, As the result of this analysis, the following has mostly due to the small number of morpho- been discovered: logical errors; however, the English transla- tion outputs contain more word order errors • there is a number of frequent error patterns, than Serbian and Slovenian ones; i.e. obstacles (issues) for SMT systems • all error rates are higher in German transla- • nature and frequency of many issues depend tions than in English ones, but the mistransla- on language combination and translation di- tion rate is especially high; rection • German translation outputs have less mor- • some of translation errors depend on the do- phological errors than Serbian and Slovenian main and text type, mostly differing for writ- translations from German; on the other hand, ten and spoken language 10 http://translate.google.com, performed on • issues concerning Slovenian and Serbian both 27th February 2015 as source and as target languages are almost
x-en en-x x-de 20 de-x error rates [%] 15 10 5 form order omission addition mistrans Figure 1: Error rates for five error classes: word form, word order, omission, addition and mistranslation; each error rate represents the percentage of particular (word-level) error type normalised over the total number of words. the same – there are only few language spe- • negation cific phenomena. Due to the distinct negation forming, mainly The most frequent general issues11 , i.e. relevant concerning multiple negations in Serbian and for all translation directions, are: Slovenian, negation structures are translated • noun collocations (written language) incorrectly. Multi-word expressions consisting of a head noun and additional nouns and adjectives in reference the prosecution has done English poses large problems for all transla- nothing so far tion directions, especially from and into En- source∗ the prosecution has not done glish. Their formation is different in other nothing so far languages and often the boundaries are not output the prosecution is not so far well detected so that they are translated to un- had done nothing intelligible phrases. source Being a rector does not give someone the freedom source 12th ”Tirana Fall” art and reference∗ Being a rector does not give music festival nobody the freedom output∗ 12th ”Tirana collection fall of output∗ Being a rector does not art and a music festival give some freedom reference the ratings agency’s first Emerging Europe Sensitivity Index ( EESI ) • phrase boundaries and order output the first Index sensitivity Europe in the development of Phrase boundaries are not detected properly ( EESI ) this agency so that the group(s) of words are misplaced, 11 Non-English parts of examples are literally translated into often accompanied with morphological errors English and marked with ∗ . and mistranslations.
reference of which about a fifth is used • missing verbs (German target) for wheat production Verbs or verb parts are missing in German output of which used to produce output, especially when they are supposed to about one fifth of wheat appear at the end of the clause. reference But why have I brought this thing here? • conjunction “i” (and) (Serbian source) output This thing, but why The main meaning of this conjunction is am I here? “and”, but another meaning “also, too, as reference The US ambassador to Sofia well” is often used too; however, it is usually said on Wednesday . translated as “and”. output Said on Wednesday , US ambassador to Sofia . • adverb “lahko” (Slovenian source) • prepositions This word is used for Slovenian conditional phrases which correspond to English con- Prepositions are mostly mistranslated, some- structions with modal verbs “can”, “might”, times also omitted or added. “shall”, or adverbs “possibly”; the entire • word order clause is often not translated correctly due to incorrect disambiguation. Local word permutations mostly affecting verbs and surrounding auxiliary verbs, pro- For translation into Serbian and Slovenian, the nouns, nouns and adverbs. most important obstacles are: Some of the frequent issues are strongly dependent • case on translation direction. For translation into En- Incorrect case is used, often in combination glish and German the issues of interest are: with incorrect singular/plural and/or gender agreement. • articles Due to the absence of articles in Slavic lan- • verb inflection guages, a number of articles in English and Verb inflection does not correspond to the German translation outputs are missing, and person and/or the tense; a number of past a certain number is mistranslated or added. participles also has incorrect singular/plurar and/or gender agreement. • pronouns The source languages are pro-drop, therefore • missing verbs a number of pronouns is missing in the En- Verb or verb parts are missing, especially for glish and German translation outputs. constructions using auxiliary and/or modal verbs. The problem is more frequent when • possessive pronoun “svoj” translating from German. This possessive pronoun can be used for all persons (“my”, “your”, “her”, “his”, “our”, • question structure (spoken language) “their”) and it is difficult to disambiguate. Question structure is incorrect; the problem is more frequent in Serbian where additional • verb tense particles (“li” and “da li”) should be used but Due to different usage of verb tenses in some are often omitted. cases, the wrong verb tense from the source language is preserved, or sometimes mis- • conjunction “a” (Serbian target) translated. The problem is more prominent This conjunction does not exist in other lan- for translation into English. guages, it can be translated as “and” or “but”, and its exact meaning is something in be- • agreement (German target) tween. Therefore it is difficult to disam- A number of cases and gender agreements in biguate the corresponding source conjunc- German output is incorrect. tion.
• “-ing” forms (English source) language related patterns in distributions of error English present continuous tense does not ex- types. The main part of the evaluation consisted ist in other languages, and in addition, it is of (manual) analysis of errors taking into account often difficult to determine if the word with linguistic properties of both target and source lan- the suffix “-ing” is a verb or a noun. There- guage. This analytic analysis has defined a set fore words with the “-ing” suffix are difficult of general issues which are causing errors for all for machine translation. translation directions, as well as sets of language dependent issues. Although many of these issues • noun-verb disambiguation (English source) are already known to be difficult, they can be ad- dressed only with the precise identification of con- Apart from the words ending with the suf- crete examples. fix “-ing”, there is a number of other English The main general issues are shown to be struc- words which can be used both as a noun as tural properties concerning multi-noun colloca- well as a verb, such as ”offer”, “search”, etc. tions and exact phrase boundaries, followed by • modal verb “sollen” (German source) negation formation, wrong, missing or added preposition as well as local word order differences. This German modal verb can have differ- For translation into English and German, article ent meanings, such as “should”, “might” and pronoun omissions are the most problematic, and “will” which is often difficult to disam- as well as disambiguation of certain frequent func- biguate. tional words. For translation into Serbian and Slovenian, cases and verb inflections are most dif- It has to be noted that some of the linguistic ficult to handle. In addition, other problems con- phenomena known to be difficult are not listed – cerning verbs are frequent as well, such as local the reason is that their overall number of occur- word order involving verbs and missing verb parts rences in the analysed texts is low and therefore (which is especially difficult when translating from the number of related errors too. For example, Ger- German). man compounds are well known for posing prob- In future work we plan to address some of the lems to natural language processing tasks among presented issues practically and analyse the effects. which is machine translation – however, in the An important thing concerning system improve- given texts only a few errors related to compounds ment is that although most of the described issues were observed, as well as a low total number of are common for various domains, the exact nature compounds. Another similar case is the verb as- of the texts desired for the task at hand should pect in Serbian and Slovenian – some related errors always be kept in mind. Analysis of issues for were detected, but their count as well as the overall domains and text types not covered by this paper count of such verbs in the data is very small. should be part of future work too. Therefore the structure and nature of the texts for a concrete task should always be taken into ac- Acknowledgments count. For example, for improvements of spoken language translation more effort should be put in This publication has emanated from research sup- question treatment than in noun collocation, and in ported by the QT21 project – European Union’s technical texts the compound problem would prob- Horizon 2020 research and innovation programme ably be significant. under grant number 645452 as well as a research grant from Science Foundation Ireland (SFI) under 6 Conclusions and future work grant number SFI/12/RC/2289. We are grateful to the anonymous reviewers for their valuable feed- In this work, we have examined several SMT back. systems involving two morphologically rich and under-resourced languages in order to identify the most important language related issues which should be dealt with in order to build better sys- tems and models. The BLEU scores are reported as a first evaluation step, followed by automatic error classification which has captured interesting
References Translation Output. The Prague Bulletin of Mathe- matical Linguistics, (96):59–68, October. Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2013. Report on Sepesy Maučec, Mirjam, Janez Brest, and Zdravko the 10th IWSLT evaluation campaign. In Proceed- Kačič. 2006. Slovenian to English Machine Trans- ings of the International Workshop on Spoken Lan- lation using Corpora of Different Sizes and Morpho- guage Translation (IWSLT), Heidelberg, Germany, syntactic Information. In Proceedings of the 5th December. Language Technologies Conference, pages 222–225, Ljubljana, Slovenia, October. Etchegoyhen, Thierry, Lindsay Bywood, Mark Fishel, Panayota Georgakopoulou, Jie Jiang, Gerard Van Steinberger, Ralf, Andreas Eisele, Szymon Klocek, Loenhout, Arantza Del Pozo, Mirjam Sepesy Spyridon Pilos, and Patrick Schlüter. 2012. DGT- Maučec, Anja Turner, and Martin Volk. 2014. TM: A freely available Translation Memory in 22 Machine Translation for Subtitling: A Large-Scale languages. In Proceedings of the 8th International Evaluation. In Proceedings of the Ninth Interna- Conference on Language Resources and Evaluation tional Conference on Language Resources and Eval- (LREC12), pages 454–459, Istanbul, Turkey, May. uation (LREC14), Reykjavik, Iceland, May. Stolcke, Andreas. 2002. SRILM – an extensible lan- Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris guage modeling toolkit. volume 2, pages 901–904, Callison-Burch, Marcello Federico, Nicola Bertoldi, Denver, CO, September. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Tiedemann, Jörg. 2012. Parallel data, tools and inter- Constantin, and Evan Herbst. 2007. Moses: Open faces in OPUS. In Proceedings of the 8th Interna- source toolkit for statistical machine translation. In tional Conference on Language Resources and Eval- Proceedings of the 45th Annual Meeting of the ACL uation (LREC12), pages 2214–2218, May. on Interactive Poster and Demonstration Sessions, Stroudsburg, PA, USA. Toral, Antonio, Raphael Rubino, Miquel Esplà-Gomis, Tommi Pirinen, Andy Way, and Gema Ramirez- Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Sanchez. 2014. Extrinsic Evaluation of Web- Statistical Machine Translation. In Proceedings of Crawlers in Machine Translation: a Case Study on the 10th Machine Translation Summit, pages 79–86, CroatianEnglish for the Tourism Domain. In Pro- Phuket, Thailand, September. ceedings of the 17th Conference of the European Association for Machine Translation (EAMT), pages Ljubešić, Nikola, Petra Bago, and Damir Boras. 2010. 221–224, Dubrovnik, Croatia, June. Statistical machine translation of Croatian weather forecast: How much data do we need? In Lužar- Tyers, Francis M. and Murat Alperen. 2010. South- Stiffler, Vesna, Iva Jarec, and Zoran Bekić, editors, East European Times: A parallel corpus of the Proceedings of the ITI 2010 32nd International Con- Balkan languages. In Proceedings of the LREC ference on Information Technology Interfaces, pages Workshop on Exploitation of Multilingual Resources 91–96, Zagreb. SRCE University Computing Centre. and Tools for Central and (South-) Eastern European Languages, pages 49–53, Valetta, Malta, May. Och, Franz Josef and Hermann Ney. 2003. A Sys- tematic Comparison of Various Statistical Alignment Žganec Gros, Jerneja and Stanislav Gruden. 2007. The Models. Computational Linguistics, 29(1):19–51, voiceTRAN machine translation system. In Pro- March. ceedings of the 8th Annual Conference of the In- ternational Speech Communication Association (IN- Papineni, Kishore, Salim Roukos, Todd Ward, and Wie- TERSPEECH 07), pages 1521–1524, Antwerp, Bel- Jing Zhu. 2002. BLEU: a method for automatic gium, August. ISCA. evaluation of machine translation. pages 311–318, Philadelphia, PA, July. Popović, Maja and Nikola Ljubešić. 2014. Explor- ing cross-language statistical machine translation for closely related South Slavic languages. In Proceed- ings of the EMNLP14 Workshop on Language Tech- nology for Closely Related Languages and Language Variants, pages 76–84, Doha, Qatar, October. Popović, Maja, David Vilar, Hermann Ney, Slobodan Jovičić, and Zoran Šarić. 2005. Augmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian–English Statistical Machine Translation. pages 41–48, Ann Arbor, MI, June. Popović, Maja. 2011. Hjerson: An Open Source Tool for Automatic Error Classification of Machine
You can also read