A Neural Machine Translation Approach for Translating Malay Parliament Hansard to English Text
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IALP 2020, Kuala Lumpur, Dec 4-6, 2020 A Neural Machine Translation Approach for Translating Malay Parliament Hansard to English Text Yu-Zane Low Lay-Ki Soon Shageenderan Sapai School of Information Technology School of Information Technology School of Information Technology Monash University Monash University Monash University Malaysia Malaysia Malaysia ylow0012@student.monash.edu soon.layki@monash.edu ssap0002@student.monash.edu Abstract— Parliament Hansard is one of the most precious knowledge and it is not possible to generate all the rules in a texts made available to the public. In Malaysia, the parliament language [7]. Hansard records the debate and discussions in Malay language. Topic modelling, sentiment analysis, relation extractions, trend In 2015, Aasha and Ganesh [1] analysed English to prediction or temporal analyses are frequently applied on Malayalam RbMT systems and found a system that “works up parliament Hansard to discover interesting patterns. However, to 6-word simple sentences” made in 2009 and a system that most of the matured tools for such processing tasks work on achieved 53.63% accuracy made in 2011. In the conclusion of English text. As such, before the Malaysian parliament Hansard their study, they mentioned that their paramount objective was can be further processed, it is essential to translate the Malay to expand the limitations of RbMT systems by including as text into English. Several machine translation approaches have been surveyed in this paper. From the literature review, neural many rules as possible to produce an error free system which machine translation, particularly the Transformer Model has is parallel to what Charoenpornsawat and colleagues been proven to provide promising results in translating different mentioned about the weaknesses of RbMT systems; the languages. In this paper, we present our implementation of requirement of linguistic knowledge and the importance of the neural machine translation for Malay to English text. The coverage of all rules in a language (the more rules, the better). experimental design shows that with a good set of parallel Charoenpornsawat and colleague’s case study on ParSit, a corpus and minimal fine-tuning, neural MT can achieve as high RbMT using interlingual-based approach and discovered that as 35.42 in BLEU score. ParSit generates many errors in “choosing incorrect meaning” at a rate of 81.74%. Charoenpornsawat and colleagues further Keywords—Machine translation, neural machine translation, parliament hansard improved ParSit by adding C4.5 rule and RIPPER [7]. The accuracy was boosted to above 70% with 2.5k training I. INTRODUCTION sentences. They concluded that their method is an adaptive Machine translation (MT) is a sub-field of computational model; can be applied to other languages and does not require linguistics that involves the utility of computers to translate linguistic knowledge. text or speech from one natural language to another [16] and In 2007, Sinnard et al. [18] and Terumasa [20] were able is stated to be as old as the modern digital computer [6]. The to improve RbMT systems by combining the RbMT with a Globalization and Localization Association (GALA), a global, statistical phrase-based post-editing system and a statistical non-profit trade organization for the translation and post editor respectively. Terumasa’s system provided localization industry stated that there are three approaches to significant translation accuracy compared to RbMT with MT, rule-based systems (RbMT); Statistical systems (SMT); significant level 0.01 both in open and closed tests. As for and Neural MT (NMT). In this study, we will research the feasibility, efficiency and accuracy of the three systems and Sinnard et al., their RbMT plus statistical phrase-based post- determine which system is most suitable to be implemented editing system (SYSTRAN + PORTAGE) was able to for the project. outperform both the RbMT (SYSTRAN) and statistical machine translation system (PORTAGE) especially for the This paper is organised as follows: Section 2 details our Europarl domain. Note that PORTAGE is a statistical phrase- literature review in machine translation. Section 3 outlines the based MT system but it is configured and trained for post- adopted Transformer Model for our Malay to English editing. translation. We then present the experimental setup in Section 4. Section 5 discusses the experimental results. Lastly, the In 1949, Warren Weaver suggested that the problems of paper is concluded in Section 6. MT can be “attacked” with statistical methods, but the approach was quickly abandoned by researchers due to the II. LITERATURE REVIEW lack of processing power and scarcity of machine-readable Rule-based Machine Translation (RbMT) system is one of texts [6]. However, that changed as time passed and Brown et the earliest translation techniques created [1]. According to al. proposed a Statistical approach to MT in 1990. Before Aasha and Ganesh, the RbMT exploits the linguistic rules and NMT was introduced in the field of MT, Statistical Phrase- language arrangement to perform translation to the destination based (PbMT) approaches were the long-standing dominant language. The strength of RbMT systems is its ability to method in MT [4, 12]. Because of its dominance and deeply analyse both syntax and semantic levels. However, the popularity, phrase-based Statistical Machine Translation disadvantage of this system is it requires a lot of linguistic approaches are focused in this study, at the expense of the XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE 978-1-7281-7689-5/20/$31.00 c 2020 IEEE 316
word-based approach. The phrase-based statistical MT [24], its compactness compared to PbMT, e.g. Lua OpenNMT approach or the alignment template approach allows for system consists of 4K lines of code while the Moses SMT has general many-to-many relations between words [15]. over 100K lines with language modelling included [9] and has According to Och and Ney (2004), the term phrase refers to a been reported to have better results over SMT in both consecutive sequence of words in a text and is discernible automatic metrics (BLEU) and human evaluation [24]. In the from the usage of the term in a linguistic sense [15]. The paper Six Challenges for Neural Machine Translation by advantages of SMT include the flexibility of the system as it Koehn and Knowles in 2017 [9], the weakness of NMT is not made specially for a language [16] and the MT is includes the lower quality for out of domain translation, poor achieved automatically using bilingual training corpus [15]. performance under low resource settings, etc. Because of its nature, SMT requires large bilingual corpora In a case study comparing translation quality of NMT and for training MT models [21] and corpus creation can be PbMT by Bentivogli and others in 2016 [4], they pitted three expensive especially for users with limited resources and the state-of-the-art PbMT systems against a NMT system on the results of SMT are unexpected; superficial fluency is same data and utilized high quality post-edits of the outputs to deceiving [16]. identify and measure post-editing effort and translation error In low-resource settings or small bilingual corpus, SMT is types such as morphological errors, lexical errors and word shown to have better results and performance than NMT order errors. The results showed that NMT was superior to models [12, 21]). Trieu et al. (2017) showed that with 800k PbMT in terms of all the error types that were measured. Their parallel sentences for training, they were able to produce an analysis also highlighted some weakness in the NMT that can NMT that had better BLEU score (28.93 & 26.81) than their be improved. Based on our research, we found three SMT (27.28 & 26.36). On the other hand, with 130k and 456k approaches/architectures in NMT, namely; RNN, parallel sentences for training, the SMT was able to Transformer and DNN. Nonetheless, we will not explore the outperform the NMT by BLEU scores of +0.48 and +3.11 for DNN architecture in NMT as it is very far from perfection [19] 456k parallel and +2.95 and +7.15 for 130k parallel sentences. and there is not a lot of research utilizing DNN for MT. Trieu et al. (2017) concluded that for Asian language pairs The two state-of-the-art NMT recently developed by (Japanese-English, Indonesian-Vietnamese, and English- Vaswani et al. in 2017 [22] and Wu et al. in 2016 [24]. Vietnamese) SMT performs better than NMT but NMT’s Vaswani et al.’s architecture is a newer form of NMT as it performance grows faster with more resources. does not exploit recurrent neural networks [9, 24] or deep In a research analysing the SMT output for 6 different neural network [19]. They claimed that their Transformer languages – English translation pairs conducted by Wu et al.in architecture can be trained significantly faster than 2011, a BLEU score of 50.84 for English to Spanish architectures based on recurrent or convolutional layers and translation and scores ranging from 31.44 to 48.63 for 7 stated they reached a new “state-of-the-art”. As mentioned translations were attained [23]. For comparison, Wu et al. earlier, they were able to acquire a BLEU score of 41.8 while accomplish a BLEU score of 47.24 while Vaswani et al. could Wu et al.’s Google Neural Machine Translation (GNMT) obtain a score of 41.8 with their NMT system based on self- achieved 41.16 for English-to-French newstest2014 tests with attention that is considered as “the current state-of-the-art” by only a fraction of the training cost [24]. Other than that, the Artetxe et al. [2] of its time, for English to French transformer model was able to outperform other state-of-the- translation. However, Wu et al. did not specify the number of art NMTs. training sentences used to achieve their results and the size of Even though Vaswani et al. proposed a superior system, their corpus is measured under “corpus size”. It is important the GNMT is a formidable force in the field of NMT. The to note that Wu et al. trained their SMT on the Medical domain GNMT is implemented using Long Short-Term Memory while Vaswani et al. trained their NMT on the WMT 2014 (LSTM) RNNs. The GNMT was reported to achieve levels of English-French dataset. This information is significant accuracy on par with the average bilingual human translator because Koehn & Knowles (2017) [10] showed that SMT and on some test sets. Compared to previous PbMT systems, the NMT performs better when trained on the Medical domain GNMT was able to deliver an approximately 60% reduction and tested on the Medical domain (BLEU score of 43.5 & 39.4 in translation errors on some language pairs respectively) than on other domains. All in all, Wu et al. achievement is very impressive and showed that SMT is a III. TRANSFORMER MODEL FOR MALAY TO ENGLISH solid option for MT, but its high BLEU score does not TRANSLATION necessarily mean that SMT has better performance than Vaswani et al.’s NMT [22]. As we wanted the best MT system to perform the Malay to English translation, we opted to use Vaswani and In recent years, NMT has appeared as the rising-star MT colleague’s solution for NMT, namely the Transformer approach; displaying formidable performances on public Model [22]. Fortunately, OpenNMT, an open-source toolkit benchmarks [5], possesses the potential to address the for NMT supports training for the Transformer model as weaknesses of older MT systems [24], and is currently a proposed by Vaswani et-al. in 2017. The Transformer model commonly used method for machine translation [9]. The NMT renounces recurrence and depends entirely on the attention strives to build and train a single, large neural network that mechanism to extract global dependencies between input and takes a sentence as input and outputs its translation unlike the output. This approach contradicts the popular approach of long-established PbMT which is composed of many sub- using Recurrent models such as the GNMT which consists of components that are tuned differently [3]. The strengths of the 2 recurrent Neural Networks in its architecture. The other NMT include the avoidance of fragile design choices in PbMT beauty of the Transformer model is the cost of training. For 2020 International Conference on Asian Language Processing (IALP) 317
the WMT 2014 en-to-de (4.5M sentence pairs) benchmark dataset, the Transformer model only took 3.5 days on 8 P100 GPUs. This factor sealed our decision. Figure 1 and 2 outlines the architecture of the model proposed by Vaswani et al [22]. The left side and right side of Figure 1 is the encoder and decoder of the neural network. Given an input, the encoder processes the input of symbol representations to a sequence of continuous representations. The continuous representation is then passed to the decoder which produces a sequence of outputs one element at a time. The main aspect of the model is the attention function. As described in Vaswani et al.’s paper, the attention function is a function that takes a query, and a set of key-value pairs and maps it to an output. The attention function contained in this architecture is applied in the “Scale Dot-Product Attention” as seen in Figure 2 where Q represent Query; K, the Key; and V, the Value. The reason Fig. 2. (left) Architecture of Scaled Dot-Product Attention. (right) Multi- why the function is called so is because the Dot-product Head Attention in the Transformer Model embodies multiple attention attention applied in the algorithm is used with a scaling factor layers running in parallel [22] ଵ where dk is the keys of dimensions. The model computes As illustrated in Figure 2, the “Scaled Dot-Product ඥௗೖ the attention function on a set of Queries, Keys and Values Attention” has h layers or heads. Hence, there would be an that are packed together into the matrices Q, K and V output of dv-dimensional output values. The output values simultaneously. The computation of the matrix of output can would then be concatenated (Concat) and projected (Linear) be described by the equation below: to produce the final output. ொ Due to the architecture and input format of the Attention (Q, K, V) = SoftMax ( )V (1) Transformer model available in the OpenNMT toolkit, the ඥௗೖ words in the source and target training dataset and input needs to be preprocessed and tokenized. Words with capital letters or even followed up with a punctuation mark without spaces might be considered as a unique word, i.e. “So,” and “So ,” (note the space in between “So” and the comma). Thus, the overall program must preprocess the data that will be fed into the NMT. IV. EXPERIMENTS A. Experimental Design The specification of the system is detailed in Table I. TABLE I. SYSTEM SPECIFICATIONS Component Specification GPU NVIDIA GeForce RTX 2060 CPU AMD Ryzen 5 3600 6-Core Processor RAM 16GB Python Python 3.7.8 PyTorch 1.4.0 TorchVision 0.5.0 As the training was executed using the GPU, a high end GPU is recommended for faster training. Due to the limitations of the system, we were forced to include the settings “- valid_batch_size 16” when calling the training function for the Transformer Model. The default maximum batch size for validation for OpenNMT-py was 32 and the training consistently stopped abruptly as the GPU was unable Fig. 1. The Transformer – model architecture [22] to accommodate the NMT’s needs. 2020 International Conference on Asian Language Processing (IALP) 318
Fig. 3. OpenNMT recommended settings for training Transformer Model To compensate for the relatively low amount of memory especially common with names such as “Ocasio-Cortez” and in the GPU, the maximum batch size used for the training was “Zane”. Only with both mentioned parameters can we obtain also half of the recommended batch size (4096 to 2048) to the proper translated sentences. replicate the model produced in the Google setup. According to the OpenNMT website, the Transformer model is said to be very sensitive to hyperparameters, hence, the settings that we included may very well affect our trained model. Apart from the hyperparameter changes done to adapt to the GPU, the other relevant parameters are not changed. Figure 3 shows the recommended settings for training Transformers model. B. Parallel Corpus The parallel corpus is obtained from the following website (https://github.com/huseinzol05/Malay-Dataset). Under the Translation Section, the author Zolkepli (2018) [25] amassed a large parallel corpus consisting of 3 million sentences under Fig. 5. Resultant Model Translation of an example of long sentence the “malay-english” directory. The nature of the dataset is not stated explicitly by the author but based on observation, it Despite that, “report-align” is not a miracle cure. In a seems to be extracted from Wikipedia, news website, etc. The particular instance, the model failed to translate the term website, http://opus.nlpl.eu/ also provided us access to “Arab-Israeli War”, instead it came out “War War”. This is OpenSubtitles, a database containing more than 3 million most likely due to the functionalities of “report_align” as the subtitles in over 60 languages including Malay [13]. We were word “Arab-Israeli” would be a “.” without the setting. able to obtain 2 million sentences from OpenSubtitles. Before Apart from the occasional “bad” selection of target words the model was trained, 40,000 sentences are extracted from for some names and terms and incorrect translation of the Zolkepli’s dataset to be used as validation data for the informal sentences, the model performed much better than we NMT and the 20% of the remaining sentences was partitioned expected. As the Malay language consists of words that share as our test dataset to be used for BLEU evaluation once the the same substring but have different prefixes, suffixes and training was complete. A total of 4,382,200 sentences was meanings (i.e. guna, mengguna, menggunakan, etc.), we were used to train the NMT. worried the MT model will be unable to accommodate the V. RESULTS AND DISCUSSION many variants of a word. The worry was for naught as the model managed to capture most of the common variants of After 5 days of training, the model is trained and is ready most words such as “jelas”, “guna” and “diskriminasi”. Even to translate. The model can translate short sentences relatively complex sentences like “Cuma saya rasa kebimbangan, takut- well such as “Aku pasti merasa sangat malu” in Figure 4 takut istilah sebegini ia bercanggah dengan undang-undang which roughly translate to “I will feel very shy” depending on ataupun tidak bercanggah dengan undang-undang.” Was the context. properly translated into “I just feel worried, this term is either against the law or not in conflict with the law.”. The model may have simplified the sentence coincidentally or not, it returned a very satisfactory translation of such an unnecessary complex sentence. By running the BLEU evaluation with the test dataset, we Fig. 4. Resultant Model Translation of an example of short sentence were able to obtain a BLEU score of 35.42 which is very promising. With Zolkepli’s publicly available dataset, anyone Its translation is arguably correct. The model performed with a GPU with at least 6GB of RAM can achieve machine very well for some longer sentences such as the one in Figure translation of this level. However, as we do not know the exact 5. However, without the additional parameters “- nature of the training dataset, the NMT could be challenged replace_unk” and “-report_align” and the presence of “rarer” with domain mismatch. Domain mismatch is a known words, the translated sentence is not accurate or challenge in Machine Translation [10]. The difference in unsatisfactory. Some translated sentences would have the meaning, expression and styles could be misinterpreted by the string “” to represent words which the model is not sure NMT. In other words, we do not know the true and effective of its target word without the “-replace_unk” parameter. If “- BLEU score of the NMT on Malaysian parliament Hansard. report_align” is not set but “-replace_unk” is set, some words The lack of or even absence of such dataset/parallel corpus in would simply translate to “.”. These occurrences are this domain means that there is no way for us to measure the 2020 International Conference on Asian Language Processing (IALP) 319
BLEU score of the NMT on Malaysian parliament Hansard [7] Charoenpornsawat, P., Sornlertlamvanich, V., & Charoenporn, T. dataset. (2002). Improving Translation Quality of Rule-based Machine Translation. http://www.mt-archive.info/Coling-2002- For comparison, we trained another NMT using the Charoenpornsawat.pdf Transformer model but using only the dataset from [8] Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., OpenSubtitles. Even though the OpenSubtitle dataset had 2M Herrmann, T., & Chen, Y. (2008). Using Moses to Integrate Multiple Rule-Based Machine Translation Engines into a Hybrid System. sentence pairs, the BLEU score it obtained is an abysmal 0.22 http://www.statmt.org/moses/ when tested with the same test dataset. This finding [9] Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). demonstrates the importance of the nature and content of the OpenNMT: Open-Source Toolkit for Neural Machine Translation. training dataset. Apart from that, the model was not practical https://arxiv.org/pdf/1701.02810.pdf for translating parliament Hansard as it was only able to [10] Koehn, P., & Knowles, R. (2017). Six Challenges for Neural Machine translate short sentences (5 to 7 words) with mediocre at best Translation. https://arxiv.org/pdf/1706.03872.pdf results. It consistently fails to translate longer sentences by [11] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., producing meaningless target sentences which usually Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open Source contains repetition of a certain word. The extremely poor Toolkit for Statistical Machine Translation ITC-irst 2. performance is expected as the OpenSubtitles dataset does not http://www.statmt.org/moses/ contain a lot of sentence pairs with more than 10 words. [12] Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical Phrase-Based Translation. Proceedings of HLT-NAACL 2003 Main Papers, 48–54. In other words, Zolkepli’s publicly available dataset is a https://www.aclweb.org/anthology/N03-1017.pdf game-changer for training Malay to English Neural Machine [13] Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large Translation. The results are degrees apart. Even though there parallel corpora from movie and tv subtitles. is the possibility of domain mismatch, the model trained with [14] Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine Zolkepli’s dataset and OpenSubtitles is still capable of Translation. producing logical and proper translations with some errors [15] Och, F. J., & Ney, H. (2004). The Alignment Template Approach to when facing the problems stated earlier in this Section (rare Statistical Machine Translation. words or informal sentences). https://dl.acm.org/doi/pdf/10.1162/0891201042544884 [16] Saini, S., & Sahula, V. (2015). A survey of machine translation VI. CONCLUSION techniques and systems for Indian languages. Proceedings - 2015 IEEE International Conference on Computational Intelligence and A neural machine translation approach has been proposed Communication Technology, CICT 2015, 676–681. to translate Malay Parliament Hansard to English texts. With https://doi.org/10.1109/CICT.2015.123 the publicly available NMT tool OpenNMT and Zolkepli’s [17] Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Malay dataset, we managed to train a Transformer model Junczys-Dowmunt, M., Läubli, S., Barone, A. V. M., Mokry, J., & N˘ Adejde †, M. (2017). Nematus: a Toolkit for Neural Machine Neural Machine Translation proposed by Vaswani et al. [22]. Translation. https://arxiv.org/pdf/1703.04357.pdf This paper proves that the tools for automatically translating [18] Sinnard, M., Ueffing, N., Isabelle, P., & Kuhn, R. (2007). Rule-based Malay Parliament Hansard to English is readily available and Translation With Statistical Phrase-based Post-editing. its strength is both formidable and astonishing. Hopefully, https://dl.acm.org/doi/pdf/10.5555/1626355.1626383 soon, the Transformer model can be improved further to [19] Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S. adapt to more uncommon words, names and terms. For our (2017). Machine translation using deep learning: An overview. 2017 International Conference on Computer, Communications and future work, we intend to experiment on MOSES [11] and Electronics, COMPTELIX 2017, 162–167. compare its effectiveness and efficiency against NMT. https://doi.org/10.1109/COMPTELIX.2017.8003957 [20] Terumasa, E. (2007). Rule Based Machine Translation Combined with REFERENCES Statistical Post Editor for Japanese to English Patent Translation. [1] Aasha, V. C., & Ganesh, A. (2015). (PDF) Rule Based Machine http://www.ipdl.inpit.go.jp/homepg_e.ipdl Translation: English to Malayalam: A Survey. [21] Trieu, H.-L., Tran, D.-V., & Nguyen, L.-M. (2017). Investigating https://www.researchgate.net/publication/291947299_Rule_Based_M Phrase-Based and Neural-Based Machine Translation on Low- achine_Translation_English_t Resource Settings. 31st Pacific Asia Conference on Language, [2] Artetxe, M., Labaka, G., & Agirre, E. (2018). Unsupervised Statistical Information and Computation (PACLIC 31), 384–391. Machine Translation. https://arxiv.org/pdf/1809.01272.pdf http://www.statmt.org/wmt16/ [3] Bahdanau, D., Cho, K., & Bengio, Y. (2016). NEURAL MACHINE [22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, TRANSLATION BY JOINTLY LEARNING TO ALIGN AND A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. TRANSLATE. https://arxiv.org/pdf/1409.0473.pdf https://arxiv.org/pdf/1706.03762.pdf [4] Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural [23] Wu, C., Xia, F., Deleger, L., & Solti, I. (2011). Statistical machine versus Phrase-Based Machine Translation Quality: a Case Study. translation for biomedical text: are we there yet? AMIA ... Annual Proceedings of the 2016 Conference on Empirical Methods in Natural Symposium Proceedings / AMIA Symposium. AMIA Symposium, Language Processing, 257–267. 2011, 1290–1299. https://www.aclweb.org/anthology/D16-1025.pdf [24] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., [5] Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Huck, M., Yepes, A. J., Koehn, P., Logacheva, V., Monz, C., Negri, Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., M., Névéol, A., Neves, M., Popel, M., Post, M., Rubino, R., Scarton, Kazawa, H., … Dean, J. (2016). Google’s Neural Machine Translation C., Specia, L., Turchi, M., … Zampieri, M. (2016). Findings of the System: Bridging the Gap between Human and Machine Translation. 2016 Conference on Machine Translation (WMT16). Proceedings of http://arxiv.org/abs/1609.08144 the First Conference on Machine Translation, Volume 2: Shared Task [25] Zolkepli, H., “Malay-Dataset, We gather Bahasa Malaysia corpus!, Papers, 2, 131–198. http://statmt.org/wmt16/results.html available at https://github.com/huseinzol05/Malay-Dataset [6] Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A Statistical Approach To Machine Translation. Computational Linguistics, 16(2), 79–85. https://www.aclweb.org/anthology/J90-2002.pdf 2020 International Conference on Asian Language Processing (IALP) 320
You can also read