Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model Amane Sugiyama Naoki Yoshinaga The University of Tokyo∗ Institute of Industrial Science, sugi@tkl.iis.u-tokyo.ac.jp The University of Tokyo ynaga@iis.u-tokyo.ac.jp Abstract Kang et al., 2020; Zhang et al., 2020). Most of these models are end-to-end models that require Although many end-to-end context-aware neu- document-level parallel data with sentential align- ral machine translation models have been pro- ments for training. However, this data is available posed to incorporate inter-sentential contexts in translation, these models can be trained only in only a few domains (Sugiyama and Yoshinaga, in domains where parallel documents with sen- 2019). Researchers have therefore started to utilize tential alignments exist. We therefore present target-side monolingual data to construct auxiliary a simple method to perform context-aware models which help a sentence-level NMT model per- decoding with any pre-trained sentence-level form context-aware translation (Voita et al., 2019a; translation model by using a document-level Stahlberg et al., 2019; Yu et al., 2020). language model. Our context-aware decoder In this study, we propose a simple yet effective is built upon sentence-level parallel data and target-side document-level monolingual data. approach to context-aware NMT using two primi- From a theoretical viewpoint, our core contri- tive components, a sentence-level NMT model and bution is the novel representation of contex- a document-level language model (LM). We can tual information using point-wise mutual in- independently train the two components on com- formation between context and the current sen- mon sentence-level parallel data and document- tence. We demonstrate the effectiveness of our level monolingual data, respectively, without us- method on English to Russian translation, by ing document-level parallel data. Our approach evaluating with BLEU and contrastive tests for context-aware translation. thereby makes it possible to perform context-aware translation with any pre-trained sentence-level NMT 1 Introduction model, using a pre-trained document-level LM. To give a probabilistic foundation to this combi- Neural machine translation (NMT) has typically nation of two independent models, we exploit the been explored in sentence-level translation settings. probabilistic nature of NMT decoding. When gener- Such sentence-level NMT models inevitably suf- ating a sequence, a left-to-right decoder outputs a fer from ambiguities when a source sentence has categorical probability distribution over the vocabu- multiple plausible interpretations. Examples of lary at every time step. The decoder assigns higher such ambiguities include anaphora, ellipsis, and probabilities to the tokens that would be more suit- lexical coherence (Voita et al., 2019b); although able at that step. Therefore, when multiple valid resolving these ambiguities has only a minor im- translations are possible for the source sentence, pact on the translation performance measured by the decoder just gives a higher probability to the BLEU scores (Papineni et al., 2002), they are vital translation that is plausible without considering in smoothly reading the translated documents. contexts. We thus adjust the probability distribu- To address this issue, context-aware NMT models tions in a context-aware manner using a target-side which incorporate document-level information in document-level LM which models inter-sentential translation have recently been explored (Jean et al., dependencies in the target-side document. 2017; Wang et al., 2017; Tiedemann and Scherrer, We evaluate our methods on English to Rus- 2017; Maruf and Haffari, 2018; Voita et al., 2018; sian translations with the OpenSubtitles2018 cor- Bawden et al., 2018; Miculicich et al., 2018; Maruf pus (Lison et al., 2018) in terms of the BLEU et al., 2019; Voita et al., 2019b; Yu et al., 2020; scores and contrastive discourse test sets (Voita ∗ Currently at Mitsubishi UFJ Morgan Stanley Securities et al., 2019b). Experimental results confirm that 5781 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5781–5791 June 6–11, 2021. ©2021 Association for Computational Linguistics
our method achieved comparable performance with From Eq. 1 and Eq. 2, we obtain existing context-aware NMT models that require either document-level parallel data (Zhang et al., ŷ ≈ arg max log p(c(y) |y)p(y|x) y 2018; Sugiyama and Yoshinaga, 2019) or more p(c(y) , y) than one additional model (Voita et al., 2019a; Yu = arg max log p(y|x) y p(c(y) )p(y) et al., 2020) for capturing contexts in translation. The contributions of this paper are as follows: = arg max C - SCORE(y; x, c(y) ) y • We theoretically derived C - SCORE, a score to where qualify context-aware translation without (y) the need for document-level parallel data. C - SCORE(y; x, c ) = log p(y|x) + PMI(c(y) , y) • Two formulations with C - SCORE turn any (3) pre-trained sentence-level NMT model into p(c(y) , y) p(y|c(y) ) (y) a context-aware model, if it generates n-best PMI (c , y) = log = log p(c(y) )p(y) p(y) outputs or performs left-to-right decoding. (4) • A comparison between our approach and shal- PMI (c(y) , y)is the point-wise mutual information low fusion (Gulcehre et al., 2015) reveals that of c(y) and y which represents the degree of co- our approach reformulates shallow fusion occurrence of y and c(y) . Given x, y and c(y) , we while adding a probabilistic foundation. can evaluate the C - SCORE by computing the two 2 Context-aware Decoding using terms in Eq. 3 using a sentence-level NMT (S - NMT) Document-level Language Model and a document-level LM (D - LM), respectively. In this section, assuming a sentence-level encoder- Notations We first introduce some notation to decoder model (Bahdanau et al., 2015; Vaswani explain the computation in Eq. 3 and Eq. 4 using et al., 2017), we first derive context-aware score (C - (auto-regressive) neural sequence generation mod- SCORE for short), a context-aware objective func- els in NMT and LM. For a sequence s (|s| ≥ 0) tion of outputs to be maximized in decoding. We and token w, a neural sequence generation model then describe how to compute the C - SCORE using parameterized by θ can compute the log probability the decoder with a document-level language model that w follows s, which we denote by log pθ (w|s)): (D - LM) (§ 2.1). We finally detail how to perform pθ (s · w) context-aware decoding based on C - SCORE (§ 2.2). log pθ (w folows s) = log = log pθ (w|s) pθ (s) 2.1 C - SCORE : objective function for where “·” denotes sequence concatenation. Apply- context-aware NMT decoding ing this auto-regressively, for any sequence s(1) Let us consider the problem of finding a transla- (|s(1) | ≥ 0) and s(2) (|s(2) | ≥ 1), the probability tion y of a source sentence x in a document. The that s(2) follows s(1) is thereby computed as: target-side context sentence(s) preceding y, c(y) , are to be given by the past translations. We formu- log pθ (s(2) follows s(1) ) late context-aware translation conditioned on c(y) |s(2) | (2) (2) X (2) (1) as the maximization of the conditional probability = log pθ (s |s )= log pθ (st |s(1) · s
PMI computed by document-level LM To com- fluent as a sequence. This behavior is unsuitable pute the components of PMI(c(y) , y), p(y) and for y,2 since “’s father” is not a complete sentence. p(y|c(y) ), we use a document-level language model (D - LM) which can handle long text spans 2.2 Searching for the optimal solution containing multiple sentences. Searching for the optimal output y that maximizes We generate training examples for D - LM from a the C - SCORE is not trivial since there are O(V T ) document as follows. We assume D - LM explicitly candidate sequences where V is the vocabulary models sentence boundaries. We first insert the size and T is the maximum length of sequences special token into every sentence boundary to be searched. We investigate two approaches to including the start and end of the document. With obtain approximate solutions: reranking (§ 2.2.1) this preprocessing, all the sentences start imme- and context-aware beam search (§ 2.2.2). diately after an token and end immediately 2.2.1 Reranking with C - SCORE before an token. We then sample text spans from the document using a sliding window, where We first generate B hypotheses of the translation the start and end of the span do not have to match HB = {y 1 , . . . , y B } with beam search of beam sentence boundaries. The sliding window’s size is size B using the sentence-level NMT model. We larger than the stride size, so adjacent spans may then choose the one that maximizes the C - SCORE. overlap. The resulting sequence is fed to the D - LM ŷ = arg max C - SCORE(y; x, c(y) ) (9) for training. Note that for D - LM indicates y∈HB sentence boundaries, in other words, both the start An issue with reranking is that we need to set B and end of the sequence. to a large value when the diversity of models’ out- Using D - LM, p(y) is computed by puts is limited (Yu et al., 2020), which increases the p(y) = pD - LM (ỹ|). (7) cost of decoding. We therefore attempt to integrate C - SCORE into the decoding with beam search. where ỹ = [y1 , . . . , yT , ]. 2.2.2 Context-aware beam search To compute p(y|c(y) ), we first obtain the context Context-aware beam search (C - AWARE beam) is sequence c̃(y) by concatenating all the sentences in beam search that is extended to work with C - c(y) with . We then compute the conditional SCORE. C - SCORE (Eq. 3) can be decomposed into probability p(y|c(y) ) by token-wise C - SCOREs (Eq. 5 through Eq. 8). p(y|c(y) ) = pD - LM (ỹ|c̃(y) ) (8) (y) C - SCORE(y; x, c ) = log p(y|x) + PMI(c(y) , y) where ỹ = [y1 , . . . , yT , ]. T X +1 Let us explain why we use the boundary-aware = C - SCOREw (ỹt |ỹ
be interpreted as PMI between the t-th token and where β is a hyper-parameter. In our notation with the context, that is, how consistent the t-th token the document-level LM, this is written as is with the context. Compared to the reranking approach, C - AWARE beam can be considered to log p(yt ) = log pS - NMT (ỹt |ỹ
Train Dev. Test then performs document-level beam search of beam src trg (mono) src trg (mono) src trg size B 0 on the lattice using the following score: # sentences 5.8M 30M 6.0k 23k 15.5k avg. # tokens 9.9 9.4 8.5 10.1 9.6 8.9 9.8 9.1 Score(yi ; y
Models para monolingual data Models deixis lex.c ell.infl ell.vp only 6M 15M 30M SentTransformer 50.0 45.9 53.2 27.0 SentTransformer (w/ BT) 32.36 32.32 32.40 32.40 w/ BT 50.0 45.9 51.6 26.8 Shallow Fusion n/a 32.39 32.56 32.52 baselines baselines Doc-Transformer 50.0 45.9 56.0 57.2 DocTransformer (w/ BT) 32.50 32.36 31.88 31.59 w/ BT 50.0 45.9 64.4 68.2 DocRepair n/a 32.13 32.36 32.35 DocRepair 89.1 75.8 82.2 67.2 Bayes DocReranker n/a 32.80∗ 33.58∗∗ 33.75∗∗ Bayes DocReranker 65.2 72.2 59.6 44.6 w/o context n/a 32.53 33.44∗∗ 33.67∗∗ proposed proposed C - SCORE 86.9 94.9 78.2 77.0 C - AWARE Rerank n/a 32.74∗ 33.01∗∗ 32.93∗ Cond. Shallow Fusion 54.7 55.3 53.4 32.4 C - AWARE Beam n/a 32.26 32.28 32.27 (y) Cond. Shallow Fusion n/a 32.38 32.55 32.55 D - LM PMI (c , y) 96.8 97.8 75.8 90.6 p(y|c(y) ) 89.7 95.7 77.4 81.6 Table 2: Test set BLEU scores. ‘*’ and ‘**’ indicate that gains from SentTransformer in the same column Table 3: Results on contrastive test sets. are statistically significant (p < 0.05 and p < 0.01) by bootstrap resampling with 1000 samples, respectively. sentation of the models’ outputs into words using SentencePiece. use a beam size of B = 20. For document-level The contrastive test set consists of contrastive beam search of Bayes DocReranker, we use a beam questions for context-aware NMT models to answer. size B 0 = 5. For beam search of SentTransformer, Each question has a source sentence x, a source DocTransformer, C - AWARE beam, and shallow fu- context c(x) , a target context c(y) , and translation sion, we use a beam size of B = 4. candidates Y = {y 1 , . . . , y M }. Models must an- swer with a candidate ŷ ∈ Y which would be the 3.3 Document-level Language models most appropriate translation of x, i.e. The architecture of the document-level LM is the ŷ = arg max p(y|x, c(x) , c(y) ) decoder part of a Transformer. The number of y∈Y decoder blocks is 12. The model size is 768 with 12 attention heads, and the inner layer of the feed- The test sets consist of 6000 examples in total. forward networks has 3072 units. We use position 4 Results and Analysis embeddings to represent position information. As described in § 2.1, when training the lan- 4.1 General translation performance guage models, a special control symbol is measured by BLEU scores inserted at every sentence boundary. Each training Table 2 lists the performance of the models in mini-batch contains text spans each of which is a terms of BLEU scores. Bayes DocReranker and randomly sampled fragment of a document with our C - AWARE Rerank consistently outperformed a maximum span length of W = 384. Text spans the baseline SentTransformer, even when it used are batched such that about 32,000 tokens are in a data augmentation by back-translation, while the training batch. other methods are just comparable to the baseline. Althogh Bayes DocReranker performed the best 3.4 Evaluation methods among all the models, the comparison to Bayes The existing automatic metrics are not adequate DocReranker without context information (using to evaluate gains from additional contexts (Baw- pS - LM (yi ) instead of pD - LM (yi |y
(a) PMI (b) PMI with rand. context (c) Cond. prob. (d) Cond. prob. (rand. context) Figure 1: Source-target correlation of contextual PMI (a, b) and conditional probability (c, d), calculated based on the correct context (a, c) and wrong context that is randomly chosen from the dataset (b, d). The dataset is a subset of the training data from the English-Russian parallel corpus. Plots are for 4166 sentence pairs in the dataset. are in bold, and additionally, the higher one of two passes. C - AWARE Rerank and Bayes DocRe- the two D - LM-based scores is shown in bold. The ranker were about 4.3 sents/sec and 7.7 sents/sec contrastive test include four test sets: deixis is for respectively. We expect that these models would be person deixis, lex.c is for lexical cohesion, ell.infl accelerated by using a language model with a better is for inflection of Russian nouns caused by ellipsis cache mechanism (e.g. TransformerXL (Dai et al., in the source sentence, and ell.vp is for verb ellipsis 2019)). C - AWARE Beam ran in about 13 sents/sec.6 in English text which is not allowed in Russian. We leave thorough analysis on speed/performance Although the contrastive test is targeted at context- trade-offs to future work. aware NMT models, it is possible to answer the contrastive questions by arg maxy PMI(c(y) , y) or 4.4 PMI correlation analysis arg maxy p(y|c(y) ). Scores obtained by these two In § 4.2 we have confirmed the effectiveness of PMI objectives are also reported in the table in addition as a measure of a valid translation given context to the scores obtained by SentTransformer. using contrastive tests. To gain a deeper insight Our C - SCORE outperforms all the context-aware into how well PMI conveys semantic connections models other than DocRepair. The performance between the current sentence and its context, we of C - SCORE is slightly worse than DocRepair analyze the correlation of PMI between source and for deixis (2.2 points) and ell.infl (4.0 points), target sentences. while achieving large improvements for lex.c (19.1 points) and ell.vp (9.8 points) over DocRepair. PMI correlation between source and target D - LM only objectives achieve higher scores than The main result we show in this section is that the C - SCORE, except for ell.infl. This is not surprising PMI of the source and target correlate well. This is because the choices in the tests are guaranteed to important because this supports the idea that PMI is be valid as translation for the source sentences if a language-independent measure of the connection given some appropriate context, so the questions between the current sentence and its context. can be solved without translation. This result still Although we have discussed only target-side indicates that the D - LM scores give good hints for PMI (c(y) , y) defined by Eq. 4, we can compute the tackling contextual ambiguities. The advantage source-side PMI(c(x) , x) in the same way. Given of C - SCORE over the SentTransformer is demon- a document-level parallel corpus, we measure a strated by the excellent performance of D - LM in correlation between PMI(c(x) , x) and PMI(c(y) , y) capturing contexts in translation. for each sentence pair (x, y) in the corpus. Figure 1a shows the PMI correlation for about 4.3 On translation efficiency 6 Note that the running time of NMT decoding also depends The inference speed depends mainly on the model on the degree of parallelism, and for C - AWARE Beam, de- coding multiple sentences in parallel is less trivial since it size and beam size. In our experiments on a sin- demands that all the previous sentences in the document are gle TITAN Xp GPU, SentTransformer decoded the translated by the time it starts to translate the current one. In fastest at 66 sents/sec, followed by DocTransformer our experiments, assuming a practical scenario where a large number of users input their documents for translation, we that ran in 40 sents/sec. DocRepair ran in about translate multiple documents in parallel so that multiple sen- 28 sents/sec, slightly slower because it decodes in tences from different documents can be translated in parallel. 5787
Figure 2: Correlation of contextual PMI between the source sentences (from the training data) and the outputs of some models (SentTransformer, C - AWARE beam without T -scaling, and C - AWARE beam with T -scaling of T = 4). 4000 sentence pairs taken from the dev data. The (Figure 2). We can find some outliers in the bottom pairs of PMI values are computed using English right area of the plot for C - AWARE beam without and Russian language models trained on the train- T -scaling, which is the cause of the low correla- ing data. We observe a clear correlation between tion coefficient R = 0.610 < Rsrc−ref = 0.695. source and target, which agrees with the intuition This result suggests that C - AWARE beam without that if the target sentence matches well in the con- T -scaling chooses some tokens based on exces- text, so does the source sentence. What is also sively high token-wise PMI, which breaks some obvious in Figure 1a is that most of the points lay translations resulting in the low BLEU. Translation in the first quadrant where both the source and of the SentTransformer shows a higher correlation target contextual PMI is greater than 0, which is with the source texts than the reference translation explained by the simple intuition that most sen- (Figure 1a). One possible explanation for this is tences should have positive co-occurrence relation alignment errors in the corpus: although worse than with their contexts. This behavior is lost when the reference translations in quality, outputs of Sent- computing the contextual PMI using an incorrect Transformer are considered to be perfectly aligned context c̃ randomly chosen in the dataset as shown to the source sentences. C - AWARE beam with T - in Figure 1b. scaling (T = 4) seems to solve this issue and The effectiveness of PMI as a measure of the achieves the highest PMI correlation R = 0.740. valid translation of the current sentence given con- text is further emphasized when compared to the 5 Related Work conditional probability p(y|c(y) ), which could be an alternative measure of how suitable y is in the The effectiveness of incorporating context into context as described in § 2.2.4. Figure 1c and 1d translation was shown in earlier literature on are the conditional probability version of Figure 1a document-level NMT (Tiedemann and Scherrer, and 1b: (p(x|c(x) ), p(y|c(y) )) for each sentence 2017; Bawden et al., 2018) using the single en- pair (x, y) in the same dataset are plotted in Fig- coder architecture. Multi-encoder architectures ure 1c and the same tuples but with random con- were explored to better capture contextual infor- texts are plotted in Figure 1d. Unlike the contextualmation (Wang et al., 2017; Tu et al., 2018; Jean PMI correlation, conditional probability correlation et al., 2017; Miculicich et al., 2018; Voita et al., remains high even when we give wrong contexts. 2018; Bawden et al., 2018; Maruf and Haffari, This is because the conditional probability of a 2018; Maruf et al., 2019; Kang et al., 2020; Zhang sentence is highly affected by how frequently the et al., 2020). However, since parallel data is often sentence is observed regardless of context; if the constructed by picking up reliable sentential align- source sentence is written with common expres- ments from comparable documents, document- sions, then so is the target sentence and they are level sentence-aligned parallel data for training likely to be observed regardless of the context. these document-level NMT models are expensive to obtain and available in only a few domains and Analysis of the model outputs language pairs (Sugiyama and Yoshinaga, 2019). PMI correlation gives us a good explanation of how Recent studies have therefore started to focus C - AWARE beam without T -scaling fails. We plot on modeling contexts using document-level mono- the PMI correlation between the source sentences lingual data. The current approaches are grouped and their translations obtained with NMT models into three categories: data augmentation via back- 5788
translation (Sugiyama and Yoshinaga, 2019), a translation; the comparable BLEU score of the no- post-editing model (Voita et al., 2019a), and mod- context version of the method (Table 2) and the re- eling document-level fluency via document-level sults of the contrastive tests (Table 3) reveal that the LM s (Stahlberg et al., 2019; Yu et al., 2020; Jean improvement is mostly due to the context-agnostic and Cho, 2020). In what follows, we review these language model prior and the backward translation approaches in detail. model. As we have discussed in § 2.2.4, document- Sugiyama and Yoshinaga (2019) reported that level LM scores prefer tokens which frequently ap- the data augmentation by back-translation (Sen- pear regardless of context and are unlikely to lead nrich et al., 2016) enhances a document-level NMT to better document-level translation. Moreover, model with a single encoder architecture in low- their method requires training a back-translation resource settings. However, we have obtained lim- model corresponding to the target sentence-level ited improvements in our settings (Table 2 and Ta- NMT model. ble 3). Moreover, this approach is expensive since Finally, we noticed that Jean and Cho (2020) it learns a document-level NMT model from a mas- (which appeared after the preprint version of this sive amount of pseudo parallel data. paper (Sugiyama and Yoshinaga, 2020)8 had been Voita et al. (2019a) proposed DocRepair, a submitted) have reached a formulation that is very context-aware post-editing model that corrects out- similar to the one presented in this paper by refor- puts of a sentence-level NMT model. Because mulating a noisy channel model of Bayes DocRe- DocRepair ignores the confidence of the first- ranker (Yu et al., 2020). Concrete differences be- stage sentence-level translation and possible alter- tween our work and theirs include the fact that we native translations, it can miscorrect outputs of the conducted thorough analysis on the performance of sentence-level NMT model when they are irregular different decoding strategies (not only beam search but correct. Moreover, when we change the tar- but also reranking). We also interpreted the sub- get sentence-level NMT model, the accompanying traction of LM scores as point-wise mutual informa- post-editing model must be trained from its outputs. tion and analyzed it by observing PMI correlation Our approaches, on the other hand, attempt a more between source and target PMI to deepen the under- “soft” revision, taking into account the output prob- standing of the formulation. abilities, i.e., confidence of the sentence-level NMT, and can perform context-aware decoding with any 6 Conclusions sentence-level NMT model, reusing a pre-trained document-level LM. We present an approach to context-aware NMT Stahlberg et al. (2019) and Yu et al. (2020) uti- based on PMI between the context and the cur- lize a document-level LM to model document-level rent sentence. We first provide the formulation of fluency of outputs; these approaches are similar the objective, C - SCORE, and the computation pro- to shallow fusion (Gulcehre et al., 2015)7 with cess of the C - SCORE using a sentence-level transla- document-level LM (§ 2.2.4), although they per- tion model and a document-level language model. form a document-level reranking of translation We investigate two search methods, reranking and hypotheses generated for individual source sen- beam search, and evaluate the methods for English- tences by using sentence-level NMT. In particular, Russian translation. We also provide some analysis Yu’s formulation has a probabilistic foundation like and visualization to better understand the nature of our approaches, and additionally utilizes a back- PMI between the context and the current sentence. ward translation model. Although their formulation We plan to design context-aware BLEU using brings a significant improvement in BLEU (Table 2), PMI for evaluating context-aware NMT models. We the score is not obtained by better document-level will evaluate our method on non-autoregressive NMT (Gu et al., 2017). We will release all code and 7 Our work is also related to shallow fusion (Gulcehre et al., data to promote the reproducibility of results.9 2015), in which token-wise probabilities output by an NMT model and a sentence-level LM are combined to be used as 8 translation scores in decoding. The theoretical background This preprint is submitted to and rejected from EMNLP of shallow fusion and our C - SCORE are different: in shallow 2020; the interested reader may refer to this paper for experi- fusion, the LM is intended to promote fluency of translations, ments on other language pairs such as English to French and whereas in our C - SCORE, we use the probability ratio of two English to Japanese translation. 9 LM probabilities which only provides contextual difference http://www.tkl.iis.u-tokyo.ac.jp/ and fluency is still left to the translation model. ~sugi/NAACL2021/ 5789
Acknowledgements Xiaomian Kang, Yang Zhao, Jiajun Zhang, and Chengqing Zong. 2020. Dynamic context selection We thank anonymous reviewers for their valuable for document-level neural machine translation via re- comments. We also thank Joshua Tanner for proof- inforcement learning. In Proceedings of the 2020 reading this paper. We also thank Masato Neishi Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 2242–2254, On- for technical advice on implementations of neural line. machine translation. The research was supported by NII CRIS collaborative research program oper- Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a ated by NII CRIS and LINE Corporation. case for document-level evaluation. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796, References Brussels, Belgium. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. gio. 2015. Neural machine translation by jointly 2018. OpenSubtitles2018: Statistical rescoring of learning to align and translate. In Proceedings of sentence alignments in large, noisy parallel corpora. the third International Conference on Learning Rep- In Proceedings of the Eleventh International Confer- resentations (ICLR). ence on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Re- Rachel Bawden, Rico Sennrich, Alexandra Birch, and sources Association (ELRA). Barry Haddow. 2018. Evaluating discourse phenom- ena in neural machine translation. In Proceedings of Sameen Maruf and Gholamreza Haffari. 2018. Docu- the 2018 Conference of the North American Chap- ment context neural machine translation with mem- ter of the Association for Computational Linguistics: ory networks. In Proceedings of the 56th Annual Human Language Technologies, Volume 1 (Long Pa- Meeting of the Association for Computational Lin- pers), pages 1304–1313, New Orleans, Louisiana. guistics (Volume 1: Long Papers), pages 1275–1284, Melbourne, Australia. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Sameen Maruf, André F. T. Martins, and Gholamreza Transformer-XL: Attentive language models beyond Haffari. 2019. Selective attention for context-aware a fixed-length context. In Proceedings of the 57th neural machine translation. In Proceedings of the Annual Meeting of the Association for Computa- 2019 Conference of the North American Chapter of tional Linguistics, pages 2978–2988, Florence, Italy. the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Jiatao Gu, James Bradbury, Caiming Xiong, Vic- Short Papers), pages 3092–3102, Minneapolis, Min- tor O.K. Li, and Richard Socher. 2017. Non- nesota. autoregressive neural machine translation. In Pro- ceedings of the the fifth International Conference for Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, Learning Representations (ICLR). and James Henderson. 2018. Document-level neural machine translation with hierarchical attention net- Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun works. In Proceedings of the 2018 Conference on Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Empirical Methods in Natural Language Processing, Holger Schwenk, and Yoshua Bengio. 2015. On pages 2947–2954, Brussels, Belgium. using monolingual corpora in neural machine translation. Computing Research Repository, Mathias Müller, Annette Rios, Elena Voita, and Rico arXiv:1503.03535. Sennrich. 2018. A large-scale test set for the eval- uation of context-aware pronoun translation in neu- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein- ral machine translation. In Proceedings of the Third berger. 2017. On calibration of modern neural net- Conference on Machine Translation: Research Pa- works. In Proceedings of the 34th International pers, pages 61–72, Brussels, Belgium. Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 1321–1330. PMLR. Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of Sébastien Jean and Kyunghyun Cho. 2020. Log- the 40th Annual Meeting of the Association for Com- linear reformulation of the noisy channel model for putational Linguistics, pages 311–318, Philadelphia, document-level neural machine translation. In Pro- Pennsylvania, USA. ceedings of the Fourth Workshop on Structured Pre- diction for NLP, pages 95–101, Online. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation mod- Sebastien Jean, Stanislas Lauly, Orhan Firat, and els with monolingual data. In Proceedings of the Kyunghyun Cho. 2017. Does neural machine trans- 54th Annual Meeting of the Association for Compu- lation benefit from larger context? arXiv preprint tational Linguistics (Volume 1: Long Papers), pages arXiv:1704.05135. 86–96, Berlin, Germany. 5790
Felix Stahlberg, Danielle Saunders, Adrià Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wang de Gispert, and Bill Byrne. 2019. Ling, Lingpeng Kong, Phil Blunsom, and Chris CUED@WMT19:EWC&LMs. In Proceedings Dyer. 2020. Better document-level machine trans- of the Fourth Conference on Machine Translation lation with Bayes’ rule. Transactions of the Associ- (Volume 2: Shared Task Papers, Day 1), pages ation for Computational Linguistics, 8:346–360. 364–373, Florence, Italy. Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Amane Sugiyama and Naoki Yoshinaga. 2019. Data Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018. augmentation using back-translation for context- Improving the transformer translation model with aware neural machine translation. In Proceedings document-level context. In Proceedings of the 2018 of the Fourth Workshop on Discourse in Machine Conference on Empirical Methods in Natural Lan- Translation (DiscoMT 2019), pages 35–44, Hong guage Processing, pages 533–542, Brussels, Bel- Kong, China. gium. Amane Sugiyama and Naoki Yoshinaga. 2020. Pei Zhang, Boxing Chen, Niyu Ge, and Kai Fan. 2020. Context-aware decoder for neural machine transla- Long-short term masking transformer: A simple tion using a target-side document-level lan- but effective baseline for document-level neural ma- guage model. Computing Research Repository, chine translation. In Proceedings of the 2020 Con- arXiv:2010.12827. ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1081–1087, Online. Jörg Tiedemann and Yves Scherrer. 2017. Neural ma- chine translation with extended context. In Proceed- ings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. 2018. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6:407–420. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30, page 6000–6010. Cur- ran Associates, Inc. Elena Voita, Rico Sennrich, and Ivan Titov. 2019a. Context-aware monolingual repair for neural ma- chine translation. In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 877–886, Hong Kong, China. Elena Voita, Rico Sennrich, and Ivan Titov. 2019b. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1198–1212, Flo- rence, Italy. Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine trans- lation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2826–2831, Copenhagen, Denmark. 5791
You can also read