Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation Kellie Webster and Emily Pitler Google Research {websterk|epitler}@google.com Abstract Spanish Britney Jean Spears... ∅ Adquirió fama durante su niñez al participar en el arXiv:2006.08881v1 [cs.CL] 16 Jun 2020 Machine translation systems with inade- programa de televisión The Mickey Mouse quate document understanding can make Club (1992). errors when translating dropped or neu- tral pronouns into languages with gendered Spanish→English He gained fame during pronouns (e.g., English). Predicting the his childhood by participating in the televi- underlying gender of these pronouns is dif- sion program The Mickey Mouse Club (1992). ficult since it is not marked textually and must instead be inferred from coreferent Figure 1: Typical example of pronoun gender er- mentions in the context. We propose a rors when translating from pronouns without tex- novel cross-lingual pivoting technique for tual gender marking (e.g., dropped or possessive automatically producing high-quality gen- pronouns in Spanish) to a language with gendered der labels, and show that this data can be pronouns. used to fine-tune a BERT classifier with 92% F1 for Spanish dropped feminine pro- nouns, compared with 30-51% for neural machine translation models and 54-71% for Such errors are worrying for their preva- a non-fine-tuned BERT model. We aug- lence. An oracle study found that augment- ment a neural machine translation model ing an NMT system with perfect pronoun in- with labels from our classifier to improve formation could give up to 6 BLEU points pronoun translation, while still having par- allelizable translation models that trans- improvement (Stojanovski and Fraser, 2018). late a sentence at a time. Further, the DiscoMT shared task on pronoun prediction tested multiple language pairs and 1 Introduction found that the most difficult language pair by far was Spanish→English and that “[t]his re- While neural modeling has solved many chal- sult reflects the difficulty of translating null lenges in machine translation, a number subjects” (Loáiciga et al., 2017). remain, notably document-level consistency. Since standard architectures are sentence- We address document-level modeling for based, they can misrepresent pronoun gender dropped and neutral personal pronouns in since cues for this are often extra-sentential Spanish. We significantly improve the ability (Läubli et al., 2018). For example, in Fig- of a strong sentence-level translator to produce ure 1,1 there is a dropped subject and a neu- correctly gendered pronouns in three stages. tral possessive pronoun in the Spanish, which First, we automatically collect high-quality are both translated with masculine pronouns pronoun gender-tagged examples at scale us- in English despite context indicating they re- ing a novel cross-lingual pivoting technique fer to Britney Jean Spears. Recommendations (Section 3). This technique easily scales to for the use of discourse in translation are de- produce large amounts of data without explicit veloped in Sim Smith (2017). human labeling, while remaining largely lan- 1 guage agnostic. This means our technique may Source: https://es.wikipedia.org/wiki/Britney_Spears, Translation: Retrieved from translate.google.com, also be easily scaled to produce data for other 27 Feb 2019. languages than Spanish; we present results on
Properties Surface Form sentence cannot be used to make translation En Es decisions. In a language like Spanish, this Nominative, Masc. he él or ∅ means that critical cues for pronoun gender Nominative, Fem. she ella or ∅ are simply not present at decoding time. Possessive, Masc. his su There have been a number of direc- Possessive, Fem. her su tions exploring how to introduce gen- eral contextual awareness into translation. Table 1: Masculine and feminine pronouns cannot Tiedemann and Scherrer (2017) investigated a be distinguished by their surface forms in Spanish simple technique of providing multiple sen- in both the dropped and possessive case. tences, concatenated together, to a translator and learning a model to translate the final sen- tence in the stream using the preceding one(s) Spanish as a proof of concept. for contextualization. Unfortunately, no sub- Next, we show how to fine-tune BERT stantial performance gain was seen. Next, the (Devlin et al., 2019), a state-of-the-art pre- Transformer model (Vaswani et al., 2017) was trained language model, to classify pronoun extended by introducing a second, context de- gender using our new dataset (Section 4). Fi- coder (Wang et al., 2017; Zhang et al., 2018). nally, we add context tags from our BERT Complementary to this, Kuang et al. (2018) model into standard MT sentence input, yield- introduces two caches, one lexical and the ing a 8.8% F1 improvement in feminine pro- other topical, to store information from con- noun generation (Section 5). text pertinent to translating future sentences. 2 Background Specific to the translation of implicit or dropped pronouns, Wang et al. (2018) shows In this section, we overview particular nuances that modeling pronoun prediction jointly associated with pronoun understanding, re- with translation improves quality, while sulting challenges for translation, and existing Miculicich Werlen and Popescu-Belis (2017a) relevant datasets. presents a proof-of-concept where coreference links are available at decoding time. Most sim- Dropped (Implicit) Pronouns Spanish ilar to our current work, Habash et al. (2019); personal pronouns have less textual mark- Moryossef et al. (2019); Vanmassenhove et al. ing of gender than their English counter- (2018) improve gender fidelity in translations parts. Where they may be inferred from con- in of first-person singular pronouns in lan- text, Spanish drops subject pronouns él/ella guages with grammatical gender inflection (he/she). Several other languages such as Chi- in this position. They show promising re- nese, Japanese, Greek, and Italian also have sults from either injecting explicit information dropped pronouns. about speaker gender, or predicting it with a Possessive Pronouns Additionally, Span- classifier. We extend this line of work without ish possessive pronouns do not mark the gen- requiring human labels to train our classifier. der of the possessor, with su(s) used where Prior Work on Aligned Corpora Trans- English uses his and her. In other Romance lation of sentences has been a task at the languages such as French, the possessive pro- Workshop on Machine Translation (WMT) noun agrees with the gender of the possessed since 2006. Over this period, more and larger object, rather than the possessor as in English. datasets have been released for training statis- These usages are summarized for Spanish in tical models. Typically released data is given Table 1 and illustrated in Figure 1. in sentence-pair form, meaning that for each MT of Ambiguous Pronouns Document- source language sentence (e.g. Spanish), there level consistency is an outstanding challenge is a target language sentence (e.g. English) for machine translation (Läubli et al., 2018). which may be considered its true translation. One reason is that standard techniques trans- For a discussion of the adequacy of this ap- late the sentences independently of one an- proach, see Guillou and Hardmeier (2018). other, meaning that context outside a given Translation of Spanish sentences to En-
(A) Page Alignment Title: ”Mitsuko Shiga” https://en.wikipedia.org/wiki/Mitsuko_Shiga https://es.wikipedia.org/wiki/Mitsuko_Shiga (B) Sentence Alignment Spanish Publicó numerosas antologı́as de su poesa durante su vida, in- cluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki. Spanish→English He published numerous anthologies of his poetry during his life, including Fuji no Mi, Asa Tsuki, Asa Ginu, and Kamakura Zakki. English She published numerous anthologies of her poetry during her life- time, including Fuji no Mi (”Wisteria Beans”), Asa Tsuki (”Morn- ing Moon”), Asa Ginu (”Linen Silk”), and Kamakura Zakki (”Ka- makura Miscellany”). (C) Pronoun Tagging ∅/She Publicó numerosas antologas de su/her poesa durante su/her vida, incluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki. Table 2: Example of how pronoun translations are extracted from comparable texts. The English and Spanish sentences were extracted from Wikipedia articles about the same subject, the sentences were aligned due to the similarities in their translations, and the labels were extracted by matching the Spanish dropped and ambiguous pronouns with the English pronouns. glish was most recently contested at WMT’132 second person pronouns and noisy alignment. (Bojar et al., 2013), which offered partici- High quality targeted datasets for discourse pants 14,980,513 sentence pairs from Eu- phenomena were explored in Bawden et al. roparl3 (Habash et al., 2017), Common Crawl (2018), however the released corpus was a (Smith et al., 2013), the United Nations cor- small French→English test set. In this paper, pus4 (Ziemski et al., 2016), and news com- we introduce a novel cross-lingual technique mentary. While large, this dataset wasn’t for gathering a large-scale targeted dataset designed for document-level understanding. without explicit human annotation. Document boundaries are available, but cor- respond loosely to what a user of a translation 3 Cross-Lingual Pivoting for system might consider a document, whole days Pronoun Gender Data Creation of European parliament, for instance. Despite document-level context being critical DiscoMT was introduced as a workshop in for pronoun translation quality, little of the 20135 to target the translation of pronouns available aligned data is directly suitable for in document context. As well as Europarl, training models. In this section, we exploit the Ted talk subtitles6 (Tiedemann, 2012) were in- special properties of Wikipedia to automate cluded for both training and evaluation. Also the collection of high-quality gender-tagged available for the translation of extended speech pronouns in document context. is Open Subtitles7 (Lison and Tiedemann, 2016; Muller et al., 2018), which contains ut- 3.1 Cross-Lingual Pivoting terance pairs which are aligned and contextu- Wikipedia has a number of desirable proper- alized according to their temporal appearance ties which we exploit to automatically tag pro- in a film. Although subtitle datasets make noun gender. Firstly, it is very large for head available very wide context, they are not ideal languages like English and Spanish. More im- in this study due to their heavy use of first and portantly, many non-English Wikipedia pages 2 express similar content to their English coun- http://www.statmt.org/wmt13/translation-task.html 3 4 http://www.statmt.org/europarl/ terpart given the resource’s style and format- https://cms.unov.org/UNCorpus/ 5 ting requirements, and the ease of using trans- http://aclweb.org/anthology/W13-3300 6 http://opus.nlpl.eu/TED2013.php lation systems to generate base text for a miss- 7 http://opus.nlpl.eu/OpenSubtitles.php ing non-English page. Based on these proper-
Prodrop Possessive Articles he she Articles his her Spanish Wikipedia 1,326,469 - - 1,326,469 - - Page Alignment 420,850 - - 420,850 - - Pronoun Tagging 45,210 118,911 41,524 52,559 266,886 111,411 Table 3: Extraction yield in each stage of the multi-lingual pivot extraction pipeline. ties, we develop the following three stage ex- Human traction pipeline, illustrated in Table 2 with he she unclear pipeline yield summarized in Table 3. Pipeline he 221 3 36 Page Alignment Identify as pairs pages she 7 179 43 with the same title in English and Spanish. We prioritize precision and require exact match. Table 4: Agreement between automatic labels and human judgment for prodrop examples. Sentence Alignment Identify sentence pairs in aligned pages which express nearly Human the same meaning. To do this, we compare his her unclear English translations of sentences8 from the Pipeline his 198 6 48 Spanish page to sentences from the English her 25 179 55 page. We do bipartite matching over these English sentence-pairs, selecting pairings with Table 5: Agreement between automatic labels and the smallest edit distance. We require that human judgment for possessive pronoun examples. edit distance is maximally one half sentence length, that paired sentences share either a noun or verb, and that the mapping is at dataset with 79,240 prodrop and 187,224 pos- most one-to-one. Table 2, step (B) shows an sessive examples. We split the 79,240 prodrop aligned sentence pair. examples into training, development, and test sets of size 72,120, 2,968, and 4,152, respec- Pronoun Tagging Perform alignment9 tively; the 187,224 possessive examples are over the tokens in detected sentence pairs, similarly split into sets of size 167,222, 8,862, identifying where English uses a gendered and 11,140. pronoun and Spanish has either a dropped or possessive pronoun. Copy the gender Quality To assess the quality of our exam- of the English pronoun as a label onto the ples, we sent 993 examples for human rat- ambiguous target pronoun, masculine for he ing. In-house bilingual speakers were shown and his, feminine for she and her. Step (C) in a paragraph of Spanish text containing a la- Table 2 shows the resulting pronoun tagging beled pronoun and asked whether, given the in the example. context, the pronoun referred to a masculine person, feminine person, or whether it was un- 3.2 Final Dataset clear. Agreement between these human labels Gender We can see from Table 3 that there and those produced by our automatic pipeline are approximately three times as many mas- is high. 84% of prodrop and 80% of posses- culine examples extracted compared to femi- sive pronouns could be disambiguated with the nine examples. This is not surprising given the provided context and, of those, 98% and 92% disparity in representation by gender found in of automatically generated gender tags agreed Wikipedia (Wagner et al., 2016). For the final with human judgment. Where the provided dataset, we down-sample masculine examples paragraph was not enough, it was typically the to achieve a 1:1 gender balance, yielding a final case that relevant context was in the directly 8 From an in-house tool. preceding paragraph. Confusion matrices are 9 We use an implementation of Vogel et al. (1996). given in Tables 4 and 5.
Masculine Feminine P R F1 P R F1 Prodrop Baseline MT (Sentences) 56.2 81.3 66.5 91.3 18.6 30.9 Baseline MT (Contexts) 62.4 60.5 61.4 93.0 25.1 39.5 Context MT (Contexts) 65.1 65.3 65.2 88.9 35.8 51.0 BERT (Sentences) 66.6 61.6 64.0 87.0 40.2 54.9 BERT (Contexts) 81.6 58.5 68.1 92.6 58.5 71.7 BERT (Contexts) + Data 90.4 95.2 92.7 95.0 89.8 92.3 Possessives Baseline MT (Sentences) 58.2 77.8 66.6 86.7 30.4 45.0 Baseline MT (Contexts) 63.6 63.4 63.5 87.2 35.0 50.0 Context MT (Contexts) 64.2 67.6 65.9 86.4 38.0 52.8 BERT (Contexts) + Data 87.4 91.3 89.3 90.9 86.8 88.8 Table 6: Intrinsic evaluation of pronoun gender classification over our new test sets. Count Prodrop Rate on WMT’13 Spanish→English data,10 pre- he 1,220 1,106 90.7% processed using the Moses toolkit.11 Rows she 314 251 79.9% designated Baseline in Table 6 are trained with the standard sentence-level formulation Table 7: Even accounting for difference in the raw of the task and yield a best test set BLEU number of pronouns by gender, prodrop is more score (Papineni et al., 2002) of 34.02. Con- likely for masculine entities than feminine entities text MT is trained using the 2+1 strategy of in WMT’13 (development). (Tiedemann and Scherrer, 2017) where an ex- tended context, in our case full sentences up to 128 subtokens,12 is used to predict the trans- 4 Classifying Pronoun Gender lation of the contextualized target sentence. Language model (LM) pretraining This modeling yields 33.99 BLEU, which is not (Peters et al., 2018; Devlin et al., 2019) substantially different to Baseline MT, consis- is emerging as an essential technique for tent with Tiedemann and Scherrer. human-quality language understanding. How- To evaluate how well these NMT systems ever, LMs are limited to learning from what model pronoun gender, we use them to trans- is explicitly expressed in text, which seems late the Spanish text from our new dataset into insufficient for Spanish given that gender is English. For each target Spanish pronoun, we often not textually realized for pronouns. use an implementation of Vogel et al. (1996) We show that BERT, a state-of-the-art to find the corresponding English pronoun and pretrained-LM model (Devlin et al., 2019), count the translation as correct if the gender already models some aspects of pronoun tag on the Spanish agrees with the English gender, achieving scores above a strong NMT pronoun gender. Table 6 shows the results baseline. We then show that fine-tuning this when the input units are single sentences (Sen- LM using our new data improves pronoun tences) or full sentences up to 128 subtokens gender discrimination by over 20% absolute (Contexts). All context sentences precede the F1 for both genders. one containing the pronoun (no lookahead). Performance is weak overall but especially 4.1 Neural Machine Translation on feminine examples of prodrop. The ten- We take as our NMT baseline a Transformer dency is for masculine pronouns to be over- (Vaswani et al., 2017) with vocabulary size 10 http://www.statmt.org/wmt13/translation-task.html 32,000 trained for up to 250k steps. We 11 https://github.com/moses-smt/mosesdecoder set input dimension to 512 and use 6 lay- 12 Given this very local context, we make the assump- ers with 8 attention heads and hidden di- tion that adjacent sentences in the pre-shuffled dataset are document-continuous. This assumption introduces mension 2048, dropout probabilities of 0.1 noise at document boundaries, which occur much less and learning rate 2.0. We train this model frequently than document continuation transitions.
used: feminine pronouns are used by Baseline Input: Britney Jean Spears (McComb, Mis- MT (Sentences) less than one time in five when isipi; 2 de diciembre de 1981), conocida como they should be. To understand why this effect Britney Spears, es una cantante, bailarina, is so strong, we analyzed the development set compositora, modelo, actriz, diseñadora de of WMT’13 (Table 7). Specifically, we counted modas y empresaria estadounidense. Comenzó the number of times the English pronouns he a actuar desde niña, a través de papeles en and she occurred (first column). Not surpris- producciones teatrales. *** adquirió fama du- ingly, there was an over-representation of mas- rante su niñez al participar en el programa de culine pronouns; for each feminine example in televisión The Mickey Mouse Club (1992). the data, there was four masculine examples. MaskLM Output: spears (0.61), britney Next, we looked in the aligned Spanish sen- (0.23), tambien (0.06), ella (0.03), rapida- tences for the corresponding pronouns él and mente (0.01)... ella and counted a case of prodrop each time Prediction: Feminine they were missing (second column). Even ac- Figure 2: Example of how the BERT pretrained counting for differences in raw frequency by model (Devlin et al., 2019) is evaluated on gen- gender, there is a difference in which prodrop dered pronoun prediction. The MaskLM predicts is more frequent for masculine entities. This is a word for the wildcard position. The feminine the first time this difference has been reported ella is found near the top of the k-best list, so the prediction is Feminine. and future work should control for the prior it induces. Introducing context in the input to an MT gender prediction where neither pronoun ap- model, either trained on sentences or contexts, pears in the top ten mask fills. The results improves overall performance by enhancing of this probing is shown in the BERT rows of feminine recall and masculine precision at the Table 6, which use input feeds of one sentence expense of masculine recall. One possible rea- (Sentences) and up to 128 subtokens (Con- son is that the added context includes the rel- texts). evant cues for breaking out of the prior for Despite scanning the full top ten mask fills, masculine pronouns. The magnitude of change we can see that the strength of BERT is in in F1 score is perhaps beyond expected given its strong precision rather than boosts to re- that Baseline MT and Context MT achieve call. While not directly comparable to MT val- similar BLEU scores. However, this finding ues, which are generated single-shot, we can builds on the arguments in Cifka and Bojar see that BERT values are stronger, particu- (2018), which show that BLEU score does not larly for feminine examples which are handled correlate perfectly with semantics. poorly by machine translation. 4.2 Language Model Pretraining 4.3 BERT Classification Task We initialize our LM experiments from a BERTBASE model (110M parameters; see We extend this understanding of pronoun gen- Devlin et al. (2019)) trained over Wikipedia der using the fine-tuning recipe given in the from 102 languages including Spanish. To as- original BERT paper. Specifically, we gener- sess how much this model knows about pro- ate classification examples where the output noun gender out-of-the-box, we extend the is a gender tag (masculine or feminine) and masked LM approach from the original BERT the input text is up to 128 subtokens includ- work, inserting a mask token in each dropped ing special tokens to mark the target pro- subject position (see Figure 2 for an exam- noun position (as in Table 8). We train over ple).13 To generate gender predictions, we de- both prodrop and possessives examples from duce masculine if él appears before ella in the the training portion of our new dataset. list of mask fills and vice versa; we make no Table 6 shows that this approach is extraor- dinarily powerful for modeling pronoun gender 13 This is applied to prodrop only. No analogous in our dataset. We note that the performance probing may be done for the neutral Spanish posses- sive pronouns without potentially semantic-breaking we see here is remarkably similar to human re-writing. ratings of quality for the dataset.
(1) Comenzo a actuar desde niña, a traves de papeles en producciones teatrales. S/he started to act as a child.FEM, through roles in theatrical productions. Adquirio fama durante su niñez. S/he acquired fame in his/her childhood. Before Fine-tuning After Fine-tuning Class Token Weight Class Token Weight *Masc. durante 0.01 Fem. niña 0.29 traves -0.01 Comenzo -0.10 actuar -0.01 niñez 0.09 Comenzo 0.01 papeles -0.01 Table 8: LIME analysis for a representative Spanish text in which all third-person singular pronouns are either dropped or genderless. An English gloss, including grammatical gender marking, is given above. On all metrics and in both genders, we see coreference for translation (Voita et al., 2018; leaps of 20 points and higher. There is mini- Webster et al., 2018), a fine-tuned BERT clas- mal performance difference across the two gen- sifier appears to implicitly learn coreference to ders; precision and recall do not need to trade make gender decisions here. off against one another. Our novel method which allows implicit pronoun gender to be 5 Improving Machine Translation made explicit and amenable to text-level mod- with Classification Tags eling is very well suited to pronoun gender Given the strong intrinsic evaluation, we use classification. the gender classification tags directly in a stan- dard NMT architecture to introduce implicit 4.4 Analysis contextual awareness while still having trans- We use the LIME toolkit14 (Ribeiro et al., lation parallelizable across sentences. This fol- 2016) to understand how the fine-tuning recipe lows precedent set in Johnson et al. (2017) and yields such a strong model of pronoun gender. Zhou et al. (2019) for controlling translations LIME is built for interpretability, learning lin- via tag-based annotations. Our simple ap- ear models which approximate heavy-weight proach yields substantial improvements in pro- architectures over features which are intuitive noun translation quality. to humans. By analyzing the highly-weighted Baseline MT + Gender Tags We use features in these approximations, we can learn the Transformer model specification above but what factors are important in the classification train with WMT’13 sentence pairs augmented decisions of the heavy-weight models. with gender tags. For each Spanish sentence, Table 8 shows the top tokens in the LIME we classify whether it contains a third-person analysis of our gender classifier, both at the singular dropped subject or possessive pro- start and end of fine-tuning, for a represen- noun using heuristics over morphological anal- tative Spanish text, shown with its English ysis, part of speech tags, and syntactic depen- gloss. Before fine-tuning, the classification de- dency structure (Andor et al., 2016).15 For cision is incorrect and importance is low and those containing a target pronoun, we concate- spread fairly evenly across all terms in con- nate the gender tag predicted by BERT (Con- text. After fine-tuning, a correct classification texts) + Data, after the sentence,16 following is made, primarily with reference to the niña a special context token. For example: token, which is the feminine form of child and refers to a young Britney Spears, the dropped (2) Adquirió fama durante su niñez al par- subject of Adquirió. That is, just as the 15 Dropped pronouns occur where a ROOT verb has original Transformer model implicitly models no nsubj dependent and no noun to its left; possessive pronouns are tagged with a label ending in $. 14 16 https://github.com/marcotcr/lime To preserve positional encoding.
Model Translation Classifier Masculine Feminine Input Input BLEU P R F1 P R F1 Baseline MT Sentences - 34.02 95.2 97.1 96.2 69.7 57.5 63.0 + Gender Tags Sentences Contexts 34.12 96.8 96.2 96.5 70.0 73.7 71.8 Context MT Contexts - 33.99 97.6 94.7 96.1 63.2 80.0 70.6 Table 9: BLEU score and pronoun accuracy by gender on the WMT’13 Spanish-to-English test set. ticipar en el programa de televisión work could look at methods for learning pre- The Mickey Mouse Club (1992). trained language models from extended docu- ment context. For robustness, we introduce noise by ran- 6 Conclusion domly flipping whether the sentence contains a target pronoun 2% of the time17 and gener- In this paper, we introduce a cross-lingual ating a random gender tag 5% of the time. pivoting technique which generates large vol- We choose this strategy as a simple exten- umes of labeled data for pronoun classifica- sion of WMT which is fit for our purposes. tion without explicit human annotation. We Specifically, we seek to show that our gender demonstrate improvements on translating pro- classification model captures implicit pronoun nouns from Spanish to English using a BERT- gender well and that its understanding is use- based classifier fine-tuned on this data, which ful for the translation of ambiguous pronouns. achieves F1 of 92% for feminine pronouns, compared with 30-51% for neural machine Results We evaluate translation perfor- translation models and 54-71% for non-fine- mance using BLEU score and pronoun tuned BERT. We show that our classifier can generation quality using F1 score disag- be incorporated into a standard sentence-level gregated by gender over gendered pro- neural machine translation system to yield nouns. We do not report APT score18 8.8% F1 score improvement in feminine pro- (Miculicich Werlen and Popescu-Belis, 2017b) noun generation. since we target few pronouns and make conclu- Our data creation method is largely sions at the level of gender, not pronoun. language-agnostic and may be extended in fu- Table 9 shows that incorporating contex- ture work to new language pairs, or even dif- tual awareness, either from concatenating full ferent grammatical features. Further in this sentences or gender tags in translation input, direction, we are excited by the prospect of improves pronoun quality markedly. Mascu- zero-shot pronoun gender classification, par- line performance is uniformly strong and our ticularly in low-resource languages. new model, Baseline MT + Gender Tags, gives the best performance on feminine pronouns, Acknowledgements achieving a 8.8% F1 score improvement com- pared to Baseline MT. As for the contextual We would like to thank Melvin Johnson, MT models analyzed in the previous section, Romina Stella, and Macduff Hughes for assis- the improvement is from addressing the low tance with machine translation work, as well recall over feminine pronouns in sentence-level as Mike Collins, Slav Petrov, and other mem- translation (while retaining precision). We bers of the Google Research team for their manually reviewed error cases and noted the feedback and suggestions on earlier versions of major class was from domain shift: where an- this manuscript. tecedents were close to pronouns in Wikipedia, there were more frequent cases in WMT of References antecedents being outside the BERT context window, and these were misclassified. Future Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, 17 In which case, input contains no special token. Slav Petrov, and Michael Collins. 2016. 18 https://github.com/idiap/APT Globally normalized transition-based neural networks.
In Proceedings of the 54th Annual Meeting of Melvin Johnson, Mike Schuster, Quoc V. the Association for Computational Linguistics Le, Maxim Krikun, Yonghui Wu, Zhifeng (Volume 1: Long Papers), pages 2442–2452, Chen, Nikhil Thorat, Fernanda Viégas, Berlin, Germany. Association for Computa- Martin Wattenberg, Greg Corrado, Mac- tional Linguistics. duff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabl Rachel Bawden, Rico Sennrich, Alexan- Transactions of the Association for Computa- dra Birch, and Barry Haddow. 2018. tional Linguistics, 5:339–351. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the Shaohui Kuang, Deyi Xiong, Wei- North American Chapter of the Association for hua Luo, and Guodong Zhou. 2018. Computational Linguistics: Human Language Modeling coherence for neural machine translation with dynami Technologies, Volume 1 (Long Papers), pages In Proceedings of the 27th International Con- 1304–1313, New Orleans, Louisiana. Association ference on Computational Linguistics, pages for Computational Linguistics. 596–606, Santa Fe, New Mexico, USA. Associa- tion for Computational Linguistics. Ondřej Bojar, Christian Buck, Chris Callison- Burch, Christian Federmann, Barry Had- Samuel Läubli, Rico Sennrich, and Martin Volk. dow, Philipp Koehn, Christof Monz, Matt 2018. Has machine translation achieved human Post, Radu Soricut, and Lucia Specia. 2013. parity? a case for document-level evaluation. In Findings of the 2013 Workshop on Statistical MachineProceedings Translation. of the 2018 Conference on Empirical In Proceedings of the Eighth Workshop on Sta- Methods in Natural Language Processing, pages tistical Machine Translation, pages 1–44, 4791–4796. Sofia, Bulgaria. Association for Computational Linguistics. Pierre Lison and Jrg Tiedemann. 2016. OpenSub- titles2016: Extracting Large Parallel Corpora Ondrej Cifka and Ondrej Bojar. 2018. from Movie and TV Subtitles. In Proceedings Are BLEU and Meaning Representation in Opposition? of the Tenth International Conference on Lan- In Proceedings of the 56th Annual Meeting of guage Resources and Evaluation (LREC’16). the Association for Computational Linguistics (Volume 1: Long Papers), pages 1362–1371, Sharid Loáiciga, Sara Stymne, Preslav Nakov, Melbourne, Australia. Association for Compu- Christian Hardmeier, Jörg Tiedemann, Mauro tational Linguistics. Cettolo, and Yannick Versley. 2017. Findings of the 2017 DiscoMT shared task on cross-lingual Jacob Devlin, Ming-Wei Chang, Kenton Lee, and pronoun prediction. In The Third Workshop on Kristina Toutanova. 2019. Bert: Pre-training Discourse in Machine Translation. of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- Lesly Miculicich Werlen and An- ference of the North American Chapter of the drei Popescu-Belis. 2017a. Association for Computational Linguistics: Hu- Using coreference links to improve spanish-to-english machine tr man Language Technologies, Volume 1 (Long In Proceedings of the 2nd Workshop on Corefer- and Short Papers), pages 4171–4186. ence Resolution Beyond OntoNotes (CORBON 2017), pages 30–40, Valencia, Spain. Association Liane Guillou and Christian Hardmeier. 2018. for Computational Linguistics. Automatic reference-based evaluation of pronoun translation misses the point. In Proceedings of the 2018 Conference on Em- Lesly Miculicich Werlen and Andrei Popescu- pirical Methods in Natural Language Processing, Belis. 2017b. Validation of an Automatic Met- pages 4797–4802, Brussels, Belgium. Associa- ric for the Accuracy of Pronoun Translation tion for Computational Linguistics. (APT). In Proceedings of the Third Work- shop on Discourse in Machine Translation (Dis- Nizar Habash, Houda Bouamor, coMT), EPFL-CONF-229974. Association for and Christine Chung. 2019. Computational Linguistics (ACL). Automatic gender identification and reinflection in Arabic. In Proceedings of the First Workshop on Gen- Amit Moryossef, Roee Aha- der Bias in Natural Language Processing, roni, and Yoav Goldberg. 2019. pages 155–165, Florence, Italy. Association for Filling gender & number gaps in neural machine translation wit Computational Linguistics. In Proceedings of the First Workshop on Gen- der Bias in Natural Language Processing, Nizar Habash, Nasser Zalmout, Dima Taji, pages 49–54, Florence, Italy. Association for Hieu Hoang, and Maverick Alzate. 2017. Computational Linguistics. A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages. In Proceedings of the 15th Conference of the Mathias Muller, Annette Rios, Elena European Chapter of the Association for Com- Voita, and Rico Sennrich. 2018. putational Linguistics: Volume 2, Short Papers, A large-scale test set for the evaluation of context-aware pronou pages 235–241, Valencia, Spain. Association for In Proceedings of the Third Conference on Ma- Computational Linguistics. chine Translation: Research Papers, pages
61–72, Belgium, Brussels. Association for Eva Vanmassenhove, Christian Hard- Computational Linguistics. meier, and Andy Way. 2018. Getting gender right in neural machine translation. Kishore Papineni, Salim Roukos, Todd In Proceedings of the 2018 Conference on Em- Ward, and Wei-Jing Zhu. 2002. pirical Methods in Natural Language Processing, Bleu: a method for automatic evaluation of machine translation. pages 3003–3008, Brussels, Belgium. Associa- In Proceedings of 40th Annual Meeting of the tion for Computational Linguistics. Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, Ashish Vaswani, Noam Shazeer, Niki Parmar, USA. Association for Computational Linguis- Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, tics. ukasz Kaiser, and Illia Polosukhin. 2017. Atten- tion is all you need. In Proceedings of NIPS, Matthew E. Peters, Mark Neumann, Mohit Iyyer, pages 6000–6010, Long Beach, California. Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextual- Stephan Vogel, Hermann Ney, and Christoph Till- ized word representations. In Proceedings of the mann. 1996. HMM-based word alignment in sta- North American Chapter of the Association of tistical translation. In Proceedings of the 16th Computational Linguistics. conference on Computational linguistics. Asso- ciation for Computational Linguistics. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”why should I trust you?”: Ex- Elena Voita, Pavel Serdyukov, Rico plaining the predictions of any classifier. In Pro- Sennrich, and Ivan Titov. 2018. ceedings of the 22nd ACM SIGKDD Interna- Context-aware neural machine translation learns anaphora resol tional Conference on Knowledge Discovery and In Proceedings of the 56th Annual Meeting of Data Mining, San Francisco, CA, USA, August the Association for Computational Linguistics 13-17, 2016, pages 1135–1144. (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Association for Compu- Karin Sim Smith. 2017. tational Linguistics. On integrating discourse in machine translation. In Proceedings of the Third Workshop on Dis- Claudia Wagner, Eduardo Graells-Garrido, David course in Machine Translation, pages 110–121, Garcia, and Filippo Menczer. 2016. Women Copenhagen, Denmark. Association for Com- through the glass ceiling: Gender assymetrics putational Linguistics. in Wikipedia. EPJ Data Science, 5. Jason R. Smith, Herve Saint-Amand, Mag- Longyue Wang, Zhaopeng Tu, dalena Plamada, Philipp Koehn, Chris Andy Way, and Qun Liu. 2017. Callison-Burch, and Adam Lopez. 2013. Exploiting cross-sentence context for neural machine translation Dirt Cheap Web-Scale Parallel Text from the Common InCrawl. Proceedings of the 2017 Conference on Em- In Proceedings of the 51st Annual Meeting of pirical Methods in Natural Language Processing, the Association for Computational Linguistics pages 2826–2831, Copenhagen, Denmark. Asso- (Volume 1: Long Papers), pages 1374–1383, ciation for Computational Linguistics. Sofia, Bulgaria. Association for Computational Linguistics. Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2018. Dario Stojanovski and Alexander Fraser. 2018. Learning to jointly translate and predict dropped pronouns with Coreference and coherence in neural machine In Proceedings of the 2018 Conference on Em- translation: A study using oracle experiments. pirical Methods in Natural Language Processing, In Proceedings of the Third Conference on Ma- pages 2997–3002, Brussels, Belgium. Associa- chine Translation: Research Papers, pages 49– tion for Computational Linguistics. 60. Kellie Webster, Marta Recasens, Vera Axelrod, Jörg Tiedemann. 2012. Parallel data, tools and and Jason Baldridge. 2018. Mind the GAP: A interfaces in OPUS. In Proceedings of the Balanced Corpus of Gendered Ambiguous Pro- Eight International Conference on Language Re- nouns. In Transactions of the ACL, page to ap- sources and Evaluation (LREC’12), Istanbul, pear. Turkey. European Language Resources Associ- ation (ELRA). Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang Xu, Jörg Tiedemann and Yves Scherrer. 2017. Min Zhang, and Yang Liu. 2018. Neural machine translation with extended context. Improving the transformer translation model with document-lev In Proceedings of the Third Workshop on Dis- In Proceedings of the 2018 Conference on Em- course in Machine Translation, pages 82–92, pirical Methods in Natural Language Processing, Copenhagen, Denmark. Association for Com- pages 533–542, Brussels, Belgium. Association putational Linguistics. for Computational Linguistics.
Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan- Hao Huang, Muhao Chen, Ryan Cot- terell, and Kai-Wei Chang. 2019. Examining gender bias in languages with grammatical gender. In Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5276–5284, Hong Kong, China. Association for Computational Linguistics. Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0. In Proceedings of the 10th edition of the Language Resources and Evalua- tion Conference.
You can also read