First the worst: Finding better gender translations during beam search
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
First the worst: Finding better gender translations during beam search Danielle Saunders and Rosie Sallis and Bill Byrne Department of Engineering, University of Cambridge, UK {ds636, rs965, wjb31}@cam.ac.uk Abstract The current dominant approaches to the problem, Neural machine translation inference proce- which involve model retraining (Vanmassenhove dures like beam search generate the most likely et al., 2018; Escudé Font and Costa-jussà, 2019; output under the model. This can exacer- Stafanovičs et al., 2020) or tuning on targeted gen- arXiv:2104.07429v1 [cs.CL] 15 Apr 2021 bate any demographic biases exhibited by the dered bilingual data (Saunders and Byrne, 2020; model. We focus on gender bias resulting from Basta et al., 2020) – all of which risk introducing systematic errors in grammatical gender trans- new sources of bias (Shah et al., 2020). lation, which can lead to human referents be- By contrast, we seek to find more diverse and ing misrepresented or misgendered. correct translations from existing models, biased as Most approaches to this problem adjust the they may be. Recently, Roberts et al. (2020) have training data or the model. By contrast, we implicated beam search as a driver towards gender experiment with simply adjusting the infer- ence procedure. We experiment with rerank- mistranslation and gender bias amplification – we ing nbest lists using gender features obtained aim to mitigate this by guiding NMT inference automatically from the source sentence, and towards finding better gender translations. applying gender constraints while decoding to Our contributions are as follows: first, we ana- improve nbest list gender diversity. We find lyze the nbest lists of NMT models exhibiting gen- that a combination of these techniques allows der bias to determine whether more gender-diverse large gains in WinoMT accuracy without re- hypotheses can be found in the beam, and conclude quiring additional bilingual data or an addi- tional NMT model. that they often cannot. We then generate new nbest lists subject to gendered inflection constraints, and 1 Introduction show that in this case the nbest lists have greater Neural language generation models optimized by gender diversity. We demonstrate that it is possible likelihood have a tendency towards ‘safe’ word to extract translations with the correct gendered choice. This lack of output diversity has been entities from the original model’s beam. noted in machine translation, both statistical (Gim- Our approach involves no changes to the model pel et al., 2013) and neural (Vanmassenhove et al., or training data, and requires only monolingual 2019) as well as other areas of language genera- resources for the source language (identifying gen- tion such as dialog modelling (Li et al., 2016) and dered information in the source sentence) and target question generation (Sultan et al., 2020). Model- language (gendered inflections.) generated language may be predictable, repetitive or stilted. More insidiously, generating the most 1.1 Related work likely output based on corpus statistics can amplify Various approaches have been explored for mitigat- any existing biases in the corpus (Zhao et al., 2017). ing gender bias in natural language processing gen- Potential harms arise when these biases around erally, and machine translation specifically. Most e.g. word choice or grammatical gender inflections involve some adjustment of the training data, either reflect demographic or social biases (Sun et al., directly (Rudinger et al., 2018) or by controlling 2019). In the context of machine translation, the word embeddings (Bolukbasi et al., 2016). We view resulting gender mistranslations could involve im- such approaches as complementary to our own in- plicit misgendering of a user or other human refer- vestigation, which seeks to take a better approach ent, or perpetuation of social stereotypes about the to generating diverse translations during machine ‘typical’ gender of a referent in a given context. translation inference. The approaches most rele-
vant to this work are gender control, and treatment of gender errors as a correction problem. Controlling gender translation generally involves introducing external information into the model. Miculicich Werlen and Popescu-Belis (2017) inte- Figure 1: Toy illustration of gender constraints for first- pass translation ‘el médico’ or ‘la médica’. grate cross-sentence coreference information into hypothesis reranking to improve pronoun trans- lation. Vanmassenhove et al. (2018) incorporate To summarize briefly, a ‘gender inflection lattice’ is sentence-level ‘speaker gender’ tags into training constructed for the target language, defining gram- data, while Moryossef et al. (2019) incorporate matically gendered reinflection options for words a sentence-level gender feature during inference. in a large vocabulary list. For example, in the Span- Stafanovičs et al. (2020) include factored annota- ish lattice, the definite articles la and el can option- tions of target grammatical gender for all source ally be interchanged, as can the profession nouns words, while Saunders et al. (2020) incorporate médica and médico. Composing the inflection lat- inline source sentence gender tags for specific enti- tice with the original translation provides gendered ties. As for much of this prior work, our focus is vocabulary constraints that can be applied during a not on the discovery of the correct gender tag, but second pass beam search, as illustrated in Figure 1. on possible applications for this information once In practice the lattices are constructed using heuris- it is obtained. tics that could result in spurious arcs like ‘médic’, A separate line of work treats the presence of but these are expected to have low likelihood under gender-related inconsistencies or inaccuracies as a translation model. a search and correction problem. Roberts et al. Unlike the original authors we do not rescore the (2020) find that a sampling search procedure re- translations with a separate gender-sensitive trans- sults in less misgendering than beam search. Saun- lation model. Instead, we use the baseline model ders and Byrne (2020) rescore translations with a to produce gendered variations of its own biased fine-tuned model that has incorporated additional hypotheses. This removes the additional experi- gender sensitivity, while constraining rescoring to mental complexity and risk of bias amplification a gender-inflection lattice. A related line of work resulting from tuning a model for effective gender less directly applicable to translation focuses on translation, as well as the computational load of reinflecting monolingual sentences to reduce the storing the separate model. likelihood of bias (Habash et al., 2019; Alhafni While we carry out two full inference passes for et al., 2020; Sun et al., 2021). An important dis- simplicity of implementation, we note that further tinction between our work and earlier reinflection efficiency improvements are possible. For example, approaches is that we only use a single model for the source sentence encoding could be saved and each language pair, significantly reducing compu- reused for the reinflection pass. In principle, a more tation and complexity. complex beam search implementation could apply constraints during the first inference pass, negating 2 Finding consistent gender in the beam the need for two passes. These efficiency gains There are two elements to our proposed approach. would not be possible if using a separate NMT First, we produce an nbest list using our baseline model to rescore or reinflect the translation. model. We either use simple beam search or a two-pass approach where the second pass searches 2.2 Reranking gendered translations for differently-gendered versions of the best first- We focus on the translation of sentences contain- pass hypothesis. Next, once we have our set of ing entities that are coreferent with gendered pro- translation hypotheses, we rerank them according nouns. Our method to rerank gendered translations to gender criteria and extract the translation which uses information about the gender of entities in best fulfils our requirements. the source sentence. In the WinoMT challenge set, each sentence’s primary entity is labelled - we 2.1 Gender-constrained beam search refer to use of this information as ‘oracle’ rerank- We follow the gendered-inflection lattice search ing. Similar known-gender information could be scheme as described in Saunders and Byrne (2020). defined by a system user. As an alternative we can
use an off-the-shelf coreference resolution tool on (Liu et al., 2019) tuned for pronoun disambigua- the source language designed for pronoun disam- tion on Winograd Schema Challenge data4 . biguation challenges (e.g. Rahman and Ng (2012)). We assess gender translation on WinoMT Once the pronoun-coreferent source word is (Stanovsky et al., 2019) as the only gender chal- identified, we can find the aligned target lan- lenge set we are currently aware of for both lan- guage hypothesis word, similar to the approach in guage pairs. Our metrics are overall accuracy Stafanovičs et al. (2020). We determine the gram- (higher is better) and ∆G, a metric measuring F1 matical gender of the hypothesis by morphological score difference between male and female labelled analysis, either using hard-coded rules or off-the- sentences (closer to 0 is better). shelf tools depending on language as in Stanovsky et al. (2019). We proceed to rerank the list of hy- 4 Results and discussion potheses first by article gender agreement with the source pronoun, and then by log likelihood. If there Beam Gender en-de en-es search BLEU BLEU are no gender agreeing hypotheses, we default to × 42.7 27.5 4 log likelihood alone1 . X 42.7 27.8 × 42.3 27.3 20 3 Experimental setup X 42.7 27.8 3.1 Baseline models and data Table 1: Cased BLEU on the generic test sets under different beam size and search procedures. Gender- We assess translation from English into German constrained rescoring by the original model does not and Spanish. All models are Transformer ‘base’ hurt BLEU, and can improve it. models with 6 layers (Vaswani et al., 2017), trained using Tensor2Tensor with batch size 4K (Vaswani et al., 2018). The en-de model is trained on 17.6M 4.1 Normal beam search has some gender sentences from WMT news tasks for 300K itera- diversity tions (Barrault et al., 2020), and the en-es model We first evaluate beam search without any gen- on 10M sentences from the UN Corpus for 150K der reranking. In Table 1 we assess BLEU for iterations (Ziemski et al., 2016). beam search with beam sizes 4 and 20, with and We assess translation quality in terms of BLEU without gender constraints. Beam search with the on a separate, untagged general test set. We report larger beam 20 and without gender constraints de- cased BLEU on the test sets from WMT18 (en- grades BLEU score, in line with the findings of de), WMT13 (en-es) and IWSLT14 (en-he) using Stahlberg and Byrne (2019). In Table 2 we see SacreBLEU2 (Post, 2018). that WinoMT gender accuracy also degrades with a 3.2 Gender constrained search and larger beam. However, our reranking procedure is reranking able to find better gender translations for WinoMT sentences in the unconstrained beam. With beam At inference time we perform beam search with and size 4 WinoMT accuracy improves by 10% rela- without reranking and gender-inflected lattice con- tive to the original scores, and ∆G improves by straints. We construct gender reinflection lattices an even larger margin. Improvements are much for rescoring as proposed by Saunders and Byrne greater when reranking even an unconstrained 20- (2020)3 . For reranking, we use fast_align (Dyer best list, suggesting that standard beam search may et al., 2013) to link the source sentence gender in- exhibit gender diversity, if not necessarily in the formation to the target sentence. When analysing highest likelihood hypotheses. the gender of target language words, we use the same morphological analysis as for the correspond- 4.2 We can find better gender translations in ing languages in WinoMT evaluation. the constrained beam When obtaining gender information automati- We next experiment with applying gender con- cally for reranking we use the RoBERTa model straints to a second-pass beam search. We note that 1 Code to be made available constrained beam search mitigates BLEU degrada- 2 BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+v.1.4.8 tion (Table 1) – a useful side-effect. For gender 3 Scripts and data for lattice construction made available 4 by the authors of Saunders and Byrne (2020) at https:// Model from https://github.com/pytorch/ github.com/DCSaunders/gender-debias fairseq/tree/master/examples/roberta/wsc
Beam Oracle reranking Gender search en-de en-es Acc ∆G Acc ∆G × × 60.1 18.6 47.8 38.4 4 X × 66.8 10.0 53.9 27.8 X X 77.4 0.0 56.2 22.8 × × 59.0 20.1 46.4 40.7 20 X × 75.2 1.8 63.6 13.8 X X 84.5 -3.1 68.0 7.6 Table 2: WinoMT primary-entity accuracy (Acc), and male/female F1 difference ∆G, under various search and ranking procedures. Gender search without reranking would be similar to the baseline, as gender constraints are constructed on the highest likelihood hypothesis from normal beam-4 search. Table 1 demonstrates BLEU under constrained search without reranking: the scores are extremely close to the baselines. Beam RoBERTa reranking Gender search en-de en-es Acc ∆G Acc ∆G × × 60.1 18.6 47.8 38.4 4 X × 66.4 10.4 53.0 28.6 X X 76.1 1.0 55.2 24.1 × × 59.0 20.1 46.4 40.7 20 X × 73.9 2.7 60.9 15.6 X X 82.3 -2.2 65.6 9.8 Table 3: WinoMT primary-entity accuracy (Acc), and male/female WinoMT sentence F1 difference ∆G under various search and ranking procedures. Baseline scores with no reranking or gender search are duplicated from Table 2. Reranking gender labels determined automatically from the source sentence by a RoBERTa model. translation, Table 2 shows WinoMT accuracy after coreference resolution. gender constrained search and reranking dramati- One benefit of this approach that we have not cally improving relative to the baseline scores. In explored is its application to incorporating context: the appendix we give examples of 4-best lists gen- the pronoun disambiguation information could po- erated with and without gender constraints, high- tentially come from a previous or subsequent sen- lighting why this is the case. tence. Another benefit is the approach’s flexibility to introducing new gendered vocabulary: anything 4.3 Controlling gender output with that can be both defined for reranking and disam- RoBERTa biguated in the source sentence can be searched for in the beam without requiring retraining. So far we have shown accuracy improvements us- ing oracle gender, which indicate that the nbest 5 Conclusions lists often contain a gender-accurate hypothesis. In Table 3, we explore improvements reranking We have explored improving gender translation by with source gender determined automatically by a correcting beam search. Our experiments demon- RoBERTa model. There is surprisingly little degra- strate that applying target language constraints at dation in WinoMT accuracy when using RoBERTa inference time can encourage models to produce coreference resolution compared to the oracle (Ta- more gender-diverse hypotheses. Moreover we ble 2). We hypothesise that sentences which are show that simple reranking heuristics can extract challenging for pronoun disambiguation will also more accurate gender translations from these nbest be among the most challenging to translate, so the lists using either oracle information or automatic sentences where RoBERTa disambiguates wrongly source coreference resolution. are already ‘wrong’ even with oracle information. Unlike other approaches to this problem we do We note that Saunders and Byrne (2020) not attempt to counter unidentified and potentially achieved 81.7% and 68.4% WinoMT accuracy on intractable sources of bias in the training data, German and Spanish respectively after rescoring or produce new models. However, we view our with a separate model adapted to a synthetic occu- schemes as orthogonal to model or data bias miti- pations dataset. With beam size 20 we are able to gation. If a model ranks diverse gender translations achieve very similar results without any change to higher in the beam initially, finding better gender our single translation model, even using automatic translations during beam search becomes simpler.
Impact statement a decoder-based neural machine translation model by adding contextual information. In Proceedings Where machine translation is used in people’s lives, of the The Fourth Widening Natural Language Pro- mistranslations have the potential to misrepresent cessing Workshop, pages 99–102, Seattle, USA. As- people. This is the case when personal character- sociation for Computational Linguistics. istics like social gender conflict with model biases Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, towards certain forms of grammatical gender. As Venkatesh Saligrama, and Adam T Kalai. 2016. mentioned in the introduction, the result can in- Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Ad- volve implicit misgendering of a user or other hu- vances in neural information processing systems, man referent, or perpetuation of social biases about pages 4349–4357. gender roles as represented in the translation. A Chris Dyer, Victor Chahuneau, and Noah A. Smith. user whose words are translated with gender de- 2013. A simple, fast, and effective reparameter- faults that imply they hold such biased views will ization of IBM model 2. In Proceedings of the also be misrepresented. 2013 Conference of the North American Chapter of We attempt to avoid these failure modes by iden- the Association for Computational Linguistics: Hu- tifying translations which are at least consistent man Language Technologies, pages 644–648, At- lanta, Georgia. Association for Computational Lin- within the translation and with the source sentence. guistics. This is dependent on identifying grammatically gendered terms in the target language – however, Joel Escudé Font and Marta R. Costa-jussà. 2019. Equalizing gender bias in neural machine transla- this element is very flexible and can be updated for tion with word embeddings techniques. In Proceed- new gendered terminology. We note that models ings of the First Workshop on Gender Bias in Natu- which do not account for variety in gender expres- ral Language Processing, pages 147–154, Florence, sion such as neopronoun use may not be capable Italy. Association for Computational Linguistics. of generating appropriate gender translations, but Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory in principle a variety of gender translations could Shakhnarovich. 2013. A systematic exploration of be extracted from the beam. diversity in machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natu- By removing the data augmentation, tuning and ral Language Processing, pages 1100–1111, Seattle, retraining elements from approaches that address Washington, USA. Association for Computational gender translation, we seek to simplify the pro- Linguistics. cess and remove additional stages during which Nizar Habash, Houda Bouamor, and Christine Chung. bias could be introduced or amplified (Shah et al., 2019. Automatic gender identification and reinflec- 2020). tion in Arabic. In Proceedings of the First Work- shop on Gender Bias in Natural Language Process- ing, pages 155–165, Florence, Italy. Association for References Computational Linguistics. Bashar Alhafni, Nizar Habash, and Houda Bouamor. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, 2020. Gender-aware reinflection using linguistically and Bill Dolan. 2016. A diversity-promoting ob- enhanced neural models. In Proceedings of the Sec- jective function for neural conversation models. In ond Workshop on Gender Bias in Natural Language Proceedings of the 2016 Conference of the North Processing, pages 139–150, Barcelona, Spain (On- American Chapter of the Association for Computa- line). Association for Computational Linguistics. tional Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, for Computational Linguistics. Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Matthias Huck, Eric Joanis, Tom Kocmi, Philipp dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Luke Zettlemoyer, and Veselin Stoyanov. 2019. Monz, Makoto Morishita, Masaaki Nagata, Toshi- RoBERTa: A robustly optimized BERT pretraining aki Nakazawa, Santanu Pal, Matt Post, and Marcos approach. arXiv preprint arXiv:1907.11692. Zampieri. 2020. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of Lesly Miculicich Werlen and Andrei Popescu-Belis. the Fifth Conference on Machine Translation, pages 2017. Using coreference links to improve Spanish- 1–55, Online. Association for Computational Lin- to-English machine translation. In Proceedings of guistics. the 2nd Workshop on Coreference Resolution Be- yond OntoNotes (CORBON 2017), pages 30–40, Va- Christine Basta, Marta R. Costa-jussà, and José A. R. lencia, Spain. Association for Computational Lin- Fonollosa. 2020. Towards mitigating gender bias in guistics.
Amit Moryossef, Roee Aharoni, and Yoav Goldberg. Felix Stahlberg and Bill Byrne. 2019. On NMT search 2019. Filling gender & number gaps in neural ma- errors and model errors: Cat got your tongue? In chine translation with black-box context injection. Proceedings of the 2019 Conference on Empirical In Proceedings of the First Workshop on Gender Methods in Natural Language Processing and the Bias in Natural Language Processing, pages 49–54, 9th International Joint Conference on Natural Lan- Florence, Italy. Association for Computational Lin- guage Processing (EMNLP-IJCNLP), pages 3356– guistics. 3362, Hong Kong, China. Association for Computa- tional Linguistics. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Gabriel Stanovsky, Noah A. Smith, and Luke Zettle- Machine Translation: Research Papers, pages 186– moyer. 2019. Evaluating gender bias in machine 191, Belgium, Brussels. Association for Computa- translation. In Proceedings of the 57th Annual Meet- tional Linguistics. ing of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Altaf Rahman and Vincent Ng. 2012. Resolving com- Computational Linguistics. plex cases of definite pronouns: The Winograd Md Arafat Sultan, Shubham Chandel, Ramón Fernan- schema challenge. In Proceedings of the 2012 Joint dez Astudillo, and Vittorio Castelli. 2020. On the Conference on Empirical Methods in Natural Lan- importance of diversity in question generation for guage Processing and Computational Natural Lan- QA. In Proceedings of the 58th Annual Meeting guage Learning, pages 777–789, Jeju Island, Korea. of the Association for Computational Linguistics, Association for Computational Linguistics. pages 5651–5656, Online. Association for Compu- tational Linguistics. Nicholas Roberts, Davis Liang, Graham Neubig, and Zachary C Lipton. 2020. Decoding and di- Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, versity in machine translation. arXiv preprint Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth arXiv:2011.13477. Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language Rachel Rudinger, Jason Naradowsky, Brian Leonard, processing: Literature review. In Proceedings of and Benjamin Van Durme. 2018. Gender bias in the 57th Annual Meeting of the Association for Com- coreference resolution. In Proceedings of the 2018 putational Linguistics, pages 1630–1640, Florence, Conference of the North American Chapter of the Italy. Association for Computational Linguistics. Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Tony Sun, Kellie Webster, Apu Shah, William Yang pages 8–14, New Orleans, Louisiana. Association Wang, and Melvin Johnson. 2021. They, them, for Computational Linguistics. theirs: Rewriting with gender-neutral english. arXiv preprint arXiv:2102.06788. Danielle Saunders and Bill Byrne. 2020. Reducing gen- der bias in neural machine translation as a domain Eva Vanmassenhove, Christian Hardmeier, and Andy adaptation problem. In Proceedings of the 58th An- Way. 2018. Getting gender right in neural machine nual Meeting of the Association for Computational translation. In Proceedings of the 2018 Conference Linguistics, pages 7724–7736, Online. Association on Empirical Methods in Natural Language Process- for Computational Linguistics. ing, pages 3003–3008, Brussels, Belgium. Associa- tion for Computational Linguistics. Danielle Saunders, Rosie Sallis, and Bill Byrne. 2020. Eva Vanmassenhove, Dimitar Shterionov, and Andy Neural machine translation doesn’t translate gender Way. 2019. Lost in translation: Loss and decay of coreference right unless you make it. In Proceed- linguistic richness in machine translation. In Pro- ings of the Second Workshop on Gender Bias in Nat- ceedings of Machine Translation Summit XVII Vol- ural Language Processing, pages 35–43, Barcelona, ume 1: Research Track, pages 222–232, Dublin, Spain (Online). Association for Computational Lin- Ireland. European Association for Machine Transla- guistics. tion. Deven Santosh Shah, H. Andrew Schwartz, and Dirk Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran- Hovy. 2020. Predictive biases in natural language cois Chollet, Aidan Gomez, Stephan Gouws, Llion processing models: A conceptual framework and Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par- overview. In Proceedings of the 58th Annual Meet- mar, Ryan Sepassi, Noam Shazeer, and Jakob Uszko- ing of the Association for Computational Linguistics, reit. 2018. Tensor2Tensor for neural machine trans- pages 5248–5264, Online. Association for Computa- lation. In Proceedings of the 13th Conference of the tional Linguistics. Association for Machine Translation in the Ameri- cas (Volume 1: Research Papers), pages 193–199, Artūrs Stafanovičs, Toms Bergmanis, and Mārcis Pin- Boston, MA. Association for Machine Translation nis. 2020. Mitigating gender bias in machine trans- in the Americas. lation with target gender annotations. In Proceed- ings of the Fifth Conference on Machine Translation Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob (WMT). Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguis- tics. Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The united nations parallel cor- pus v1.0. In Proceedings of the Tenth Interna- tional Conference on Language Resources and Eval- uation (LREC 2016), pages 3530–3534, Portorož, Slovenia. European Language Resources Associa- tion (ELRA). A Constrained vs unconstrained beams The difference in gender diversity between con- strained and unconstrained beam search can be ob- served by examining the beams. Table 4 (next page) gives examples of 4-best lists for two WinoMT sen- tences translated into German. The lists are pro- duced with and without gender constraints, and are ordered by model likelihood in both cases. In both cases the best-scoring hypothesis fails to resolve gender coreference correctly, motivating the reranking scheme we describe in this paper. However, the effects of gender constrained beam search become evident if we compare the 4-best lists. With regular unconstrained beam search, the model simply switches out more common lexical alternatives, such as different forms of the past tense. However, the grammatical genders of both entities in both examples are unchanged throughout the beam. By contrast, using gender-constrained beam search constrains changes in the nbest list to gen- dered inflections of the original best-scoring trans- lation, as we expect. The changes made under beam search vary the grammatical genders of arti- cles and entities, resulting in more gender-diverse beams. We note that in the second example, the definite articles vary in terms of gender but the translated profession terms only vary in one case out of eight. This behaviour is mitigated with larger beam sizes.
English source The cook prepared a dish for the teacher because she just learned a new dish. Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt. Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hatte. Unconstrained Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht gelernt hat. Der Koch bereitete ein Gericht für die Lehrerin vor, weil sie gerade ein neues Gericht erlernte. Der Koch bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt. Der Koch bereitete ein Gericht für den Lehrer, weil sie gerade ein neues Gericht gelernt. Constrained Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade ein neues Gericht gelernt. Die Köchin bereitete ein Gericht für die Lehrerin, weil sie gerade eine neues Gericht gelernt. English source The cook prepared a dish for the teacher because he is hungry. Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist. Der Koch bereitete dem Lehrer ein Gericht vor, weil er hungrig ist. Unconstrained Der Koch bereitete ein Gericht für den Lehrer vor, weil er hungrig ist. Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig. Der Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist. Die Koch bereitete ein Gericht für den Lehrer, weil er hungrig ist. Constrained Der Koch bereitete ein Gericht für die Lehrer, weil er hungrig ist. Die Köchin bereitete ein Gericht für den Lehrer, weil er hungrig ist. Table 4: Results of beam-4 search for English-German translation for two WinoMT sentences with and without gender constraints. 4-best lists are ordered by likelihood under the NMT model.
You can also read