XPersona: Evaluating Multilingual Personalized Chatbot
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
XPersona: Evaluating Multilingual Personalized Chatbot Zhaojiang Lin∗, Zihan Liu∗ , Genta Indra Winata∗ , Samuel Cahyawijaya∗ , Andrea Madotto∗ , Yejin Bang, Etsuko Ishii, Pascale Fung Center for Artificial Intelligence Research (CAiRE) Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong {zlinao,zliucr,giwinata,scahyawijaya,amadotto}@connect.ust.hk, pascale@ece.ust.hk Abstract English), and thus the resulting systems can per- form conversations only in the training language. Personalized dialogue systems are an essen- For wide, commercial dialogue systems are re- tial step toward better human-machine in- quired to handle a large number of languages since arXiv:2003.07568v2 [cs.CL] 8 Apr 2020 teraction. Existing personalized dialogue the smart home devices market is increasingly inter- agents rely on properly designed conversa- tional datasets, which are mostly monolingual national (Etherington, 2019). Therefore, creating (e.g., English), which greatly limits the usage multilingual conversational benchmarks is essen- of conversational agents in other languages. tial, yet challenging since it is costly to perform In this paper, we propose a multi-lingual ex- human annotation of data in all languages. tension of Persona-Chat (Zhang et al., 2018), A possible solution is to use translation systems namely XPersona. Our dataset includes per- before and after the model inference, a two-step sona conversations in six different languages translation from any language to English and from other than English for building and evaluating multilingual personalized agents. We experi- English to any language. This comes with three ma- ment with both multilingual and cross-lingual jor problems: 1) amplification of translation errors trained baselines, and evaluate them against since the current dialogue systems are far from per- monolingual and translation-pipeline models fect, especially with noisy input; 2) the three-stage using both automatic and human evaluation. pipeline system is significantly slower in terms of Experimental results show that the multilin- inference speed; and 3) high translation costs since gual trained models outperform the translation- the current state-of-the-art models, especially in pipeline and that they are on par with the low resources languages, are only available using monolingual models, with the advantage of having a single model across multiple lan- costly APIs. guages. On the other hand, the state-of-the- In this paper, we analyze two possible art cross-lingual trained models achieve infe- workarounds to alleviate the aforementioned chal- rior performance to the other models, showing lenges. The first is to build a cross-lingual trans- that cross-lingual conversation modeling is a ferable system by aligning cross-lingual represen- challenging task. We hope that our dataset and tations, as in Conneau et al. (2018), in which the baselines 1 will accelerate research in multilin- system is trained on one language and zero-shot gual dialogue systems. to another language. The second is to learn a mul- tilingual system directly from noisy multilingual 1 Introduction data (e.g., translated data), thus getting rid of the translation system dependence at inference time. Personalized dialogue agents have been shown effi- To evaluate the aforementioned systems, we pro- cient in conducting human-like conversation. This pose a dataset called Multilingual Persona-Chat, progress has been catalyzed thanks to existing con- or XPersona, by extending the Persona-Chat cor- versational dataset such as Persona-chat (Zhang pora (Dinan et al., 2019a) to six languages: Chi- et al., 2018; Dinan et al., 2019a). However, the nese, French, Indonesian, Italian, Korean, and training data are provided in a single language (e.g., Japanese. In XPersona, the training sets are au- ∗ tomatically translated using translation APIs with Equal contributions. Listing order is random 1 Datasets and all the baselines are available in https: several human-in-the-loop passes of mistake cor- //github.com/HLTCHKUST/Xpersona rection. In contrast, the validation and test sets
Persona I helped design the game starcraft. I am a famous twitch streamer.. Sys: do you know the game starcraft ? i helped designing it ! En Usr: that s awesome , i m sure my ex boyfriend would like that game Sys: i usually stream the game on my twitch channel that s great , i also love twitch as well Sys: conosci il gioco starcraft? ho aiutato a progettarlo! It Usr: è fantastico, sono sicuro che al mio ex ragazzo gli piacerebbe Sys: di solito faccio streaming mentre gioco sul mio canale Twitch Sys: connaissez-vous le jeu starcraft? j’ai aidé à le concevoir! Fr Usr: c’est génial, je suis sûr que mon ex petit ami aimerait ce jeu Sys: Je diffuse généralement le jeu sur ma chaîne Twitch Sys: apakah anda tahu game starcraft? saya yang mendesainnya! Id Usr: itu luar biasa, saya yakin mantan pacar saya suka game itu. Sys: saya biasanya memainkan game itu di channel twitch saya. Sys: 你知道游戏《星际争霸》吗?我帮忙设计了它! Zh Usr: 好厉害,我觉得我的前男友会喜欢那个游戏 Sys: 我经常在我的直播频道上直播游戏 Sys: 너ᄀ ᅦᄋᆷᄉ ᅵ ᅳᄐ ᅡ크ᄅ ᅢᄑ ᅳ트ᄅ ᆯᄋ ᅳ ᅡ니? ᄂ ᅡᄂ ᆫᄀ ᅳ ᅳᄀ ᆺᄋ ᅥ ᆯᄃ ᅳ ᅵ자ᄋ ᆫᄒ ᅵ ᅡᄂ ᆫᄀ ᅳ ᆺᄋ ᅥᆯᄃ ᅳ ᅩᄋ ᆻᄋ ᅪ ᅥ! Ko Usr: ᄆᆺᄌ ᅥ ᆫᄃ ᅵ ᅦ, ᄂ ᅢᄌ ᆫᄂ ᅥ ᆷᄌ ᅡ ᅡᄎ ᆫᄀ ᅵ ᅮ가ᄀ ᅳᄀ ᅦᄋᆷᄋ ᅵ ᆯᅩ ᅳ ᄌᄋ ᇂ ᅡᄒ ᆯᄀ ᅡ ᅥᄅ ᅡᄀ ᅩᄒ ᅪ ᆨᄉᆫᄒ ᅵ ᅢ. Sys: 나ᄂᆫᄇ ᅳ ᅩᄐ ᆼᄂ ᅩ ᅢᄐ ᅳᄋ ᅱ치ᄎ ᅢᄂ ᆯᄅ ᅥ ᅩᄀ ᅳᄀ ᅦᄋ ᆷᄋ ᅵ ᆯᄉ ᅳ ᅳᄐ ᅳᄅ ᅵᄆᆼᄒ ᅵ ᅢ. Sys: ゲームのスタークラフトを知っていますか?私はそれを設計するのを助けました! Jp Usr: それはすごいです、私は私の元彼がそのゲームを好きになると確信しています Sys: 私は通常、twitchチャンネルでゲームをストリーミングします Table 1: Multi-turn annotated dialogue samples from test set in seven languages. For simplicity, we only show three turns for each dialogue and the persona in English. are annotated by human experts to facilitate both to understand the mixed language dialogue automatic and human evaluations in multiple lan- context and generate coherent responses. guages. Furthermore, we propose competitive baselines in two training settings, namely, cross-lingual 2 Related Work and multilingual, and compare them with trans- Dialogue Systems are categorized as goal- lation pipeline models. Our baselines leverage pre- oriented (Williams and Young, 2007; Young et al., trained cross-lingual (Chi et al., 2019) and multi- 2013) and chit-chat (Serban et al., 2016; Vinyals lingual (Devlin et al., 2018) models. and Le, 2015). Interested readers may refer to Gao An extensive automatic and human evalua- et al. (2018) for a general overview. In this paper, tion (Li et al., 2019) of our models shows that a we focus on the latter, for which, in recent years, multilingual system is able to outperform strong several tasks and datasets have been proposed translation-based models and on par with or even to ground the conversation on knowledge (Dinan improve the monolingual model. The cross-lingual et al., 2019b; Gopalakrishnan et al., 2019; Shuster performance is still lower than other models, which et al., 2018; Fan et al., 2019; Reddy et al., 2019; indicates that cross-lingual conversation modeling Choi et al., 2018; Moon et al., 2019) such as Wiki- is very challenging. The main contribution of this Articles, Reddit-Post, and CNN-Article. In this paper are summarized as follows: work, we focus on personalized dialogue agents where the dialogues are grounded on persona infor- • We present the first multilingual non-goal- mation. oriented dialogue benchmark for evaluating multilingual generative chatbots. Li et al. (2016a) was the first to introduce a persona-grounded dialogue dataset for improving • We provide both cross-lingual and multilin- response consistency. Later on, Zhang et al. (2018) gual baselines and discuss their limitations to and Dinan et al. (2019a) introduced Persona-chat, a inspire future research. multi-turn conversational dataset, where two speak- ers are paired, and a persona description (4–5 sen- • We show the potential of multilingual systems tences) is randomly assigned to each of them. By
Valid. Test Lang #Dial. #Utt. Edit BLEU #Dial. #Utt. Edit BLEU Fr 248 3868 21.23 94.45 249 3900 24.29 94.19 It 140 2160 83.01 80.45 140 2192 81.93 80.08 Id 484 7562 157.58 60.46 484 7540 156.19 60.66 Jp 275 4278 71.41 53.66 275 4322 75.83 49.56 Ko 299 4684 74.04 61.25 300 4678 70.96 62.49 Zh 222 3440 30.33 59.89 222 3458 33.07 64.61 Table 2: The statistics of the collected dataset. We report the number of dialogues (#Dial.) and utterances (#Utt.) of the validation and test set in six languages. Edit distance per dialogue (Edit) and BLEU score are computed to show the difference between the human-annotated dataset and auto-translated dataset. (Training set is reported in Appendix A) conditioning the response generation on the per- Cross-lingual Cross-lingual adaptation learns sona descriptions, a chit-chat model is able to pro- the inter-connections among languages and circum- duce a more persona-consistent dialogue (Zhang vents the requirement of extensive training data in et al., 2018). Several works have improved on the target languages (Wisniewski et al., 2014; Zhang initial baselines with various methodologies (Ku- et al., 2016; Liu et al., 2019b). Cross-lingual trans- likov et al., 2018; Yavuz et al., 2019; Hancock fer learning methods have been applied to multiple et al., 2019; Madotto et al., 2019; Joshi et al., 2017; NLP tasks, such as named entity recognition (Ni Zemlyanskiy and Sha, 2018), especially using large et al., 2017; Xie et al., 2018), natural language pre-trained models (Wolf et al., 2019; Zhang et al., understanding (Liu et al., 2019c), dialogue state 2019). tracking (Chen et al., 2018), part-of-speech tag- ging (Wisniewski et al., 2014; Zhang et al., 2016; Multilingual Extensive approaches have been in- Kim et al., 2017), and dependency parsing (Ahmad troduced to construct multilingual systems, for ex- et al., 2019; Schuster et al., 2019b). Meanwhile, ample, multilingual semantic role labeling (Ak- Lample and Conneau (2019) and Conneau et al. bik et al., 2015; He et al., 2019), multilingual ma- (2019) proposed pre-trained cross-lingual language chine translation (Johnson et al., 2017), multilin- models to align multiple language representations, gual automatic speech recognition (Toshniwal et al., achieving state-of-the-art results in many cross- 2018; Yue et al., 2019; Nakayama et al., 2019; lingual classification tasks. The aforementioned Winata et al., 2019c), and named entity recogni- tasks focused on classification and sequence label- tion (Winata et al., 2019a,b). Multilingual deep ing, while instead, Chi et al. (2019) proposed to pre- contextualized model such as Multilingual BERT train both the encoder and decoder of a sequence-to- (M-BERT) (Devlin et al., 2018) have been com- sequence model (XNLG) to conduct cross-lingual monly used to represent multiple languages and generation tasks, namely, question generation and elevate the performance in many NLP applications, abstractive summarization. The latter is the closest such as classification tasks (Pires et al., 2019), tex- to our task since it focuses on language generation; tual entailment, named entity recognition (K et al., however cross-lingual dialogue generation has not 2020), and natural language understanding (Liu yet been explored. et al., 2019c). Multilingual datasets have also been 3 Data Collection created for a number of NLP tasks, such as named entity recognition or linking (Sang, 2002; Sang The proposed XPersona dataset is an extension and De Meulder, 2003; Pan et al., 2017; Aguilar of the persona-chat dataset (Zhang et al., 2018; et al., 2018), question answering (Liu et al., 2019a; Dinan et al., 2019a). Specifically, we extend the Lewis et al., 2019), semantic role labeling (Hajic ConvAI2 (Dinan et al., 2019a) to six languages: et al., 2009), part-of-speech tagging (Nivre et al., Chinese, French, Indonesian, Italian, Korean, and 2017), dialogue state tracking (Mrkšić et al., 2017), Japanese. Since the test set of ConvAI2 is hid- and natural language understanding (Schuster et al., den, we split the original validation set into a new 2019a). However, none of these datasets include validation set and test sets. Then, we firstly au- the multilingual chit-chat task. tomatically translate the training, validation, and
Response Response M-Encoder M-Decoder M-Causal Decoder Word Embedding Word Embedding Word Embedding X Positional Embedding Positional Embedding Positional Embedding Xpos Persona User Sys. User Language Embedding Persona User Sys. User Language Xseg (a) (b) Figure 1: (a) Multilingual Encoder-Decoder model. (b) Multilingual Causal Decoder model. (Detailed illustration is reported in Appendix B) test set using APIs (PapaGo 2 for Korean, Google translated training set. We firstly sample 200 dia- Translate 3 for other languages). For each lan- logues from the training set (∼2600 utterances) in guage, we hired native speaker annotators with each language, and we assign human annotators to a fluent level of English and asked them to revise list all frequent translation mistakes in the given the machine-translated dialogues and persona sen- dialogues. For example, daily colloquial English tences in the validation set and test set according expressions such as “cool", “I see", and “lol" are to original English dialogues. The main goal of usually literally translated. After that, we use a human annotation is to ensure the revised conver- simple string matching to revise the inappropriate sations are coherent and fluent in target language translations in the whole training-set and return a despite the cultural discrepancy in different lan- revision log, which records all the revised utter- guages. Therefore, annotators are not restricted ances. Then, we assign human annotators to check to translate the English dialogues. They are also all the revised utterances and list translation mis- allowed to customize dialogues and persona sen- takes again. We repeat this process at least twice for tences. The annotated dialogues can deviate from each language. Finally, we summarize the statistics original translation while retain persona and con- of the collected dataset in Table 2. versation consistency. The full annotation instruc- tions are reported in Appendix A. 4 Multilingual Personalized Compared to collecting new persona sentences Conversational Models and dialogues in each language, human-annotating the dialogues by leveraging translation APIs has Let us define a dialogue D = multiple advantages. First, it increases the data {U1 , S1 , U2 , S2 , . . . , Un , Sn } as an alternat- distribution similarity across languages (Conneau ing set of utterances from two speakers, where et al., 2018), which can better examine the system’s U and S represent the user and the system, cross-lingual transferability. Second, revising the respectively. Each speaker has its correspond- machine-translated dialogues based on the original ing persona description that consists of a set English dialogue improves the data construction of sentences P = {P1 , . . . , Pm }. Given the efficiency. Third, it leverages the well-constructed system persona sentences Ps and dialogue English persona conversations as a reference to history Dt = {U1 , S1 , U2 , . . . , St−1 , Ut }, we are ensure the dialogue quality without the need for interested in predicting the system utterances St . training a new pool of workers to generate new samples (Conneau et al., 2018). 4.1 Model Architecture On the other hand, human-translating the entire We explore both encoder-decoder and causal de- training-set (∼130K utterances) in six languages coder architectures, and we leverage existing pre- is expensive. Therefore, we propose an iterative trained contextualized multilingual language mod- method to improve the quality of the automatically els as weights initialization. Hence, we firstly de- 2 https://papago.naver.com fine the multilingual embedding layer and then the 3 https://translate.google.com two multilingual models used in our experiments.
Embedding We define three embedding matri- and the dialogue history Dt as the language model ces: word embedding E W ∈ R|V |×d , positional prefix, and autoregressively decode the system re- embedding E P ∈ RM ×d , and segmentation em- sponse St based on language embedding (i.e. lid ): bedding E S ∈ R|S|×d , where |.| denotes set car- dinality, d is the embedding size, V denotes the St = Decoder(E([Ps ; Dt ]), lid ). (4) vocabulary, M denotes the maximum sequence length, and S denotes the set of segmentation to- Figure 1 shows the conceptual differences be- kens. Segmentation embedding (Wolf et al., 2019) tween the encoder-decoder and casual decoder. is used to indicate whether the current token is part Note that in both multilingual models, the dialogue of i) Persona sentences, ii) System (Sys.) utter- history encoding process is language-agnostic, ances, iii) User utterances, iv) response in Lan- while decoding language is controlled by the lan- guage lid . The language embedding lid is used guage embedding. Such design allows the model to inform the model which language to generate. to understand mixed-language dialogue contexts Hence, given a sequence of tokens X, the embed- and to responds in the desired language (details in ding functions E are defined as: Section 5.3.2). E(X) = E W (X) ⊕ E P (Xpos ) ⊕ E S (Xseg ), (1) 4.2 Training Strategy We consider two training strategies to learn a multi- where ⊕ denotes the positional sum, Xpos = lingual conversational model: multilingual training {1, . . . , |X|} and Xseg is the sequence of segmen- and cross-lingual training. tation tokens, as in Wolf et al. (2019). Figure 1 shows a visual representation of the embedding Multilingual Training jointly learns to perform process. A more detailed illustration is reported in personalized conversations in multiple languages. Appendix B. We follow a transfer learning approach (Wolf et al., 2019; See et al., 2019) by initializing our models Encoder-Decoder To model the response gener- with the weights of the large multilingual pretrained ation, we use a Transformer (Vaswani et al., 2017) model M-Bert (Pires et al., 2019). For the causal based encoder-decoder (Vinyals and Le, 2015). As decoder, we add the causal mask into self-attention illustrated in Figure 1, we concatenate 4 the system layer to convert M-Bert encoder to decoder. For persona Ps with the dialogue history Dt . Then we encoder-decoder model, we randomly initialize use the embedding layer E to finally pass it to the the cross encoder-decoder attention (Rothe et al., encoder. In short, we have: 2019). Then, we train the both models on the com- H = Encoder(E([Ps ; Dt ])), (2) bined training set in all 7 languages using cross- entropy loss. where H ∈ RL×dmodel is the hidden representation computed by the encoder, and L denotes the input Cross-lingual Training transfers knowledge sequence length. Then, the decoder attends to H from the source language data to the target lan- and generates the system response St token by to- guages. In this setting, the model is trained on ken. In the decoder, segmentation embedding is English (source language) conversational samples, the language ID embedding (e.g., we look up the and evaluated on the other 6 languages. Following embedding for Italian to decode Italian). Thus: the methodology proposed by Chi et al. (2019), we align the embedded representations of different lan- St = Decodert (H, lid ), (3) guages into the same embedding space by applying cross-lingual pre-training to the encoder-decoder model. The pre-training procedure consists of two Causal Decoder As an alternative to encoder- stages: decoders, the causal-decoders (Radford et al., 2018, 2019; He et al., 2018) have been used to model • pre-training the encoder and the decoder inde- conversational responses (Wolf et al., 2019; Zhang pendently utilizing masked language model- et al., 2019) by giving as a prefix the dialogue his- ing, as in Lample and Conneau (2019); tory. In our model, we concatenate the persona Ps 4 We use the notation [a; b] for concatenating the vectors a • jointly pre-training the encoder-decoder by and b using two objective functions: Cross-Lingual
Bert2Bert M-Bert2Bert CausalBert M-CausalBert XNLG ppl. BLEU ppl. BLEU ppl. BLEU ppl. BLEU ppl. BLEU En 21.99 1.53 25.99 0.57 16.08 1.79 15.62 1.97 54.74* 2.25* Zh 21.35 3.36 13.24 1.25 8.69 5.51 9.27 5.7 3482.27 2.16 It 50.36 0.6 24.16 0.31 18.41 1.32 15.12 1.3 917.63 0.41 Jp 10.09 5.23 10.64 0.79 11.00 6.74 7.13 4.53 999.81 0.0 Ko 12.81 0.24 34.31 0.00 9.66 1.06 9.56 1.08 - - Id 21.37 0.11 22.83 0.22 14.77 2.1 14.61 1.92 844.98 0.15 Fr 13.22 0.35 15.58 0.50 10.39 1.97 10.59 2.17 640.33 0.09 Table 3: Results of automatic evaluation score on test set in seven languages. We compute the BLEU score and perplexity (ppl.) for monolingual, multilingual, and cross-lingual models. Auto-Encoding (XAE) and Denoising Auto- 2016), they help to roughly estimate the perfor- Encoding (DAE) (Chi et al., 2019). mance of different models under the same test set. More recently, Adiwardana et al. (2020) has shown For instance, DAE adds perturbations to the input the correlation between perplexity and human judg- sentence of encoder and tries to reconstructs the ment in open-domain chit-chat models. original sentence using the decoder, whereas, XAE uses parallel translation data to pre-train both the Human Asking humans to evaluate the quality encoder and decoder with machine translation ob- of a dialogue model is challenging, especially when jective. As in the multilingual models, the language multiple models have to be compared. The likert IDs are fed into the decoder to control the language score (a.k.a. 1 to 5 scoring) has been widely used of generated sentences. Both pre-training stages to evaluate the interactive experience with conver- require both parallel and non-parallel data in the sational models (Venkatesh et al., 2018; See et al., target language. 2019; Zhang et al., 2018; Dinan et al., 2019a). In After the two stages of pre-training, the model such evaluation, a human interacts with the sys- is fine-tuned using just the source language sam- tems for several turns, and then they assign a score ples (i.e., English) with the same cross-entropy from 1 to 5 based on three questions (Zhang et al., loss as for the multilingual training. However, as 2018) about fluency, engagingness, and consistency. suggested in Chi et al. (2019), only the encoder This evaluation is both expensive to conduct and parameters are updated with back-propagation and requires many samples to achieve statistically sig- both the decoder and the word embedding layer nificant results Li et al. (2019). To cope with these remain frozen. This retains the decoders’ ability to issues, Li et al. (2019) proposed ACUTE-EVAL, generate multilingual output while still being able an A/B test evaluation for dialogue systems. The to learn new tasks using only the target language. authors proposed two modes: human-model chats and self-chat (Li et al., 2016b; Ghandeharioun et al., 5 Experiments 2019). In this work, we opt for the latter since it is cheaper to conduct and achieves similar results (Li 5.1 Evaluation Metrics et al., 2019) to the former. Another advantage of us- Evaluating open-domain chit-chat models is chal- ing this method is the ability to evaluate multi-turn lenging, especially in multiple languages and at the conversations instead of single-turn responses. dialogue-level. Hence, we evaluate our models us- Following ACUTE-EVAL, the annotator is pro- ing both automatic and human evaluation. In both vided with two full dialogues made by self-chat or cases, human-annotated dialogues are used, which human-dialogue. The annotator is asked to choose show the importance of the provided dataset. which of the two dialogues is better in terms of engagingness, interestingness, and humanness. For Automatic For each language, we evaluate re- each comparison, we sample 60–100 conversations sponses generated by the models using perplexity from both models. In Appendix C, we report the (ppl.) and BLEU (Papineni et al., 2002) with refer- exact questions and instructions given to the anno- ence to the human-annotated responses. Although tators, and the user interface used in the evaluation. these automatic measures are not perfect (Liu et al., We hired native speakers annotators for all six con-
Engageness Interestingness Humanness Lang Multi Wins % Human Mono Poly Human Mono Poly Human Mono Poly En 23.33 68.57 36.36 23.33 64.29 32.73 30.00 62.86 42.73 Fr 32.00 55.17 42.86 16.00 53.45 48.21 28.00 50.00 44.64 Id 21.67 51.67 65.45 23.33 46.67 55.45 25.00 46.67 65.45 It 35.00 48.33 56.36 30.00 48.33 53.64 30.00 40.00 57.27 Jp 18.33 50.00 61.82 13.33 43.33 45.45 18.33 51.67 59.09 Ko 30.00 52.46 62.39 26.67 50.82 59.63 28.33 52.46 64.22 Zh 36.67 55.00 65.45 36.67 60.00 61.82 36.67 55.00 70.91 Table 4: Results of ACUTE-EVAL human evaluation. Tests are conducted pairwise between M-CausalBert (Multi.) and other models (Human, Poly-encoder (Poly), Monolingual CausalBert (Mono)). Numbers indicate the winning rate of Multi. Numbers in bold are statistically significant (p < 0.05). sidered languages. The annotators were different We adapt this model to the other languages by us- from the dataset collection annotators to avoid any ing the Google Translate API to translate target lan- possible bias. guages (e.g., Chinese) query to English as the input to the model, then translate the English response 5.2 Implementation Details back to the target language. Thus, the response Multilingual Models We use the "BERT-Base, generation flow is: target query → English query Multilingual Cased" checkpoint, and we denote → English response → target response. We denote the multilingual encoder-decoder model as M- this model as Poly. Bert2Bert (∼220M parameters) and causal de- coder model as M-CausalBert (∼110M param- Cross-lingual Models. In the first pre-training eters). We fine-tune both models in the combined stage, we use the pre-trained weights from XLMR- training set (English in Persona-chat (Zhang et al., base (Conneau et al., 2019). Then, we follow the 2018), six languages in Xpersona) for five epochs second pre-training stage of XNLG (Chi et al., with AdamW 5 optimizer and a learning rate of 2019) for pre-training Italian, Japanese, Korean, 6.25e-5. Indonesia cross-lingual transferable models. For Chinese and French, we directly apply the pre- Monolingual Models To verify whether the mul- trained XNLG (Chi et al., 2019) weights7 . Then, tilingual agent will under-perform the monolingual the pre-trained models are fine-tune on English Per- agent in the monolingual conversational task, we sonaChat training set and early stop based on the build a monolingual encoder-decoder model and perplexity on target language validation set. causal decoder model for each language. For a fair comparison, we initialize the monolingual models 5.3 Results and Discussion with a pre-trained monolingual BERT 6 (Devlin 5.3.1 Quantitative Analysis et al., 2018; Cui et al., 2019; Martin et al., 2019). We denote the monolingual encoder-decoder model Table 3 compares monolingual, multilingual, and as Bert2Bert (∼220M parameters) and causal de- cross-lingual models in terms of BLEU and per- coder model as CausalBert (∼110M parameters). plexity in the human-translated test set. On both Then we fine-tune each model in each language evaluation matrices, the causal decoder models out- independently for the same number of epoch and perform the encoder-decoder models. We observe optimizer as the multilingual model. that the encoder-decoder model tends to overlook dialogue context and generate digressive responses. Translation-based Models Another strong base- (Generated samples are available in Appendix D) line we compare with is Poly-encoder (Humeau We hypothesize that this is because the one-to- et al., 2019), a large-scale pre-trained retrieval many problem (Zhao et al., 2017) in open-domain model that has shown state-of-the-art performance conversation weakens the relation between encoder in the English Persona-chat dataset (Li et al., 2019). and decoder; thus the well pre-trained decoder 5 AdamW: Adam algorithm with weight decay (Bert) easily converges to a locally-optimal, and 6 The monolingual BERT pre-trained models are available 7 in https://github.com/huggingface/transformers Available in https://github.com/CZWin32768/XNLG
learns to ignore the dialogue context from the en- System Persona coder and generate the response in an unconditional My favorite activity is rock climbing. language model way. We leave the investigation I’m a mechanical engineer. of this problem to future work. On the other hand, I like to go hiking in the spring. M-CausalBert achieves a comparable or slightly I live in colorado. better performance compared to CausalBert, which Dialogue suggests that M-CausalBert leverages the data from studio inglese oggi e tu? (I am studying Usr other languages. As expected, we observe a sig- English today, you?) nificant gap between the cross-lingual model and Sys just watching netflix with friends other models, which indicates that cross-lingual 我喜欢攀岩,你呢? (I like rock Usr climbing, what about you?) zero-shot conversation modeling is very challeng- nice, I enjoy it while in the summer. ing. Sys what are you up to? Table 4 shows the human evaluation result of saya ingin melakukan panjat tebing comparing M-CausalBert (Multi) against the hu- Usr (I want to go to hike) man, translation-based Poly-encoder (Poly), and Sys I’m going hiking the most time of year monolingual CausalBert (Mono). The results il- スポーツがお好きなようですね lustrate that Multi outperforms Mono in English Usr (It seems you like sports) and Chinese, and is on par with Mono in other lan- I love the outdoors too, do you have guages. On the other hand, Poly shows a strong Sys any hobbies? performance in English as it was pre-trained with ᅡᄂ ᄂ ᆫᄐ ᅳ ᅵᄇ ᅵᄇ ᅩᄂᆫᄀ ᅳ ᆺᄋ ᅥ ᆯᅩ ᅳ ᄌᄋ ᇂ ᅡ해.(I like a large-scale English conversation corpus. In con- Usr to watch tv) trast, the performance of Poly drops in other lan- Sys i really like hiking and listening to music guages, which indicates that the imperfect transla- tion affects translation-based systems. We also con- Table 5: Many-to-one: understand mixed-language di- duct M-CausalBert (Multi) against XNLG (cross) alogue context in multiple languages and generate re- human evaluation, and Multi achieve nearly 100 sponse in one language percent winning rate. 5.3.2 Qualitative Analysis and Discussion model in 6 languages, and the model generate re- We randomly sample 7 self-chat dialogues for each sponses in English, 2) one-to-many, in which users baseline model in the seven languages and report converse with the model using English, and the them in Appendix D., And we summarize the gen- model generates responses in 6 languages using eration of each model as follows: language embedding and corresponding persona sentences. Table 5 and table 6 illustrate the gener- Poly Poly-encoder, pretrained on 174 million ation examples under these settings (more exam- Reddit data, can accurately retrieve coherent and ples reported in Appendix C.1). Most of the time, diverse responses in English. However, in the other M-CausalBert can understand the mixed-language six languages, some of the retrieved responses are context, and decode coherent response in different digressive due to translation error. languages. Understanding the mixed-language di- Monolingual & Multilingual We observe that alogue context is a desirable skill for end-to-end both the monolingual and multilingual models can chit-chat systems, and a systematic study of this generate fluent responses. Compared to Bert2Bert research question is needed in future. and M-Bert2Bert, CausalBert and M-CausalBert can generate more on-topic responses but some- Cross-lingual. The current state-of-the-art cross- times repeat through turns. CausalBert and M- lingual generation approach XNLG (Chi et al., CausalBert are on par with each other in mono- 2019) shows inferior performance on multi-turn lingual conversational tasks, while M-CausalBert dialogue tasks, and generates repetitive responses. shows the advantage of handling a mixed-language Although cross-lingual dialogue generation is chal- context. For multilingual speakers, the conversa- lenging, it reduces the human effort for data anno- tion may involve multiple languages. Therefore, tation in different languages. Therefore, the cross- we experiment on M-CausalBert with two settings: language transfer is an important direction to inves- 1) many-to-one, in which users converse with the tigate.
System Persona References I love to drink fancy tea. I have a big library at home. Daniel Adiwardana, Minh-Thang Luong, David R So, I’m a museum tour guide. Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, I’m partly deaf. Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Dialogue et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Usr Hi, I am a computer science student, you? I’m a bookkeeper for the local Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona En museum of art. Diab, Julia Hirschberg, and Thamar Solorio. 2018. 你好,我是一名博物馆老师。 Named entity recognition on code-switched data: Zh (I am a teacher in a museum) Overview of the calcs 2018 shared task. In Proceed- bonjour, je suis juste un séjour à la ings of the Third Workshop on Computational Ap- maison maman de mon immense proaches to Linguistic Code-Switching, pages 138– Fr 147. bibliothèque. (hello, I’m just a stay at home my huge library.) Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Sono un bibliotecario, ma ho bisogno Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On It di rilassarmi. (I am a bookkeper, but difficulties of cross-lingual transfer with order differ- I need to relax) ences: A case study on dependency parsing. In Pro- こんにちは。私は大学にいます、 ceedings of the 2019 Conference of the North Amer- Jp あなたの専攻は何ですか? ican Chapter of the Association for Computational (Hello. I am in college, what is your major?) Linguistics: Human Language Technologies, Vol- Saya tidak tahu, tetapi saya tuli. ume 1 (Long and Short Papers), pages 2440–2452. Id (I don’t know I am deaf) ᆫᄂ ᅡ ᄋ ᆼ, ᄂ ᅧ ᅡᄂ ᆫᄉ ᅳ ᆫᅢ ᅥ ᆼ ᄉᄂ ᆷᄋ ᅵ ᅵ야. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yun- Ko ᅥᄂ ᄂ ᆫᄆ ᅳ ᅮᄋ ᆺᄋ ᅥ ᆯᄀ ᅳ ᆼᄇ ᅩ ᅮ하ᄀ ᅩᄋ ᆻᄂ ᅵ ᅵ? yao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu. (Hello, I am a teacher. What are you studying?) 2015. Generating high quality proposition banks for multilingual semantic role labeling. In Proceedings Table 6: One-to-many: response one dialogue context of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International with 7 different languages Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 397–407. Wenhu Chen, Jianshu Chen, Yu Su, Xin Wang, Dong 6 Conclusion Yu, Xifeng Yan, and William Yang Wang. 2018. Xl- nbt: A cross-lingual neural belief tracking frame- work. In Proceedings of the 2018 Conference on In this paper, we studied both cross-lingual and Empirical Methods in Natural Language Processing, multilingual approaches in end-to-end personalized pages 414–424. dialogue modeling. We presented the XPersona Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian- dataset, a multilingual extension of Persona-Chat, Ling Mao, and Heyan Huang. 2019. Cross-lingual for evaluating the multilingual personalized chat- natural language generation via pre-training. arXiv preprint arXiv:1909.10481. bots. We further provided both cross-lingual and multilingual baselines and compared them with the Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- monolingual approach and two-stage translation ap- tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- moyer. 2018. Quac: Question answering in context. proach. Extensive automatic evaluation and human In Proceedings of the 2018 Conference on Empiri- evaluation were conducted to examine the models’ cal Methods in Natural Language Processing, pages performance. The experimental results showed that 2174–2184. multilingual trained models, with a single model Alexis Conneau, Kartikay Khandelwal, Naman Goyal, across multiple languages, can outperform the two- Vishrav Chaudhary, Guillaume Wenzek, Francisco stage translation approach and is on par with mono- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- lingual models. On the other hand, the current state- moyer, and Veselin Stoyanov. 2019. Unsupervised of-the-art cross-lingual approach XNLG achieved cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. lower performance than other baselines. In future work, we plan to research a more advanced cross- Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- lingual generation approach and construct a mixed- ina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross- language conversational benchmark for evaluating lingual sentence representations. In Proceedings of multilingual systems. the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 2475–2485.
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shexia He, Zuchao Li, and Hai Zhao. 2019. Syntax- Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. aware multilingual semantic role labeling. In Pro- Pre-training with whole word masking for chinese ceedings of the 2019 Conference on Empirical Meth- bert. arXiv preprint arXiv:1906.08101. ods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Language Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Processing (EMNLP-IJCNLP), pages 5353–5362. Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo ing. arXiv preprint arXiv:1810.04805. Chen, and Tie-Yan Liu. 2018. Layer-wise coordi- Emily Dinan, Varvara Logacheva, Valentin Malykh, nation between encoder and decoder for neural ma- Alexander Miller, Kurt Shuster, Jack Urbanek, chine translation. In Advances in Neural Informa- Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan tion Processing Systems, pages 7944–7954. Lowe, et al. 2019a. The second conversational intelligence challenge (convai2). arXiv preprint Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, arXiv:1902.00098. and Jason Weston. 2019. Poly-encoders: Trans- former architectures and pre-training strategies for Emily Dinan, Stephen Roller, Kurt Shuster, Angela fast and accurate multi-sentence scoring. CoRR Fan, Michael Auli, and Jason Weston. 2019b. Wiz- abs/1905.01969. External Links: Link Cited by, 2:2– ard of wikipedia: Knowledge-powered conversa- 2. tional agents. In International Conference on Learn- ing Representations. Melvin Johnson, Mike Schuster, Quoc Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Darrell Etherington. 2019. Amazon launches multilin- Fernanda Viégas, Martin Wattenberg, Greg Corrado, gual mode for using alexa in multiple languages at et al. 2017. Google’s multilingual neural machine once. translation system: Enabling zero-shot translation. Angela Fan, Yacine Jernite, Ethan Perez, David Grang- Transactions of the Association for Computational ier, Jason Weston, and Michael Auli. 2019. Eli5: Linguistics, 5:339–351. Long form question answering. arXiv preprint arXiv:1907.09190. Chaitanya K Joshi, Fei Mi, and Boi Faltings. 2017. Per- sonalization in goal-oriented dialog. arXiv preprint Jianfeng Gao, Michel Galley, and Lihong Li. 2018. arXiv:1706.07503. Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Re- Karthikeyan K, Zihan Wang, Stephen Mayhew, and search & Development in Information Retrieval, Dan Roth. 2020. Cross-lingual ability of multilin- pages 1371–1374. ACM. gual bert: An empirical study. In International Con- ference on Learning Representations. Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Lapedriza, and Rosalind Picard. 2019. Approximat- Eric Fosler-Lussier. 2017. Cross-lingual transfer ing interactive human evaluation with self-play for learning for pos tagging without cross-lingual re- open-domain dialog systems. In Advances in Neu- sources. In Proceedings of the 2017 Conference on ral Information Processing Systems, pages 13658– Empirical Methods in Natural Language Processing, 13669. pages 2832–2838. Karthik Gopalakrishnan, Behnam Hedayatnia, Qin- Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu and Jason Weston. 2018. Importance of a search Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür, and strategy in neural dialogue modelling. arXiv Amazon Alexa AI. 2019. Topical-chat: To- preprint arXiv:1811.00907. wards knowledge-grounded open-domain conversa- tions. Proc. Interspeech 2019, pages 1891–1895. Guillaume Lample and Alexis Conneau. 2019. Cross- Jan Hajic, Massimiliano Ciaramita, Richard Johans- lingual language model pretraining. arXiv preprint son, Daisuke Kawahara, M Antònia Martí, Lluís arXiv:1901.07291. Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, et al. 2009. The conll-2009 Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian shared task: Syntactic and semantic dependencies Riedel, and Holger Schwenk. 2019. Mlqa: Eval- in multiple languages. In Proceedings of the Thir- uating cross-lingual extractive question answering. teenth Conference on Computational Natural Lan- arXiv preprint arXiv:1910.07475. guage Learning (CoNLL 2009): Shared Task, pages 1–18. Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp- ithourakis, Jianfeng Gao, and Bill Dolan. 2016a. A Braden Hancock, Antoine Bordes, Pierre-Emmanuel persona-based neural conversation model. In Pro- Mazare, and Jason Weston. 2019. Learning from ceedings of the 54th Annual Meeting of the Associa- dialogue after deployment: Feed yourself, chatbot! tion for Computational Linguistics (Volume 1: Long arXiv preprint arXiv:1901.05415. Papers), volume 1, pages 994–1003.
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jian- Sahoko Nakayama, Andros Tjandra, Sakriani Sakti, feng Gao, and Dan Jurafsky. 2016b. Deep rein- and Satoshi Nakamura. 2019. Zero-shot code- forcement learning for dialogue generation. arXiv switching asr and tts with multilingual machine preprint arXiv:1606.01541. speech chain. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Margaret Li, Jason Weston, and Stephen Roller. 2019. pages 964–971. IEEE. Acute-eval: Improved dialogue evaluation with opti- mized questions and multi-turn comparisons. arXiv Jian Ni, Georgiana Dinu, and Radu Florian. 2017. preprint arXiv:1909.03087. Weakly supervised cross-lingual named entity recog- nition via effective annotation and representation Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- projection. In Proceedings of the 55th Annual Meet- worthy, Laurent Charlin, and Joelle Pineau. 2016. ing of the Association for Computational Linguistics How not to evaluate your dialogue system: An em- (Volume 1: Long Papers), pages 1470–1480. pirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the Joakim Nivre, Željko Agić, Lars Ahrenberg, et al. 2017. 2016 Conference on Empirical Methods in Natural Universal dependencies 2.0. lindat/clarin digital li- Language Processing, pages 2122–2132. Associa- brary at the institute of formal and applied linguis- tion for Computational Linguistics. tics, charles university, prague. Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Sun. 2019a. Xqa: A cross-lingual open-domain Nothman, Kevin Knight, and Heng Ji. 2017. Cross- question answering dataset. In Proceedings of the lingual name tagging and linking for 282 languages. 57th Annual Meeting of the Association for Compu- In Proceedings of the 55th Annual Meeting of the tational Linguistics, pages 2358–2368. Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1946–1958. Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata, Peng Xu, Andrea Madotto, and Pascale Fung. 2019b. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Zero-shot cross-lingual dialogue systems with trans- Jing Zhu. 2002. Bleu: a method for automatic eval- ferable latent variables. In Proceedings of the uation of machine translation. In Proceedings of 2019 Conference on Empirical Methods in Natu- the 40th annual meeting on association for compu- ral Language Processing and the 9th International tational linguistics, pages 311–318. Association for Joint Conference on Natural Language Processing Computational Linguistics. (EMNLP-IJCNLP), pages 1297–1303. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng How multilingual is multilingual bert? In Proceed- Xu, and Pascale Fung. 2019c. Attention-informed ings of the 57th Annual Meeting of the Association mixed-language training for zero-shot cross-lingual for Computational Linguistics, pages 4996–5001. task-oriented dialogue systems. Alec Radford, Karthik Narasimhan, Time Salimans, Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and and Ilya Sutskever. 2018. Improving language un- Pascale Fung. 2019. Personalizing dialogue agents derstanding with unsupervised learning. Technical via meta-learning. In Proceedings of the 57th An- report, OpenAI. nual Meeting of the Association for Computational Linguistics, pages 5454–5459. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Louis Martin, Benjamin Muller, Pedro Javier Ortiz models are unsupervised multitask learners. OpenAI Suárez, Yoann Dupont, Laurent Romary, Éric Ville- Blog, 1(8):9. monte de la Clergerie, Djamé Seddah, and Benoît Siva Reddy, Danqi Chen, and Christopher D Manning. Sagot. 2019. Camembert: a tasty french language 2019. Coqa: A conversational question answering model. arXiv preprint arXiv:1911.03894. challenge. Transactions of the Association for Com- Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra- putational Linguistics, 7:249–266. jen Subba. 2019. Opendialkg: Explainable conver- Sascha Rothe, Shashi Narayan, and Aliaksei Sev- sational reasoning with attention-based walks over eryn. 2019. Leveraging pre-trained checkpoints knowledge graphs. In Proceedings of the 57th An- for sequence generation tasks. arXiv preprint nual Meeting of the Association for Computational arXiv:1907.12461. Linguistics, pages 845–854. Erik F Sang. 2002. Introduction to the conll-2002 Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira shared task: Language-independent named entity Leviant, Roi Reichart, Milica Gašić, Anna Korho- recognition. arXiv preprint cs/0209010. nen, and Steve Young. 2017. Semantic specializa- tion of distributional word vector spaces using mono- Erik F Sang and Fien De Meulder. 2003. Intro- lingual and cross-lingual constraints. Transactions duction to the conll-2003 shared task: Language- of the Association for Computational Linguistics, independent named entity recognition. arXiv 5:309–324. preprint cs/0306050.
Sebastian Schuster, Sonal Gupta, Rushin Shah, and Genta Indra Winata, Zhaojiang Lin, Jamin Shin, Zihan Mike Lewis. 2019a. Cross-lingual transfer learning Liu, and Pascale Fung. 2019b. Hierarchical meta- for multilingual task oriented dialog. In Proceed- embeddings for code-switching named entity recog- ings of the 2019 Conference of the North American nition. In Proceedings of the 2019 Conference on Chapter of the Association for Computational Lin- Empirical Methods in Natural Language Processing guistics: Human Language Technologies, Volume 1 and the 9th International Joint Conference on Natu- (Long and Short Papers), pages 3795–3805. ral Language Processing (EMNLP-IJCNLP), pages 3532–3538. Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019b. Cross-lingual alignment of con- Genta Indra Winata, Andrea Madotto, Chien-Sheng textual word embeddings, with applications to zero- Wu, and Pascale Fung. 2019c. Code-switched lan- shot dependency parsing. In Proceedings of the guage models using neural based synthetic data from 2019 Conference of the North American Chapter of parallel sentences. In Proceedings of the 23rd Con- the Association for Computational Linguistics: Hu- ference on Computational Natural Language Learn- man Language Technologies, Volume 1 (Long and ing (CoNLL), pages 271–280. Short Papers), pages 1599–1613. Guillaume Wisniewski, Nicolas Pécheux, Souhir Abigail See, Stephen Roller, Douwe Kiela, and Jason Gahbiche-Braham, and François Yvon. 2014. Cross- Weston. 2019. What makes a good conversation? lingual part-of-speech tagging through ambiguous how controllable attributes affect human judgments. learning. In Proceedings of the 2014 Conference on arXiv preprint arXiv:1902.08654. Empirical Methods in Natural Language Processing Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and (EMNLP), pages 1779–1785. Joelle Pineau. 2016. Generative deep neural net- works for dialogue: A short review. arXiv preprint Thomas Wolf, Victor Sanh, Julien Chaumond, and arXiv:1611.06216. Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network Kurt Shuster, Samuel Humeau, Antoine Bordes, and based conversational agents. arXiv preprint Jason Weston. 2018. Engaging image chat: Model- arXiv:1901.08149. ing personality in grounded dialogue. arXiv preprint arXiv:1811.00945. Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A Smith, and Jaime Carbonell. 2018. Neural cross- Shubham Toshniwal, Tara N Sainath, Ron J Weiss, lingual named entity recognition with minimal re- Bo Li, Pedro Moreno, Eugene Weinstein, and Kan- sources. In Proceedings of the 2018 Conference on ishka Rao. 2018. Multilingual speech recognition Empirical Methods in Natural Language Processing, with a single end-to-end model. In 2018 IEEE Inter- pages 369–379. national Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 4904–4908. IEEE. Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, and Dilek Hakkani-Tur. 2019. Deepcopy: Grounded Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob response generation with hierarchical pointer net- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz works. In Proceedings of the 20th Annual SIGdial Kaiser, and Illia Polosukhin. 2017. Attention is all Meeting on Discourse and Dialogue, pages 122– you need. In Advances in neural information pro- 132. cessing systems, pages 5998–6008. Steve Young, Milica Gašić, Blaise Thomson, and Ja- Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fen- son D Williams. 2013. Pomdp-based statistical spo- fei Guo, Raefer Gabriel, Ashish Nagar, Rohit ken dialog systems: A review. Proceedings of the Prasad, Ming Cheng, Behnam Hedayatnia, Ange- IEEE, 101(5):1160–1179. liki Metallinou, et al. 2018. On evaluating and comparing conversational agents. arXiv preprint Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng, arXiv:1801.03625, 4:60–68. and Haizhou Li. 2019. End-to-end code-switching Oriol Vinyals and Quoc V Le. 2015. A neural conver- asr for low-resourced language pairs. arXiv preprint sational model. arXiv preprint arXiv:1506.05869. arXiv:1909.12681. Jason D Williams and Steve Young. 2007. Partially Yury Zemlyanskiy and Fei Sha. 2018. Aiming to know observable markov decision processes for spoken you better perhaps makes me a more engaging dia- dialog systems. Computer Speech & Language, logue partner. CoNLL 2018, page 551. 21(2):393–422. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Genta Indra Winata, Zhaojiang Lin, and Pascale Fung. Szlam, Douwe Kiela, and Jason Weston. 2018. Per- 2019a. Learning multilingual meta-embeddings for sonalizing dialogue agents: I have a dog, do you code-switching named entity recognition. In Pro- have pets too? In Proceedings of the 56th Annual ceedings of the 4th Workshop on Representation Meeting of the Association for Computational Lin- Learning for NLP (RepL4NLP-2019), pages 181– guistics (Volume 1: Long Papers), pages 2204–2213. 186. Association for Computational Linguistics.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi Jaakkola. 2016. Ten pairs to tag– multilingual pos tagging via coarse mapping be- tween embeddings. In Proceedings of the 2016 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 1307–1317. Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoen- coders. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664.
A Dataset Collection A.1 Annotation Instructions In this section, we show the instructions for French annotation: • There are two existing columns of conversa- tions: the first column (en) is the original con- versations in English, the second column (fr) is the conversations translated by an automatic system (e.g., Google Translate). • You should copy the conversation from the second column (the translated conversations) into the third column (named fr_annotation). In that column, you should then revise the incorrect or inappropriate translations. • The goal of the revision is to make the conver- sations more coherent and fluent in the target language (French). Hence you can customize dialogues and persona sentences to make them fluent and coherent in the target language, in- cluding by deviating from the original transla- tion. However, you should retain persona and conversation consistency. Figure 2: Human evaluation interface modified from A.2 Training Set Statistics ACUTE-EVAL(Li et al., 2019) We report our iterative revised training set statistics in Table 7. C Human Evaluation B Model Detail As illustrated in Figure 2, the annotator is provided with two full dialogues made by a self-chat model Figure 3 and 4 illustrates the details of the multilin- or human-dialogues. Then the annotators are asked gual causal decoder and the multilingual encoder- the following questions: decoder models. • Who would you talk to for a long conversa- Train tion? Lang # Dial. # Utt. Edit BLEU Fr 16878 248244 0.06 99.98 • If you had to say one of these speakers is It 16878 248244 1.09 99.8 interesting and one is boring, who would you Id 16878 248244 0.18 99.94 say is more interesting? Jp 16878 248244 0.38 99.17 Ko 16878 248244 0.97 99.51 • Which speaker sounds more human? Zh 16878 248244 0.52 98.1 D Generated Samples Table 7: The number of dialogues (#Dial.) and utter- ances (#Utt.) of the training set in six languages. Edit D.1 Mixed-language Samples distance per dialogue and BLEU score are computed to show the difference between the iterative revised We report more the mixed-language samples gener- dataset and auto-translated dataset. ated by M-CausalBert in Table 8 and 9.
I am fine and you ? EOS Causal Decoder X I love cats Hi ! Hi how are you ? SOS I am fine and you ? + + + + + + + + + + + + + + + + + Xpos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 + + + + + + + + + + + + + + + + + Xseg Per Per Per Sys Sys Usr Usr Usr Usr Usr En En En En En En En Figure 3: Multilingual Causal Decoder model. I am fine and you ? EOS M-Encoder M-Decoder X I love cats Hi ! Hi how are you ? SOS I am fine and you ? + + + + + + + + + + + + + + + + + Xpos 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 + + + + + + + + + + + + + + + + + Xseg Per Per Per Sys Sys Usr Usr Usr Usr Usr En En En En En En En Figure 4: Multilingual Encoder-Decoder model. D.2 Model Comparison Samples We randomly sample one self-chat dialogue exam- ples for each model in each language and report them in figure 5-37.
System Persona System Persona i have two dogs. i love disneyland and mickey mouse. i like to work on vintage cars. i love to spend time with my family. my favorite music is country. i’m a baby delivery nurse. i own two vintage mustangs. i walk three miles every day. Dialogue Dialogue Usr do you have any hobbies? Usr what do you do in your free time? En i like music, country is my favorite. you? En i love to play with my family. how about you? 我喜欢听乡村音乐。你呢 我主要是在家里 Zh Zh (I like country music, you?) (I just stay at home) je possède deux voitures anciennes j’aime passer du temps avec ma Fr et j’aime faire de l’artisanat famille et mon mari. Fr (I have two vintage cars and I like to make crafts) (i love spending time with my family and Ciao. mi piace lavorare su auto d’epoca my husband.) e ascoltare musica country. Ciao. Sono in pensione da quando ero un’infermiera. It It (Hello. I love working on vintage cars and (Hello. I have been retired since I was a nurse.) listening to country music.) こんにちは、私は家族と時間を過ごすのが こんにちは。私は大学にいます、 Jp 大好きです Jp (Hi, i really like it) (Hi, i love spending time with my family) Saya suka mendengarkan musik country, dan Anda? Saya suka berjalan 3 mil setiap hari dan Anda? Id Id (I like listening to country music, and you?) (I love walking 3 miles every day and you?) ᆫᄂ ᅡ ᄋ ᆼ, ᄂ ᅧ ᅡᄂ ᆫᄉ ᅳ ᆫᄉ ᅥ ᆼᄂ ᅢ ᆷᄋ ᅵ ᅵᄋ ᅣ. ᆫᄂ ᅡ ᄋ ᆼ, ᄂ ᅧ ᅡᄂ ᆫᄀ ᅳ ᅡᄌᆨᄀ ᅩ ᅪ시ᄀ ᅡ ᆫᄋᆯᄇ ᅳ ᅩᅢᄂᄂ ᆫᄀ ᅳ ᆺᄋ ᅥ ᆯᅳ ᅳ ᄌᄀ ᆯ ᆫᄃ ᅵ ᅡ. Ko Ko (I have two vintage mustangs.) (Hi, I enjoy spending time with my family.) Table 8: One-to-many by M-CausalBert System Persona System Persona i am a cancer survivor. my brother is older than me. my parents were both very athletic. i ride the bus to school. i got a new job just yesterday to be a life coach. i’m on the soccer team. i am a clean eater. i am 13 years old. Dialogue Dialogue こんにちは ᆫᄂ ᅡ ᄋ ᆼ ᅧ Usr Usr (Hello) (Hello) Sys hey there how are you? Sys hello, how are you? 我很好,我是一名学生,你呢? 我很好,刚刚下班回家 Usr Usr (I am good, I am a student, what about you?) (I am good, just came home from work) Sys i am a full time student and have no job yet. Sys oh, cool. what do you do for work? cosa Studi? Sono un insegnante di inglese Usr Usr (What do you study?) (I am English teacher.) Sys oh i see. what are you studying? Sys oh, that’s awesome! i hope to get a job soon. l’informatique quel âge avez-vous? Usr Usr (computer science) (how old are you? ) Sys computer stuff. what is your favorite subject? oh ok. i’m only 13, but my brother is Sys Usr i like machine learning. older than me Sys i like to read and make fun of things. Usr Apakah Anda pelajar? (are you a student?) Apa pekerjaan orang tuamu? Sys do you play soccer? i am a football player Usr (what do your parents do?) Usr no, i like to read Sys i’m from cali, my parents were both athletic. Sys i like to ride the bus and play soccer Table 9: Many-to-one by M-CausalBert
You can also read