TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TicketTalk: Toward human-level performance with end-to-end, transaction-based dialog systems Bill Byrne*, Karthik Krishnamoorthi*, Saravanan Ganesh*, Mihir Sanjay Kale Google, Mountain View, CA {billb,krishnamoorthi,srrvnn,mihirkale}@google.com Abstract 1 Introduction We present a data-driven, end-to-end approach Building a dialog system that handles human con- arXiv:2012.12458v1 [cs.CL] 23 Dec 2020 to transaction-based dialog systems that per- versational behavior is challenging because it must forms at near-human levels in terms of ver- respond sensibly and relevantly to a wide variety of bal response quality and factual grounding ac- context-sensitive user input over multiple conver- curacy. We show that two essential compo- sation turns. Task-based systems, e.g. those used nents of the system produce these results: a for ticket booking, food ordering, etc., face further sufficiently large and diverse, in-domain la- beled dataset, and a neural network-based, pre- hurdles to incorporate ever changing, real-world trained model that generates both verbal re- knowledge into the dialog and execute transactions. sponses and API call predictions. In terms of Recently, there has been growing interest in the data, we introduce TicketTalk, a movie ticket- so-called end-to-end approach to task-based dia- ing dialog dataset with 23,789 annotated con- log systems (Peng et al., 2020; Hosseini-Asl et al., versations. The movie ticketing conversations 2020; Lin et al., 2020; Wen et al., 2017; Bordes range from completely open-ended and unre- et al., 2016) due to its relatively simple and scal- stricted to more structured, both in terms of able architecture, and promising results in chatbot their knowledge base, discourse features, and number of turns. In qualitative human evalu- applications (Vinyals and Le, 2015; Serban et al., ations, model-generated responses trained on 2015b). Inspired by sequence-to-sequence learn- just 10,000 TicketTalk dialogs were rated to ing (Sutskever et al., 2014), this approach trains a “make sense” 86.5% of the time, almost the single model on a dialog dataset to form the basis same as human responses in the same contexts. for a given application. For each dialog turn, the Our simple, API-focused annotation schema model effectively takes the conversation history as results in a much easier labeling task making its input and generates an appropriate response. it faster and more cost effective. It is also the key component for being able to predict API To gain wider adoption, the end-to-end approach calls accurately. We handle factual ground- must overcome challenges with respect to training ing by incorporating API calls in the training data and factual grounding. In terms of training data, allowing our model to learn which ac- data, there is already general concern in the NLP tions to take and when. Trained on the same community about the lack of quality, task-oriented 10,000-dialog set, the model’s API call predic- dialog datasets, especially domain-specific collec- tions were rated to be correct 93.9% of the time in our evaluations, surpassing the ratings tions (Wen et al., 2017; Bordes et al., 2016). This for the corresponding human labels. We show problem is compounded for end-to-end approaches how API prediction and response generation since they typically require a large amount of in- scores improve as the dataset size incremen- domain data to generate competitive results. With tally increases from 5000 to 21,000 dialogs. respect to grounding, since the end-to-end ap- Our analysis also clearly illustrates the bene- proach is based on a single neural network, it must fits of pre-training. To facilitate future work either incorporate the knowledge base (KB) into on transaction-based dialogs, we have publicly the model itself, or the model must be able to accu- released the TicketTalk dataset at https:// git.io/JL8an. rately predict which API calls to make and when. In addition, details returned from the API calls * Equal contribution must be accurately incorporated in conversational
responses. This is contrasted with modular archi- state tracking. It has also undergone a series of tectures where the user’s intent is derived from a subsequent refinements. MSR-E2E, featured in the structured representation and then used to deter- Microsoft dialog challenge (Li et al., 2018), has mine which API calls to make such as in Rastogi 10,087 dialogues in three domains, movie-ticket et al. (2020) and Madotto (2020). booking, restaurant reservation, and taxi booking. In this work we promote an end-to-end approach (Byrne et al., 2019) offers 13,215 dialogs in six do- to single-domain, transaction-based dialog systems mains and has been updated with a second install- and describe how we overcome both data and ment, Taskmaster-2 (Byrne, Bill and Krishnamoor- grounding challenges described above. In quali- thi, Karthik and Ganesh, Saravanan and Dubey, tative evaluations, our models perform on par with Amit and Cedilnik, Andy and Kim, Kyu-Young, humans in generating verbal responses as well as 2020), which adds 17,289 more dialogs totalling predicting API calls. Just two components form over 30,000. The Schema Guided Dialogue dataset the basis for this system: a sufficiently large, in- (Rastogi et al., 2020) has 22,825 dialogs in mul- domain, labeled dataset and a pre-trained trans- tiple domains. MetaLWOZ (Lee et al., 2019) has former model. Combining natural language output 37,884 dialogs in 227 domains and is aimed at help- and structured API calls into a unified text-to-text- ing models more accurately predict user responses format allows us to leverage general purpose text- in new domains. Both Schema and MetaLWOZ to-text transformers to train models. Specifically, are used in DSTC8 (Kim et.al, 2019). In addition we use the T5 infrastructure (Raffel et al., 2019) to these, Serban et al. (2018) provides a thorough and show that its pre-training feature has a signifi- survey of dialog corpora released in previous years. cant impact on evaluations, boosting scores by 30 percent. 2.2 Modular vs. end-to-end architectures Models were trained on our TicketTalk dataset, In contrast to the end-to-end 1 approach, tradi- our new movie ticketing dialog dataset with 23,789 tional, modular strategies employ a division of la- conversations labeled with a simple yet unique API- bor among the components, e.g. understanding, based annotation schema. This makes it one of the state tracking, dialog policy, generation, etc., which largest, single-domain datasets to date. A public are either largely hand-crafted or derived from train- release of the dataset accompanies this paper. We ing individual models on labeled datasets (Wen chose movie ticketing since it is both transaction- et al., 2017; Young et al., 2013). This architecture based and relatively complex, but our overall ap- is inherently more complex than the single-model proach to dialog systems applies to any task-based end-to-end strategy we propose and can require domain. While there is a lot of recent work on significantly more design and engineering. More- multi-domain task-based dialog systems, human- over, since each module requires its own supervised like interaction for even single-domain tasks has training dataset, it is harder to apply to different yet to be demonstrated. By first solving the prob- domains (Serban et al., 2015a). lem for a single domain, we argue that replicating the process for multiple domains will be achievable by simply training additional high-quality datasets labeled with the same API-focused strategy. 2 Related work and background 2.1 Datasets Figure 1: Traditional modular system Over the past few years the NLP community has responded to the lack of dialog data with larger, However, the separation of functions makes the publicly released task-oriented datasets spanning modular approach more transparent and in some multiple domains (Wu et al., 2020; Budzianowski respects easier to debug. It has also been consid- and Vulić, 2019). This underscores the crucial role ered by some to be better equipped to interact with data plays in any approach to task-based dialog 1 systems. MultiWOZ (Budzianowski et al., 2018) The term “end-to-end” is sometimes also used when de- scribing parts of modular systems (Li et al., 2017; Wen et al., consists of 10,420 dialogs in multiple domains and 2017) but it is fundamentally different from the single text-to- has become a popular benchmarking corpus for text transformer model approach we present here.
external APIs (Sukhbaatar et al., 2015; Wen et al., Wen et al., 2017). However, significant time and ex- 2017) and therefore might be better suited for task- pense can be saved when assembling these corpora based dialogs. As mentioned above, we show that by simplifying the collection and annotation proce- our single model-based approach can accurately dures. In addition, little to no training is required generate both the appropriate response as well as for workers to be able to perform consistently well. predict the correct API call at the right time. 3.2 Collection methodology Using self-dialogs (where a worker creates the whole conversation, both user and agent turns) facil- itates building large and linguistically rich datasets since it is both simple and cost effective, and al- Figure 2: Simplified end-to-end system lows users to draw on their lifetime of conversa- tional experiences. This in turn ensures the model can handle the wide range of human conversational 3 The TicketTalk dataset behaviors that emerge in natural dialog. For this project we extended the self-dialog to include over 3.1 Overview three dozen sets of user instructions to generate a The TicketTalk movie ticketing dataset was created wider variety of conversations, from open-ended using the self-dialog collection method (Krause prompts to more specific instructions that require et al., 2017; Moghe et al., 2018; Byrne et al., 2019) specific types of exchanges. For example, one set where a crowd-sourced worker writes both sides of simply instructs workers to “write the transcrip- the dialog (i.e. both customer and ticketing agent tion of a conversation” in which a person makes a turns) based on a particular scenario and set of in- successful ticket transaction with a booking agent. structions. Following the annotation strategy used This allows dialog creators to express their unique for Taskmaster-1 (Byrne et al., 2019)), we limit view of what a typical movie ticketing transaction labels to basic entities and events (i.e. API calls). would be, structuring each conversation how they see fit. They are also instructed to find real values STAT TYPE VALUE for required details (i.e. slots) such as time, date, theater, movie, etc. using a movie or theater site Dialogs 23,789 of their choice for a specific location. This ensures Total turns 481,632 the dataset has a large and diverse KB. In contrast, Unique tokens 62,868 the more restrictive sets of instructions focus on specific sub-dialogs for error handling, changing Avg. turns per dialog 20.25 a detail, entity resolution, and the like. In such Avg. tokens per turn 10.35 cases we often provide a limited KB with one or Unique named entities 57,285 more values for all the details so the worker can focus on the primary task of creating a realistic Table 1: TicketTalk Dataset Statistics set of exchanges for this type of interaction. In a third type of scenario, the conversation is partially The rationale for limiting dialogs to a single do- completed and the user’s task is focused on a very main (movie ticketing) is based on our hypothesis specific part of the exchange. This allows us to “fill that human-level performance in terms of both re- holes” in the data quickly and cost effectively. That sponse generation and API call prediction for a is, we can create large numbers of short, conver- particular task requires larger (i.e. 10,000+), more sational examples that the model does not handle diverse datasets than are currently available. In adequately and then retrain for better results. other words, carefully curated, annotated datasets that cover all the idiosyncrasies of a single task or 3.3 Annotation transaction are a key factor in model performance. Dialog data annotation can be complex and time Concern about the cost and efficiency of creating consuming even for trained linguists as it typically these larger corpora has led some researchers to involves carefully and consistently labeling dia- look for approaches that alleviate dependencies log states, user intents, and dialog acts, among on annotated data (Budzianowski and Vulić, 2019; other possible labels (Henderson et al., 2013; Wen
et al., 2017; Budzianowski et al., 2018). The API- U user targeted approach is far more straightforward since A agent only basic entities (e.g. name, time, number of tick- PN program name ets, theater, movie attributes, etc.) and API calls PAN program argument name (e.g. to find theaters, movies, and showtimes, book PAV program argument value tickets, etc.) are labeled. The task is therefore eas- PR program response PRAN program response argument name ier to learn, faster to complete, and cheaper to run. PRAV program response argument value Moreover, as we discuss below, it fits well with C conversation context the text-to-text format we use in our approach to transaction-based dialog systems. The full annota- Table 2: Tokens identifying string type and function tion schema is included with the dataset release. 4 A novel end-to-end approach 4.3 Model and implementation The goal of our model is to generate a text string 4.1 Overview that either serves as a verbal response to the user or We implement a new approach to end-to-end dia- that contains one or more API calls with the data log systems by combining natural language output required at the current stage of the conversation. and structured API calls into a unified text-to-text Verbal responses come in two flavors: those that format where the input and output are always text depend on a particular API call details and those strings. This allows us to leverage widely available, that do not. For example, when an API is invoked state of the art, general purpose text-to-text trans- to find theater names for a given movie and loca- formers as the foundation of our system. Specifi- tion, the details returned from the API call must cally, we used the publicly available Text-To-Text be correctly incorporated into the system’s next re- Transfer Transformer (T5) (Raffel et al., 2019) to sponse, e.g. “I found two theaters, AMC 20 and train our models. The T5 framework was designed Century City 16.” In contrast, other verbal outputs, specifically to explore transfer learning techniques e.g. “What city do you plan to see the movie in?” for NLP and includes pre-training on the Colossal are derived from the overall conversation history. Clean Crawled Corpus (C4), composed of hun- Given the required text-to-text format used in dreds of gigabytes of web-based English text (Raf- our approach, we identify the type and function of fel et al., 2019). The original pre-training objective each string by converting the annotations to a set for the C4 corpus in the T5 framework was a de- of tokens. As shown in Table 2, tokens identify noising task, i.e. recovering missing words from the speaker, i.e. user vs. agent, the string type i.e. the input. Since this type of task scales well to utterance vs. API call, and the details of each API multiple downstream tasks, we used our custom call, both names as well as input parameters and inputs/targets from the TicketTalk dataset to repre- values, and response parameters and values. We sent an end-to-end task based dialog system and also tag the conversation “context” which separates ultimately achieve positive results. the most recent turn from previous turns. Our token key is shown in Table 2. 4.2 Setup The first step is to use tokens to represent the user and agent interactions, providing speaker in- We use T5-Base (Raffel et al., 2019) as our pre- formation to the model by the use of ”” and trained model, which follows the transformer archi- ””. We then convert any API invocations into tecture (Vaswani et al., 2017) and consists of 220M their text equivalent using tokens for marking API parameters. It was pre-trained on the large scale names, argument types and values, i.e. ””, C4 dataset mentioned above for 1M steps with a ””, etc. The results of these two steps are span corruption objective. We fine-tune this model shown in Table 3. on the Taskmaster-3 dataset for 40000 steps with The next step is to create the model inputs and a constant learning rate of 0.001 using 16 TPU v3 targets. We use the following algorithm to accom- chips. The batch size was set to 131,072 tokens plish this: per batch. The maximum input sequence length and output length were set to 1024 and 256 tokens 1. Initialize conversation context to an empty respectively. string.
I’d like to watch a movie. INPUTS TARGETS Sure. I can help you with that. I’d like to watch a Sure. I can help What kind of movies are you interested in? movie. you with that. What kind Are there any good action movies? of movies are you inter- ested in? find movies Are there any find movies name.genreaction good action movies? name.genre find movies action name.movie I’d like to watch a John WickJack Ryan movie. Sure. I can help I found John Wick and Jack Ryan. you with that. What kind of movies are you inter- Table 3: Speaker turns and API calls identified with ested in? tokens find movies I found John Wick name.movie and Jack Ryan. John Wick 2. Iterate through the interactions and do the fol- Jack Ryan lowing: I’d like to watch a movie. (a) If the sentence is a user utterance () Sure. I can help or a program response(), add it to you with that. What the model input along with the conversa- kind of movies are you interested in? tion context (if present). Are there any (b) If the sentence is an agent utterance good action movies? find movies () or program invocation (), name.genre add it to the model target. action (c) If both model input and target have been Table 4: Generating inputs vs. targets created, output the (input, target) pair and update the conversation context to reflect this. ii. Issue the API call by providing it to (d) Continue (2) to generate the next input, the API adapter. target pair. iii. Format API results and provide it to the model along with the conversa- Using the these rules, the model inputs and tar- tion context. gets are generated as in Table 4. Once the model has been trained on inputs and iv. Start from (3). targets, we can use the system to accomplish tasks in the following manner: This interaction lifecycle is illustrated in Figure 3. 1. Obtain user utterance and format it by adding the speaker token. 2. Provide the formatted utterance to the model. 3. Obtain model prediction (a) If the model prediction contains the agent () token, format it and show it to the user. i. Update conversation context and start again from (1). (b) If the model prediction contains the pro- gram () token: i. Extract program argument name Figure 3: System interaction life cycle () and value ().
4.4 Invoking APIs 5.2 Evaluation methodology When we detect an API call in the output, we in- Dataset size and pre-training are key factors in cre- voke the API, retrieve the results, and embed the ating models for end-to-end dialog systems. To responses in the next model input. As shown in understand the amount of data required for our Figure 4, each API call predicted by the model approach, we trained four models, each on a dif- typically contains a generic API name, such as ferent number of randomly selected subsets of the ”find-movies”, or ”find-theaters”, and a list of key TicketTalk dataset, namely 5000, 7500, 10,000 and value pairs that detail the specific parameters to be 21,000 dialogs. To measure the effect of transfer used while invoking the API, as shown in Figure 4. learning, we trained a second 10,000-dialog model without the T5 framework’s pre-training compo- nent, setting up an A-B comparison with the pre- trained model. As mentioned earlier, our models generate three types of output: API calls, verbal responses based on the results of an API call, and “plain” verbal responses based on the conversation context (i.e. not dependent on a particular API call response). Figure 4: Example API invocation (outside model) We set up a pair of evaluations for each type. The first evaluation asked human raters to evaluate the model’s output given a specific conversation his- The API call, while structured, may still include tory (i.e. context) while the second asked raters to pronouns or other co-referential phrases as input evaluate the human’s response for the same set of parameters. For example, the date parameter for contexts. Each experiment included 1000 context- an API call might contain the value “tonight”, and response pairs of varying lengths, i.e. some con- the location value might be “nearby”. The reso- versation histories might have just one exchange lution of these entities happens outside the core (a user and agent turn) while others could have up interaction layer in what can be understood as the to nine exchanges. We requested three ratings for “API adapter” (and not the actual API itself). This each question distributed among a pool of about not only helps simplify annotation, but also helps 900 paid raters for a total of 3000 data points per leverage existing solutions to these well defined experiment. Table 5 and Table 6 below shows a problems. This separation of the API layer is also sample context-response pair presented to human useful for encapsulating all API specific artifacts, raters for each type of model output. like authentication tokens, endpoint addresses and data formatters. In this way, the end-to-end system CONTEXT NEXT RESPONSE is able to interact with the user to solicit details Cust: Can you help me Agent: OK. Do you relevant to the task, generate API calls to fetch book a movie ticket? have any theaters in data from external knowledge sources, and use the mind? Agent: Yes I can. responses provided by the API call to construct Cust: Can you find tick- natural language responses. ets for the movie Knives Out? 5 Experiments Agent: Sure! What time did you want to 5.1 Overview book? Cust: 5 PM would be In this section, we show how our end-to-end ap- best. proach to transaction-based dialog systems pro- duces verbal responses and predicts API calls with Table 5: Context paired with generated verbal response near human-level quality and accuracy. Through We use our “makes-sense” metric to evaluate human qualitative evaluations, we show that two the model-generated responses and API call pre- aspects in particular, dataset size and pre-training, dictions against the human standard. For verbal significantly affect performance. Below we de- responses, we ask one question: scribe our evaluation methodology followed by a detailed discussion of the experiment results. • Does the agent’s next response make sense?
CONTEXT ACTION Size Plain Resp. Resp. to API API call Cust: I would like to see FIND MOVIES location: 5K a movie tonight. Oak Valley Arkansas model: 86.9% -5.5% 92.3% -3.9% 95.2% -2.2% Agent: Sure. What human: 92.4% 96.2% 97.4% movie would you like to see? BLEU: 56 Cust: I’m not really sure. Can you help me pick 7.5K something? model: 87.8% -3% 93.8% -2.4% 95.2% -2.3% Agent: No problem. I human: 90.8% 96.2% 97.7% can give you the names BLEU: 59 of a couple of movies playing in your area. What city are you going 10K to see the movie in? model: 86.5% -1.9% 91.8% -1.4% 97.1% +2.5% human: 88.4% 93.2% 94.6% Table 6: Context paired with predicted API call BLEU: 61 For negative answers, we give a list of reasons 21K raters believe it does not make sense (i.e. off topic, model: 89.8% -1.4% 95.3% -0.3% 93.9% +0.3% repeated information, incorrect details, language human: 91.2% 95.6% 93.6% mistakes, other). For API call predictions there are BLEU: 60 two questions: No Pre-training 10K 1. Do all the action types, their details, and their model: 55.8% -32.6% 63.1% -30.1% 72.8% -21.8% order make sense at this point in the conversa- BLEU: 51 tion? Table 7: Effects of training set size and pre-training on model 2. Are there any actions that should be listed accuracy here but that are missing (either as additions or replacements)? fares better than human API labeling. As an au- Again, raters are given options to choose for nega- tomatic metric, we also provide the BLEU score tive answers. generated for each model. This offline evaluation strategy offers scalability The effect of pre-training is also very clear. After and minimal rater training. However, an online, training a fifth model, this time without the T5 interactive evaluation infrastructure would allow framework’s pre-training feature, we see a huge us to evaluate the ability of the model to handle drop in evaluation scores. As shown at the bottom errors in its own output (from previous predictions) of Table 7, we see a decrease of 30% in model and its robustness while dealing with novel inputs. performance for verbal responses and about a 25% Future evaluation will be carried out on this new drop in API call prediction accuracy. infrastructure. Finally, the quality of the model’s prediction 5.3 Results stays on par with human scores throughout the conversation as the context grows. Figure 5 shows Comparing the “makes-sense” scores for model- how the model’s ”makes sense” score stay on the generated vs. human-generated responses, a clear same path after each exchange. pattern of improvement emerges based on dataset size. As shown in Table 7, when 5K and 7.5K 6 Conclusion dialogs are used for the training set, scores for model-generated responses lag behind the human- We have described an end-to-end dialog sys- generated scores by up to 5.5%. At 10K dialogs, tem approach that shows promising potential for the response scores differ by less than 2% and transaction-based dialog applications. In offline hu- model-generated API predictions outperform hu- man evaluations, our single-domain models trained man labels by 2.5%. At 21K dialogs, model- on just 10,000 dialogs generate responses and pre- generated responses improve to near human-level dict API calls with near-human level accuracy. One performance. The 21K model’s API call prediction key aspect of this strategy is combining natural lan-
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra- madan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278. Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey, Andy Cedilnik, and Kyu-Young Kim. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358. Figure 5: Model accuracy per dialog exchange Byrne, Bill and Krishnamoorthi, Karthik and Ganesh, Saravanan and Dubey, Amit and Cedilnik, Andy guage output and structured API calls into a uni- and Kim, Kyu-Young. 2020. Taskmaster-2. https: //github.com/google-research-datasets/ fied text-to-text format in order to leverage general Taskmaster/tree/master/TM-2-2020. purpose text-to-text transformers, such as the T5 Dataset two in series of three. framework. In this way, predicting which API call Matthew Henderson, Blaise Thomson, and Steve to make and when is essentially the same as generat- Young. 2013. Deep neural network approach for the ing the appropriate utterance at a given point in the dialog state tracking challenge. In Proceedings of conversation. The pre-training component signifi- the SIGDIAL 2013 Conference, pages 467–471. cantly boosts performance on our downstream task Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, of fine tuning models on the our datasets. These Semih Yavuz, and Richard Socher. 2020. A simple carefully curated and sufficiently large datasets language model for task-oriented dialogue. arXiv are also core to this strategy, and creating them is preprint arXiv:2005.00796. straightforward using the self-dialog technique and Chulaka Gunasekara Sungjin Lee Adam Atkinson simple, API-focused annotation. The TicketTalk Baolin Peng Hannes Schulz Jianfeng Gao Jinchao dataset released with this paper is one such exam- Li Mahmoud Adada Minlie Huang Luis Lastras Jonathan K. Kummerfeld Walter S. Lasecki Chiori ple. When compared with more traditional, modu- Hori Anoop Cherian Tim K. Marks Abhinav Ras- lar system architectures, our end-to-end approach togi Xiaoxue Zang Srinivas Sunkara Raghav Gupta should significantly reduce design and engineering Kim et.al, Michel Galley. 2019. The eighth dialog time and resources needed to build task-based dia- system technology challenge. arXiv preprint. log systems. Future work will include interactive Ben Krause, Marco Damonte, Mihai Dobre, Daniel evaluation of current models as well as an applica- Duma, Joachim Fainberg, Federico Fancellu, Em- tion of this approach to multiple-domain systems. manuel Kahembwe, Jianpeng Cheng, and Bonnie Webber. 2017. Edina: Building an open do- main socialbot with self-dialogues. arXiv preprint Acknowledgments arXiv:1709.09816. We would like to thank our colleagues Daniel S Lee, H Schulz, A Atkinson, J Gao, K Suleman, De Freitas Adiwardana, Noam Shazeer, Filip L El Asri, M Adada, M Huang, S Sharma, W Tay, Radlinksi, and Pedro Moreno for their discussion et al. 2019. Multi-domain task-completion dialog challenge. Dialog System Technology Challenges, and insights through several iterations of this pa- 8. per. We thank Hadar Shemtov for his guidance and support of the overall project. Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task- completion neural dialogue systems. arXiv preprint arXiv:1703.01008. References Xiujun Li, Sarah Panda, JJ (Jingjing) Liu, and Jianfeng Antoine Bordes, Y-Lan Boureau, and Jason Weston. Gao. 2018. Microsoft dialogue challenge: Building 2016. Learning end-to-end goal-oriented dialog. end-to-end task-completion dialogue systems. In arXiv preprint arXiv:1605.07683. SLT 2018. Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, gpt-2–how can i help you? towards the use of pre- and Pascale Fung. 2020. Mintl: Minimalist transfer trained language models for task-oriented dialogue learning for task-oriented dialogue systems. arXiv systems. arXiv preprint arXiv:1907.05774. preprint arXiv:2009.12005.
Andrea Madotto. 2020. Language models as few-shot Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, learner for task-oriented dialogue systems. arXiv Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, preprint arXiv:2008.06239. Stefan Ultes, and Steve Young. 2017. A network- based end-to-end trainable task-oriented dialogue Nikita Moghe, Siddhartha Arora, Suman Banerjee, and system. In Proceedings of the 15th Conference of Mitesh M Khapra. 2018. Towards exploiting back- the European Chapter of the Association for Compu- ground knowledge for building conversation sys- tational Linguistics: Volume 1, Long Papers, pages tems. arXiv preprint arXiv:1809.08205. 438–449, Valencia, Spain. Association for Computa- tional Linguistics. Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayan- deh, Lars Liden, and Jianfeng Gao. 2020. Soloist: Chien-Sheng Wu, Steven Hoi, Richard Socher, and Few-shot task-oriented dialog with a single pre- Caiming Xiong. 2020. Tod-bert: Pre-trained natural trained auto-regressive model. language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Steve Young, Milica Gašić, Blaise Thomson, and Ja- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, son D Williams. 2013. Pomdp-based statistical spo- Wei Li, and Peter J Liu. 2019. Exploring the limits ken dialog systems: A review. Proceedings of the of transfer learning with a unified text-to-text trans- IEEE, 101(5):1160–1179. former. arXiv preprint arXiv:1910.10683. Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 34, pages 8689–8696. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015a. Build- ing end-to-end dialogue systems using generative hi- erarchical neural network models. arXiv preprint arXiv:1507.04808. Iulian V Serban, Alessandro Sordoni, Yoshua Ben- gio, Aaron Courville, and Joelle Pineau. 2015b. Hierarchical neural network generative models for movie dialogues. arXiv preprint arXiv:1507.04808, 7(8):434–441. Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau- rent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Dis- course, 9(1):1–49. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27:3104–3112. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information process- ing systems, 30:5998–6008. Oriol Vinyals and Quoc Le. 2015. A neural conversa- tional model. arXiv preprint arXiv:1506.05869.
You can also read