Few Shot Dialogue State Tracking using Meta-learning - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Few Shot Dialogue State Tracking using Meta-learning Saket Dingliwal1,2 Bill Gao1 Sanchit Agarwal1 Chien-Wei Lin1 Tagyoung Chung1 Dilek Hakkani-Tür1 1 Amazon Alexa AI 2 Carnegie Mellon University {skdin,shuyag,agsanchi,chienwel,tagyoung,hakkanit}@amazon.com Abstract Many of the recent works (Wu et al., 2019; Zhang et al., 2019; Goel et al., 2019; Heck et al., Dialogue State Tracking (DST) forms a core 2020) have proposed various neural models that component of automated chatbot based sys- arXiv:2101.06779v2 [cs.CL] 23 Jan 2021 achieve good performance for the task but are data tems designed for specific goals like hotel, hungry in general. Therefore, adapting to a new taxi reservation, tourist information etc. With unknown domain (target domain) requires large the increasing need to deploy such systems in new domains, solving the problem of zero/few- amounts of domain-specific annotations limiting shot DST has become necessary. There has their use. However, given a wide range of prac- been a rising trend for learning to transfer tical applications, there has been a recent inter- knowledge from resource-rich domains to un- est in data-efficient approaches. Lee et al. (2019), known domains with minimal need for addi- Gao et al. (2020) used transformer (Vaswani et al., tional data. In this work, we explore the 2017) based models which significantly reduce merits of meta-learning algorithms for this data dependence. Further, Gao et al. (2020) model transfer and hence, propose a meta-learner D- the problem as machine reading comprehension REPTILE specific to the DST problem. With extensive experimentation, we provide clear task and benefit from its readily available external evidence of benefits over conventional ap- datasets and methods. Wu et al. (2019) were first to proaches across different domains, methods, propose transferring knowledge from one domain base models and datasets with significant (5- to another. Since, many domains like restaurant 25%) improvement over the baseline in low- and hotel share a lot of common slots like name, data setting. Our proposed meta-learner is ag- area, rating, etc and hence such a transfer proved nostic of the underlying model and hence any to be effective for a low-resource domain. More existing state-of-the-art DST system can im- prove its performance on unknown domains recently, Campagna et al. (2020) aimed at zero-shot using our training strategy. DST using synthetic data generation for the target domain imitating data from other domains. 1 Introduction Recent meta-learning methods like MAML (Finn et al., 2017), REPTILE (Nichol et al., 2018) Task-Oriented Dialogue (TOD) systems are auto- have proven to be very successful in efficient and mated conversational agents built for a specific goal fast adaptations to new tasks with very few labelled (for example hotel reservation). Many businesses samples. These methods specifically aim at the from wide-variety of domains (like hotel, restau- setting where there are many similar tasks but very rant, car-rental, payments etc) have adopted these small amount of data for each task. Agnostic of the systems to cut down their cost on customer sup- underlying model, these meta-learning algorithms port services. Almost all such systems have a Dia- spit out initialization for its parameters which when logue State Tracking (DST) module which keeps fine-tuned using low-resource target task achieves track of values for some predefined domain-specific good performance. Following their widespread slots (example hotel-name, restaurant-rating etc) success in few-shot image classification, there has after every turn of utterances from user and system. been a lot of recent work on their merit in natural These values are then used by Natural Language language processing tasks. Huang et al. (2018); Gu Generator (NLG) to generate system responses and et al. (2018); Sennrich and Zhang (2019); Bansal fulfill the user goals. et al. (2019); Dou et al. (2019); Yan et al. (2020)
attempt at using meta-learning for efficient trans- provide evidence of the benefit of our approach fer of knowledge from high-resource tasks to a over conventional methods. We achieve a sig- low-resource task. Further, some of the more re- nificant 5-25% improvement over the baseline in cent works (Dai et al., 2020; Qian and Yu, 2019) few-shot DST that is consistent across different tar- have shown meta-learners can be used for system get domains, methods, base models and datasets. response generation in TOD systems which is gen- erally downstream task for our DST task. 2 Background To the best of our knowledge, ours is the first 2.1 Dialogue State Tracking work exploring meta-learning algorithms for the DST refers to keeping track of the state of the DST problem. While prior work focused on train- dialogue at every turn. State of dialogue can be ing their models with a mixture of data from other defined as hslot name,slot valuei pairs that repre- available domains (train domains) followed by fine- sents, given a domain-specific slot, the value that tuning with data from target domain, we identify the user provides or system-provided value that that this method of transferring knowledge between the user accepts. Further, many domains have a domains is inefficient, particularly in very low-data pre-defined ontology that specify the set of values setting with just 0, 1, 2, 4 annotated examples from each slot can take. Note that the number of val- target domain. We, on the other hand, use train ues in ontology varies a lot with slots. Some slots domains to meta-learn the parameters of the model like hotel-stars might have just five different values used to initialize the fine-tuning process. We hy- (called categorical slots), while those like hotel- pothesize that though different domains share many name have hundreds of possible values (called ex- common slots, they can have different complexi- tractive slots). It might be possible that a slot has ties. For some of the domains, it might be easier never been discussed in the dialogue sequence and to train the model using very few examples while in that case, model has to predict a None value for others may require large number of gradient steps that particular slot. (based on their different data complexity and train- Various models have been proposed for the ing curves with 1%, 5%, 10% data in Gao et al. above task, but particularly relevant to this work (2020)). Meta-learning takes into account that this is transformer-based model STARC by Gao et al. gradient information and share it across domains. (2020). For each slot, they form a question (like Rather than looking for an initialization that try to what is the name of the hotel for hotel-name slot) simultaneously minimizes joint loss over all the do- and then at each turn append the tokens from di- mains, it looks for a point from which the optimum alogue utterance and the question separated by parameters of individual domains are reachable in [SEP] token. They then pass these sequence of few (< 5) gradient steps (and hence very few train- tokens through a transformer to form token em- ing examples for these steps). Then the hope is beddings. For the extractive slots, they use token that the target domain is similar to at least one of embeddings to mark the span (start and end po- the train domains (for example hotel & restaurant sition) of the answer value in the dialogue itself or taxi & train) and hence the learned initializa- (called extractive-model). For the categorical slots tion will achieve efficient fine-tuning with very few with less number of possible values, categorical- examples for the target domain as well. This di- model append embeddings of each possible value rection of limited data is motivated by practical to the token embeddings and then use a classifier applicability, where it might be possible for any de- with softmax layer to predict the correct option. veloper to manually annotate 4-8 examples before deploying the chatbot for a new domain. 2.2 Meta-Learning We highlight the main contributions of our work With advances in model-agnostic meta-learning below (i) We are the first to explore and reason framework by Finn et al. (2017); Nichol et al. about the benefits of meta-learning algorithms (2018), the few-shot problems have been revo- for DST problem (ii) We propose a meta-learner lutionized. These frameworks define a underly- D-REPTILE that is agnostic to underlying model ing task-distribution from which both train (τ ) 0 and hence has the ability to improve the state of and target tasks (τ ) are sampled. For each the art performance in zero/few-shot DST for new task τ , we are given very few labelled data- domains. (iii) With extensive experimentation, we points Dτtrain and a loss function Lτ . Now,
given a new data point Dτtest 0 from target task straight-forward to switch any other initialization 0 τ , the goal is to learn parameters θM of any based meta-learner by changing meta-update step. model M such that Lτ 0 (Dτtest 0 ; θM ) is mini- The novelty of our learner lies in its definition of mized. This is achieved by k-steps of gradient the meta-learning tasks that represent different do- descent using Dτtrain 0 with learning rate α. More mains of DST problem. This algorithm aims to find θMIN IT , which we use to initialize the model for the formally, θM = SGD(Dτtrain 0 IN IT ; k, α) , Lτ 0 , θM fine-tuning stage with the target domain. where SGD(D, L, θIN IT ; k, α) gives θ(k) such that θ(t) = θ(t−1) −α∇θ (L(D; θ)), θ(0) = θIN IT (1) Algorithm 1: D-REPTILE: Meta Learner for DST Therefore, the goal now is to find a good initializa- Input: Dd1 , Dd2 . . . Ddn IN IT for the gradient descent using the data tion θM Parameters :M, L, pD (.), α, β, k, m Result: θM IN IT from train tasks τ . This is achieved by minimizing the empirical loss as 1 Initialize θM randomly 2 for iteration i = 1, 2, . . . do θ(k) = SGD(Dτtrain , Lτ , θ; k, α) (2) 3 sample m domains Di using pD (.) X 4 for domain dj ∈ Di do IN IT θM = arg min Lτ (Dτtrain ; θ(k) ) (3) 5 sample data points Dij from Ddj θ τ d 6 θMj ← SGD(Dij , L, θM ; k, α) Note that the above optimization is complex 7 end and involve second-order derivatives. For com- Pm dj 1 putational benefits, Nichol et al. (2018) proposed 8 θM ← θM + β m j=1 (θM − θM ) REPTILE and showed that these terms can be ig- 9 end nored without effecting the performance of the 10 return θM ; meta-learning algorithm. We refer the reader to their work for more details. We argue that the meta-learned initialization are better suited for fine-tuning than conventional 3 Methodology methods. In the hope that joint optimal parame- In this work, we propose D-REPTILE, a meta- ters for train domains lie close to individual do- learning algorithm specific to DST task. Fol- mains, Wu et al. (2019) initialize the fine-tuning lowing what Qian and Yu (2019) did for di- stage of the target domain from the joint minimum alogue generation problem, we treat differ- of the loss from data from all the train domains ent domains as tasks for the meta-learning (called Naive pre-training before Fine-Tuning or algorithm. Let D = {d1 , d2 , . . . dn } (eg. NFT here). More formally, they chose the follow- {restaurant, taxi, payment, . . .}) be the set of ing initialization train domains for which we have annotated data n X IN IT available. Let pD (.) define a probability distribu- θM = arg min L(Ddj ; θ) (4) θ tion over these domains. Let Dd1 , Dd2 . . . Ddn be j=1 the training data from each of these domains. Let Such an initialization tries to simultaneously min- M be any DST model with parameters θM . Let imize the loss for all the domains which might m be the task-batch size (number of domains in be useful if the goal was to perform well on test a batch in our case), α, β be the inner and outer data coming from mixture of these domains. How- learning rate respectively, k be the number of gra- ever, here our goal is to perform well on a single dient steps. Let SGD(.) be the function as defined unknown target domain and no direct relation be- in equation 1. Borrowing the meta-learning the- tween this initialization and the optimal parameters ory regarding optimizing the objective equation for the target domain can be seen. Further, as 3 from Nichol et al. (2018), we define the algo- the number of train domains increases or training rithm D-REPTILE in Algorithm 1. The update rule data for each domain decreases, the joint optimum for initialization (as defined in step 8) is same as can be very far-off from the individual domain- that of REPTILE. We chose REPTILE over other optimum parameters. Therefore, these methods meta-learning algorithms because of its simplicity perform particularly bad. We show empirical ev- and computational advantages. Nonetheless, its idence for this hypothesis in Section 4. On the
other hand, if we optimize equation 3, we will our transformer in our experiments and label it with reach a point in the parameter space from where suffix ’-RC’ to distinguish it from ’-base’ model. all the domain-optimum parameters are reachable in k-gradient descent steps. Therefore, we can 4.2 Evaluation Metric hope to reach the optimum parameters for the tar- Based on the objective in DST, there is a well es- get domain as well efficiently. This hope is much tablished metric Joint Goal Accuracy (JGA). JGA larger for DST problem specifically because of sim- is the fraction of total turns across all dialogues for ilarities in different related domains (specifically which the predicted and ground truth dialogue state related slots as shown in Section 5). matches for all slots. Following Wu et al. (2019), Let us consider the following example, let restau- for testing for a single target domain in a multi- rant and taxi be two of the train domains. Optimiz- domain dialogue, we only consider slots from that ing equation 3, we might reach a point which is domain in metric computation. Note that in some closer to optimum parameters of restaurant domain of our experiments (where explicitly mentioned), than taxi domain if we have have smaller gradi- we further restrict the slots to only extractive or ent values for restaurant data but large for taxi. only categorical slots. Also, as it happens most of However notably, both the optimum-parameters the times, whenever a slot is not mentioned in any are reachable in k-gradient steps. Now if target turn, the ground truth value for that slot is None. domain is hotel (similar to restaurant domain with For analysis, we further use the metric Active Slot common slots like rating, name, etc), we will al- Accuracy which is the fraction of predicted values ready be close to its optimum parameters. Also if of a particular slot that were correct whenever the the target domain is bus (similar to taxi domain ground truth value was not none. with common slots like time, place, etc), we will have larger gradients in fine-tuning stage and thus 4.3 Experimental Setting will reach the optimum parameters for bus as well. For all our experiments, both D-REPTILE and This might not have been possible with equation 4 baseline (NFT (Sec. 3)) uses STARC (Sec. 2) as the optimum parameters for joint of restaurant as base model M. This ensures that all the gains and taxi data might be very far from both the indi- achieved in our experiments are only due to meta- vidual train domains and will also have no specific learning. In our implementation 1 , we use pre- gradient properties for faster adaptation for any of trained word embeddings Roberta-Large (Liu et al., hotel or bus target domains. 2019), Adam optimizer (Kingma and Ba, 2014) for gradient updates in both inner and outer loop, 4 Experiments α = 5e−5 , β = 1, m = 4, k = 5, pD (i) ∝ |Ddi | 4.1 Datasets (chosen using dev-set experiments as explained in Section 5). As shown recently (Mosbach et al., We used two different DST datasets for our ex- 2020), the fine-tuning of transformer based model periments. (i) MultiWoz 2.0 (Budzianowski et al., is unstable, therefore, we run fine-tuning multiple 2018), 2.1 (Eric et al., 2019) (ii) DSTC8 (Rastogi times and report the mean and the standard devi- et al., 2019). The former is manually annotated ation of the performance. Also, the performance complex dataset with mostly 5 different domains, varies with the choice of training data from tar- 8438 dialogues while the latter is relatively simple get domain used for fine-tuning. However, for our synthetically generated dataset with 26 domains experiments, we chose these dialogues based on and 16142 dialogues. Both the datasets contains number of active slots (not None) and use the same dialogues spanning multiple domains. Following dialogue for both D-REPTILE and baseline. Since, the setting from Wu et al. (2019), for extracting we use very little data (0, 1, or 2 examples) from data of a particular domain from the dataset, we target domain, we obviously would like to have consider all the dialogues in which that domain is dialogues that at-least have all the slots being dis- present and ignore slots from other domains both cussed in the utterances. In practical scenarios, in train and test set. Further, as shown by Gao where a developer might be creating 1 or 2 ex- et al. (2020), we use external datasets from Ma- amples for a new domain, it is always possible to chine Reading for Question Answering (MRQA) include all the slots in the dialogue utterances. 2019 shared task (Fisch et al., 2019), DREAM (Sun 1 et al., 2019), RACE (Lai et al., 2017) to pre-train https://github.com/saketdingliwal/Few-Shot-DST
Joint Goal Accuracy Hotel Restaurant Taxi Attraction Train 4.4 Results TRADE (Wu et al., 2019) 13.7 11.52 60.58 19.87 22.37 STARC (Gao et al., 2020) 28.6 28.2 65 36.9 26.1 In our experiments, we are able to achieve signifi- STARC + D-REPTILE 32.4 47.8 67.2 45.9 46.1 cant improvement over the baseline method under Table 1: Zero-shot performance on MultiWoz 2.0 low-data setting (< 32 dialogues). Note that the dataset. Domains like restaurant and train witness a choice of low-data setting is guided by the practical significant boost in performance over baselines applications of the method. It also validates our hypothesis that the initialization chosen by meta- learning is closer to optimal parameters of the tar- Figure 2. Note that JGA metric is computed here get domain in terms of gradient steps and therefore with restricted slots based on the type of the model. perform better when there is very less data. How- The gains are larger for the extractive model possi- ever, as fine-tuning data is increased to 1000s of bly because marking span in original dialogue can dialogues, any random initialization is also able be considered slightly harder task than choosing to reach the optimal parameters for target domain. among limited number of choices. We observe the benefits of D-REPTILE in limited Across datasets - To show that the merits of D- data consistently across different domains, datasets REPTILE are not limited to MultiWoz data, we and models as explained one-by-one below tested with domains from DSTC8 dataset as both Across domains - We used all different domains train and target domain. In Figure 1, the orange of MultiWoz 2.0 data as target domain in 5 plots lines represent model pre-trained using all the do- in Figure 1. We pre-train D-REPTILE (solid) and mains in DSTC8 as train domains while target NFT (dashed) versions of different models (repre- domain is from MultiWoz. As expected, the perfor- sented by different colors). For the models repre- mance of these models fall below red and blue lines sented by red and blue colors, we used all domains (models pre-trained with MultiWoz train domains) other than target domain as our train domains. For but above green (no pre-training) as training and example, for the first plot, hotel domain is our tar- testing datasets are different. However, the solid get domain, while restaurant, train, attraction and orange line (D-REPTILE) lies above dashed line taxi are our train domains. The red corresponds (NFT). In another set of experiments, we used tar- to starting with Roberta-Base embeddings, while get domain from DSTC8 and compiled the results the blue represent Roberta-RC which is pre-trained in Figure 3. Except for Hotels 1, Hotels 2 and Roberta-Base with reading comprehension datasets Hotels 3, all other domains from DSTC8 are used (Gao et al., 2020). The green dotted line represent as train domains while Hotels 2 is kept as target model without any pretraining. It is clearly very domain. We see that the benefits of meta-learning bad and unstable. This shows importance of us- are much larger for DSTC8 dataset than MultiWoz. ing other domains for few-shot experiments. We For example, with 8 dialogues for fine-tuning, D- fine-tune all our models using different amount of REPTILE achieves JGA of 43.9% while NFT is training data of target domain (x-axis). In each only able to get 14.1%. This can be attributed to one of our models, the solid lines (D-REPTILE) increase in number of different training tasks (23 lies strictly above the dashed lines (NFT) in JGA domains were used as train domains for DSTC8 as metric. The gains obtained are as high as 47.8% opposed to 4 for MultiWoz experiments). (D-REPTILE) vs 22.3% (NFT) for restaurant do- Surprisingly, the meta-learned initializations not main with 1 dialogue which is more than 100% only adapt faster but are also better to start with. improvement at no annotation cost at all. We see an improvement in zero-shot performance as well. In addition to comparison with the NFT Across models - Not only the results are consis- baseline, we also show improvement over existing tent across different base models for transformer models on MultiWoz 2.0 dataset in Table 1. Also as shown in Figure 1 but also across different DST note that D-REPTILE is model-agnostic and there- methods. As done in Gao et al. (2020), we train fore has the capability to improve the JGA for any separate categorical and extractive models for ho- underlying model for a new unknown domain. tel domain (using categorical and extractive data respectively from train domains) (which we have 5 Ablation Studies combined to plot Figure 1). If we consider these two fairly different models separately, we achieve To validate our various theoretical hypothesis, similar trends in each individually as plotted in search for hyper-parameters, clearly identify and
Figure 1: Performance of D-REPTILE vs NFT for different MultiWoz domains with three different models. reason about the situations where using meta- learning helps DST, we perform additional analysis as written in subsections below. 5.1 Slot-wise Analysis To exactly pin-point the advantage of D-REPTILE, we do a slot-wise analysis of our models in Fig- ure 4 and 5. Note that slots are defined as domain name.slot name. For example, hotel.day represents performance of the models in predicting Figure 2: Performance of D-REPTILE vs NFT for dif- the values for day slot where the target domain was ferent DST models for different slots in hotel domain. hotel. Overall performance or JGA in plot 1 of Figure 1 is combination of all the hotel slots like day, people, area, etc. Figure 4 shows the slots which are common among different domains while Figure 5 compare the performance for slots that are unique to a target domain. We can see that for the common slots, the solid lines (D-REPTILE) mostly lie higher than the dashed (NFT) counter- parts. However, nothing can be said in particular about slots in Figure 5. This behaviour is expected as unique slots particular to a target domain have little to gain from the different slots present in train domains (which were used for pre-training). This is evident from the fact that slots like hotel.internet, Figure 3: Performance of D-REPTILE vs NFT for Ho- hotel.parking have zero-shot active accuracy close tels 2 domain in DSTC8 data as target domain. to zero for all kinds of pretraining strategies (Figure 5). However, wherever slots between different do-
Figure 4: Active Slot accuracy for slots common between different domains Figure 5: Active Slot accuracy for slots unique to specific target domain
5.3 Adding more train domains As mentioned in previous section, we observe that benefits of D-REPTILE are much more profound when target domain is from DSTC8 dataset than when it is from MultiWoz (Figure 3). Given that DSTC8 has 23 train domains as compared to 4 in MultiWoz, it is not difficult to see the reason for this boost in performance. In this subsection, we Figure 6: JGA for categorical model for hotel domain try to answer the question whether MultiWoz target with different datasets can also gain from additional domains of DSTC8. Here, for ease of computation, we only experiment with categorical model with hotel domain as tar- get . We use both DSTC8 domains and MultiWoz domains (of course excluding hotel domain data during pre-training) and test it on hotel data from MultiWoz. These are represented by additional pink and black lines in Figure 6. We observe that al- though D-REPTILE helps to improve performance over baseline NFT but adding additional domains does not help the model much overall(solid black line is similar to solid blue line). This shows that in addition to the number of different training tasks, Figure 7: JGA for restaurant domain dev set with differ- the relatedness of those tasks is also very crucial ent hyper-parameters for the best D-REPTILE model for meta-learning. The DSTC8 domains which are out-of-sample for MultiWoz target domain did not mains are similar, the pretraining have much larger prove to be effective. (the small difference between influence. In that case, the merit of learning gener- JGA values for 1-dialogue fine-tuning in Figure 6 alizable initialization from D-REPTILE than NFT and categorical model in Figure 2 is due to differ- is much more clearly evident (Figure 4). ence in the choice of the single dialogue from hotel domain used for fine-tuning) 5.2 Hyper-parameter Search 6 Conclusion We briefly discuss the choice of various hyper- parameters here. We use dev set from restaurant We conclude our analysis on the merits of meta- domain for searching for optimum values for differ- learning as compared to naive pre-training for DST ent parameters introduced by meta-learning, while problem on a very positive note. Given the prac- the rest are kept same as STARC model (Gao et al., tical applicability of very-low data analysis, we 2020). In Figure 7, we plot the variation in perfor- provide enough evidence to a developer of an au- mance with k and pD (.). Like any meta-learning tomated conversational system for an unknown do- algorithm, setting k too small or too large hurts main that irrespective of his/her model and target the performance in our case as well (specially domain, D-REPTILE can achieve significant im- k = 1 where it becomes theoretically similar to provement (sometimes almost double) over conven- NFT (Nichol et al., 2018)). Hence, optimum value tional fine-tuning methods with no additional cost. k = 5 is used for all our experiments. Also, simi- With detailed ablations, we further provide insights lar to the conclusion in Dou et al. (2019), we find on which slots and domains will particularly benefit choosing pD (.) of any domain as proportional to from pre-traning strategies and which of those will the size of the training dataset of that domain help- require additional data. Being agnostic to underly- ful (blue vs red line). This is attributed to the fact ing model, our proposed algorithm has capability that in case of imbalance in data among different to push state-of-the-art in zero/few-shot DST prob- train domains, the algorithm gets to see all the data lem, giving hope for expanding the scope of similar from the resource-rich domain as it is chosen more chatbot based systems in new businesses. often and hence generalizes better.
References Milica Gašić. 2020. Trippy: A triple copy strategy for value independent neural dialog state tracking. Trapit Bansal, Rishikesh Jha, and Andrew McCallum. arXiv preprint arXiv:2005.02877. 2019. Learning to few-shot learn across diverse nat- ural language classification tasks. arXiv preprint Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen- arXiv:1911.03863. tau Yih, and Xiaodong He. 2018. Natural language Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang to structured query generation via meta-learning. Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra- arXiv preprint arXiv:1803.02400. madan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for Diederik P Kingma and Jimmy Ba. 2014. Adam: A task-oriented dialogue modelling. arXiv preprint method for stochastic optimization. arXiv preprint arXiv:1810.00278. arXiv:1412.6980. Giovanni Campagna, Agata Foryciarz, Mehrad Morad- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, shahi, and Monica S Lam. 2020. Zero-shot and Eduard Hovy. 2017. Race: Large-scale reading transfer learning with synthesized data for multi- comprehension dataset from examinations. arXiv domain dialogue state tracking. arXiv preprint preprint arXiv:1704.04683. arXiv:2005.00891. Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim. 2019. Yinpei Dai, Hangyu Li, Chengguang Tang, Yongbin Li, Sumbt: Slot-utterance matching for universal Jian Sun, and Xiaodan Zhu. 2020. Learning low- and scalable belief tracking. arXiv preprint resource end-to-end goal-oriented dialog for fast and arXiv:1907.07421. reliable system deployment. In Proceedings of the 58th Annual Meeting of the Association for Compu- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tational Linguistics, pages 609–618. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. Roberta: A robustly optimized bert pretraining ap- 2019. Investigating meta-learning algorithms for proach. arXiv preprint arXiv:1907.11692. low-resource natural language understanding tasks. arXiv preprint arXiv:1908.10423. Marius Mosbach, Maksym Andriushchenko, and Diet- Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Ku- rich Klakow. 2020. On the stability of fine-tuning mar, Abhishek Sethi, Peter Ku, Anuj Kumar bert: Misconceptions, explanations, and strong base- Goyal, Sanchit Agarwal, Shuyang Gao, and Dilek lines. arXiv preprint arXiv:2006.04884. Hakkani-Tur. 2019. Multiwoz 2.1: A consoli- dated multi-domain dialogue dataset with state cor- Alex Nichol, Joshua Achiam, and John Schulman. rections and state tracking baselines. arXiv preprint 2018. On first-order meta-learning algorithms. arXiv:1907.01669. arXiv preprint arXiv:1803.02999. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Kun Qian and Zhou Yu. 2019. Domain adaptive di- Model-agnostic meta-learning for fast adaptation of alog generation via meta learning. arXiv preprint deep networks. arXiv preprint arXiv:1703.03400. arXiv:1906.03520. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Eunsol Choi, and Danqi Chen. 2019. Mrqa 2019 Raghav Gupta, and Pranav Khaitan. 2019. Towards shared task: Evaluating generalization in reading scalable multi-domain conversational agents: The comprehension. arXiv preprint arXiv:1910.09753. schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855. Shuyang Gao, Sanchit Agarwal, Tagyoung Chung, Di Jin, and Dilek Hakkani-Tur. 2020. From machine Rico Sennrich and Biao Zhang. 2019. Revisiting low- reading comprehension to dialogue state tracking: resource neural machine translation: A case study. Bridging the gap. arXiv preprint arXiv:2004.05827. arXiv preprint arXiv:1905.11901. Rahul Goel, Shachi Paul, and Dilek Hakkani-Tür. 2019. Hyst: A hybrid approach for flexible and Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, accurate dialogue state tracking. arXiv preprint and Claire Cardie. 2019. Dream: A challenge data arXiv:1907.00883. set and models for dialogue-based reading compre- hension. Transactions of the Association for Com- Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, putational Linguistics, 7:217–231. and Victor OK Li. 2018. Meta-learning for low- resource neural machine translation. arXiv preprint Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob arXiv:1808.08437. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Michael Heck, Carel van Niekerk, Nurul Lubis, Chris- you need. In Advances in neural information pro- tian Geishauser, Hsien-Chin Lin, Marco Moresi, and cessing systems, pages 5998–6008.
Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini- Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state gen- erator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743. Ming Yan, Hao Zhang, Di Jin, and Joey Tianyi Zhou. 2020. Multi-source meta transfer for low resource multiple-choice question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7331–7341. Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. 2019. Find or classify? dual strat- egy for slot-value predictions on multi-domain dia- log state tracking. arXiv preprint arXiv:1910.03544.
You can also read