Show, Don't Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue Raghav Gupta∗, Harrison Lee∗, Jeffrey Zhao, Abhinav Rastogi, Yuan Cao, Yonghui Wu Google Research {raghavgupta, harrisonlee}@google.com Abstract However, description-based schema represen- tations have drawbacks. Writing precise natural Building universal dialogue systems that language descriptions requires manual effort and can seamlessly operate across multiple do- can be tricky, and descriptions only provide indi- mains/APIs and generalize to new ones with arXiv:2204.04327v1 [cs.CL] 8 Apr 2022 rect supervision about how to interact with a ser- minimal supervision and maintenance is a crit- ical challenge. Recent works have leveraged vice compared to an example. Furthermore, Lee natural language descriptions for schema ele- et al. (2021b) showed that state-of-the-art schema- ments to enable such systems; however, de- guided DST models may not be robust to variation scriptions can only indirectly convey schema in schema descriptions, causing significant accu- semantics. In this work, we propose Show, racy drops. Don’t Tell, a prompt format for seq2seq mod- We propose using a single dialogue example eling which uses a short labeled example di- with state annotations as an alternative to the alogue to show the semantics of schema ele- ments rather than tell the model via descrip- description-based schema representation, similar tions. While requiring similar effort from ser- to one-shot priming (Brown et al., 2020). Rather vice developers, we show that using short ex- than tell the model about schema element seman- amples as schema representations with large tics in natural language, we show the schema language models results in stronger perfor- through a demonstration, as in Figure 1. Ap- mance and better generalization on two pop- plying our approach, Show, Don’t Tell (SDT), to ular dialogue state tracking benchmarks: the two SotA DST models consistently results in su- Schema-Guided Dialogue dataset and the Mul- tiWoZ leave-one-out benchmark. perior accuracy and generalization to new APIs across both the Schema-Guided Dataset (SGD) 1 Introduction (Rastogi et al., 2020b) and MultiWoZ Leave-One- Out (Budzianowski et al., 2018; Lin et al., 2021b) Task-oriented dialogue (TOD) systems need to sup- benchmarks, while being more data-efficient and port an ever-increasing variety of services. Since robust to schema variations. many service developers lack the resources to col- lect labeled data and/or the requisite ML expertise, 2 Show, Don’t Tell zero and few-shot transfer to unseen services is Following SoTA models, we pose DST as a seq2seq critical to the democratization of dialogue agents. task (Wu et al., 2019; Zhao et al., 2021a), where New approaches to TOD that can generalize the seq2seq language model (in our case, T5) is to new services primarily rely on combining two finetuned on a DST dataset. During finetuning and techniques: large language models like BERT (De- evaluation, the model input consists of a prompt vlin et al., 2019) and T5 (Raffel et al., 2020), and and context, and the target contains ground truth schema-guided modeling i.e. using natural lan- belief states. We compare against two baselines: guage descriptions of schema elements (intents and slots) as model inputs to enable inference on unseen • T5-ind (Lee et al., 2021a): Model input com- services (Rastogi et al., 2020a,b). Models combin- prises the dialogue history as context concate- ing the two currently show state-of-the-art results nated with one slot description as the prompt. on dialogue state tracking (DST) (Heck et al., 2020; The target is the value of that slot in the dialogue Lee et al., 2021a; Zhao et al., 2022). state. Inference is done per slot - i.e. values for ∗ *Equal contribution different slots are independently decoded.
T5-ind SDT-ind P1 = amount: The amount of money to send Pind 1 = [ex] [user] I need to transfer 125 dollars [slot] or request amount=125 dollars P2 = receiver: Name of the contact or account Pind 2 = [ex] [user] Make the transfer to Victoria. [slot] to make the transaction with receiver=Victoria ... ... T5-seq SDT-seq P = 0: The amount of money to send or Pseq = [ex] [user] I want to make a payment to Jerry for $82 request 1: Name of the contact or account to from my mastercard [system] Confirming you want to pay make the transaction with 2: Whether the Jerry $82 with your credit card yes? [user] Yes that’s right, transaction is private or not a) True b) False 3: make the transaction private too [slot] amount=$82 The source of money used for making the receiver=Jerry private_visibility=a of a) True b) False payment a) credit card b) debit card c) app payment_method=a of a) credit card b) debit card c) app balance balance Figure 1: Illustration of all prompt formats for a payment service for both description-based and Show, Don’t Tell models with independent (top) and sequential (bottom) decoding of dialogue state. • T5-seq (Zhao et al., 2022): Model input com- a single slot value for *-ind models and the entire prises the descriptions of all slots as the prompt, turn’s belief state for *-seq models. followed by the dialogue history as the context. For both T5-* (baseline) and SDT-*, we enumer- The target is the sequence of slot-value pairs in ate the categorical slot values in multiple-choice the dialogue state - i.e. the dialogue state is de- format in the prompt and task models with decod- coded sequentially in a single pass. ing the correct multiple choice letter. More details on the prompt design and its impact We modify the prompt formats above to utilize on performance are provided in Appendix C. demonstrations instead of descriptions as described Formulating prompt examples: It is imperative below and illustrated in Figure 1. that SDT prompts contain sufficient information to infer the semantics for all slots in the schema. • SDT-ind: A prompt Pindi comprises a single ut- This is easy for SDT-ind, which uses a separate terance labeled with a slot value pair formatted prompt for each slot. However, for SDT-seq, we as only choose example dialogues where all slots in Pind i = [ex]; uind i ; [slot]; svi the schema are used. where uind i is a user utterance where slot i is ac- tive and svi is the slot value pair. [ex], [slot] 3 Experimental Setup are special delimiter tokens, and ; denotes con- Datasets: We conduct experiments on two DST catenation. benchmarks: Schema-guided Dialogue (SGD) • SDT-seq: A prompt Pseq comprises a single la- (Rastogi et al., 2020b) and MultiWOZ 2.1 beled dialogue formatted as: (Budzianowski et al., 2018; Eric et al., 2020). For MultiWOZ, we evaluate on the leave-one-out setup Pseq = [ex]; u1 ; ...; un ; [slot]; sv1 ; ...; svm (Wu et al., 2019; Lin et al., 2021a), where models are trained on all domains but one and evaluated on the holdout domain. Additionally, we apply the where uj is an utterance, and other symbols recommended TRADE pre-processing script1 for are shared with SDT-ind. In simple terms, the fair comparison with other work. For both datasets, prompt is constructed by concatenating all utter- we created concise prompt dialogues modeled after ances in the example dialogue followed by all dialogues observed in the datasets. slot-value pairs in the final dialogue state. Implementation: We train SDT models by fine- The context in both formats is a concatenation tuning pretrained T5 1.1 checkpoints. For both of the dialogue history for the current training ex- datasets, we select one example prompt per ser- ample. The final model input is formed by con- vice schema (for SDT-seq) or slot (for SDT-ind), catenating the prompt and the context strings. The 1 https://github.com/budzianowski/ target string is the same as T5-*, containing only multiwoz#dialog-state-tracking
Model All Seen Unseen Model Attraction Hotel Restaurant Taxi Train Avg MRC+WD-DST* 86.5 92.4 84.6 TRADE 20.1 14.2 12.6 59.2 22.4 25.7 T5-seq 86.4 95.8 83.3 SUMBT 22.6 19.8 16.5 59.5 22.5 28.2 T5-ind 87.7 95.3 85.2 TransferQA 31.3 22.7 26.3 61.9 36.7 35.8 SDT-ind 87.5±0.9 95.2±0.7 85.0±1.4 T5-seq 76.1 28.6 69.8 87.0 60.4 64.4 SDT-seq 88.8±0.5 95.8±0.2 86.4±0.7 SDT-seq 73.2 34.2 71.9 83.7 68.4 66.3 Table 1: SGD test set JGA for SDT versus other ap- Table 2: Cross-domain (leave-one-out) JGA on Multi- proaches. *Data augmentation/special rules applied. WOZ 2.1. Results for TRADE, SUMBT, and Trans- ferQA from (Kumar et al., 2020), (Campagna et al., 2020), and (Lin et al., 2021a), respectively. and use the same prompt for all examples for that service/slot across training and evaluation. Unless state-of-the-art results on the overall task by +2% otherwise noted, all T5-based models (T5/SDT- and in 3 of the 5 domains. seq/ind) are finetuned on T5-XXL (11B parame- ters). See appendices A and B for more details on 4.3 Impact of Model Size training and baselines respectively. T5’s XXL size may be unsuitable in a number of settings; consequently, we measure SDT’s perfor- 4 Results mance on SGD across other model sizes in Table 3. For the base and large model sizes, both SDT varia- 4.1 Results on SGD tions offer higher JGA than their description-based Table 1 contains results on the SGD test set. Since counterparts, possibly due to smaller T5 models be- SDT results may depend on the choice of example ing less capable of inferring unseen slots with just turn/dialogue provided in the prompt, 5 different a description. Additionally, SDT-ind outperforms versions of prompts are created for each service us- SDT-seq for these sizes. ing different examples. We report the average JGA across these versions and the 95% confidence inter- Model Base (250M) Large (800M) XXL (11B) vals. SDT-seq achieves the highest JGA, showing T5-seq 72.9 80.0 86.4 major gains, particularly on unseen services, over T5-ind 72.6 82.2 87.7 its description-based counterpart T5-seq and the SDT-ind 78.2±0.6 83.7±0.8 87.5±0.9 next-best model T5-ind. SDT-ind is comparable SDT-seq 76.3±1.6 83.2±0.6 88.8±0.5 to its counterpart T5-ind, and better than T5-seq. Table 3: SGD test set JGA across T5 model sizes. Based on these results, conveying service seman- tics via a single dialogue example appears more effective than using natural language descriptions. 4.4 Data Efficiency We hypothesize that SDT-seq outperforms SDT- To examine the data efficiency of SDT models, we ind because the full dialogue prompts used in SDT- train SDT-seq in a low-resource setting with 0.16% seq demonstrate more complex linguistic patterns (10-shot), 1%, and 10% of the SGD training data (e.g. coreference resolution, long term dependen- and evaluate on the entire test set. For 10-shot, we cies) than the single utterance prompts of SDT-ind. randomly sample 10 training dialogues from every On the other hand, T5-seq does not outperform T5- service; for 1% and 10%, we sample uniformly ind because no additional information is conveyed across the entire dataset. SDT-seq demonstrates far to the model through stacking descriptions. Also higher data efficiency than T5-seq (Table 4). all-else-equal, decoding all slots in one pass is more challenging than decoding each slot independently. Model 10-shot 1% 10% We also experimented with using more than one T5-seq 51.0 79.4 83.0 dialogue to prompt SDT-seq, but we did not see an SDT-seq 70.7 84.5 87.4 increase in performance. Table 4: Data efficiency experiments on SGD test set. 4.2 MultiWOZ Results Table 2 summarizes results for the MultiWOZ 2.1 4.5 Robustness leave-one-out setup. Comparing T5-seq and SDT- Large LMs are often sensitive to the choice of seq, both finetuned on T5-XXL, SDT achieves prompt (Zhao et al., 2021b; Reynolds and Mc-
Example Dialogue Error 1. T5-seq confused by similar-sounding slots T5-seq: to=Sacramento, I need to find tickets to Anaheim, CA. When would you like to travel, from=Anaheim and where are you going to? Traveling to Sacramento on the 4th. 2. T5-seq misses active slots T5-seq: new_alarm_name=None Can you please add an alarm called Grocery run. 3. SDT-seq misses categorical values not seen in prompt SDT-seq: event_type=music I like Broadway shows and want to see one on Tuesday next week. (ground truth=theater) Figure 2: Examples of common error patterns made by T5-seq but not SDT-seq, and vice versa. Donell, 2021). To this end, we evaluate SDT-seq on easier to generate than a succinct dialogue that cov- the SGD-X (Lee et al., 2021b) benchmark, which ers all slots. However, given the performance gain, comprises 5 variants with paraphrased slot names example-based prompts may be a better choice for and descriptions per schema (see Appendix Figure many settings, especially for smaller model sizes 4). Table 5 shows SDT-seq achieves the highest where the gain is more pronounced. average JGA (JGAv1−5 ) and lowest schema sensi- tivity (SSJGA ), indicating it is the most robust of 5.2 Descriptions plus demonstrations the compared models. Even so, the JGA decline Training with descriptions has proven effective for indicates SDT-seq is sensitive to how slot names improving DST performance (Zhao et al., 2022; are written. Lee et al., 2021a), and our experiments show that demonstrations are even more effective. We com- Model JGAOrig JGAv1−5 Dif frel SSJGA bined both together to see if the two are comple- SGP-DST* 60.5 49.9 -17.5 51.9 T5-indbase * 72.6 64.0 -11.9 40.4 mentary or overlapping, and we find that perfor- T5-seq (name)H 79.7 73.0 -8.4 35.0 mance does not improve above using demonstra- T5-seq 86.4 77.8 -10.0 27.0 tions alone (Appendix Table A1. We hypothesize SDT-seq 88.8 81.2 -8.6 24.1 that demonstrations already convey slot semantics Table 5: Robustness evaluation on the SGD-X test sets. sufficiently and descriptions become extraneous. *Results from Lee et al. (2021b). HResult of using 5.3 Prompting vs. traditional finetuning T5-seq with only slot names and not descriptions, from Zhao et al. (2022). To understand the impact of using a single demon- stration as a prompt vs. traditional finetuning, we finetune T5-seq a second time on the same set of di- 5 Discussion alogues used in SDT-seq prompts; it therefore has access to both slot descriptions as well as a single 5.1 Writing descriptions vs. demonstrations demonstration for each service. In this case, T5-seq We note that the information provided to SDT is is provided strictly more information than SDT-seq. not identical to what is provided to typical schema- T5-seq with finetuning obtains a JGA of 87.7% guided models, as SDT exchanges natural language on SGD, on par with T5-ind but still lower than descriptions for a demonstration of identifying slots SDT-seq, suggesting that dialogue examples are in a dialogue. However, we argue that from the better used as prompts (Le Scao and Rush, 2021). developer standpoint, creating a single example Interestingly, finetuning on more than one dialogue is similar in effort to writing descriptions, so we example per service did not improve performance consider the methods comparable. Creating the (Appendix Figure 3). SDT-seq prompts for all 45 services in SGD took an experienced annotator ∼2 hours, compared to 5.4 Error analysis ∼1.5 hours for generating slot descriptions. SDT- Figure 2 compares some common errors made ind prompts are even simpler to write because they by T5-seq and SDT-seq. The patterns suggest relax the requirement for creating a coherent dia- that SDT’s demonstrations are helpful when mul- logue where all slots are used. tiple slots are similar to each other (#1) and when One advantage of descriptions is that they can be prompt dialogues closely match target dialogues
(#2). However, SDT can be limited by its prompt. References For instance, in #3 it has only seen the "music" Tom Brown, Benjamin Mann, Nick Ryder, Melanie value for the event_type slot in the prompt, poten- Subbiah, Jared D Kaplan, Prafulla Dhariwal, tially resulting in under-predicting the other cate- Arvind Neelakantan, Pranav Shyam, Girish Sastry, gorical value ("theater"). Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, 6 Related Work Clemens Winter, Chris Hesse, Mark Chen, Eric Prior approaches focused on framing DST as ques- Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, tion answering (Ruan et al., 2020; Ma et al., 2019; Alec Radford, Ilya Sutskever, and Dario Amodei. Zhang et al., 2021). Many MultiWoZ cross-domain 2020. Language models are few-shot learners. In models leverage slot names/descriptions (Wu et al., Advances in Neural Information Processing Systems, 2019; Lee et al., 2019; Lin et al., 2021a). volume 33, pages 1877–1901. Curran Associates, Inc. Pretrained generative LLMs (Raffel et al., 2020; Brown et al., 2020) have enabled framing NLP Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang tasks as seq2seq problems. Some DST papers Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- madan, and Milica Gašić. 2018. MultiWOZ - a (Zhao et al., 2021a; Feng et al., 2021) look at set- large-scale multi-domain Wizard-of-Oz dataset for tings with no train-test discrepancy. Many studies task-oriented dialogue modelling. In Proceedings of explore the efficacy of task-specific prompts (Jiang the 2018 Conference on Empirical Methods in Nat- et al., 2020; Liu et al., 2021). Madotto et al. (2020) ural Language Processing, pages 5016–5026, Brus- sels, Belgium. Association for Computational Lin- and prime LMs with examples for dialogue tasks, guistics. but without finetuning. Wei et al. (2021) finetunes language models to understand prompts for a dif- Giovanni Campagna, Agata Foryciarz, Mehrad Morad- shahi, and Monica Lam. 2020. Zero-shot transfer ferent task. learning with synthesized data for multi-domain dia- logue state tracking. In Proceedings of the 58th An- 7 Conclusion nual Meeting of the Association for Computational Linguistics, pages 122–132, Online. Association for We study the use of demonstrations as LM prompts Computational Linguistics. to convey the semantics of APIs in lieu of natu- ral language descriptions for TOD. While taking Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of similar effort to construct, demonstrations outper- deep bidirectional transformers for language under- form description-based prompts in our experiments standing. In Proceedings of the 2019 Conference across DST datasets (SGD and MultiWOZ), model of the North American Chapter of the Association sizes, and training data sizes, while being more for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), robust to changes in schemata. This work provides pages 4171–4186, Minneapolis, Minnesota. Associ- developers of TOD systems with more options for ation for Computational Linguistics. API representations to enable transfer to unseen ser- Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, vices. In future work, we would like to explore this Sanchit Agarwal, Shuyang Gao, Adarsh Kumar, representation for other TOD tasks (e.g. dialogue Anuj Goyal, Peter Ku, and Dilek Hakkani-Tur. 2020. management and response generation). MultiWOZ 2.1: A consolidated multi-domain dia- logue dataset with state corrections and state track- 8 Ethical Considerations ing baselines. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference, pages We proposed a more efficient way of building TOD 422–428, Marseille, France. European Language Re- systems by leveraging demonstrations in place of sources Association. descriptions, leading to increased accuracy with Yue Feng, Yang Wang, and Hang Li. 2021. A sequence- minimal/no data preparation overhead. We con- to-sequence approach to dialogue state tracking. In duct our experiments on publicly-available TOD Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the datasets in English, covering domains which are 11th International Joint Conference on Natural Lan- popular for building conversational agents. We guage Processing (Volume 1: Long Papers), pages hope our work leads to building more accurate 1714–1725, Online. Association for Computational TOD systems with similar or less overhead, and Linguistics. encourages further research in the area. Michael Heck, Carel van Niekerk, Nurul Lubis, Chris- tian Geishauser, Hsien-Chin Lin, Marco Moresi, and
Milica Gasic. 2020. TripPy: A triple copy strategy Linguistics: Human Language Technologies, pages for value independent neural dialog state tracking. 5640–5648, Online. Association for Computational In Proceedings of the 21th Annual Meeting of the Linguistics. Special Interest Group on Discourse and Dialogue, pages 35–44, 1st virtual meeting. Association for Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Computational Linguistics. Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt understands, too. Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language Yue Ma, Zengfeng Zeng, Dawei Zhu, Xuan Li, Yiy- models know? ing Yang, Xiaoyuan Yao, Kaijie Zhou, and Jianping Shen. 2019. An end-to-end dialogue state tracking Norman P. Jouppi, Cliff Young, Nishant Patil, David system with machine reading comprehension and Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah wide & deep classification. Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, and Pierre-luc et al. Cantin. 2017. In- Andrea Madotto, Zihan Liu, Zhaojiang Lin, and Pas- datacenter performance analysis of a tensor pro- cale Fung. 2020. Language models as few-shot cessing unit. SIGARCH Comput. Archit. News, learner for task-oriented dialogue systems. CoRR, 45(2):1–12. abs/2008.06239. Adarsh Kumar, Peter Ku, Anuj Goyal, Angeliki Met- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine allinou, and Dilek Hakkani-Tur. 2020. Ma-dst: Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Multi-attention-based scalable dialog state tracking. Wei Li, and Peter J. Liu. 2020. Exploring the limits Proceedings of the AAAI Conference on Artificial In- of transfer learning with a unified text-to-text trans- telligence, 34(05):8107–8114. former. Teven Le Scao and Alexander Rush. 2021. How many Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, data points is a prompt worth? In Proceedings of the Raghav Gupta, and Pranav Khaitan. 2020a. Schema- 2021 Conference of the North American Chapter of guided dialogue state tracking task at dstc8. the Association for Computational Linguistics: Hu- Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, man Language Technologies, pages 2627–2636, On- Raghav Gupta, and Pranav Khaitan. 2020b. To- line. Association for Computational Linguistics. wards scalable multi-domain conversational agents: Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. The schema-guided dialogue dataset. Proceedings 2021a. Dialogue state tracking with a language of the AAAI Conference on Artificial Intelligence, model using schema-driven prompting. In Proceed- 34(05):8689–8696. ings of the 2021 Conference on Empirical Methods Laria Reynolds and Kyle McDonell. 2021. Prompt pro- in Natural Language Processing, pages 4937–4949, gramming for large language models: Beyond the Online and Punta Cana, Dominican Republic. Asso- few-shot paradigm. ciation for Computational Linguistics. Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Harrison Lee, Raghav Gupta, Abhinav Rastogi, Yuan Liu. 2020. Fine-tuning bert for schema-guided zero- Cao, Bin Zhang, and Yonghui Wu. 2021b. Sgd-x: shot dialogue state tracking. A benchmark for robust generalization in schema- guided dialogue systems. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim. 2019. drew M. Dai, and Quoc V. Le. 2021. Finetuned lan- SUMBT: Slot-utterance matching for universal and guage models are zero-shot learners. scalable belief tracking. In Proceedings of the 57th Annual Meeting of the Association for Computa- Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini- tional Linguistics, pages 5478–5483, Florence, Italy. Asl, Caiming Xiong, Richard Socher, and Pascale Association for Computational Linguistics. Fung. 2019. Transferable multi-domain state gener- ator for task-oriented dialogue systems. Zhaojiang Lin, Bing Liu, Andrea Madotto, Seung- whan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang Yang Zhang, Vahid Noroozi, Evelina Bakhturina, and Wang, Zhou Yu, Eunjoon Cho, Rajen Subba, and Boris Ginsburg. 2021. Sgd-qa: Fast schema-guided Pascale Fung. 2021a. Zero-shot dialogue state track- dialogue state tracking for unseen services. ing via cross-task transfer. Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul Mingqiu Wang, Harrison Lee, Abhinav Rastogi, Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu, Izhak Shafran, and Yonghui Wu. 2022. Description- Andrea Madotto, Eunjoon Cho, and Rajen Subba. driven task-oriented dialog modeling. 2021b. Leveraging slot descriptions for zero-shot cross-domain dialogue StateTracking. In Proceed- Jeffrey Zhao, Mahdis Mahdieh, Ye Zhang, Yuan Cao, ings of the 2021 Conference of the North Ameri- and Yonghui Wu. 2021a. Effective sequence-to- can Chapter of the Association for Computational sequence dialogue state tracking. In Proceedings of
the 2021 Conference on Empirical Methods in Natu- of debit card, the model might decode bank ral Language Processing, pages 7486–7493, Online balance. and Punta Cana, Dominican Republic. Association for Computational Linguistics. C.2 Slot IDs vs. slot names Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and When we delexicalized slot names with slot IDs, Sameer Singh. 2021b. Calibrate before use: Improv- JGA dropped -5%. One downside of this approach ing few-shot performance of language models. is that the model lost access to valuable semantic information conveyed by the slot name. Another A SDT Model Details downside is that the model could not distinguish All T5 checkpoints used are available publicly2 . two slots that had the same value in the prompt. For all experiments, we use a sequence length of For example, if the prompt was "I would like a pet- 2048, dropout of 10% and a batch size of 16. We friendly hotel room with wifi" and the correspond- used a constant learning rate of 1e − 3 or 1e − ing slots were 1=True (has_wifi) and 2=True 4. All models were trained for 50k steps or until (pets_allowed), it is ambiguous which ID refers to convergence, and each experiment was conducted which slot. on either 64 or 128 TPU v3 chips (Jouppi et al., The potential upside of using slot IDs was to 2017). remove dependence on the choice of slot name, but this did not succeed for the reasons above. B Baseline Models C.3 Decoding active slots vs. all slots For SGD, we compare against SGP-DST (Ruan We experimented with training the model to only et al., 2020), MRC+WD-DST (Ma et al., 2019), decode active slots rather than all slots with none T5-seq (Zhao et al., 2022) and T5-ind (Lee et al., values when they were inactive. JGA dropped - 2021a). 0.4%, which we hypothesized could be a result of For MultiWOZ, we compare against TRADE greater dissimilarity between the slot-value string (Wu et al., 2019), SUMBT (Lee et al., 2019), Trans- in the prompt (which contained all slots by con- ferQA (Lin et al., 2021a), and T5-seq. struction) and the target, which only contained a Transfer QA is based on T5-large, and T5-ind subset of slots. and T5-seq are based on T5-XXL in this paper unless otherwise noted. C.4 In-line annotations vs. dialogue+slots concatenated C Prompt Design We hypothesized that bringing the slot annotation We experimented with various formats for the SDT in the prompt closer to where it was mentioned prompt before arriving at the final format. Below, in the dialogue might help the model better under- we list alternative designs that we tried and their stand the slot’s semantic meaning. We changed the impact on JGA, as evaluated on the SGD test set. format as follows: C.1 Categorical value strings vs. multiple • Original: [example] [user] I choice answers would like a pet-friendly We found that JGA dropped -2% when we hotel room with wifi tasked the model with decoding categorical val- [system] I found ... [slot] ues instead of multiple choice answers - e.g. has_wifi=True payment_method=debit card instead of • In-line: [example] [user] I would payment_method=b (where b is linked to the like a pet-friendly hotel value debit card in the prompt as described in room with wifi [has_wifi=True] Section 2). We found that when tasking the model [system] I found ... to decode categorical values, it would often de- code related yet invalid values, which we counted However, this decreased JGA by more than - as false in our evaluation. For example, instead 20%. We hypothesized that this was likely due to a mismatch between the prompt’s annotations and 2 https://github.com/google-research/ the target string format, which remained the same. text-to-text-transfer-transformer/blob/ main/released_checkpoints.md
Model All Seen Unseen SDT-seq + desc 88.6±0.9 95.7±0.5 86.2±1.0 SDT-seq 88.8±0.5 95.8±0.2 86.4±0.7 Table A1: We experiment with prompting using both descriptions and demonstrations (SDT-seq + desc) vs. demonstrations-only (SDT-seq) and find that descrip- tions do not improve performance. Figure 3: Results of secondarily finetuning T5-seq with dialogues, to help understand whether prompting or finetuning is more effective. The examples used for finetuning are derived from the set of dialogues used as prompts across the 5 trials of SDT-seq. From this, we observe that prompting outperforms finetuning.
Figure 4: The original schema for a Payment service its closest (v1 ) and farthest (v5 ) SGD-X variants, as measured by linguistic distance functions. For the SGD-X benchmark, models are trained on the original SGD dataset and evaluated on the test set, where the original test set schemas are replaced by SGD-X variant schemas.
You can also read