Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing Osman Ramadan, Paweł Budzianowski, Milica Gašić Department of Engineering, University of Cambridge, U.K. {oior2,pfb30,mg436}@cam.ac.uk Abstract Understanding (SLU) unit that utilizes a seman- tic dictionary to hold all the key terms, rephras- Robust dialogue belief tracking is a key ings and alternative mentions of a belief state. The component in maintaining good quality di- SLU then delexicalises each turn using this seman- alogue systems. The tasks that dialogue tic dictionary, before it passes it to the BT compo- systems are trying to solve are becoming nent (Wang and Lemon, 2013; Henderson et al., increasingly complex, requiring scalabil- 2014b; Williams, 2014; Zilka and Jurcicek, 2015; ity to multi-domain, semantically rich dia- Perez and Liu, 2016; Rastogi et al., 2017). How- logues. However, most current approaches ever, this approach is not scalable to multi-domain have difficulty scaling up with domains dialogues because of the effort required to de- because of the dependency of the model fine a semantic dictionary for each domain. More parameters on the dialogue ontology. In advanced approaches, such as the Neural Belief this paper, a novel approach is introduced Tracker (NBT), use word embeddings to alleviate that fully utilizes semantic similarity be- the need for delexicalisation and combine the SLU tween dialogue utterances and the ontol- and BT into one unit, mapping directly from turns ogy terms, allowing the information to to belief states (Mrkšić et al., 2017). Nevertheless, be shared across domains. The evalua- the NBT model does not tackle the problem of tion is performed on a recently collected mixing different domains in a conversation. More- multi-domain dialogues dataset, one order over, as each slot is trained independently without of magnitude larger than currently avail- sharing information between different slots, scal- able corpora. Our model demonstrates ing such approaches to large multi-domain sys- great capability in handling multi-domain tems is greatly hindered. dialogues, simultaneously outperforming In this paper, we propose a model that jointly existing state-of-the-art models in single- identifies the domain and tracks the belief states domain dialogue tracking tasks. corresponding to that domain. It uses semantic 1 Introduction similarity between ontology terms and turn utterances to allow for parameter sharing between Spoken Dialogue Systems (SDS) are computer different slots across domains and within a single programs that can hold a conversation with a hu- domain. In addition, the model parameters are man. These can be task-based systems that help independent of the ontology/belief states, thus the user achieve specific goals, e.g. finding and the dimensionality of the parameters does not booking hotels or restaurants. In order for the increase with the size of the ontology, making SDS to infer the user goals/intentions during the the model practically feasible to deploy in multi- conversation, its Belief Tracking (BT) component domain environments without any modifications. maintains a distribution of states, called a belief Finally, we introduce a new, large-scale corpora state, across dialogue turns (Young et al., 2010). of natural, human-human conversations providing The belief state is used by the system to take ac- new possibilities to train complex, neural-based tions in each turn until the conversation is con- models. Our model systematically improves upon cluded and the user goal is achieved. In order to state-of-the-art neural approaches both in single extract these belief states from the conversation, and multi-domain conversations. traditional approaches use a Spoken Language 432 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 432–437 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
2 Background 3.2 Multi-domain Dialogue State Tracking The belief states of the BT are defined based Recently, Rastogi et al. (2017) proposed a multi- on an ontology - the structured representation of domain approach using delexicalized utterances the database which contains entities the system fed to a two layer stacked bi-directional GRU net- can talk about. The ontology defines the terms work to extract features from the user and the sys- over which the distribution is to be tracked in tem utterances. These, combined with the candi- the dialogue. This ontology is constructed in date slots and values, are passed to a feed-forward terms of slots and values in a single domain set- neural network with a softmax in the last layer. ting. Or, alternatively, in terms of domains, slots The candidate set fed to the network consists of and values in a multi-domain environment. Each the selected candidates from the previous turn and domain consists of multiple slots and each slot candidates from the ontology to a limit K, which contains several values, e.g. domain=hotel, restricts the maximum size of the chosen set. Con- slot=price, value=expensive. In each sequently, the model does not need an ad-hoc be- turn, the BT fits a distribution over the values of lief state update mechanism like in the NBT. each slot in each domain, and a none value is The parameters of the GRU network are de- added to each slot to indicate if the slot is not fined for the domain, whereas the parameters of mentioned so that the distribution sums up to 1. the feed-forward network are defined per slot, al- The BT then passes these states to the Policy Op- lowing transfer learning across different domains. timization unit as full probability distributions to However, the model relies on delexicalization to take actions. This allows robustness to noisy envi- extract the features, which limits the performance ronments (Young et al., 2010). The larger the on- of the BT, as it does not scale to the rich variety of tology, the more flexible and multi-purposed the the language. Moreover, the number of parameters system is, but the harder it is to train and maintain increases with the number of slots. a good quality BT. 3 Related Work 4 Method In recent years, a plethora of research has been The core idea is to leverage semantic similarities generated on belief tracking (Williams et al., between the utterances and ontology terms to com- 2016). For the purposes of this paper, two pre- pute the belief state distribution. In this way, the viously proposed models are particularly relevant. model parameters only learn to model the interac- tions between turn utterances and ontology terms 3.1 Neural Belief Tracker (NBT) in the semantic space, rather than the mapping The main idea behind the NBT (Mrkšić et al., from utterances to states. Consequently, informa- 2017) is to use semantically specialized pre- tion is shared between both slots and across do- trained word embeddings to encode the user ut- mains. Additionally, the number of parameters terance, the system act and the candidate slots and does not increase with the ontology size. Do- values taken from the ontology. These are fed to main tracking is considered as a separate task but semantic decoding and context modeling modules is learned jointly with the belief state tracking of that apply a three-way gating mechanism and pass the slots and values. The proposed model uses the output to a non-linear classifier layer to pro- semantically specialized pre-trained word embed- duce a distribution over the values for each slot. It dings (Wieting et al., 2015). To encode the user uses a simple update rule, p(st ) = βp(st−1 ) + λy, and system utterances, we employed 7 indepen- where p(st ) is the belief state at time step t, y is dent bi-directional LSTMs (Graves and Schmid- the output of the binary decision maker of the NBT huber, 2005). Three of them are used to encode and β and λ are tunable parameters. the system utterance for domain, slot and value The NBT leverages semantic information tracking respectively. Similarly, three Bi-LSTMs from the word embeddings to resolve lexi- encode the user utterance while and the last one cal/morphological ambiguity and maximize the is used to track the user affirmation. A variant of shared parameters across the values of each slot. the CNNs as a feature extractor, similar to the one However, it only applies to a single domain and used in the NBT-CNN (Mrkšić et al., 2017) is also does not share parameters across slots. employed. 433
Figure 1: The proposed model architecture, using Bi-LSTMs as encoders. Other variants of the model use CNNs as feature extractors (Kim, 2014; Kalchbrenner et al., 2014). 4.1 Domain Tracking 4.2 Candidate Slots and Values Tracking Figure 1 presents the system architecture with two Slots and values are tracked using a similar archi- bi-directional LSTM networks as information en- tecture as for domain tracking (Figure 1). How- coders running over the word embeddings of the ever, to correctly model the context of the system- user and system utterances. The last hidden states user dialogue at each turn, three different cases are of the forward and backward layers are concate- considered when computing the similarity vectors: nated to produce hdusr , hdsys of size L for the user 1. Inform: The user is informing the system and system utterances respectively. In the second about his/her goal, e.g. ’I am looking for a variant of the model, CNNs are used to produce restaurant that serves Turkish food’. these vectors (Kim, 2014; Kalchbrenner et al., 2. Request: The system is requesting informa- 2014). To detect the presence of the domain in the tion by asking the user about the value of dialogue turn, element-wise multiplication is used a specific slot. If the system utterance is: as a similarity metric between the hidden states ’When do you want the taxi to arrive?’ and and the ontology embeddings of the domain: the user answers with ’19:30’. dk = hdk tanh(Wd ed + bd ), 3. Confirm: The system wants to confirm in- formation about the value of a specific slot. If where k ∈ {usr, sys}, ed is the embedding vector the system asked: ’Would you like free park- of the domain and Wd ∈ RL×D transforms the ing?’, the user can either affirm positively or domain word embeddings of dimension D to the negatively. The model detects the user affir- hidden representation. The information about se- mation, using a separate bi-directional LSTM mantic similarity is held by dusr and dsys , which or CNN to output hausr . are fed to a non-linear layer to output a binary de- cision: The three cases are modelled as following: Pt (d) = σ(wd {dusr ⊕ dsys } + bd ), s,v yinf = winf {susr ⊕ vusr } + binf , where wd ∈ R2L and bd are learnable parameters s,v yreq = wreq {ssys ⊕ vusr } + breq , that map the semantic similarity to a belief state s,v probability Pt (d) of a domain d at a turn t. yaf = waf {ssys ⊕ vsys ⊕ hausr } + baf , 434
where sk , vk for k ∈ {usr, sys} represent seman- are split into domains and slots and values. Thanks tic similarity between the user and system utter- to the disjoint training, the learning of slot and ances and the ontology slot and value terms re- value belief states are not restricted to a specific spectively computed as shown in Figure 1, and w domain. Therefore, the model shares the knowl- and b are learnable parameters. edge of slots and values across different domains. The distribution over the values of slot s in do- The loss function for the domain tracking is: main d at turn t can be computed by summing the N X unscaled states, yinf , yreq and yaf for each value X Ld = − tn (d)logP1:T n (d), v in s, and applying a softmax to normalize the n=1 d∈D distribution: s,v Pt (s, v) = softmax(yinf s,v + yreq s,v + yaf ). where d is a vector of domains over the dialogue, tn (d) is the domain label for the dialogue n and 4.3 Belief State Update N is the number of dialogues. Similarly, the loss Since dialogue systems in the real-world operate function for the slots and values tracking is: in noisy environments, a robust BT should utilize N the flow of the conversation to reduce the uncer- X X Ls,v = − tn (s, v)logP1:T n (s, v), tainty in the belief state distribution. This can n=1 s,v∈S,V be achieved by passing the output of the deci- sion maker, at each turn, as an input to an RNN where s and v are vectors of slots and values over that runs over the dialogue turns as shown in Fig- the dialogue and tn (s, v) is the joint label vector ure 1, which allows the gradients to be propagated for the dialogue n. across turns. This alleviates the problem of tun- ing hyper-parameters for rule-based updates. To 5 Datasets and Baselines avoid the vanishing gradient problem, three net- Neural approaches to statistical dialogue develop- works were tested: a simple RNN, an RNN with ment, especially in a task-oriented paradigm, are a memory cell (Henderson et al., 2014a) and a greatly hindered by the lack of large scale datasets. LSTM. The RNN with a memory cell proved to That is why, following the Wizard-of-Oz (WOZ) give the best results. In addition to the fact that it approach (Kelley, 1984; Wen et al., 2017), we reduces the vanishing gradient problem, this vari- ran text-based multi-domain corpus data collec- ant is less complex than an LSTM, which makes tion scheme through Amazon MTurk. The main training easier. Furthermore, a variant of RNN goal of the data collection was to acquire human- used for domain tracking has all its weights of the human conversations between a tourist visiting a form: Wi = αi I, where αi is a distinct learn- city and a clerk from an information center. At the able parameter for hidden, memory and previous beginning of each dialogue the user (visitor) was state layers and I is the identity matrix. Similarly, given explicit instructions about the goal to ful- weights of the RNN used to track the slots and val- fill, which often spanned multiple domains. The ues is of the form: Wj = γj I + λj (1 − I), where task of the system (wizard) is to assist a visitor γj and λj are the learnable parameters. These two having an access to databases over domains. The variants of RNN are a combination of Henderson WOZ paradigm allowed us to obtain natural and et al. (2014a) and Mrkvsić and Vulić (2018) previ- semantically rich multi-topic dialogues spanning ous works. The output is P1:T (d) and P1:T (s, v), over multiple domains such as hotels, attractions, which represents the joint probability distribution restaurants, booking trains or taxis. The dialogues of the domains and slots and values respectively cover from 1 up to 5 domains per dialogue greatly over the complete dialogue. Combining these to- varying in length and complexity. gether produces the full belief state distribution of the dialogue: 5.1 Data Structure P1:T (d, s, v) = P1:T (d)P1:T (s, v). The data consists of 2480 single-domain dialogues and 7375 multi-domain dialogues usually span- 4.4 Training Criteria ning from 2 up to 5 domains. Some domains con- Domain tracking and slots and values tracking are sists also of sub-domains like booking. The aver- trained disjointly. Belief state labels for each turn age sentence lengths are 11.63 and 15.01 for users 435
WOZ 2.0 New WOZ (only restaurants) Slot NBT-CNN Bi-LSTM CNN NBT-CNN Bi-LSTM CNN Food 88.9 96.1 96.4 78.3 84.7 85.3 Price range 93.7 98.0 97.9 92.6 95.6 93.6 Area 94.3 97.8 98.1 78.3 82.6 86.4 Joint goals 84.2 85.1 85.5 57.7 59.9 63.7 Table 1: WOZ 2.0 and new dataset test set accuracies of the NBT-CNN and the two variants of the proposed model, for slots food, price range, area and joint goals. and wizards respectively. The combined ontol- New WOZ (multi-domain) ogy consists of 5 domains, 27 slots and 663 val- Model F1 score Accuracy % ues making it significantly larger than observed in Uniform Sampling 0.108 10.8 other datasets. To enforce reproducibility of re- Bi-LSTM 0.876 93.7 sults, we distribute the corpus with a pre-specified CNN 0.878 93.2 train/test/development random split. The test and development sets contain 1k examples each. Each Table 2: The overall F1 score and accuracy for the dialogues consists of a goal, user and system ut- multi-domain dialogues test set.4 terances and a belief state per turn. The data and model is publicly available.1 This is because the dialogues in the new dataset are richer and more noisier, as a closer resem- 5.2 Evaluation blance to real environment dialogues. We also used the extended WOZ 2.0 dataset (Wen Table 2 presents the results on multi-domain di- et al., 2017).2 WOZ2 dataset consists of 1200 sin- alogues from the new dataset described in Sec- gle topic dialogues constrained to the restaurant tion 5. To demonstrate the difficulty of the multi- domain. All the weights were initialised using nor- domain belief tracking problem, values of a the- mal distribution of zero mean and unit variance oretical baseline that samples the belief state uni- and biases were initialised to zero. ADAM op- formly at random are also presented. Our model timizer (Kingma and Ba, 2014) (with 64 batch gracefully handles such a difficult task. In most size) is used to train all the models for 600 epochs. of the cases, CNNs demonstrate better perfor- Dropout (Srivastava et al., 2014) was used for reg- mance than Bi-LSTMs. We hypothesize that this ularisation (50% dropout rate on all the intermedi- comes from the effectiveness of extracting local ate representations). For each of the two datasets and position-invariant features, which are crucial we compare our proposed architecture (using ei- for semantic similarities (Yin et al., 2017). ther Bi-LSTM or CNN as encoders) to the NBT model3 (Mrkšić et al., 2017). 7 Conclusions In this paper, we proposed a new approach that 6 Results tackles the issue of multi-domain belief tracking, Table 1 shows the performance of our model in such as model parameter scalability with the ontol- tracking the belief state of single-domain dia- ogy size. Our model shows improved performance logues, compared to the NBT-CNN variant of the in single-domain tasks compared to the state-of- NBT discussed in Section 3.1. Our model outper- the-art NBT method. By exploiting semantic sim- forms NBT in all the three slots and the joint goals ilarities between dialogue utterances and ontology for the two datasets. NBT previously achieved terms, the model alleviates the need for ontology- state-of-the-art results (Mrkšić et al., 2017). More- dependent parameters and maximizes the amount over, the performance of all models is worse on the of information shared between slots and across do- new dataset for restaurant compared to WOZ 2.0. mains. In future, we intend to investigate introduc- ing new domains and ontology terms without fur- 1 http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ ther training thus performing zero-shot learning. 2 Publicly available at https://mi.eng.cam.ac. uk/˜nm480/woz_2.0.zip. 4 F1-score is computed by considering all the values in 3 Publicly available at https://github.com/ each slot of each domain as positive and the ”none” state of nmrksic/neural-belief-tracker. the slot as negative. 436
Acknowledgments Abhinav Rastogi, Dilek Hakkani-Tur, and Larry Heck. 2017. Scalable multi-domain dialogue state track- The authors would like to thank Nikola Mrkšić, ing. arXiv preprint arXiv:1712.10224 . Jacquie Rowe, the Cambridge Dialogue Systems Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Group and the ACL reviewers for their construc- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. tive feedback. Paweł Budzianowski is supported Dropout: A simple way to prevent neural networks by EPSRC Council and Toshiba Research Eu- from overfitting. Journal of Machine Learning Re- rope Ltd, Cambridge Research Laboratory. The search 15:1929–1958. data collection was funded through Google Fac- Zhuoran Wang and Oliver Lemon. 2013. A simple ulty Award. and generic belief tracking mechanism for the dia- log state tracking challenge: On the believability of observed information. In Proceedings of the SIG- DIAL 2013 Conference. pages 423–432. References Alex Graves and Jürgen Schmidhuber. 2005. Frame- Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Lina wise phoneme classification with bidirectional lstm M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and other neural network architectures. Neural Net- David Vandyke, and Steve Young. 2017. A network- works 18(5-6):602–610. based end-to-end trainable task-oriented dialogue system. In Proceedings on EACL . Matthew Henderson, Blaise Thomson, and Steve John Wieting, Mohit Bansal, Kevin Gimpel, Karen Young. 2014a. Robust dialog state tracking using Livescu, and Dan Roth. 2015. From paraphrase delexicalised recurrent neural networks and unsu- database to compositional paraphrase model and pervised adaptation. In Spoken Language Technol- back. Transactions of the Association for Compu- ogy Workshop (SLT), 2014 IEEE. IEEE, pages 360– tational Linguistics 3:345–358. 365. Jason Williams, Antoine Raux, and Matthew Hender- Matthew Henderson, Blaise Thomson, and Steve son. 2016. The dialog state tracking challenge se- Young. 2014b. Word-based dialog state tracking ries: A review. Dialogue & Discourse 7(3):4–33. with recurrent neural networks. In Proceedings of the 15th Annual Meeting of the Special Inter- Jason D Williams. 2014. Web-style ranking and slu est Group on Discourse and Dialogue (SIGDIAL). combination for dialog state tracking. In Proceed- pages 292–299. ings of the 15th Annual Meeting of the Special Inter- est Group on Discourse and Dialogue (SIGDIAL). Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- pages 282–291. som. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL . Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of cnn and rnn John F Kelley. 1984. An iterative design methodol- for natural language processing. arXiv preprint ogy for user-friendly natural language office infor- arXiv:1702.01923 . mation applications. ACM Transactions on Infor- mation Systems (TOIS) 2(1):26–41. Steve Young, Milica Gašić, Simon Keizer, François Mairesse, Jost Schatzmann, Blaise Thomson, and Yoon Kim. 2014. Convolutional neural networks for Kai Yu. 2010. The hidden information state model: sentence classification. In Proceedings of EMNLP . A practical framework for pomdp-based spoken dia- logue management. Computer Speech & Language Diederik P Kingma and Jimmy Ba. 2014. Adam: A 24(2):150–174. method for stochastic optimization. ICLR . Lukas Zilka and Filip Jurcicek. 2015. Incremental lstm-based dialog state tracker. In Automatic Speech Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Recognition and Understanding (ASRU), 2015 IEEE Wen, Blaise Thomson, and Steve Young. 2017. Workshop on. IEEE, pages 757–762. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1777– 1788. Nikola Mrkvsić and Ivan Vulić. 2018. Fully statistical neural belief tracking. In Proceedings of ACL. Julien Perez and Fei Liu. 2016. Dialog state tracking, a machine reading approach using memory network. arXiv preprint arXiv:1606.04052 . 437
You can also read