Efficient Context and Schema Fusion Networks for Multi-Domain Dialogue State Tracking
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Efficient Context and Schema Fusion Networks for Multi-Domain Dialogue State Tracking Su Zhu, Jieyu Li, Lu Chen∗ and Kai Yu∗ State Key Laboratory of Media Convergence Production Technology and Systems MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University SpeechLab, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China {paul2204,oracion,chenlusz,kai.yu}@sjtu.edu.cn Abstract Dialogue state tracking (DST) aims at estimat- ing the current dialogue state given all the pre- arXiv:2004.03386v4 [cs.CL] 7 Oct 2020 ceding conversation. For multi-domain DST, the data sparsity problem is a major obstacle due to increased numbers of state candidates and dialogue lengths. To encode the dialogue context efficiently, we utilize the previous di- alogue state (predicted) and the current dia- logue utterance as the input for DST. To con- sider relations among different domain-slots, the schema graph involving prior knowledge is exploited. In this paper, a novel context and schema fusion network is proposed to encode the dialogue context and schema graph by us- Figure 1: An example of multi-domain dialogues. Ut- ing internal and external attention mechanisms. terances at the left side are from the system agent, and Experiment results show that our approach can utterances at the right side are from a user. The dia- outperform strong baselines, and the previous logue state of each domain is represented as a set of state-of-the-art method (SOM-DST) can also (slot, value) pairs. be improved by our proposed schema graph. 1 Introduction users across different domains (Budzianowski et al., 2018; Eric et al., 2019). As shown in Fig. 1, Dialogue state tracking (DST) is a key component the dialogue covers three domains (i.e., Hotel, in task-oriented dialogue systems which cover cer- Attraction and Taxi). The goal of multi- tain narrow domains (e.g., booking hotel and travel domain DST is to predict the value (including planning). As a kind of context-aware language NONE) for each domain-slot pair based on all the understanding task, DST aims to extract user goals preceding dialogue utterances. However, due to or intents hidden in human-machine conversation increasing numbers of dialogue turns and domain- and represent them as a compact dialogue state, slot pairs, the data sparsity problem becomes the i.e., a set of slots and their corresponding values. main issue in this field. For example, as illustrated in Fig. 1, (slot, value) To tackle the above problem, we emphasize that pairs like (name, huntingdon marriott hotel) are DST models should support open-vocabulary based extracted from the dialogue. It is essential to build value decoding, encode context efficiently and in- an accurate DST for dialogue management (Young corporate domain-slot relations: et al., 2013), where dialogue state determines the next machine action and response. 1. Open-vocabulary DST is essential for real- Recently, motivated by the tremendous growth world applications (Wu et al., 2019; Gao et al., of commercial dialogue systems like Apple Siri, 2019; Ren et al., 2019), since value sets for Microsoft Cortana, Amazon Alexa, or Google As- some slots can be very huge and variable (e.g., sistant, multi-domain DST becomes crucial to help song names). ∗ The corresponding authors are Lu Chen and Kai Yu. 2. To encode the dialogue context efficiently, we
attempt to get context representation from the et al., 2014b,a; Yu et al., 2015), or jointly learn lan- previous (predicted) dialogue state and the guage understanding in an end-to-end way (Hender- current turn dialogue utterance, while not con- son et al., 2014b,c). These methods heavily rely on catenating all the preceding dialogue utter- hand-crafted features and complex domain-specific ances. lexicons for delexicalization, which are difficult to extend to new domains. Recently, most works 3. To consider relations among domains and about DST focus on encoding dialogue context slots, we introduce the schema graph which with deep neural networks (such as CNN, RNN, contains domain, slot, domain-slot nodes and LSTM-RNN, etc.) and predicting a value for each their relationships. It is a kind of prior knowl- possible slot (Mrkšić et al., 2017; Xu and Hu, 2018; edge and may help alleviate the data imbal- Zhong et al., 2018; Ren et al., 2018). ance problem. Multi-domain DST Most traditional state track- To this end, we propose a multi-domain dia- ing approaches focus on a single domain, which logue state tracker with context and schema fusion extract value for each slot in the domain (Williams networks (CSFN-DST). The fusion network is ex- et al., 2013; Henderson et al., 2014a). They can ploited to jointly encode the previous dialogue state, be directly adapted to multi/mixed-domain conver- the current turn dialogue and the schema graph by sations by replacing slots in a single domain with internal and external attention mechanisms. After domain-slot pairs (i.e. domain-specific slots) (Ra- multiple layers of attention networks, the final rep- madan et al., 2018; Gao et al., 2019; Wu et al., resentation of each domain-slot node is utilized to 2019; Zhang et al., 2019; Kim et al., 2019). De- predict the corresponding value, involving context spite its simplicity, this approach for multi-domain and schema information. For the value prediction, DST extracts value for each domain-slot indepen- a slot gate classifier is applied to decide whether a dently, which may fail to capture features from slot domain-slot is mentioned in the conversation, and co-occurrences. For example, hotels with higher then an RNN-based value decoder is exploited to stars are usually more expensive (price range). generate the corresponding value. Predefined ontology-based DST Most of the Our proposed CSFN-DST is evaluated on Mul- previous works assume that a predefined ontology tiWOZ 2.0 and MultiWOZ 2.1 benchmarks. Abla- is provided in advance, i.e., all slots and their val- tion study on each component further reveals that ues of each domain are known and fixed (Williams, both context and schema are essential. Contribu- 2012; Henderson et al., 2014a). Predefined tions in this work are summarized as: ontology-based DST can be simplified into a value classification task for each slot (Henderson et al., • To alleviate the data sparsity problem and 2014c; Mrkšić et al., 2017; Zhong et al., 2018; Ren enhance the context encoding, we propose et al., 2018; Ramadan et al., 2018; Lee et al., 2019). exploiting domain-slot relations within the It has the advantage of access to the known can- schema graph for open-vocabulary DST. didate set of each slot, but these approaches may • To fully encode the schema graph and dia- not be applicable in the real scenario. Since a full logue context, fusion networks are introduced ontology is hard to obtain in advance (Xu and Hu, with graph-based, internal and external atten- 2018), and the number of possible slot values could tion mechanisms. be substantial and variable (e.g., song names), even if a full ontology exists (Wu et al., 2019). • Experimental results show that our approach Open-vocabulary DST Without a predefined on- surpasses strong baselines, and the previous tology, some works choose to directly generate state-of-the-art method (SOM-DST) can also or extract values for each slot from the dialogue be improved by our proposed schema graph. context, by using the encoder-decoder architecture (Wu et al., 2019) or the pointer network (Gao et al., 2 Related Work 2019; Ren et al., 2019; Le et al., 2020). They can Traditional DST models rely on semantics ex- improve the scalability and robustness to unseen tracted by natural language understanding to pre- slot values, while most of them are not efficient in dict the current dialogue states (Young et al., 2013; context encoding since they encode all the previous Williams et al., 2013; Henderson et al., 2014d; Sun utterances at each dialogue turn. Notably, a multi-
domain dialogue could involve quite a long history, e.g., MultiWOZ dataset (Budzianowski et al., 2018) contains about 13 turns per dialogue on average. Graph Neural Network Graph Neural Net- work (GNN) approaches (Scarselli et al., 2009; Veličković et al., 2018) aggregate information from graph structure and encode node features, which can learn to reason and introduce structure infor- mation. Many GNN variants are proposed and also Figure 2: An example of schema graph. Domain nodes applied in various NLP tasks, such as text clas- are in orange, slot nodes are in green and domain-slot sification (Yao et al., 2019), machine translation nodes are in blue. (Marcheggiani et al., 2018), dialogue policy opti- mization (Chen et al., 2018, 2019) etc. We intro- different domain-slot pairs and exploit them as an duce graph-based multi-head attention and fusion additional input to guide the context encoding, we networks for encoding the schema graph. formulate them as a schema graph G = (V, E) with node set V and edge set E. Fig. 2 shows an 3 Problem Formulation example of schema graph. In the graph, there are three kinds of nodes to denote all domains D, slots In a multi-domain dialogue state tracking problem, S, and domain-slot pairs O, i.e., V = D ∪ S ∪ O. we assume that there are M domains (e.g. taxi, Four types of undirected edges between different hotel) involved, D = {d1 , d2 , · · · , dM }. Slots nodes are exploited to encode prior knowledge: included in each domain d ∈ D are denoted as a set S d = {sd1 , sd2 , · · · , sd|S d | }.1 Thus, there 1. (d, d0 ): Any two domain nodes, d ∈ D and d0 ∈ D, are linked to each other. are J possible domain-slot pairs totally, PM Odm= {O1 , O2 , · · · , OJ }, where J = m=1 |S |. 2. (s, d): We add an edge between slot s ∈ S Since different domains may contain a same and domain d ∈ D nodes if s ∈ S d , . slot, we denote all distinct N slots as S = {s1 , s2 , · · · , sN }, where N ≤ J. 3. (d, o) and (s, o): If a domain-slot pair o ∈ O A dialogue can be formally represented as is composed of the domain d ∈ D and slot {(A1 , U1 , B1 ), (A2 , U2 , B2 ), · · · , (AT , UT , BT )}, s ∈ S, there are two edges from d and s to where At is what the agent says at the t-th turn, Ut this domain-slot node respectively. is the user utterance at t turn, and Bt denotes the corresponding dialogue state. At and Ut are word 4. (s, s0 ): If the candidate values of two different sequences, while Bt is a set of domain-slot-value slots (s ∈ S and s0 ∈ S) would overlap, there triplets, e.g., (hotel, price range, expensive). Value is also an edge between them, e.g., destination vtj is a word sequence for j-th domain-slot pair and departure, leave at and arrive by. at the t-th turn. The goal of DST is to correctly predict the value for each domain-slot pair, given 4 Context and Schema Fusion Networks the dialogue history. for Multi-domain DST Most of the previous works choose to con- In this section, we will introduce our approach catenate all words in the dialogue history, for multi-domain DST, which jointly encodes the [A1 , U1 , A2 , U2 , · · · , At , Ut ], as the input. How- current dialogue turn (At and Ut ), the previous ever, this may lead to increased computation time. dialogue state Bt−1 and the schema graph G by In this work, we propose to utilize only the cur- fusion networks. After that, we can obtain context- rent dialogue turn At , Ut and the previous dialogue aware and schema-aware node embeddings for all state Bt−1 to predict the new state Bt . During the J domain-slot pairs. Finally, a slot-gate classifier training, we use the ground truth of Bt−1 , while the and RNN-based value decoder are exploited to ex- previous predicted dialogue state would be used in tract the value for each domain-slot pair. the inference stage. The architecture of CSFN-DST is illustrated in Schema Graph To consider relations between Fig. 3, which consists of input embeddings, con- 1 For open-vocabulary DST, possible values for each slot text schema fusion network and state prediction s ∈ S d are not known in advance. modules.
Figure 3: The overview of the proposed CSFN-DST. It takes the current dialogue utterance, the previous dialogue state and the schema graph as the input and predicts the current dialogue state. It consists of an embedding layer, context and schema fusion networks, a slot-gate classifier and an RNN-based value decoder. 4.1 Input Embeddings nodes are arranged as G = d1 ⊕ · · · ⊕ dM ⊕ s1 ⊕ Besides token and position embeddings for encod- · · ·⊕sN ⊕o1 ⊕· · ·⊕oJ . Each node embedding is ini- ing literal information, segment embeddings are tialized by averaging embeddings of tokens in the also exploited to discriminate different types of corresponding domain/slot/domain-slot. Positions input tokens. embeddings are omitted in the graph. The edges (1) Dialogue Utterance We denote the represen- of the graph are represented as an adjacency ma- tation of the dialogue utterances at t-th turn as trix AG whose items are either one or zero, which a joint sequence, Xt = [CLS] ⊕ At ⊕; ⊕Ut ⊕ would be used in the fusion network. To empha- [SEP], where [CLS] and [SEP] are auxiliary to- size edges between different types of nodes can be kens for separation, ⊕ is the operation of sequence different in the computation, we exploit node types concatenation. As [CLS] is designed to capture to get segment embeddings. the sequence embedding, it has a different segment 4.2 Context and Schema Fusion Network type with the other tokens. The input embeddings of Xt are the sum of the token embeddings, the At this point, we have input representations HG 0 ∈ |G|×d Xt |X |×d Bt−1 |B |×d segmentation embeddings and the position embed- R m , H0 ∈ R t m , H0 ∈R t−1 m, dings (Vaswani et al., 2017), as shown in Fig. 3. where |.| gets the token or node number. The con- (2) Previous Dialogue State As mentioned be- text and schema fusion network (CSFN) is utilized fore, a dialogue state is a set of domain-slot-value to compute hidden states for tokens or nodes in Xt , triplets with a mentioned value (not NONE). There- Bt−1 and G layer by layer. We then apply a stack fore, we denote the previous dialogue state as of L context- and schema-aware self-attention lay- Xt Bt−1 Bt−1 = [CLS] ⊕ Rt−1 1 K , where ⊕ · · · ⊕ Rt−1 ers to get final hidden states, HGL , HL , HL . The K is the number of triplets in Bt−1 . Each triplet i-th layer (0 ≤ i < L) can be formulated as: d-s-v is denoted as a sub-sequence, i.e., R = Xt B Xt Bt−1 HG t−1 G i+1 , Hi+1 , Hi+1 = CSFNLayeri (Hi , Hi , Hi ) d ⊕ - ⊕ s ⊕ - ⊕ v. The domain and slot names are tokenized, e.g., price range is replaced with “price range”. The value is also represented as a token sequence. For the special value DONTCARE which 4.2.1 Multi-head Attention means users do not care the value, it would be re- Before describing the fusion network, we first in- placed with “dont care”. The input embeddings of troduce the multi-head attention (Vaswani et al., Bt−1 are the sum of the token, segmentation and 2017) which is a basic module. The multi-head position embeddings. Positions are re-enumerated attention can be described as mapping a query and for different triplets. a set of key-value pairs to an output, where the (3) Schema Graph As mentioned before, the query, keys, values, and output are all vectors. The schema graph G is comprised of M domain nodes, output is computed as a weighted sum of the val- N slot nodes and J domain-slot nodes. These ues, where the weight assigned to each value is
computed by a compatibility function of the query function (Ba et al., 2016). FFN(x) is a feed- with the corresponding key. forward network (FFN) function with two fully- Consider a source sequence of vectors Y = connected layer and an ReLU activation in between, |Y | {y i }i=1 where y i ∈ R1×dmodel and Y ∈ R|Y |×dmodel , i.e., FFN(x) = max (0, xW1 + b1 ) W2 + b2 . |Z| and a target sequence of vectors Z = {z i }i=1 Similarly, more details about updating Bt−1 where z i ∈ R1×dmodel and Z ∈ R|Z|×dmodel . For HX t i , Hi are described in Appendix A. each vector y i , we can compute an attention vector The context and schema-aware encoding can ci over Z by using H heads as follows: also be simply implemented as the original trans- (h) (h) (y i WQ )(z j WK )> (h) exp(eij ) former (Vaswani et al., 2017) with graph-based (h) (h) eij = p ; aij = P|Z| (h) multi-head attentions. dmodel /H l=1 exp(eil ) |Z| (h) X (h) (h) (1) (H) 4.3 State Prediction ci = aij (z j WV ); ci = Concat(ci , · · · , ci )WO j=1 The goal of state prediction is to produce the next dialogue state Bt , which is formulated as two where 1 ≤ h ≤ H, WO ∈ Rdmodel ×dmodel , and (h) (h) (h) stages: 1) We first apply a slot-gate classifier for WQ , WK , WV ∈ Rdmodel ×(dmodel /H) . We can each domain-slot node. The classifier makes a de- compute ci for every y i and get a transformed ma- cision among {NONE, DONTCARE, PTR}, where trix C ∈ R|Y |×dmodel . The entire process is denoted NONE denotes that a domain-slot pair is not men- as a mapping MultiHeadΘ : tioned at this turn, DONTCARE implies that the C = MultiHeadΘ (Y, Z) (1) user can accept any values for this slot, and PTR represents that the slot should be processed with a Graph-based Multi-head Attention To apply the value. 2) For domain-slot pairs tagged with PTR, multi-head attention on a graph, the graph adja- we further introduced an RNN-based value decoder cency matrix A ∈ R|Y |×|Z| is involved to mask to generate token sequences of their values. nodes/tokens unrelated, where Aij ∈ {0, 1}. Thus, (h) 4.3.1 Slot-gate Classification eij is changed as: We utilize the final hidden vector of j-th domain- slot node in G for the slot-gate classification, and (h) (h) > (yi W√Q )(z j WK ) (h) , if Aij = 1 the probability for the j-th domain-slot pair at the eij = dmodel /H −∞, otherwise t-th turn is calculated as: gate and Eqn. (1) is modified as: Ptj = softmax(FFN(HG L,M +N +j )) C = GraphMultiHeadΘ (Y, Z, A) (2) The loss for slot gate classification is Eqn. (1), can be treated as a special case of Eqn. T X J gate gate X (2) that the graph is fully connected, i.e., A = 1. Lgate = − log(Ptj · (ytj )> ) t=1 j=1 4.2.2 Context- and Schema-Aware Encoding Each layer of CSFN consists of internal and ex- gate where ytj is the one-hot gate label for the j-th ternal attentions to incorporate different types of domain-slot pair at turn t. inputs. The hidden states of the schema graph G at the i-the layer are updated as follows: 4.3.2 RNN-based Value Decoder After the slot-gate classification, there are J 0 IGG = GraphMultiHeadΘGG (HG G G i , Hi , A ) domain-slot pairs tagged with PTR class which Xt EGX = MultiHeadΘGX (HG i , Hi ) indicates the domain-slot should take a real value. gate Bt−1 They are denoted as Ct = {j|argmax(Ptj ) = EGB = MultiHeadΘGB (HG i , Hi ) PTR}, and J 0 = |Ct |. G CG = LayerNorm(Hi + IGG + EGX + EGB ) We use Gated Recurrent Unit (GRU) (Cho et al., HG i+1 = LayerNorm(CG + FFN(CG )) 2014) decoder like Wu et al. (2019) and the soft copy mechanism (See et al., 2017) to get the final where AG is the adjacency matrix of the schema output distribution Ptjvalue,k over all candidate to- graph and LayerNorm(.) is layer normalization kens at the k-th step. More details are illustrated in
Appendix B. The loss function for value decoder is Domain Slots Train Valid Test Restaurant area, food, name, 3813 438 437 T X X price range, book day, X book people, book Lvalue = − log(Ptjvalue,k · (ytj value,k > ) ) time t=1 j∈Ct k Hotel area, internet, name, 3381 416 394 parking, price range, value,k stars, type, book day, where ytj is the one-hot token label for the j-th book people, book domain-slot pair at k-th step. stay During training process, the above modules can Train arrive by, day, de- 3103 484 494 parture, destination, be jointly trained and optimized by the summations leave at, book people of different losses as: Taxi arrive by, departure, 1654 207 195 destination, leave at Attraction area, name, type 2717 401 395 Ltotal = Lgate + Lvalue Total 8438 1000 1000 5 Experiment Table 1: Data statistics of MultiWOZ2.1. 5.1 Datasets We use MultiWOZ 2.0 (Budzianowski et al., 2018) (Hashimoto et al., 2017). We do a grid search over and MultiWOZ 2.1 (Eric et al., 2019) to evaluate {4, 5, 6, 7, 8} for the layer number of CSFN on the our approach. MultiWOZ 2.0 is a task-oriented validation set. We use a batch size of 32. The dataset of human-human written conversations DST model is trained using ADAM (Kingma and spanning over seven domains, consists of 10348 Ba, 2014) with the learning rate of 1e-4. During multi-turn dialogues. MultiWOZ 2.1 is a revised training, we use the ground truth of the previous version of MultiWOZ 2.0, which is re-annotated dialogue state and the ground truth value tokens. with a different set of inter-annotators and also In the inference, the predicted dialogue state of the canonicalized entity names. According to the work last turn is applied, and we use a greedy search of Eric et al. (2019), about 32% of the state anno- strategy in the decoding process of the value de- tations is corrected so that the effect of noise is coder. counteracted. Note that hospital and police are excluded since 5.3 Baseline Models they appear in training set with a very low fre- We make a comparison with the following exist- quency, and they do not even appear in the test set. ing models, which are either predefined ontology- To this end, five domains (restaurant, train, hotel, based DSTs or open-vocabulary based DSTs. Pre- taxi, attraction) are involved in the experiments defined ontology-based DSTs have the advantage with 17 distinct slots and 30 domain-slot pairs. of access to the known candidate set of each slot, We follow similar data pre-processing proce- while these approaches may not be applicable in dures as Wu et al. (2019) on both MultiWOZ 2.0 the real scenario. and 2.1. 2 The resulting corpus includes 8,438 FJST (Eric et al., 2019): It exploits a bidirectional multi-turn dialogues in training set with an aver- LSTM network to encode the dialog history and a age of 13.5 turns per dialogue. Data statistics of separate FFN to predict the value for each slot. MultiWOZ 2.1 is shown in Table 1. The adjacency HJST (Eric et al., 2019): It encodes the dialogue matrix AG of MultiWOZ 2.0 and 2.1 datasets is history using an LSTM like FJST, but utilizes a shown in Figure 4 of Appendix, while domain-slot hierarchical network. pairs are omitted due to space limitations. SUMBT (Lee et al., 2019): It exploits BERT (De- vlin et al., 2018) as the encoder for the dialogue 5.2 Experiment Settings context and slot-value pairs. After that, it scores We set the hidden size of CSFN, dmodel , as 400 every candidate slot-value pair with the dialogue with 4 heads. Following Wu et al. (2019), the context by using a distance measure. token embeddings with 400 dimensions are ini- HyST (Goel et al., 2019): It is a hybrid approach tialized by concatenating Glove embeddings (Pen- based on hierarchical RNNs, which incorporates nington et al., 2014) and character embeddings both a predefined ontology-based setting and an 2 https://github.com/budzianowski/ open-vocabulary setting. multiwoz DST-Reader (Gao et al., 2019): It models the DST
Models BERT used MultiWOZ 2.0 MultiWOZ 2.1 HJST (Eric et al., 2019)* 7 38.40 35.55 FJST (Eric et al., 2019)* 7 40.20 38.00 SUMBT (Lee et al., 2019) 3 42.40 - HyST (Goel et al., 2019)* 7 42.33 38.10 predefined DS-DST (Zhang et al., 2019) 3 - 51.21 ontology DST-Picklist (Zhang et al., 2019) 3 - 53.30 DSTQA (Zhou and Small, 2019) 7 51.44 51.17 SST (Chen et al., 2020) 7 51.17 55.23 DST-Span (Zhang et al., 2019) 3 - 40.39 DST-Reader (Gao et al., 2019)* 7 39.41 36.40 TRADE (Wu et al., 2019)* 7 48.60 45.60 COMER (Ren et al., 2019) 3 48.79 - open- NADST (Le et al., 2020) 7 50.52 49.04 vocabulary SOM-DST (Kim et al., 2019) 3 51.72 53.01 CSFN-DST (ours) 7 49.59 50.81 CSFN-DST + BERT (ours) 3 51.57 52.88 SOM-DST (our implementation) 3 51.66 52.85 SOM-DST + Schema Graph (ours) 3 52.23 53.19 Table 2: Joint goal accuracy (%) on the test set of MultiWOZ 2.0 and 2.1. * indicates a result borrowed from Eric et al. (2019). 3means that a BERT model (Devlin et al., 2018) with contextualized word embeddings is utilized. from the perspective of text reading comprehen- jointly encode the previous state, the previous and sions, and get start and end positions of the corre- current dialogue utterances. An RNN-decoder is sponding text span in the dialogue context. also applied to generate values for slots that need DST-Span (Zhang et al., 2019): It treats all to be updated in the open-vocabulary setting. domain-slot pairs as span-based slots like DST- Reader, and applies a BERT as the encoder. 5.4 Main Results DST-Picklist (Zhang et al., 2019): It defines picklist-based slots for classification similarly to Joint goal accuracy is the evaluation metric in our SUMBT and applies a pre-trained BERT for the experiments, which is represented as the ratio of encoder. It relies on a predefined ontology. turns whose predicted dialogue states are entirely consistent with the ground truth in the test set. DS-DST (Zhang et al., 2019): Similar to HyST, it Table 2 illustrates that the joint goal accuracy of is a hybrid system of DS-Span and DS-Picklist. CSFN-DST and other baselines on the test set of DSTQA (Zhou and Small, 2019): It models multi- MultiWOZ 2.0 and MultiWOZ 2.1 datasets. domain DST as a question answering problem, and generates a question asking for the value of each As shown in the table, our proposed CSFN-DST domain-slot pair. It heavily relies on a predefined can outperform other models except for SOM-DST. ontology, i.e., the candidate set for each slot is By combining our schema graphs with SOM-DST, known, except for five time-related slots. we can achieve state-of-the-art performances on both MultiWOZ 2.0 and 2.1 in the open-vocabulary TRADE (Wu et al., 2019): It contains a slot gate setting. Additionally, our method using BERT module for slots classification and a pointer gener- (Bert-base-uncased) can obtain very com- ator for dialogue state generation. petitive performance with the best systems in the COMER (Ren et al., 2019): It uses a hierarchical predefined ontology-based setting. When a BERT decoder to generate the current dialogue state itself is exploited, we initialize all parameters of CSFN as the target sequence. with the BERT encoder’s and initialize the to- NADST (Le et al., 2020): It uses a non- ken/position embeddings with the BERT’s. autoregressive decoding scheme to generate the current dialogue state. 5.5 Analysis SST (Chen et al., 2020): It utilizes a graph attention matching network to fuse information from utter- In this subsection, we will conduct some ablation ances and schema graphs, and a recurrent graph studies to figure out the potential factors for the im- attention network to control state updating. How- provement of our method. (Additional experiments ever, it heavily relies on a predefined ontology. and results are reported in Appendix C, case study SOM-DST (Kim et al., 2019): It uses a BERT to is shown in Appendix D.)
Models Joint Acc. (%) Models Attr. Hotel Rest. Taxi Train CSFN-DST 50.81 CSFN-DST 64.78 46.29 64.64 47.35 69.79 B (-) Omit HL t−1 in the decoder 48.66 (-) No SG 65.97 45.48 62.94 46.42 67.58 (-) Omit HX t L in the decoder 48.45 (+) The previous utterance Xt−1 50.75 Table 5: Domain-specific joint accuracy on MultiWOZ 2.1. SG means Schema Graph. Table 3: Ablation studies for context information on MultiWOZ 2.1. Turn Proportion (%) w/ SG w/o SG 1 13.6 89.39 88.19 (−1.20) BERT used 2 13.6 73.87 72.87 (−1.00) Models 7 3 3 13.4 58.69 57.78 (−0.91) CSFN-DST 50.81 52.88 4 12.8 51.96 50.80 (−1.16) (-) No schema graph, AG = 1 49.93 52.50 5 11.9 41.01 39.63 (−1.38) (-) No schema graph, AG = I 49.52 52.46 6 10.7 34.51 35.15 (+0.64) (+) Ground truth of the previous state 78.73 80.35 7 9.1 27.91 29.55 (+1.64) (+) Ground truth slot-gate classifi. 77.31 80.66 8 6.3 24.73 23.23 (−1.50) (+) Ground truth value generation 56.50 59.12 9 4.0 20.55 19.18 (−1.37) 10 2.3 16.37 12.28 (−4.09) 11 1.3 12.63 8.42 (−4.21) Table 4: Joint goal accuracy(%) of ablation studies on 12 0.6 12.77 8.51 (−4.26) MultiWOZ 2.1. > 12 0.4 9.09 0.00 (−9.09) all 100 50.81 49.93 5.5.1 Effect of context information Table 6: Joint accuracies over different dialogue turns on MultiWOZ 2.1. It shows the impact of using schema Context information consists of the previous dia- graph on our proposed CSFN-DST. logue state or the current dialogue utterance, which are definitely key for the encoder. It would be in- teresting to know whether the two kinds of context slot relations is essential for joint accuracy, we information are also essential for the RNN-based further make analysis on domain-specific and turn- value decoder. As shown in Table 3, we choose to specific results. As shown in Table 5, the schema omit the top hidden states of the previous dialogue graph can benefit almost all domains except for B state (HL t−1 ) or the current utterance (HXt L ) in the Attaction (Attr.). As illustrated in Table 1, the At- RNN-based value decoder. The results show both taction domain contains only three slots, which of them are crucial for generating real values. should be much simpler than the other domains. Do we need more context? Only the current Therefore, we may say that the schema graph can dialogue utterance is utilized in our model, which help complicated domains. would be more efficient than the previous meth- The turn-specific results are shown in Table 6, ods involving all the preceding dialogue utterance. where joint goal accuracies over different dialogue However, we want to ask whether the performance turns are calculated. From the table, we can see will be improved when more context is used. In that data proportion of larger turn number becomes Table 3, it shows that incorporating the previous smaller while the larger turn number refers to more dialogue utterance Xt−1 gives no improvement, challenging conversation. From the results of the which implies that jointly encoding the current ut- table, we can find the schema graph can make im- terance and the previous dialogue state is effective provements over most dialogue turns. as well as efficient. 5.5.3 Oracle experiments 5.5.2 Effect of the schema graph The predicted dialogue state at the last turn is uti- In CSFN-DST, the schema graph with domain-slot lized in the inference stage, which is mismatched relations is exploited. To check the effectiveness with the training stage. An oracle experiment is of the schema graph used, we remove knowledge- conducted to show the impact of training-inference aware domain-slot relations by replacing the ad- mismatching, where ground truth of the previous jacency matrix AG as a fully connected one 1 or dialogue state is fed into CSFN-DST. The results node-independent one I. Results in Table 4 show in Table 4 show that joint accuracy can be nearly that joint goal accuracies of models without the 80% with ground truth of the previous dialogue schema graph are decreased similarly when BERT state. Other oracle experiments with ground truth is either used or not. slot-gate classification and ground truth value gen- To reveal why the schema graph with domain- eration are also conducted, as shown in Table 4.
5.5.4 Slot-gate classification value prediction that DSTQA exploits a hybrid of We conduct experiments to evaluate our model per- value classifier and span prediction layer, which formance on the slot-gate classification task. Ta- relies on a predefined ontology. ble 7 shows F1 scores of the three slot gates, i.e., SOM-DST (Kim et al., 2019) is very similar to {NONE, DONTCARE, PTR}. It seems that the pre- our proposed CSFN-DST with BERT. The main trained BERT model helps a lot in detecting slots difference between SOM-DST and CSFN-DST is of which the user doesn’t care about values. The how to exploit the previous dialogue state. For the F1 score of DONTCARE is much lower than the previous dialogue state, SOM-DST considers all others’, which implies that detecting DONTCARE domain-slot pairs and their values (if a domain- is a much challenging sub-task. slot pair contains an empty value, a special token NONE is used), while CSFN-DST only consid- Gate CSFN-DST CSFN-DST + BERT ers the domain-slot pairs with a non-empty value. NONE 99.18 99.19 Thus, SOM-DST knows which domain-slot pairs DONTCARE 72.50 75.96 PTR 97.66 98.05 are empty and would like to be filled with a value. We think that it is the strength of SOM-DST. How- Table 7: Slot-gate F1 scores on MultiWOZ 2.1. ever, we choose to omit the domain-slot pairs with an empty value for a lower computation burden, which is proved in Table 8. As shown in the last 5.6 Reproducibility two rows of Table 2, the schema graph can also We run our models on GeForce GTX 2080 Ti improve SOM-DST, which achieves 52.23% and Graphics Cards, and the average training time for 53.19% joint accuracies on MultiWOZ 2.0 and 2.1, each epoch and number of parameters in each respectively. Appendix E shows how to exploit model are provided in Table 8. If BERT is ex- schema graph in SOM-DST. ploited, we accumulate the gradients with 4 steps for a minibatch of data samples (i.e., 32/4 = 8 6 Conclusion and Future Work samples for each step), due to the limitation of GPU memory. As mentioned in Section 5.4, joint We introduce a multi-domain dialogue state tracker goal accuracy is the evaluation metric used in our with context and schema fusion networks, which experiments, and we follow the computing script involves slot relations and learns deep representa- provided in TRADE-DST 3 . tions for each domain-slot pair dependently. Slots from different domains and their relations are or- Method Time per Batch # Parameters ganized as a schema graph. Our approach outper- CSFN-DST 350ms 63M CSFN-DST + BERT 840ms 115M forms strong baselines on both MultiWOZ 2.0 and SOM-DST + SG 1160ms 115M 2.1 benchmarks. Ablation studies also show that the effectiveness of the schema graph. Table 8: Runtime and mode size of our methods. It will be a future work to incorporate rela- tions among dialogue states, utterances and domain 5.7 Discussion schemata. To further mitigate the data sparsity problem of multi-domain DST, it would be also in- The main contributions of this work may focus on teresting to incorporate data augmentations (Zhao exploiting the schema graph with graph-based at- et al., 2019) and semi-supervised learnings (Lan tention networks. Slot-relations are also utilized et al., 2018; Cao et al., 2019). in DSTQA (Zhou and Small, 2019). However, DSTQA uses a dynamically-evolving knowledge Acknowledgments graph for the dialogue context, and we use a static schema graph. We absorb the dialogue context We thank the anonymous reviewers for their by using the previous (predicted) dialogue state thoughtful comments. This work has been as another input. We believe that the two different supported by Shanghai Jiao Tong University usages of the slot relation graph can be complemen- Scientific and Technological Innovation Funds tary. Moreover, these two methods are different in (YG2020YQ01) and No. SKLMCPTS2020003 3 https://github.com/jasonwu0731/ Project. trade-dst
References Rahul Goel, Shachi Paul, and Dilek Hakkani-Tür. 2019. HyST: A hybrid approach for flexible and Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- accurate dialogue state tracking. arXiv preprint ton. 2016. Layer normalization. arXiv preprint arXiv:1907.00883. arXiv:1607.06450. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ruoka, and Richard Socher. 2017. A joint many-task gio. 2014. Neural machine translation by jointly model: Growing a neural network for multiple NLP learning to align and translate. arXiv preprint tasks. In Proceedings of the 2017 Conference on arXiv:1409.0473. Empirical Methods in Natural Language Processing, pages 1923–1933. Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- Matthew Henderson, Blaise Thomson, and Jason D madan, and Milica Gašić. 2018. MultiWOZ - a Williams. 2014a. The second dialog state tracking large-scale multi-domain wizard-of-Oz dataset for challenge. In Proceedings of Annual SIGdial Meet- task-oriented dialogue modelling. In Proceedings of ing on Discourse and Dialogue, pages 263–272. the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 5016–5026. Matthew Henderson, Blaise Thomson, and Steve Young. 2014b. Robust dialog state tracking using Ruisheng Cao, Su Zhu, Chen Liu, Jieyu Li, and Kai delexicalised recurrent neural networks and unsuper- Yu. 2019. Semantic parsing with dual learning. In vised adaptation. In Proceedings of IEEE Spoken Proceedings of the 57th Annual Meeting of the Asso- Language Technology Workshop (SLT). ciation for Computational Linguistics, pages 51–64. Matthew Henderson, Blaise Thomson, and Steve Lu Chen, Zhi Chen, Bowen Tan, Sishan Long, Mil- Young. 2014c. Word-based dialog state tracking ica Gasic, and Kai Yu. 2019. AgentGraph: To- with recurrent neural networks. In Proceedings of wards universal dialogue management with struc- Annual SIGdial Meeting on Discourse and Dialogue, tured deep reinforcement learning. IEEE/ACM pages 292–299. Transactions on Audio, Speech, and Language Pro- cessing, 27(9):1378–1391. Mettew Henderson, Blaise Thomson, and Jason D. Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen Tan, Williams. 2014d. The third dialog state tracking and Kai Yu. 2020. Schema-guided multi-domain di- challenge. In Proceedings of IEEE Spoken Lan- alogue state tracking with graph attention neural net- guage Technology Workshop (SLT). works. In AAAI, pages 7521–7528. Sungdong Kim, Sohee Yang, Gyuwan Kim, and Sang- Lu Chen, Bowen Tan, Sishan Long, and Kai Yu. 2018. Woo Lee. 2019. Efficient dialogue state tracking Structured dialogue policy with graph neural net- by selectively overwriting memory. arXiv preprint works. In Proceedings of the 27th International arXiv:1911.03906. Conference on Computational Linguistics, pages 1257–1268. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah- arXiv:1412.6980. danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- Ouyu Lan, Su Zhu, and Kai Yu. 2018. Semi-supervised proaches. arXiv preprint arXiv:1409.1259. training using adversarial multi-task learning for spoken language understanding. In 2018 IEEE Inter- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and national Conference on Acoustics, Speech and Sig- Kristina Toutanova. 2018. BERT: Pre-training of nal Processing (ICASSP), pages 6049–6053. IEEE. deep bidirectional transformers for language under- standing. arXiv preprint arXiv:1810.04805. Hung Le, Richard Socher, and Steven C.H. Hoi. 2020. Non-autoregressive dialog state tracking. In Interna- Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, tional Conference on Learning Representations. Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani- Tur. 2019. MultiWOZ 2.1: Multi-domain dialogue Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim. 2019. state corrections and state tracking baselines. arXiv SUMBT: Slot-utterance matching for universal and preprint arXiv:1907.01669. scalable belief tracking. In Proceedings of the 57th Annual Meeting of the Association for Computa- Shuyang Gao, Abhishek Sethi, Sanchit Agarwal, Tagy- tional Linguistics, pages 5478–5483. oung Chung, and Dilek Hakkani-Tur. 2019. Dia- log state tracking: A neural reading comprehension Diego Marcheggiani, Joost Bastings, and Ivan Titov. approach. In Proceedings of the 20th Annual SIG- 2018. Exploiting semantics in neural machine trans- dial Meeting on Discourse and Dialogue, pages 264– lation with graph convolutional networks. arXiv 273. preprint arXiv:1804.08313.
Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Jason Williams, Antoine Raux, Deepak Ramachandran, Wen, Blaise Thomson, and Steve Young. 2017. Neu- and Alan Black. 2013. The dialog state tracking ral belief tracker: Data-driven dialogue state track- challenge. In Proceedings of Annual SIGdial Meet- ing. In Proceedings of the 55th Annual Meeting ing on Discourse and Dialogue, pages 404–413. of the Association for Computational Linguistics, pages 1777–1788. Jason D Williams. 2012. A belief tracking challenge task for spoken dialog systems. In NAACL-HLT Jeffrey Pennington, Richard Socher, and Christopher Workshop on Future Directions and Needs in the Manning. 2014. Glove: Global vectors for word rep- Spoken Dialog Community: Tools and Data, pages resentation. In Proceedings of the 2014 Conference 23–24. on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1532–1543. Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini- Asl, Caiming Xiong, Richard Socher, and Pascale Osman Ramadan, Paweł Budzianowski, and Milica Ga- Fung. 2019. Transferable multi-domain state gener- sic. 2018. Large-scale multi-domain belief track- ator for task-oriented dialogue systems. In Proceed- ing with knowledge sharing. In Proceedings of the ings of the 57th Annual Meeting of the Association 56th Annual Meeting of the Association for Compu- for Computational Linguistics. tational Linguistics, pages 432–437. Liliang Ren, Jianmo Ni, and Julian McAuley. 2019. Puyang Xu and Qi Hu. 2018. An end-to-end approach Scalable and accurate dialogue state tracking via hi- for handling unknown slot values in dialogue state erarchical sequence generation. In Proceedings of tracking. In Proceedings of the 56th Annual Meeting the 2019 Conference on Empirical Methods in Nat- of the Association for Computational Linguistics. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. (EMNLP-IJCNLP), pages 1876–1885. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Liliang Ren, Kaige Xie, Lu Chen, and Kai Yu. 2018. Intelligence, pages 7370–7377. Towards universal dialogue state tracking. In Pro- ceedings of the 2018 Conferenceon Empirical Meth- Steve Young, Milica Gašić, Blaise Thomson, and Ja- ods in Natural Language Processing (EMNLP), son D Williams. 2013. Pomdp-based statistical spo- pages 2780–2786. ken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The Kai Yu, Kai Sun, Lu Chen, and Su Zhu. 2015. graph neural network model. IEEE Transactions on Constrained markov bayesian polynomial for effi- Neural Networks, 20(1):61–80. cient dialogue state tracking. IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, Abigail See, Peter J Liu, and Christopher D Man- 23(12):2177–2188. ning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng arXiv:1704.04368. Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. 2019. Find or classify? dual strat- Kai Sun, Lu Chen, Su Zhu, and Kai Yu. 2014a. A gen- egy for slot-value predictions on multi-domain dia- eralized rule based tracker for dialogue state track- log state tracking. arXiv preprint arXiv:1910.03544. ing. In Proceedings of IEEE Spoken Language Tech- nology Workshop (SLT). Zijian Zhao, Su Zhu, and Kai Yu. 2019. Data augmen- Kai Sun, Lu Chen, Su Zhu, and Kai Yu. 2014b. The tation with atomic templates for spoken language SJTU system for dialog state tracking challenge 2. understanding. In Proceedings of the 2019 Confer- In Proceedings of the 15th Annual Meeting of the ence on Empirical Methods in Natural Language Special Interest Group on Discourse and Dialogue Processing and the 9th International Joint Confer- (SIGDIAL), pages 318–326. ence on Natural Language Processing (EMNLP- IJCNLP), pages 3628–3634. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Victor Zhong, Caiming Xiong, and Richard Socher. Kaiser, and Illia Polosukhin. 2017. Attention is all 2018. Global-locally self-attentive encoder for di- you need. In Advances in neural information pro- alogue state tracking. In Proceedings of the 56th An- cessing systems. nual Meeting of the Association for Computational Linguistics, pages 1458–1467. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Li Zhou and Kevin Small. 2019. Multi-domain dia- 2018. Graph attention networks. In Proceedings logue state tracking as dynamic knowledge graph of International Conference on Learning Represen- enhanced question answering. arXiv preprint tations. arXiv:1911.06192.
A Context- and Schema-Aware Encoding GRU is initialized with B Besides the hidden states HG i of the schema graph 0 gtj = HX t t−1 L,0 + HL,0 Bt−1 G, we show the details of updating HX t i , Hi in the i-th layer of CSFN: and e0tj = HG L,M +N +j . The value generator transforms the hidden state B Bt−1 HG Xt t−1 G Xt i+1 , Hi+1 , Hi+1 = CSFNLayeri (Hi , Hi , Hi ) to the probability distribution over the token vo- cabulary at the k-th step, which consists of two parts: 1) distribution over all input tokens, 2) distri- The hidden states of the dialogue utterance Xt bution over the input vocabulary. The first part is at the i-the layer are updated as follows: computed as B IXX = MultiHeadΘXX (HX t Xt Ptjctx,k = softmax(ATT(gtj k , [HXt L ; HL t−1 ])) i , Hi ) Bt−1 EXB = MultiHeadΘXB (HX t i , Hi ) where Ptjctx,k ∈ R1×(|Xt |+|Bt−1 |) , and ATT(., .) is a EXG = MultiHeadΘXG (HX t G function to get attention weights (Bahdanau et al., i , Hi ) 2014) with more details shown in Appendix B.1. CX = LayerNorm(HXt i + IXX + EXB + EXG ) The second part is calculated as HX t i+1 = LayerNorm(CX + FFN(CX )) B cktj = Ptjctx,k [HXt L ; HL t−1 ] where IXX contains internal attention vectors, EXB Ptjvocab,k = softmax([gtj k ; cktj ]Wproj E > ) and EXG are external attention vectors. The hidden states of the previous dialogue state where Ptjvocab,k ∈ R1×dvocab , cktj ∈ R1×dmodel is a Bt−1 at the i-the layer are updated as follows: context vector, Wproj ∈ R2dmodel ×dmodel is a trainable Bt−1 Bt−1 parameter, and E ∈ Rdvocab ×dmodel is the token em- IBB = GraphMultiHeadΘBB (Hi , Hi , ABt−1 ) bedding matrix shared across the encoder and the Bt−1 decoder. EBX = MultiHeadΘBX (Hi , HX t i ) Bt−1 We use the soft copy mechanism (See et al., EBG = MultiHeadΘBG (Hi , HG i ) 2017) to get the final output distribution over all Bt−1 CB = LayerNorm(Hi + IBB + EBX + EBG ) candidate tokens: B Ptjvalue,k = pgen Ptjvocab,k + (1 − pgen )Ptjctx,k t−1 Hi+1 = LayerNorm(CB + FFN(CB )) k pgen = sigmoid([gtj ; ektj ; cktj ]Wgen ) where ABt−1 is the adjacency matrix of the previ- ous dialogue state. The adjacency matrix indicates where Wgen ∈ R3dmodel ×1 is a trainable parameter. that each triplets in Bt−1 is separated, while tokens The loss function for value decoder is in a same triplet are connected with each other. The T X X [CLS] token is connected with all triplets, serving X Lvalue = − log(Ptjvalue,k · (ytj value,k > ) ) as a transit node. t=1 j∈Ct k B RNN-based Value Decoder value,k where ytj is the one-hot token label for the j-th domain-slot pair at k-th step. After the slot-gate classification, there are J 0 domain-slot pairs tagged with PTR class which B.1 Attention Weights indicates the domain-slot should take a real value. For attention mechanism for computing Ptjctx,k in gate They are denoted as Ct = {j|argmax(Ptj ) = the RNN-based value decoder, we follow Bahdanau 0 PTR}, and J = |Ct |. et al. (2014) and define the ATT(., .) function as We use Gated Recurrent Unit (GRU) (Cho et al., 2014) decoder like Wu et al. (2019) and See et al. att > k ∈ R1×dmodel is recur- (2017). The hidden state gtj ui =tanh(xWatt att 1 + hi W2 + b )v sively updated by taking a word embedding ektj as exp(ui ) ai = PS the input until [EOS] token is generated: j=1 exp(uj ) k gtj k−1 k = GRU(gtj , etj ) a ={a1 , · · · , aS } = ATT(x, H)
where x ∈ R1×d , H ∈ RS×d , Watt 1 ∈R d×d , Watt ∈ 2 where Dt−1 and Dt are the last and current ut- d×d att 1×d 1×d R , b ∈ R , v ∈ R , and hi is the i-th terances, respectively. The dialogue state Bt is row vector of H. Therefore, ATT(x, H) returns an denoted as Bt = Bt1 ⊕ . . . ⊕ BtJ , where Btj = attention distribution of x over H. [SLOT]j ⊕ S j ⊕ − ⊕ Vtj is the representation of the j-th slot-value pair. To incorporate the schema C Additional Results graph, we exploit the special token [SLOT]j to re- Domain-specific Results Domain-specific accu- place the domain-slot node oj in the schema graph racy is the joint goal accuracy measured on a subset (j = 1, · · · , J). Then, domain and slot nodes of the predicted dialogue state, which only contains G0 = d1 ⊕ · · · ⊕ dM ⊕ s1 ⊕ · · · ⊕ sN are con- the slots belong to a domain. From the results of catenated into Xt , i.e., Table 9, we can find BERT can make improvements Xt = [CLS] ⊕ Dt−1 ⊕ Dt ⊕ Bt−1 ⊕ G0 , on all domains, and especially the improvement on Taxi domain is the largest. where the relations among domain, slot and Domain CSFN-DST CSFN-DST + BERT domain-slot nodes are also considered in attention Attraction 64.78 67.82 masks of BERT. Hotel 46.29 48.80 Restaurant 64.64 65.23 Taxi 47.35 53.58 Train 69.79 70.91 Table 9: Domain-specific joint accuracy on MultiWOZ 2.1. Slot-specific Results Slot-specific F1 score is mea- sured for predicting slot-value pairs of the corre- sponding slot. Table 10 shows slot-specific F1 scores of CSFN-DST without the schema graph, CSFN-DST and CSFN-DST with BERT on the test set of MultiWOZ 2.1. D Case Study We also conduct case study on the test set of Mul- tiWOZ 2.1, and four cases are shown in Table 11. From the first three cases, we can see the schema graph can copy values from related slots in the memory (i.e., the previous dialogue state). In the case C1, the model makes the accurate reference of the phrase “whole group” through the context, and the value of restaurant-book people is copied as the value of train-book people. We can also see a failed case (C4). It is too complicated to in- ference the departure and destination by a word “commute”. E SOM-DST with Schema Graph For SOM-DST (Kim et al., 2019), the input tokens to the state operation predictor are the concatena- tion of the previous turn dialog utterances, the cur- rent turn dialog utterances, and the previous turn dialog state: Xt = [CLS] ⊕ Dt−1 ⊕ Dt ⊕ Bt−1 ,
Figure 4: Adjacency matrix AG of MultiWOZ 2.0 and 2.1 datasets. It contains only domain and slot nodes, while domain-slot paris are omitted due to space limitations. The first five items are domains (“attraction, hotel, restaurant, taxi, train”), and the rest are slots.
Domain-slot CSFN-DST (no SG) CSFN-DST CSFN-DST + BERT attraction-area 91.67 91.92 92.81 attraction-name 78.28 77.77 79.55 attraction-type 90.95 90.89 91.97 hotel-area 84.59 84.21 84.86 hotel-book day 97.16 97.79 97.03 hotel-book people 95.43 96.14 97.35 hotel-book stay 96.04 97.00 96.98 hotel-internet 86.79 89.98 86.58 hotel-name 82.97 83.11 84.61 hotel-parking 86.68 87.66 86.07 hotel-price range 89.09 90.10 92.56 hotel-stars 91.51 93.49 93.34 hotel-type 77.58 77.87 82.12 restaurant-area 93.73 94.27 94.09 restaurant-book day 97.75 97.66 97.42 restaurant-book people 96.79 96.67 97.84 restaurant-book time 92.43 91.60 94.29 restaurant-food 94.90 94.18 94.48 restaurant-name 80.72 81.39 80.59 restaurant-price range 93.47 94.28 94.49 taxi-arrive by 78.81 81.09 86.08 taxi-departure 73.15 71.39 75.15 taxi-destination 73.79 78.06 79.83 taxi-leave at 77.29 80.13 88.06 train-arrive by 86.43 87.56 88.77 train-book people 89.59 91.41 92.33 train-day 98.44 98.41 98.52 train-departure 95.91 96.52 96.22 train-destination 97.08 97.06 96.13 train-leave at 69.97 70.97 74.50 Joint Acc. overall 49.93 50.81 52.88 Table 10: Slot-specific F1 scores on MultiWOZ 2.1. SG means Schema Graph. The results in bold black are the best slot F1 scores.
(restaurant-book day, friday), (restaurant-book people, 8), (restaurant-book time, 10:15), Previous DS: (restaurant-name, restaurant 2 two), (train-leave at, 12:15), (train-destination, peterborough), (train-day, saturday), (train-departure, cambridge) System: How about train tr3934? It leaves at 12:34 & arrives at 13:24. Travel time is 50 minutes. Human: That sounds fine. Can I get tickets for my whole group please? (restaurant-name, restaurant 2 two), (restaurant-book day, friday), (restaurant-book people, 8), (restaurant-book time, 10:15), (train-departure, cambridge), Gold DS: (train-leave at, 12:15), (train-day, saturday), (train-destination, peterborough), (train-book people, 8) C1 (restaurant-name, restaurant 2 two), (restaurant-book day, friday), (restaurant-book people, 8), (restaurant-book time, 10:15), (train-departure, cambridge), CSFN-DST (no SG): (train-leave at, 12:15), (train-day, saturday), (train-destination, peterborough), (train-book people, 1) (restaurant-name, restaurant 2 two), (restaurant-book day, friday), (restaurant-book people, 8), (restaurant-book time, 10:15), (train-departure, cambridge), CSFN-DST: (train-leave at, 12:15), (train-day, saturday), (train-destination, peterborough), (train-book people, 8) (hotel-area, west), (hotel-price range, cheap), (hotel-type, guest house), Previous DS: (hotel-internet, yes), (hotel-name, warkworth house), (restaurant-area, centre), (restaurant-food, italian), (restaurant-price range, cheap), (restaurant-name, ask) System: 01223364917 is the phone number. 12 bridge street city centre, cb21uf is the address. Human: Thanks. I will also need a taxi from the hotel to the restaurant. Will you handle this? (hotel-area, west), (hotel-price range, cheap), (hotel-type, guest house), (hotel-internet, yes), (hotel-name, warkworth house), (restaurant-area, centre), Gold DS: (restaurant-food, italian), (restaurant-price range: cheap), (restaurant-name, ask), (taxi-departure, warkworth house), (taxi-destination, ask) C2 (hotel-area, west), (hotel-price range, cheap), (hotel-type, guest house), (hotel-internet, yes), (hotel-name, warkworth house), (restaurant-area, centre), CSFN-DST (no SG): (restaurant-food, italian), (restaurant-price range: cheap), (restaurant-name, ask), (taxi-departure, warkworth house), (taxi-destination, warkworth house) (hotel-area, west), (hotel-price range, cheap), (hotel-type, guest house), (hotel-internet, yes), (hotel-name, warkworth house), (restaurant-area, centre), CSFN-DST: (restaurant-food, italian), (restaurant-price range: cheap), (restaurant-name, ask), (taxi-departure, warkworth house), (taxi-destination, ask) (attraction-area, east), (attraction-name, funky fun house), (restaurant-area, east), Previous DS: (restaurant-food, indian), (restaurant-price range, moderate), (restaurant-name, curry prince) System: cb58jj is there postcode. Their address is 451 newmarket road fen ditton. Great, thank you! Also, can you please book me a taxi between the restaurant and funky Human: fun house? I want to leave the restaurant by 01:30. (attraction-area, east), (attraction-name, funky fun house), (restaurant-area, east), (restaurant-food, indian), (restaurant-price range, moderate), Gold DS: (restaurant-name, curry prince), (taxi-departure, curry prince), (taxi-destination, funky fun house), (taxi-leave at, 01:30) C3 (attraction-area, east), (attraction-name, funky fun house), (restaurant-area, east), (restaurant-food, indian), (restaurant-price range, moderate), CSFN-DST (no SG): (restaurant-name, curry prince), (taxi-departure, curry garden), (taxi-destination, funky fun house), (taxi-leave at, 01:30) (attraction-area, east), (attraction-name, funky fun house), (restaurant-area, east), (restaurant-food, indian), (restaurant-price range, moderate), CSFN-DST: (restaurant-name, curry prince), (taxi-departure, curry prince), (taxi-destination, funky fun house), (taxi-leave at, 01:30) (hotel-name, a and b guest house), (hotel-book day, tuesday), (hotel-book people, 6), Previous DS: (hotel-book stay, 4), (attraction-area, west), (attraction-type, museum) Cafe jello gallery has a free entrance fee. The address is cafe jello gallery, 13 magdalene System: street and the post code is cb30af. Can I help you with anything else? Human: Yes please. I need a taxi to commute. (hotel-name, a and b guest house), (hotel-book day, tuesday), (hotel-book people, 6), Gold DS: (hotel-book stay, 4), (attraction-area, west), (attraction-type, museum), C4 (taxi-destination, cafe jello gallery), (taxi-departure, a and b guest house) (hotel-name, a and b guest house), (hotel-book day, tuesday), (hotel-book people, 6), CSFN-DST (no SG): (hotel-book stay, 4), (attraction-area, west), (attraction-type, museum) (hotel-name, a and b guest house), (hotel-book day, tuesday), (hotel-book people, 6), CSFN-DST: (hotel-book stay, 4), (attraction-area, west), (attraction-type, museum), (taxi-destination, cafe jello gallery) Table 11: Four cases on the test set of MultiWOZ 2.1. DS means Dialogue State, and SG means Schema Graph.
You can also read