Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

Page created by Laurie Santos

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

                          Osman Ramadan, Paweł Budzianowski, Milica Gašić
                                    Department of Engineering,
                                   University of Cambridge, U.K.
                              {oior2,pfb30,mg436}@cam.ac.uk

                         Abstract                                 Understanding (SLU) unit that utilizes a seman-
                                                                  tic dictionary to hold all the key terms, rephras-
      Robust dialogue belief tracking is a key                    ings and alternative mentions of a belief state. The
      component in maintaining good quality di-                   SLU then delexicalises each turn using this seman-
      alogue systems. The tasks that dialogue                     tic dictionary, before it passes it to the BT compo-
      systems are trying to solve are becoming                    nent (Wang and Lemon, 2013; Henderson et al.,
      increasingly complex, requiring scalabil-                   2014b; Williams, 2014; Zilka and Jurcicek, 2015;
      ity to multi-domain, semantically rich dia-                 Perez and Liu, 2016; Rastogi et al., 2017). How-
      logues. However, most current approaches                    ever, this approach is not scalable to multi-domain
      have difficulty scaling up with domains                     dialogues because of the effort required to de-
      because of the dependency of the model                      fine a semantic dictionary for each domain. More
      parameters on the dialogue ontology. In                     advanced approaches, such as the Neural Belief
      this paper, a novel approach is introduced                  Tracker (NBT), use word embeddings to alleviate
      that fully utilizes semantic similarity be-                 the need for delexicalisation and combine the SLU
      tween dialogue utterances and the ontol-                    and BT into one unit, mapping directly from turns
      ogy terms, allowing the information to                      to belief states (Mrkšić et al., 2017). Nevertheless,
      be shared across domains. The evalua-                       the NBT model does not tackle the problem of
      tion is performed on a recently collected                   mixing different domains in a conversation. More-
      multi-domain dialogues dataset, one order                   over, as each slot is trained independently without
      of magnitude larger than currently avail-                   sharing information between different slots, scal-
      able corpora. Our model demonstrates                        ing such approaches to large multi-domain sys-
      great capability in handling multi-domain                   tems is greatly hindered.
      dialogues, simultaneously outperforming
                                                                     In this paper, we propose a model that jointly
      existing state-of-the-art models in single-
                                                                  identifies the domain and tracks the belief states
      domain dialogue tracking tasks.
                                                                  corresponding to that domain. It uses semantic
1     Introduction                                                similarity between ontology terms and turn
                                                                  utterances to allow for parameter sharing between
Spoken Dialogue Systems (SDS) are computer                        different slots across domains and within a single
programs that can hold a conversation with a hu-                  domain. In addition, the model parameters are
man. These can be task-based systems that help                    independent of the ontology/belief states, thus
the user achieve specific goals, e.g. finding and                 the dimensionality of the parameters does not
booking hotels or restaurants. In order for the                   increase with the size of the ontology, making
SDS to infer the user goals/intentions during the                 the model practically feasible to deploy in multi-
conversation, its Belief Tracking (BT) component                  domain environments without any modifications.
maintains a distribution of states, called a belief               Finally, we introduce a new, large-scale corpora
state, across dialogue turns (Young et al., 2010).                of natural, human-human conversations providing
The belief state is used by the system to take ac-                new possibilities to train complex, neural-based
tions in each turn until the conversation is con-                 models. Our model systematically improves upon
cluded and the user goal is achieved. In order to                 state-of-the-art neural approaches both in single
extract these belief states from the conversation,                and multi-domain conversations.
traditional approaches use a Spoken Language

                                                            432
    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 432–437
                 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics

2 Background 3.2 Multi-domain Dialogue State Tracking
The belief states of the BT are defined based Recently, Rastogi et al. (2017) proposed a multi-
on an ontology - the structured representation of domain approach using delexicalized utterances
the database which contains entities the system fed to a two layer stacked bi-directional GRU net-
can talk about. The ontology defines the terms work to extract features from the user and the sys-
over which the distribution is to be tracked in tem utterances. These, combined with the candi-
the dialogue. This ontology is constructed in date slots and values, are passed to a feed-forward
terms of slots and values in a single domain set- neural network with a softmax in the last layer.
ting. Or, alternatively, in terms of domains, slots The candidate set fed to the network consists of
and values in a multi-domain environment. Each the selected candidates from the previous turn and
domain consists of multiple slots and each slot candidates from the ontology to a limit K, which
contains several values, e.g. domain=hotel, restricts the maximum size of the chosen set. Con-
slot=price, value=expensive. In each sequently, the model does not need an ad-hoc be-
turn, the BT fits a distribution over the values of lief state update mechanism like in the NBT.
each slot in each domain, and a none value is The parameters of the GRU network are de-
added to each slot to indicate if the slot is not fined for the domain, whereas the parameters of
mentioned so that the distribution sums up to 1. the feed-forward network are defined per slot, al-
The BT then passes these states to the Policy Op- lowing transfer learning across different domains.
timization unit as full probability distributions to However, the model relies on delexicalization to
take actions. This allows robustness to noisy envi- extract the features, which limits the performance
ronments (Young et al., 2010). The larger the on- of the BT, as it does not scale to the rich variety of
tology, the more flexible and multi-purposed the the language. Moreover, the number of parameters
system is, but the harder it is to train and maintain increases with the number of slots.
a good quality BT.

3 Related Work 4 Method

In recent years, a plethora of research has been The core idea is to leverage semantic similarities
generated on belief tracking (Williams et al., between the utterances and ontology terms to com-
2016). For the purposes of this paper, two pre- pute the belief state distribution. In this way, the
viously proposed models are particularly relevant. model parameters only learn to model the interac-
tions between turn utterances and ontology terms
3.1 Neural Belief Tracker (NBT) in the semantic space, rather than the mapping
The main idea behind the NBT (Mrkšić et al., from utterances to states. Consequently, informa-
2017) is to use semantically specialized pre- tion is shared between both slots and across do-
trained word embeddings to encode the user ut- mains. Additionally, the number of parameters
terance, the system act and the candidate slots and does not increase with the ontology size. Do-
values taken from the ontology. These are fed to main tracking is considered as a separate task but
semantic decoding and context modeling modules is learned jointly with the belief state tracking of
that apply a three-way gating mechanism and pass the slots and values. The proposed model uses
the output to a non-linear classifier layer to pro- semantically specialized pre-trained word embed-
duce a distribution over the values for each slot. It dings (Wieting et al., 2015). To encode the user
uses a simple update rule, p(st ) = βp(st−1 ) + λy, and system utterances, we employed 7 indepen-
where p(st ) is the belief state at time step t, y is dent bi-directional LSTMs (Graves and Schmid-
the output of the binary decision maker of the NBT huber, 2005). Three of them are used to encode
and β and λ are tunable parameters. the system utterance for domain, slot and value
The NBT leverages semantic information tracking respectively. Similarly, three Bi-LSTMs
from the word embeddings to resolve lexi- encode the user utterance while and the last one
cal/morphological ambiguity and maximize the is used to track the user affirmation. A variant of
shared parameters across the values of each slot. the CNNs as a feature extractor, similar to the one
However, it only applies to a single domain and used in the NBT-CNN (Mrkšić et al., 2017) is also
does not share parameters across slots. employed.

433

Figure 1: The proposed model architecture, using Bi-LSTMs as encoders. Other variants of the model
use CNNs as feature extractors (Kim, 2014; Kalchbrenner et al., 2014).

4.1 Domain Tracking 4.2 Candidate Slots and Values Tracking
Figure 1 presents the system architecture with two Slots and values are tracked using a similar archi-
bi-directional LSTM networks as information en- tecture as for domain tracking (Figure 1). How-
coders running over the word embeddings of the ever, to correctly model the context of the system-
user and system utterances. The last hidden states user dialogue at each turn, three different cases are
of the forward and backward layers are concate- considered when computing the similarity vectors:
nated to produce hdusr , hdsys of size L for the user 1. Inform: The user is informing the system
and system utterances respectively. In the second about his/her goal, e.g. ’I am looking for a
variant of the model, CNNs are used to produce restaurant that serves Turkish food’.
these vectors (Kim, 2014; Kalchbrenner et al., 2. Request: The system is requesting informa-
2014). To detect the presence of the domain in the tion by asking the user about the value of
dialogue turn, element-wise multiplication is used a specific slot. If the system utterance is:
as a similarity metric between the hidden states ’When do you want the taxi to arrive?’ and
and the ontology embeddings of the domain: the user answers with ’19:30’.
dk = hdk tanh(Wd ed + bd ), 3. Confirm: The system wants to confirm in-
formation about the value of a specific slot. If
where k ∈ {usr, sys}, ed is the embedding vector
the system asked: ’Would you like free park-
of the domain and Wd ∈ RL×D transforms the
ing?’, the user can either affirm positively or
domain word embeddings of dimension D to the
negatively. The model detects the user affir-
hidden representation. The information about se-
mation, using a separate bi-directional LSTM
mantic similarity is held by dusr and dsys , which
or CNN to output hausr .
are fed to a non-linear layer to output a binary de-
cision: The three cases are modelled as following:

Pt (d) = σ(wd {dusr ⊕ dsys } + bd ), s,v
yinf = winf {susr ⊕ vusr } + binf ,
where wd ∈ R2L and bd are learnable parameters s,v
yreq = wreq {ssys ⊕ vusr } + breq ,
that map the semantic similarity to a belief state s,v
probability Pt (d) of a domain d at a turn t. yaf = waf {ssys ⊕ vsys ⊕ hausr } + baf ,

434

where sk , vk for k ∈ {usr, sys} represent seman-              are split into domains and slots and values. Thanks
tic similarity between the user and system utter-              to the disjoint training, the learning of slot and
ances and the ontology slot and value terms re-                value belief states are not restricted to a specific
spectively computed as shown in Figure 1, and w                domain. Therefore, the model shares the knowl-
and b are learnable parameters.                                edge of slots and values across different domains.
   The distribution over the values of slot s in do-           The loss function for the domain tracking is:
main d at turn t can be computed by summing the
                                                                                 N X
unscaled states, yinf , yreq and yaf for each value                              X
                                                                       Ld = −               tn (d)logP1:T
                                                                                                      n
                                                                                                          (d),
v in s, and applying a softmax to normalize the
                                                                                n=1 d∈D
distribution:
                           s,v
      Pt (s, v) = softmax(yinf    s,v
                               + yreq    s,v
                                      + yaf  ).                where d is a vector of domains over the dialogue,
                                                               tn (d) is the domain label for the dialogue n and
4.3   Belief State Update                                      N is the number of dialogues. Similarly, the loss
Since dialogue systems in the real-world operate               function for the slots and values tracking is:
in noisy environments, a robust BT should utilize                             N
the flow of the conversation to reduce the uncer-
                                                                              X X
                                                                   Ls,v = −                 tn (s, v)logP1:T
                                                                                                         n
                                                                                                             (s, v),
tainty in the belief state distribution. This can                             n=1 s,v∈S,V
be achieved by passing the output of the deci-
sion maker, at each turn, as an input to an RNN                where s and v are vectors of slots and values over
that runs over the dialogue turns as shown in Fig-             the dialogue and tn (s, v) is the joint label vector
ure 1, which allows the gradients to be propagated             for the dialogue n.
across turns. This alleviates the problem of tun-
ing hyper-parameters for rule-based updates. To                5     Datasets and Baselines
avoid the vanishing gradient problem, three net-               Neural approaches to statistical dialogue develop-
works were tested: a simple RNN, an RNN with                   ment, especially in a task-oriented paradigm, are
a memory cell (Henderson et al., 2014a) and a                  greatly hindered by the lack of large scale datasets.
LSTM. The RNN with a memory cell proved to                     That is why, following the Wizard-of-Oz (WOZ)
give the best results. In addition to the fact that it         approach (Kelley, 1984; Wen et al., 2017), we
reduces the vanishing gradient problem, this vari-             ran text-based multi-domain corpus data collec-
ant is less complex than an LSTM, which makes                  tion scheme through Amazon MTurk. The main
training easier. Furthermore, a variant of RNN                 goal of the data collection was to acquire human-
used for domain tracking has all its weights of the            human conversations between a tourist visiting a
form: Wi = αi I, where αi is a distinct learn-                 city and a clerk from an information center. At the
able parameter for hidden, memory and previous                 beginning of each dialogue the user (visitor) was
state layers and I is the identity matrix. Similarly,          given explicit instructions about the goal to ful-
weights of the RNN used to track the slots and val-            fill, which often spanned multiple domains. The
ues is of the form: Wj = γj I + λj (1 − I), where              task of the system (wizard) is to assist a visitor
γj and λj are the learnable parameters. These two              having an access to databases over domains. The
variants of RNN are a combination of Henderson                 WOZ paradigm allowed us to obtain natural and
et al. (2014a) and Mrkvsić and Vulić (2018) previ-           semantically rich multi-topic dialogues spanning
ous works. The output is P1:T (d) and P1:T (s, v),             over multiple domains such as hotels, attractions,
which represents the joint probability distribution            restaurants, booking trains or taxis. The dialogues
of the domains and slots and values respectively               cover from 1 up to 5 domains per dialogue greatly
over the complete dialogue. Combining these to-                varying in length and complexity.
gether produces the full belief state distribution of
the dialogue:                                                  5.1    Data Structure
        P1:T (d, s, v) = P1:T (d)P1:T (s, v).                  The data consists of 2480 single-domain dialogues
                                                               and 7375 multi-domain dialogues usually span-
4.4   Training Criteria                                        ning from 2 up to 5 domains. Some domains con-
Domain tracking and slots and values tracking are              sists also of sub-domains like booking. The aver-
trained disjointly. Belief state labels for each turn          age sentence lengths are 11.63 and 15.01 for users

                                                         435

WOZ 2.0 New WOZ (only restaurants)
Slot NBT-CNN Bi-LSTM CNN NBT-CNN Bi-LSTM CNN
Food 88.9 96.1 96.4 78.3 84.7 85.3
Price range 93.7 98.0 97.9 92.6 95.6 93.6
Area 94.3 97.8 98.1 78.3 82.6 86.4
Joint goals 84.2 85.1 85.5 57.7 59.9 63.7

Table 1: WOZ 2.0 and new dataset test set accuracies of the NBT-CNN and the two variants of the
proposed model, for slots food, price range, area and joint goals.

and wizards respectively. The combined ontol- New WOZ (multi-domain)
ogy consists of 5 domains, 27 slots and 663 val- Model F1 score Accuracy %
ues making it significantly larger than observed in Uniform Sampling 0.108 10.8
other datasets. To enforce reproducibility of re- Bi-LSTM 0.876 93.7
sults, we distribute the corpus with a pre-specified CNN 0.878 93.2
train/test/development random split. The test and
development sets contain 1k examples each. Each Table 2: The overall F1 score and accuracy for the
dialogues consists of a goal, user and system ut- multi-domain dialogues test set.4
terances and a belief state per turn. The data and
model is publicly available.1 This is because the dialogues in the new dataset
are richer and more noisier, as a closer resem-
5.2 Evaluation blance to real environment dialogues.
We also used the extended WOZ 2.0 dataset (Wen Table 2 presents the results on multi-domain di-
et al., 2017).2 WOZ2 dataset consists of 1200 sin- alogues from the new dataset described in Sec-
gle topic dialogues constrained to the restaurant tion 5. To demonstrate the difficulty of the multi-
domain. All the weights were initialised using nor- domain belief tracking problem, values of a the-
mal distribution of zero mean and unit variance oretical baseline that samples the belief state uni-
and biases were initialised to zero. ADAM op- formly at random are also presented. Our model
timizer (Kingma and Ba, 2014) (with 64 batch gracefully handles such a difficult task. In most
size) is used to train all the models for 600 epochs. of the cases, CNNs demonstrate better perfor-
Dropout (Srivastava et al., 2014) was used for reg- mance than Bi-LSTMs. We hypothesize that this
ularisation (50% dropout rate on all the intermedi- comes from the effectiveness of extracting local
ate representations). For each of the two datasets and position-invariant features, which are crucial
we compare our proposed architecture (using ei- for semantic similarities (Yin et al., 2017).
ther Bi-LSTM or CNN as encoders) to the NBT
model3 (Mrkšić et al., 2017).
7 Conclusions
In this paper, we proposed a new approach that
6 Results tackles the issue of multi-domain belief tracking,
Table 1 shows the performance of our model in such as model parameter scalability with the ontol-
tracking the belief state of single-domain dia- ogy size. Our model shows improved performance
logues, compared to the NBT-CNN variant of the in single-domain tasks compared to the state-of-
NBT discussed in Section 3.1. Our model outper- the-art NBT method. By exploiting semantic sim-
forms NBT in all the three slots and the joint goals ilarities between dialogue utterances and ontology
for the two datasets. NBT previously achieved terms, the model alleviates the need for ontology-
state-of-the-art results (Mrkšić et al., 2017). More- dependent parameters and maximizes the amount
over, the performance of all models is worse on the of information shared between slots and across do-
new dataset for restaurant compared to WOZ 2.0. mains. In future, we intend to investigate introduc-
ing new domains and ontology terms without fur-
1
http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ ther training thus performing zero-shot learning.
2
Publicly available at https://mi.eng.cam.ac.
uk/˜nm480/woz_2.0.zip. 4
F1-score is computed by considering all the values in
3
Publicly available at https://github.com/ each slot of each domain as positive and the ”none” state of
nmrksic/neural-belief-tracker. the slot as negative.

436

Acknowledgments Abhinav Rastogi, Dilek Hakkani-Tur, and Larry Heck.
2017. Scalable multi-domain dialogue state track-
The authors would like to thank Nikola Mrkšić, ing. arXiv preprint arXiv:1712.10224 .
Jacquie Rowe, the Cambridge Dialogue Systems
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Group and the ACL reviewers for their construc- Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
tive feedback. Paweł Budzianowski is supported Dropout: A simple way to prevent neural networks
by EPSRC Council and Toshiba Research Eu- from overfitting. Journal of Machine Learning Re-
rope Ltd, Cambridge Research Laboratory. The search 15:1929–1958.
data collection was funded through Google Fac- Zhuoran Wang and Oliver Lemon. 2013. A simple
ulty Award. and generic belief tracking mechanism for the dia-
log state tracking challenge: On the believability of
observed information. In Proceedings of the SIG-
DIAL 2013 Conference. pages 423–432.
References
Alex Graves and Jürgen Schmidhuber. 2005. Frame- Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Lina
wise phoneme classification with bidirectional lstm M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes,
and other neural network architectures. Neural Net- David Vandyke, and Steve Young. 2017. A network-
works 18(5-6):602–610. based end-to-end trainable task-oriented dialogue
system. In Proceedings on EACL .
Matthew Henderson, Blaise Thomson, and Steve John Wieting, Mohit Bansal, Kevin Gimpel, Karen
Young. 2014a. Robust dialog state tracking using Livescu, and Dan Roth. 2015. From paraphrase
delexicalised recurrent neural networks and unsu- database to compositional paraphrase model and
pervised adaptation. In Spoken Language Technol- back. Transactions of the Association for Compu-
ogy Workshop (SLT), 2014 IEEE. IEEE, pages 360– tational Linguistics 3:345–358.
365.
Jason Williams, Antoine Raux, and Matthew Hender-
Matthew Henderson, Blaise Thomson, and Steve son. 2016. The dialog state tracking challenge se-
Young. 2014b. Word-based dialog state tracking ries: A review. Dialogue & Discourse 7(3):4–33.
with recurrent neural networks. In Proceedings
of the 15th Annual Meeting of the Special Inter- Jason D Williams. 2014. Web-style ranking and slu
est Group on Discourse and Dialogue (SIGDIAL). combination for dialog state tracking. In Proceed-
pages 292–299. ings of the 15th Annual Meeting of the Special Inter-
est Group on Discourse and Dialogue (SIGDIAL).
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- pages 282–291.
som. 2014. A convolutional neural network for
modelling sentences. In Proceedings of ACL . Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich
Schütze. 2017. Comparative study of cnn and rnn
John F Kelley. 1984. An iterative design methodol- for natural language processing. arXiv preprint
ogy for user-friendly natural language office infor- arXiv:1702.01923 .
mation applications. ACM Transactions on Infor-
mation Systems (TOIS) 2(1):26–41. Steve Young, Milica Gašić, Simon Keizer, François
Mairesse, Jost Schatzmann, Blaise Thomson, and
Yoon Kim. 2014. Convolutional neural networks for Kai Yu. 2010. The hidden information state model:
sentence classification. In Proceedings of EMNLP . A practical framework for pomdp-based spoken dia-
logue management. Computer Speech & Language
Diederik P Kingma and Jimmy Ba. 2014. Adam: A 24(2):150–174.
method for stochastic optimization. ICLR .
Lukas Zilka and Filip Jurcicek. 2015. Incremental
lstm-based dialog state tracker. In Automatic Speech
Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien
Recognition and Understanding (ASRU), 2015 IEEE
Wen, Blaise Thomson, and Steve Young. 2017.
Workshop on. IEEE, pages 757–762.
Neural belief tracker: Data-driven dialogue state
tracking. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers). volume 1, pages 1777–
1788.

Nikola Mrkvsić and Ivan Vulić. 2018. Fully statistical
neural belief tracking. In Proceedings of ACL.

Julien Perez and Fei Liu. 2016. Dialog state tracking,
a machine reading approach using memory network.
arXiv preprint arXiv:1606.04052 .

437

You can also read