Augmenting Transformers with KNN-Based Composite Memory for Dialog

Page created by Daryl Tucker
 
CONTINUE READING
Augmenting Transformers with KNN-Based Composite Memory for Dialog
Augmenting Transformers with KNN-Based
                              Composite Memory for Dialog

                          Angela Fan                                               Claire Gardent
                      Facebook AI Research                                          CNRS/LORIA
                      Université de Lorraine                                claire.gardent@loria.fr
                             LORIA
                      angelafan@fb.com

                        Chloé Braud                                                     Antoine Bordes
                        CNRS/IRIT                                                     Facebook AI Research
                   chloe.braud@irit.fr                                                 abordes@fb.com

                       Abstract                                            tectures on each task. In this work, we focus
                                                                           on human–machine dialog and how to efficiently
   Various machine learning tasks can benefit                              retrieve external knowledge that is relevant to
   from access to external information of different                        the dialog. We consider two scenarios and for
   modalities, such as text and images. Recent                             each scenario, retrieve two types of knowledge:
   work has focused on learning architectures                              (i) knowledge about similar dialog contexts and
   with large memories capable of storing this                             (ii) external knowledge used to ground the
   knowledge. We propose augmenting genera-                                conversation into real world information.
   tive Transformer neural networks with KNN-                                 Knowledge about similar dialog contexts allows
   based Information Fetching (KIF) modules.                               for a hybrid retrieval/generative approach to dialog
   Each KIF module learns a read operation to                              where the system response is generated based not
   access fixed external knowledge. We apply                               only on a representation of the current dialog
   these modules to generative dialog modeling,                            context and of the relevant world knowledge,
   a challenging task where information must be                            but also based on a response retrieved from a
   flexibly retrieved and incorporated to maintain                         similar dialog context. The retrieved knowledge
   the topic and flow of conversation. We demon-                           can be viewed as providing information about
   strate the effectiveness of our approach by                             structure and dialog sentences, or utterances:
   identifying relevant knowledge required for                             which response is likely given a similar context?
   knowledgeable but engaging dialog from                                     External knowledge is also retrieved to improve
   Wikipedia, images, and human-written dialog                             the semantic content of the dialog model. In
   utterances, and show that leveraging this                               one scenario, Wizard of Wikipedia (Dinan et al.
   retrieved information improves model perfor-                            2018), general topics are provided to crowdwor-
   mance, measured by automatic and human                                  kers, who are asked to have in-depth and specific
   evaluation.                                                             conversations about these topics by referencing
                                                                           specific Wikipedia sentences as knowledge. In this
1 Introduction
                                                                           scenario, external knowledge is retrieved from a
Machine learning approaches to various tasks,                              pre-selected set of Wikipedia sentences associated
such as game-playing or dialog, are often depen-                           with the current dialog topic. Retrieval aims to
dent on external information. This information                             select the sentence that is most relevant at each
can take multimodal forms, including structured                            step of the dialog and thereby to ground system
knowledge bases, free text, and images, and                                responses in relevant world knowledge (e.g., by
also comes in overwhelmingly large quantities.                             referring to Star Wars when talking about science
A pressing challenge is to create models that                              fiction).
can identify which specific elements of multiple                              In the other scenario, Engaging ImageChat
information sources are relevant in a particular                           (Shuster et al., 2020), crowdworkers are provided
context, and incorporate them into standard archi-                         with images and asked to have a conversation
                                                                      82

          Transactions of the Association for Computational Linguistics, vol. 9, pp. 82–99, 2021. https://doi.org/10.1162/tacl a 00356
                    Action Editor: Masaaki Nagata. Submission batch: 6/2020; Revision batch: 9/2020; Published 3/2021.
                           c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Augmenting Transformers with KNN-Based Composite Memory for Dialog
inspired by or about the image. In this case, the           ledge elements are selected, allowing users to
retrieved external knowledge is images and their            better understand the information the generative
associated dialogs. By retrieving images that are           model conditions upon when writing the subse-
similar to the image being talked about, we aim to          quent utterance. On both datasets, we achieve
enrich system responses with knowledge about                state-of-the-art results compared to generative
what is typically mentioned when describing                 models and find there is no statistically significant
similar images (e.g., when talking about an image           difference in the interestingness or human pre-
with dogs, mentioning their breed).                         ference of our model output compared to state-
   Our work on incorporating different types and            of-the-art retrieval models.
modalities of knowledge is related to methods that
strive to add external memory, such as knowledge            2 Related Work
bases, to neural networks. Previous work has ex-
plored incorporating large external memories into           We discuss related work on learning to incorporate
neural network layers (Weston et al., 2015;                 external knowledge into neural networks and
Sukhbaatar et al., 2015, 2019; Lample et al., 2019).        efficiently access relevant information. We then
Many existing approaches focus on using attention           describe work in generative dialog that incor-
over the memory slots, which is computationally             porates knowledge.
intensive and becomes less effective as the the size
of the memory grows. In this work, we propose               2.1 Incorporating External Knowledge
representing multiple sources of external infor-            Augmenting neural networks with memory, or
mation as fixed encodings and using K Nearest               longer-term components that can be accessed
Neighbors (KNN) search to fetch relevant infor-             with read and write operations, has been
mation. KNN search is computationally efficient             explored in various proposed architectures. For
and scalable, and libraries like faiss (Johnson             example, Memory Networks (Weston et al., 2015;
et al., 2019) allow KNN to be easily used on GPUs           Sukhbaatar et al., 2015, 2019) introduce attention
and integrated into neural networks. Further,               mechanisms over large external memories. Neural
the external memories are pre-encoded, so the               cache models (Grave et al., 2017b) simplify these
information encoding is only computed once. As              to access previous memories with a dot product.
the external memories are kept fixed, they do not           Previous work has also studied how to read and
require any training to learn the memories along            write into these memory architectures (Rae et al.,
with the model. We can thus scale easily to larger          2016; Graves et al., 2014; Joulin and Mikolov,
memories by learning only the KNN-based read                2015). In contrast, we focus on how to read large
operation to identify relevant information from             memories.
the memory.                                                    Another line of research has focused on
   Our core contribution proposes an efficient,             computational scalability for larger external me-
KNN-based Information Fetching (KIF) module                 mories to allow efficient access of information.
that can access relevant external knowledge, com-           For example, Chandar et al. (2016) propose a
bine knowledge from different sources, and inte-            hierarchical memory network rather than a flat
grate this information into standard sequence to            one and Rae et al. (2016) learn sparse operations
sequence architectures. We apply these flexible             to read and write. Lample et al. (2019) focus
modules to two dialog datasets that challenge gen-          on learning memories of up to one million
erative models to leverage external information to          slots and how to efficiently access the slots
write coherent, on-topic responses. Both of our             using product keys. Khandelwal et al. (2019)
chosen tasks require models to leverage external            use nearest neighbor operations to augment
information, such as information from Wikipedia             language models by performing retrieval at the
or images, to engage in the conversation. We                token level—in contrast, we focus on multimodal
show that relevant information can be identified            retrieval of multiple pieces of knowledge based
from hundreds of thousands of candidates in a               on an entire dialog context. Beyond explicit
multimodal, multi-knowledge-source setting to               memory representations, it may be possible to
improve the performance of generative dialog                store information implicitly during training time
models. Further, the output of the KIF modules              by memorizing common patterns present in text
is interpretable as specific human-readable know-           (Petroni et al., 2019). We focus on learning

                                                       83
Augmenting Transformers with KNN-Based Composite Memory for Dialog
to fetch relevant information from multiple                   model (Dinan et al., 2018; Weston et al., 2018;
explicit external multimodal knowledge sources                Cai et al., 2019; Zhu et al., 2020). Some of this
and integrate them into one network. Further,                 work has specialized to use both types of models to
our work allows the retrieved information to be               generate conversations in an ensemble (Song et al.,
interpreted as each memory slot is an explicit fact           2016) or to specifically improve consistency (Song
that can be read as text, rather than a learned vector        et al., 2020). We extend these approaches by
such as in Lample et al. (2019).                              augmenting generative models with retrieval-like
   Work has also focused on computationally                   operations based on KNN search, allowing dialog
efficient softmax operations (Mnih and Hinton,                models to flexibly incorporate various sources of
2009; Grave et al., 2017a; Chen et al., 2016).                external knowledge at the same time and scale to
Many approximate softmax techniques use KNN-                  large quantities of retrieval candidates.
like operations to form clusters, and the overall
softmax operation is constrained by the slow                  3 KNN-based Information
calculation of the exponential. Our usage of KNN                Fetching Modules
benefits from efficient and scalable libraries such
as faiss and nmslib.                                          Broadly, the KIF module assumes an en-
                                                              coder model M can access inputs X =
                                                              {x1 , x2 , . . . , xn }. For example, X can be a
2.2 Generative Dialog
                                                              collection of sentences, and xi represents an
We develop a general architecture for incorpo-                individual sentence. In a setting without additional
rating external information and apply it to the               supporting information, the encoder will process
case of generative dialog models. Previous work               an input xi and produce the encoder output
in dialog has leveraged knowledge as necessary                M (xi ). If xi is a sequence such as a sentence,
information to accomplish the task. For example,              then M (xi ) is a representation of the variable
airline and restaurant booking tasks often use                size of the sequence length by the fixed size
API calls to access information about reservation             encoder M ’s hidden size. However, in many tasks,
times and availability (Bordes et al., 2017). In              additional information is present, represented as
contrast, our work focuses on how to incorporate              E = {e1 , e2 , . . . , em }. We encode each element
unstructured knowledge, such as free text found               of X and E into a vector representation using the
on the Web. Previous work has used architectures              encoder. To identify the closest information in E
that attend over the available knowledge and                  that is relevant to xi , our general approach will
identify relevant pieces of information, which                be to use KNN by comparing the representation
scales poorly with large quantities of information            of xi with the representation of each element in
(Dinan et al., 2018; Qin et al., 2019; Lian                   the set E . KNN is a fully differentiable operation
et al., 2019). We replace the use of attention                (Plötz and Roth, 2018), so can be incorporated in a
over external information with the output of                  straightforward way into neural models. The most
a KNN module. Other work has investigated                     relevant information in E will then be available in
incorporating information retrieval in language               the model. We display a KIF-Augmented model 1
modeling and question answering (Chen et al.,                 in Figure 1 and describe how the KIF module
2017; Fan et al., 2019; Seo et al., 2019; Guu et al.,         operates.
2020), while we focus on dialog applications and                 One challenge to overcome is that the
flexibly incorporating knowledge from multiple,               representation of all elements of the knowledge
multimodal sources.                                           source E are pre-computed and kept fixed,
   On the modeling side, work has explored                    creating M (E )—we do not backpropagate to
both generative (Serban et al. 2016a, 2016b)                  affect the embeddings of the pre-encoded
and retrieval based models (Zhang et al., 2018),              knowledge. In the early stages of training, the
which identify the best utterance from the                    model receives large amounts of loss, which would
training set to return as the dialog response. This           affect the quality of the pre-encoded embeddings
often leverages self-attention or cross-attention             if we backpropagated to them. Further, encoding
mechanisms (Humeau et al., 2019). Further work                the fixed external knowledge once and re-using
has explored hybrid models, for example, using the            it allows for greater scalability. However, this
output of a retrieval model as input for a generative         lack of backpropagation can introduce a mismatch

                                                         84
Augmenting Transformers with KNN-Based Composite Memory for Dialog
Figure 1: KIF modules fetch relevant information from multimodal external knowledge. External
knowledge sources E1 and E2 are pre-encoded by encoder M (green). In the model, input xi is encoded
by encoder M ′ (blue) to produce M ′ (xi ). KIF modules (orange) operate on M ′ (xi ) and identify the
nearest neighbors encoded in M (E1 ) and M (E2 ) using KNN. Identified relevant elements from E1 and
E2 are re-encoded by M ′ in a gating mechanism with a weighted sum (represented by σ (WS1i ) · WS1i ,
where WS stands for weighted sum), then concatenated to M ′ (xi ). Full description with notation can be
found in Section 3.

between the encoding of E and the encodings                  to fE (M ′ (xi )) in M (E ), based on KNN search
produced by a model that is training, as the training        with inner product. Then, the relevant elements
model has constantly changing representations                identified by KNN are re-encoded by M ′ . For
because the weights are being learned. We use                example, if element ej is retrieved by KIF, it would
M to represent the original encoder model used               produce M ′ (ej ). We use the optimized faiss
to encode E and M ′ to represent the constantly              library for KNN search, which can conduct
training model that is encoding X . The model                billion-scale KNN efficiently on GPUs.
must learn a function to align M ′ (xi ) to the                 The KNN output for an element xi is produced
pre-encoded elements of the external memory                  by using faiss to search for the k nearest
M (E ).                                                      representations to fE (M ′ (xi )) in M (E ). Note
   To circumvent this misalignment, we learn                 that as the encoders M and M ′ produce output
a mapping operator fE (M ′ (xi )) that trains to             representations of variable length (for example, in
map elements of the model’s representation of X ,            the case where xi is a variable length sequence,
or M ′ (X ), into the additional information repre-          such as a sentence), we average across the length
sentation space M (E ). Concretely, fE (M ′ (xi ))           dimension to produce a fixed-size representations
is a multilayer perceptron with ReLU nonlineari-             r to conduct the KNN search.
ties. From the input elements of X , fE (M ′ (xi ))
                                                                         rxi = Avg fE (M ′ (xi ))
                                                                                                  
learns representations of an output close to the                                                              (1)
                                                                                 
corresponding projection of X into E . This can                         RE = Avg(M (e)) | e ∈ E               (2)
be interpreted as learning a read operation on a                     KNNxi = KNearest k, rxi , RE
                                                                                                       
                                                                                                              (3)
fixed external memory. If there was no change
to the encoding of the model compared to the                 Then, the KIF module output for an element xi
pre-computed knowledge, then the ideal map-                  is the set of all re-encoded representations of the
ping operator would be the identity function (as             KNN-retrieved knowledge:
M ′ would equal M ). However, as the model
                                                                        KIFxi = M ′ (e) | e ∈ KNNi
                                                                                  
changes significantly during the training process,                                                           (4)
the nonlinear mapping capability of fE (M ′ (xi ))           These elements are weighted by their normal-
is essential to be able to identify the correct              ized nearest neighbor scores and then summed.
knowledge E from the input X .                               This is subsequently concatenated to M ′ (xi ) to
   Thus, a model augmented with KIF will                     form the final encoder output:
incorporate external knowledge in the following
manner. First, we find the k nearest elements                        [M ′ (xi ), WeightedSum(KIFi )]         (5)

                                                        85
This can be easily extended to using multiple            4.1 KIF for Generative Dialog
modules simultaneously. For instance, two
                                                            In dialog, xi represents the text of the conversation
sources of external information, E1 and E2 , can
                                                            i. A conversation consists of multiple back-
be combined by identifying the top candidates
                                                            and-forth utterances (or turns). For example, a
of each information source. The weighted sum
                                                            conversation could consist of 4 turns: xi =
of the KIF output on each information source is
                                                            [xi,1 , xi,2 , xi,3 , xi,4 ] where xi,4 is the direct
concatenated with the encoded input M ′ (xi ). The
                                                            utterance the model should respond to, and the
KIF output dimensionality is the same size as the
                                                            earlier utterances are the conversation context.
hidden size of M ′ (xi ), so they can be directly
                                                               Standard generative dialog models use a
concatenated.
                                                            Transformer neural network as the encoder M
   Finally, different sources of information may            and want to produce an output that is an ap-
not be required for every prediction and some               propriate response to the conversation. However,
information sources can be more important than              in many cases, the conversation history alone
others. To allow the model to make more fine-               does not include all of the information required to
grained decisions about what information to                 produce an appropriate response. For example, if
use from what source, and how much of it,                   a model needs to chat about a specific movie,
we add a gating mechanism using a sigmoid                   it can be helpful to provide the model with
function around each weighted sum of KNN                    more information about that movie so a more
representations. KIF1i and KIF2i denote the KIF             interesting dialog response could be produced. To
module from Equation (4) applied to E1 and E2 ,             incorporate knowledge, models often concatenate
respectively.                                               a knowledge source E such as Wikipedia to
                                                            xi and use attention modules to identify the
        WS1i = WeightedSum(KIF1i )              (6)         most relevant knowledge. However, this approach
        WS2i = WeightedSum(KIF2i )              (7)         is computationally intensive when handling
                                                            large quantities of information. Further, attention
which produces the final encoder output, a                  mechanisms have been found to operate poorly
concatenation of M ′ (xi ) with the output of               over long sequences, as the mechanism becomes
multiple KIF modules:                                       blurry due to the softmax and struggles to make
                                                            fine-grained decisions (Fan et al., 2018b). The
  ′                                                       same is true for hierarchical approaches, which
  M (xi ), σ (WS1i ) · WS1i , σ (WS2i ) · WS2i (8)
                                                            lack scalability.
                                                               We augment Transformer sequence to sequence
   This concatenation represents the output of the          (seq2seq) networks on the encoder side with KIF
encoder M ′ and can be used for various purposes,           to improve generative dialog models. We experi-
such as providing the encoder output to a decoder           ment on two dialog tasks, Wizard of Wikipedia
in a sequence to sequence model.                            (Dinan et al., 2018) and Engaging ImageChat
                                                            (Shuster et al., 2020). In both datasets, models
                                                            must leverage information external to the dialog
4 Applying KIF to Dialog Tasks
                                                            history alone—in Wizard of Wikipedia, the chat
We describe how to apply KIF to the task of                 requires access to knowledgeable facts and in
generative dialog, a setting where models must              Engaging ImageChat, discussion about a specific
generate engaging and on-topic responses. We                image. As models must process multiple inputs
investigate dialog for two reasons: First, dialog           and ground responses in the knowledgeable facts
agents must be able to consult relevant information         or images, these tasks challenge existing seq2seq
to maintain the topic of the conversation. Second,          approaches.
retrieval-based agents have strong performance
                                                            4.2 Wizard of Wikipedia
compared to generative ones, due to their ability to
copy dialog utterances from the training set. Using         The goal of the Wizard of Wikipedia dataset is to
KIF, we can incorporate the benefits of retrieval           train knowledgeable agents that can chat in any
architectures into generative, knowledge-based              domain. The dataset contains 1,365 various topics
models.                                                     discussed in 18,430 dialogs in the training set,

                                                       86
totalling 166,787 training utterances. Each topic is        information learned by accessing the Wikipedia
a general concept, such as dogs or ice cream, and is        knowledge.
included as the first utterance of the conversation.
The conversation is meant to be in-depth and                Additional KNN Features. To better identify
detailed, so individual utterances must reference           relevant training utterances from the large quantity
specific knowledge as a basis for the utterance. The        available, we break down xi into conversation
knowledge takes the form of Wikipedia sentences.            sub-features for a more fine-grained match in the
For example, the chat utterance I love Toy Story!           KNN search step. By conducting KNN on more
It was released in 1995 would reference the                 features, we can achieve higher quality retrieval.
Wikipedia sentence Toy Story is a 1995 American             We leverage the nature of dialog to decide these
computer-animated buddy comedy [...]. For each              features.
utterance, a set of sentences are identified by an             We concatenate the encoding of the most
information retrieval system, and the crowdworker           recent dialog utterance (e.g., xi,last ) with the
selected one knowledge sentence as the basis for            encoding of the dialog context from the current
their utterance.                                            conversation and the turn number t, such that
                                                            M ′ (xi,last ), M ′ (xi,−last ), t is the representation
Knowledge Sources. Our model for Wizard of                  used for KNN search. Concretely, if the model is
Wikipedia has access to two sources of external             trying to produce the 5th turn of the conversation,
information, E1 and E2 :                                    then xi,last is the most recent utterance from the
                                                            dialog partner, xi,−last would be the last 3 turns
  • E1 is Wikipedia Knowledge provided by the
                                                            of exchange, and t would be 4. Note that the turn
    dataset as evidence to support knowledgeable
                                                            number is represented as a standalone number.
    chitchat (initially curated by the information
                                                            These are known to be salient conversation fea-
    retrieval system used in Dinan et al. [2018]).
                                                            tures. The most recent dialog utterance is the di-
    The scale of this KNN search is to filter
                                                            rect turn the model is responding to, and the
    through an average of 34 sentences. The KIF
                                                            dialog context may provide additional clues. The
    module uses dialog features to fetch relevant
                                                            turn number is important, as earlier turns are often
    knowledge to condition upon to generate the
                                                            generic (e.g., how are you doing today) and later
    subsequent utterance.
                                                            turns are more specific.
  • E2 is Training Utterances. To incorporate
    the benefits of retrieval-based dialog models           4.3 Engaging ImageChat
    to the generative setting, we use KIF to
                                                            The goal of Engaging ImageChat is to create
    identify relevant utterances from the training
                                                            agents capable of chitchatting about images
    set and take their responses as input. If
                                                            selected from the YFFC100M dataset (Thomee
    many conversations about dogs have already
                                                            et al., 2016). The dataset contains 186,782 dialogs
    occurred, models should be able to take
                                                            in the training set, each about a unique image,
    advantage of these human-written examples
                                                            totalling 355,862 utterances. Agents are assigned
    to improve their generations. For example,
                                                            one of 215 personalities (e.g., sweet, caring,
    likely conversation could occur about the
                                                            excited) to increase engagingness. Previous work
    breed of the dog, daily routine with a pet, and
                                                            (Shuster et al., 2020, 2019) identified that both
    similar topics. There are around 170K dialog
                                                            crowdworkers and models, when provided with
    utterances as inputs to KNN search. This can
                                                            personalities, produced more diverse, interesting
    be interpreted as incorporating the benefits of
                                                            responses, as evaluated by humans.
    retrieval models by identifying an utterance
                                                               We use a multimodal neural network designed
    with similar structure as the text the model
                                                            to handle both image input and text input.
    would like to generate. We do not allow the
                                                            Following Shuster et al. (2020), the images are
    module to fetch the correct response of the
                                                            encoded using a pre-trained ResNeXt network
    current conversation context.
                                                            (Xie et al., 2017). To extract the final image
   Access to these two sources of knowledge                 representation, we project the 2048-dimensional
can be seen as learning a template and a topic              output of the image encoder to 512-dimensions
separately. Sample templates can be identified              using a deep multilayer perceptron with ReLU
from the training utterances, and topic-specific            activation units. The conversation history, which

                                                       87
includes the one-word personality, is encoded with          then the turn number t and personality p are
a Transformer encoder network. The image and                represented separately. As the personality is a
conversation are integrated using the Multimodal-           word, we use the same Transformer to encode
Sum-Combiner module proposed in Shuster et al.              it. The concatenation of features used for KNN
(2020).                                                     search is: M ′ (xi,last ), M ′ (xi,−last ), t, p.
Knowledge Sources. Our model for Engaging                   5 Experimental Setup
ImageChat has access to two sources of external
information, E1 and E2 :                                    5.1 Implementation Details
  • E1 is Chat on Similar Images. Although there            Parameter Settings. We use parl.ai (Miller
    are over 180K different images in this dataset,         et al., 2017) to implement our models. The data for
    many of the images are similar. For example,            both datasets used is available for download from
    conversations associated with two pictures              parl.ai as well. We use byte-pair encoding
    of dogs could be relevant to each other. The            (Sennrich et al., 2016) to represent the text to better
    model is able to use KIF directly on the                handle the rare word problem (Dinan et al., 2018;
    current image features to fetch from around             Fan et al., 2018a). Our generative Transformer
    180K different images and return 6 turns of             models have 8 encoder layers and 8 decoder layers,
    related chat for each fetched image. Fetching           with FFN size 2048, embedding dimension 512,
    from E1 consists of identifying related image           and 4 attention heads. We optimize using Adam
    chats, or conversations on related topics.              (Kingma and Ba) and the inverse square root
  • E2 is Training Utterances. Similar to the               learning schedule (Vaswani et al., 2017) with 10k
    motivation for the previous dataset, we allow           warmup updates. The initial learning rate is 0.0001
    the model to identify training utterances that          and we optimize for model perplexity. We use a
    could be useful for responding in the current           dropout of 0.5 and set gradient clipping to 0.1.
    conversation. The scale of this fetching task           We set k = 5 for all cases. For both datasets,
    is large: 350K dialog utterances. This could            we model a vocabulary size of 54,944 based on
    be interpreted as identifying utterances with           the BPE-based vocabulary from the Reddit pre-
    similar structure to what the model would               training. We tuned the learning rate and batchsize
    like to generate, and is complementary to the           hyperparameters together.
    topic-based related image chats.                        Pre-training. We pre-train the Transformer
Additional KNN Features. To identify relevant               seq2seq model used for both datasets on 250M
information from training utterances, we use the            comments from Reddit. The Reddit dataset was
same dialog features as Wizard of Wikipedia in              made available by pushshift.io. The comments
the KNN search step, with one modification: We              are parsed to maintain conversational threads
add the personality provided by the dataset. We             of users responding to each other, so the
represent the personality feature as the personality        encoder network has been exposed to conversa-
word, such as caring, and embed it with the                 tional context at training time. Note that the
encoder M ′ . As utterances from speakers with              Reddit dataset does not include aspects such as
the same personality are more likely to be                  personality, as those are unique to specific datasets
similar, this feature improves the quality of the           such as Engaging ImageChat. The context size in
fetched information. For example, conversations             pre-training is set to 512 tokens. The ResNeXt
with the sweet personality often include similar            encoder used to model images for the Engaging
text such as aww, that’s wonderful. We use                  ImageChat dataset was pre-trained on 3.5 billion
two additional features for the KNN search: t,              images (Mahajan et al., 2018).
the turn number, and p, the personality. This
feature is explicitly used in Shuster et al. (2020)         5.2 Evaluation
to improve the engagingness and flow of the                 Generation. We generate with beam search,
conversation. Similar to Wizard of Wikipedia, we            setting the beam size to 4. We use 3-gram block-
represent the conversation turn t as a number.              ing. This technique disallows repeated n-grams
The Transformer model is used to encode text                from being generated multiple times and reduces
xi and produce a representation of the text,                repetition.

                                                       88
Automatic Metrics. Following Dinan et al.                    collected on the same topic for Wizard of Wiki-
(2018), we compute F1, a metric of unigram                   pedia and collected on the same image and per-
overlap, between the generated utterance and                 sonalities for Engaging ImageChat. Topic and
the human-written reference utterance from the               images selected for evaluation are unique and
dataset. For generative models, utterances are               taken randomly from the test set.
generated using beam search. For retrieval models,
the next utterance is predicted by ranking the entire        5.3 Baselines
set of training utterances, and the highest scoring          We compare Transformers augmented with KIF to
utterance is chosen.                                         other existing approaches on Wizard of Wikipedia
   In Wizard of Wikipedia, there are two test sets:          and Engaging ImageChat. The best approaches,
A set of seen topics, or topics that have been               judged by human evaluation, are retrieval models,
seen at training time with new test-time dialogs.            the Retrieval Transformer Memory Network from
The second set is unseen, or topics that have not            Dinan et al. (2018) and the Retrieval Transformer
been encountered at all during training time. We             from Shuster et al. (2020). These have been
evaluate on both subsets.                                    shown to be strong baselines compared with
                                                             other retrieval techniques based on TF-IDF (Chen
Human Evaluation. We follow the setup and                    et al., 2017). Thus, we report the existing retrieval
use the analysis questions proposed in the                   models for both datasets, but focus on comparing
Acute-Eval dialog evaluation system (Li et al.,              to other generative baselines.
2019). For reproducibility, we adopt this existing              We compare to three additional generative
evaluation setting that has been applied to several          baselines. Note that in Wizard of Wikipedia,
dialog datasets. We use the question wording                 the construction of the dataset is that sentences
suggested by Acute-Eval and follow their                     of Wikipedia knowledge are provided with the
self-chat procedure and interface. As one of the             utterances in a concatenated form. Models must
original datasets assessed in this system was                identify the relevant information in this provided
Wizard of Wikipedia, their evaluation setting                knowledge, or can access more Wikipedia know-
extends naturally to ours. We collect 100 human-             ledge beyond the provided sentences. The follow-
bot conversational dialogs on a crowdsourcing                ing baseline methods always have access to the
platform for both datasets. The dialogs are eight            information provided in the datas et already, but
turns long. Then, we show pairs of the collected             no additional Wikipedia knowledge beyond that.
conversations side by side, one conversation with
a human and model A and the other conversation                 • Transformer Memory Networks. To contrast
with a human and model B. We ask annotators the                  the ability of KIF to existing work, we
following questions:                                             compare our models to published Trans-
                                                                 former Memory Networks (Dinan et al.,
  • Who would you prefer to talk to for a long                   2018). These models encode each piece of
    conversation?                                                external information independently with a
  • If you had to say one of the speakers is                     Transformer Encoder, and these are stored
    interesting and one is boring, who would you                 as memory slots. To access information in
    say is more interesting?                                     the memory slots, a model performs dot-
  • Which speaker sounds more human?                             product attention between the memory slots
  • Which speaker has more coherent responses                    and the dialog context. In Dinan et al. (2018),
    in the conversation?                                         the knowledge selection from Wikipedia was
                                                                 supervised with either (a) a two-stage model
  • If you had to say that one speaker is more
                                                                 where the first model was trained to pre-
    knowledgeable and one is more ignorant,
                                                                 dict the right knowledge and a second model
    who is more knowledgeable? (Wizard of
                                                                 conditions on the predicted knowledge to
    Wikipedia only)
                                                                 generate the next utterance, or (b) an end-
  We measure the percentage of time one model                    to-end model with an auxiliary loss for
was chosen over the other, taking the majority                   knowledge prediction accuracy.
agreement between three evaluators. To reduce                  • Retrieve and Refine. We implement a hybrid
variance, dialogs paired in the evaluation were                  model (Weston et al., 2018) that incorporates

                                                        89
top retrieval candidates as additional input               more effectively as they are trained for dialog.
  to Generative Transformer MemNets. In Re-                  Thus, we replace CoVE embeddings with
  trieve and Refine, a fixed number of candi-                domain-specific ones.
  dates are retrieved and concatenated to the
  conversational history in the encoder, making            All of Transformer generative baselines are
  the input much longer. For both datasets, the         initialized with the same pre-training on Reddit
  Retrieve and Refine mechanism that fetches            that we use for our models for fair comparison on
  a fixed number of training utterances is added        modeling quality.
  to the Generative Transformer MemNet with
  Reddit Pre-Training baseline.                         6 Results
  Unlike the KIF-Augmented Transformer, the             We describe the results of incorporating KIF
  retrieval is conducted with a separate model          modules into Transformer networks. We display
  so there is no backpropagation to affect the          an example conversation between a human and
  retrieval. With KIF, models can alter the             our model in Figure 4, and show the top scoring
  retrieved candidates by learning the mapping          Wikipedia knowledge and Training Utterance
  operator. Further, a fixed amount of infor-           fetched by KIF modules. We compare to various
  mation is always retrieved, without the cap-          baselines using automatic and human evaluation,
  ability to easily rescale to focus on specific        and discuss our experiments. We present various
  candidates. KIF modules have weighting                ablation settings to understand the key features
  mechanisms to focus more on certain infor-            that make our method function.
  mation, and the modules are combined with
  gating so models can learn which knowledge            6.1 KIF is Effective for Incorporating
  sources are more important and adjust                     Knowledge
  flexibly. Lastly, Retrieve and Refine is only         Automatic Evaluation. Comparing KIF aug-
  used to retrieve one source of information:           mented Transformer networks to published base-
  training set utterances.                              lines and Retrieve and Refine, we find improved
• Response Generation with MR. We imple-                results.
  ment the model proposed in Qin et al. (2019),            For Wizard of Wikipedia, the improvement in
  which encodes the conversation history and            F1 score over the best baseline is around 8 points
  document contextually with a biLSTM before            (see Table 1). A major contributing factor is the
  generating the next dialog utterance. The             construction of the dataset—as each dialog turn
  initial model was applied to a machine                is grounded in a specific knowledge sentence
  reading task where a knowledge document               from Wikipedia, improving the ability to identify
  was provided along with the conversation              the relevant fact strongly improves performance.
  history. For Wizard of Wikipedia, we replace          Contrasting the results from the seen and unseen
  the knowledge document with the Wikipedia             test sets in Table 1, the improvement on unseen is
  sentences provided in the dataset. The model          worse—it is harder to fetch training utterances for
  then uses the conversation to identify the            unseen topics.
  most relevant information in the document                While Imagechat has no explicit dependency
  using a cross-attention mechanism. For the            on knowledge, we still see a 2 point improve-
  Engaging ImageChat dataset, as there is no            ment compared to the Generative Transformer
  document provided with the dataset, we                MemNet (with the additional Reddit pre-training),
  replace the expected document with the                indicating that KIF can be generally useful (see
  conversation history, and use the most recent         Table 2). Compared to an even stronger baseline
  utterance in the conversation to attend to the        that we tune in this work, Retrieve and Refine, we
  conversation history.                                 see 1 point improvement.
  We make an additional improvement to this             Human Evaluation. Results are shown in
  baseline: in Qin et al. (2019), the embeddings        Figure 2. On both datasets, we find there is large
  used pre-trained CoVE vectors (McCann                 improvement over existing generative models
  et al., 2017). We found our Reddit pre-               (green bars) that is statistically significant for some
  trained Transformer embeddings to work                of the evaluation questions. Evaluators agree that

                                                   90
Model                                                                            Test F1              Test F1
                                                                                   (Seen)              (Unseen)
  Retrieval Baselines
  Retrieval Transformer MemNet (Dinan et al., 2018)                                  15.4                   12.4
  Generative Baselines
  2-Stage Generative MemNet (Dinan et al., 2018)                                     18.9                   17.4
  Generative Transformer MemNet (Dinan et al., 2018)                                 16.9                   14.4
  + Reddit Pre-Training                                                              17.6                   16.3
  Retrieve and Refine (Weston et al., 2018)                                          18.2                   17.9
  Response Generation with MR (Qin et al., 2019)                                     17.5                   16.8
  KIF-Augmented Transformer                                                          25.9                   22.3

 Table 1: Results on the Wizard of Wikipedia dataset. We implement the Retrieve and Refine
 and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on
 Wizard of Wikipedia. The Seen test set consists of conversations on topics seen at training time, and
 the Unseen test set consists of conversations about new topics that were not in the training set.

  Model                                                                                           Test F1
  Retrieval Baselines
  Retrieval Transformer (Shuster et al., 2020)                                                       9.81
  Generative Baselines
  Generative Transformer MemNet (Dinan et al., 2018)                                                7.1
  + Reddit Pre-Training                                                                            12.8
  Retrieve and Refine(Weston et al., 2018)                                                         13.6
  Response Generation with MR (Qin et al., 2019)                                                   13.2
  KIF-Augmented Transformer                                                                        14.4

  Table 2: Results on the Engaging ImageChat dataset. We implement the Generative Transformer
  Memory Network, Retrieve and Refine, and Response Generation with MR approaches, all with
  Reddit Pre-Training, and evaluate them on Engaging ImageChat.

KIF-augmented Transformers are generally more                        models. For example, on Engaging ImageChat,
coherent and human-sounding compared to the                          while our model has significantly improved over
Generative MemNet.                                                   the generative baseline (see green bars in Figure 2,
   Comparison with existing retrieval models                         right), it does not beat retrieval based methods in
(shown in blue) is more nuanced. Along the                           sounding more human or being more interesting
lines of existing work (Zhang et al., 2018; Dinan                    (see blue bars in Figure 2, right). As the Retrieval
et al., 2018), we find that retrieval-based models                   baseline returns human-written text for other
score very well in human evaluations that ask                        humans to evaluate, we hypothesize that humans
how human or interesting a dialog sounds. This                       score each other’s writing quite well. Compared
is because retrieval models return human-written                     with generative models, which we focus on
utterances from the training set and do not suffer                   improving, retrieval models often produce longer
from decoding mistakes present in generative                         text with more interesting, nuanced vocabulary
   1
    In Shuster et al. (2020), retrieval Transformer models           usage, and do not make generation mistakes
report Hits@N using a fixed candidate set of 99 distractor           such as repetition. These factors often lead to
candidates and 1 true candidate. We compute F1 using their           the stronger performance of retrieval models.
open-sourced model by scoring the entire training set of over
350K utterances with the model and taking the top scoring               A surprising result is that KIF-augmented
candidate as the response.                                           Transformers are more human sounding than

                                                                91
Figure 2: Human Evaluation Results on Both Datasets. More than 50% indicates the KNN Model is
preferred. Stars indicate statistical significance at p < 0.05.

retrieval models on Wizard of Wikipedia. This
is because the dataset’s utterances are long and
factual due to the tendency of crowdworkers
to copy Wikipedia. Sometimes humans chatting
with the retrieval bot would respond uh. . . that’s
an interesting fact? Otherwise, our model
scores similarly to retrieval models, with most
evaluations not having statistically significant              Figure 3: Human Evaluation on the Unseen
difference.                                                   Test Set of Wizard of Wikipedia. More than
   We conduct a second evaluation on the Unseen               50% indicates the KNN Model is preferred. Stars
Test Set of the Wizard of Wikipedia dataset.                  indicate statistical significance at p < 0.05.
Results are shown in Figure 3. Trends are similar
compared to the results on the Seen Test set,                 mistakes. Overall, we find that the conclusions
though the preference for the KIF-augmented                   (and statistical significance) are stable across
Transformer is greater over the retrieval baseline.           multiple evaluations.
We hypothesize that because the Unseen Test Set
is on entirely held out topics, the retrieval baseline        6.2 Analysis of Fetched Knowledge
can struggle to identify relevant utterances. In              Example conversations from our KIF-augmented
contrast, the KIF-augmented Transformer, similar              generative model are shown in Figure 4 on
to the generative baseline from Dinan et al. (2018),          Wizard of Wikipedia. We find that relevant
can use the generative capability to produce                  knowledge is identified that affects the content
utterances.                                                   of the generated utterance. For example, the
   Lastly, we conduct an additional study to                  model finds knowledge sentences about Disney
examine the variance of the comparative dialog                movies as the human conversationalist starts
judgements. The evaluation study for Wizard of                the conversation discussing Disney. The model
Wikipedia is repeated three times on different                leverages the fetched knowledge to write the
days, and evaluators who have answered on                     content of the generated utterance. In a concrete
previous days are not allowed to evaluate again               example, the fetched sentence disney announced
in any subsequent experiments. Overall, we                    intentions [...] after the success of the incredibles
find reasonable interannotator agreement rates,               leads the model to generate the utterance i love the
around 73% averaged across all evaluations,                   incredibles, they are my favorite disney movie.
which is similar to the agreement rates reported                 In contrast, the model uses the form of the
in Li et al. (2019). We find there is greater                 fetched training utterance often as a template for
variance on questions asking which dialog is                  writing a response. For example, the model copies
more human and more interesting, most likely as               the training utterance Ohhh . . . what do people
different evaluators can interpret these in different         with color blindness do to cope with the effects?
ways. Further, we see that comparison with                    and starts the model generation with Ohhh ... and
the Retrieval model has less variance compared                continues with the question i think toy story is a
to the Generative model, possibly because the                 classic? following the form of the selected training
Retrieval model’s human written text is devoid of             utterance.

                                                         92
Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of
Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed
with model output.

   Figure 5 displays the top-3 fetched training           test the scalability of the module. In Figure 6(a),
set utterances and knowledge sentences on the             we compare the Generative Transformer MemNet
Wizard of Wikipedia dataset when responding               Baseline with KIF-Augmented Transformers in
to a human utterance. KIF modules can identify            three settings. The first is the standard Wikipedia
multiple relevant items. In response to the human         sentences provided by the dataset (average
question about blue skies the 1946 movie the model        34 sentences). Then, we extend to providing
identifies both the comedy film and the band.             the model with the full Wikipedia article (on
   Finally, the elements retrieved by KIF modules         average, 57 sentences) and finally to multiple
provide a more interpretable understanding of             Wikipedia articles (on average, totaling 205
what the model is conditioning upon to generate           sentences), identified using the conversation’s
a dialog response. In Table 3, we display for the         topic. This increasing size of available knowl-
same dialog history, changing the model’s fetched         edge could be realistic for settings where it
training utterance and knowledge sentence for our         is unclear what information is most relevant,
own examples. The model heavily incorporates              if filtering steps to preprocess the data remove
our manual changes of the fetched information into        potentially relevant information, or if information
the generated utterance. For example, changing            synthesis from multiple knowledge sources is
the knowledge directly affects what the model             necessary to produce a high-quality generation.
generates as the favorite character—from buzz             As the Wikipedia knowledge becomes more
lightyear to mr potato head to slinky dog—while           difficult to identify, performance decreases, but
changing the fetched training utterance changes           still outperforms the baseline that uses the
the form of the generated sentence.                       dataset-provided set of 34 sentences.
                                                             Comparing the scaling capability of KIF to the
6.3 Scaling KIF to Challenging                            standard Generative Transformer MemNet Base-
    Retrieval Settings                                    line highlights the advantage of using KNN. The
KIF modules can be used in more realistic and             attention-based mechanism used in Dinan et al.,
challenging settings for knowledge retrieval that         2018 struggles to identify salient information

                                                     93
Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding
to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from
validation.

when given increasingly larger quantities of                of the KIF module—requiring only a feature
knowledge, unlike the KNN information fetch. We             vector to find nearest neighbors from—allows
hypothesize the attention mechanism is challenged           fetching on multiple modalities such as text and
by softmax-ing over a larger quantity of inputs, as         images. In Table 4, using the Image-based KIF
it can be difficult to make sharp distinctions.             to fetch text from Related Images is important
                                                            to reach the strongest performance (compare
6.4 Ablations                                               Training Utterances Only that uses text-based KIF
Importance of Multiple Knowledge Sources.                   and using both Training Utterances and Related
One benefit of the KIF module approach is                   Images).
that several modules can be combined, each
capturing information from a different source. In           Using Dialog Features for KNN Performance.
both settings, Wizard of Wikipedia and Engaging             The quality of the KNN search is critical to the
ImageChat, two modules were used to incorporate             performance of KIF modules. As the external
multiple forms of knowledge—training utterances             knowledge is kept fixed, KIF must be able to align
to capture the capability of a retrieval-based model        the dialog context with the knowledge to identify
and knowledge from Wikipedia or related chats               relevant pieces of information. In Table 5, we
based on image features. We perform here an                 show that matching on more features can improve
ablation study to evaluate the impact of using              the quality of the retrieved information. Using only
only one source of information. As can be seen              the encoding of the immediate previous utterance
in Table 4, performance decreases when only one             can improve results on Wizard of Wikipedia by
source of information is used (see Table 4).                7 F1 points, but this is further improved by
   For Engaging ImageChat, this study also                  also leveraging the encoding of context (+1.8
underlines the importance of being able to fetch            F1) and using the dialog turn number (+1 F1).
in a multimodal fashion. The general form                   These features are available in the datasets, and

                                                       94
Knowledge                          Training Utterance                     Generation
buzz lightyear’s name is in        my favorite character in that          cool! my favorite character in
honor of astronaut edwin ‘buzz’    book series is hermione granger        that movie is buzz lightyear
aldrin

mr potato head is based on the     my favorite character in that          my favorite character in that
real-life mr. potato head toy      book series is hermione granger        movie is real-life mr potato
                                                                          head
slinky dog is a toy dachschund     my favorite character in that          cool! my favorite character is the
with a metal slinky for a body     book series is hermione granger        slinky dog
slinky dog is a toy dachschund     i really like the character            cool! i really like slinky dog
with a metal slinky for a body     hermione granger

slinky dog is a toy dachschund     my favorite character of all           i love that movie, my favorite
with a metal slinky for a body     time has to be hermione granger        character has to be slinky dog
                                                                          the dachshund

slinky dog is a toy dachschund     i agree with you! that’s my            i think so too! my favorite is
with a metal slinky for a body     favorite character as well             slinky

Table 3: Effect of Fetched Information on Generated Utterances. The top section provides examples
for a fixed training utterance, changing the knowledge—the generated text maintains the construction
of the training utterance but changes the favorite character to match the knowledge. The bottom section
provides examples for fixed knowledge but changing the training utterance—the generated text modifies
its form to match the training utterance, but the favorite character information remains consistent.

Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue)
while the baseline model, the Generative Transformer MemNet (gray), scales poorly (b) Gating can
remove irrelevant information. In the 3 Sources case, one source of external information is unrelated.
(c) Performance as k varies.

we leverage them to improve the relatedness of            N = 2 or N = 3 fixed hops. As the number
retrieved knowledge.                                      of hops is fixed, the multi-hop operation remains
                                                          differentiable. We do not allow the model to
Multi-Hop Retrieval with KIF. Work in me-                 retrieve the same information in a second hop.
mory networks (Weston et al., 2015; Sukhbaatar               We experimented in two settings. First, the
et al., 2015) utilized multi-hop mechanisms. Such         same KIF module is used multiple times to fetch
capacity could be useful when multiple sources are        different information, and then all of the fetched
necessary or information is incrementally fetched.        knowledge is concatenated. Results are shown
To emulate multi-hop memory mechanisms, we                in Table 6 (top). Second, we examine spreading
use KIF to retrieve relevant information for              the fetches into different KIF modules at various

                                                     95
Model                                     Test F1            Model                             Valid F1
 Wizard of Wikipedia                                          KIF-Augmented Transformer           27.4
 Training Utterances Only               18.1                   One KIF Module fetches multiple times
 Wiki Knowledge Only                    23.9                  2 Fetches                       26.9
 Training Utterances and Wiki Knowledge 25.9                  3 Fetches                       26.0
 Engaging ImageChat
                                                                Multiple KIF Modules fetch once each
 Training Utterances Only                      13.9
                                                              2 Fetches                       26.5
 Related Images Only                           13.8
                                                              3 Fetches                       25.9
 Training Utterances and Related Images        14.4
                                                              Table 6: Multi-hop with KIF to retrieve
Table 4: Using Multiple KIF Modules on Multiple               information with multiple fetch steps.
Sources is important for improved performance.

                                                           (Zhang et al., 2018). This dataset looks quite
  Model                               Valid F1
                                                           different—short utterances without factual
  Wizard of Wikipedia                                      knowledge—and should be easy for the model
  Previous Utterance Only               24.6               to identify as distinct from Wizard of Wikipedia.
  + dialog Context                      26.4               As shown in Figure 6(b), if KIF on PersonaChat is
  + Turn Embedding                      27.4               included without gating, it has a harmful effect as
  Engaging ImageChat                                       the model includes irrelevant information. When
  Previous Utterance Only               13.3               equipped with gating, the model learns to use
  + dialog Context                      14.5               the gate to ignore some inputs, and can recover
  + Turn Embedding + Personality        15.1               almost the full performance of a model without
                                                           this irrelevant information source.
  Table 5: Important Features for KNN Search
                                                           Size of K in KNN. Figure 6(c) shows the
  using KIF. Salient conversation features
                                                           performance on Wizard of Wikipedia when
  improve performance on both datasets.
                                                           varying the amount of knowledge. Being able to
                                                           access multiple relevant pieces of information is
encoder depths. This could be interpreted as the           helpful, but too much information can be harmful.
model learning to access more information each             This is likely because the weighted sum becomes
layer. As the model progresses deeper, more                blurry if too many sentences are incorporated.
abstract and high level representations are built,
which could allow different knowledge to be
retrieved. Results are shown in Table 6 (bottom).
                                                           7 Conclusion
   In both multi-hop settings, no improvement in           We present a KNN-based Information Fetching
performance on the Wizard of Wikipedia dataset             module that learns to identify relevant information
is observed. We hypothesize that this can be               from external knowledge sources by learning a
partially attributed to the construction of the            mapping-based read operation. KIF modules ben-
dataset—as humans explicitly based their written           efit from the scalability and efficiency of KNN
dialog utterance on one knowledge sentence.                search, enabling computation with large external
Further, it is possible that concatenation brings          memories. We show in the context of two dialog
together too much information for the model to             datasets that relevant knowledge can be identi-
incorporate, and thus adding additional fetches            fied and incorporated to create more engaging,
makes the retrieval more noisy.                            high-quality dialog.
Effect of Gating. We analyze the effect of the             Acknowledgments
gating mechanism by evaluating the capability of
the gate to identify and focus on salient infor-           We thank the reviewers and action editor for
mation. On Wizard of Wikipedia, we concatenate             their comments and insightful discussion. We
a third source of information: dialog turns from           thank Emily Dinan and Kurt Shuster for provid-
a completely different corpus called PersonaChat           ing assistance to reproduce their original works.

                                                      96
References                                                  in Natural Language Processing and the 9th
                                                            International Joint Conference on Natural
Antoine Bordes, Y-Lan Boureau, and Jason
                                                            Language Processing (EMNLP-IJCNLP),
  Weston. 2017. Learning end-to-end goal-
                                                            pages 4177–4187.
  oriented dialog. In 5th International Conference
  on Learning Representations, ICLR 2017,                 Angela Fan, David Grangier, and Michael Auli.
  Toulon, France, April 24-26, 2017, Conference             2018a. Controllable abstractive summarization.
  Track Proceedings.                                        In Proceedings of the 2nd Workshop on
                                                            Neural Machine Translation and Generation,
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu,                    pages 45–54.
  Xiao-jiang Liu, and Shuming Shi. 2019.
  Retrieval-guided dialogue response generation           Angela Fan, Mike Lewis, and Yann Dauphin.
  via a matching-to-generation framework.                   2018b. Hierarchical neural story generation. In
  In Proceedings of the 2019 Conference on                  Proceedings of the 56th Annual Meeting of
  Empirical Methods in Natural Language                     the Association for Computational Linguistics
  Processing and the 9th International Joint                (Volume 1: Long Papers), pages 889–898.
  Conference on Natural Language Processing               Edouard Grave, Armand Joulin, Moustapha Cissé,
  (EMNLP-IJCNLP), pages 1866–1875. DOI:                     David Grangier, and Hervé Jégou. 2017a.
  https://doi.org/10.18653/v1/D19                           Efficient softmax approximation for GPUs.
  -1195                                                     In Proceedings of the 34th International
Sarath Chandar, Sungjin Ahn, Hugo Larochelle,               Conference on Machine Learning-Volume 70,
  Pascal Vincent, Gerald Tesauro, and Yoshua                pages 1302–1310.
  Bengio. 2016. Hierarchical memory networks.             Edouard Grave, Armand Joulin, and Nicolas
  CoRR, abs/1605.07427.                                     Usunier. 2017b. Improving neural language
                                                            models with a continuous cache. In 5th
Danqi Chen, Adam Fisch, Jason Weston, and
                                                            International Conference on Learning Repre-
  Antoine Bordes. 2017. Reading Wikipedia to
                                                            sentations, ICLR 2017, Toulon, France, April
  answer open-domain questions. In Proceedings
                                                            24-26, 2017, Conference Track Proceedings.
  of the 55th Annual Meeting of the Association
  for Computational Linguistics (Volume 1: Long           Alex Graves, Greg Wayne, and Ivo Danihelka.
  Papers), pages 1870–1879. DOI: https://                   2014. Neural Turing machines. arXiv preprint
  doi.org/10.18653/v1/P17-1171, PMCID:                      arXiv:1410.5401.
  PMC5579958                                              Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Wenlin Chen, David Grangier, and Michael Auli.              Pasupat, and Ming-Wei Chang. 2020. Retrieval
 2016. Strategies for training large vocabulary             augmented language model pre-training. In
 neural language models. In Proceedings of                  Proceedings of the International Conference
 the 54th Annual Meeting of the Association                 on Machine Learning, pages 5695–5704.
 for Computational Linguistics (Volume 1:                 Samuel Humeau, Kurt Shuster, Marie-Anne
 Long Papers), pages 1975–1985. DOI:                        Lachaux, and Jason Weston. 2019. Poly-
 https://doi.org/10.18653/v1/P16                            encoders: Architectures and pre-training strate-
 -1186                                                      gies for fast and accurate multi-sentence scoring.
Emily Dinan, Stephen Roller, Kurt Shuster,                  In International Conference on Learning
 Angela Fan, Michael Auli, and Jason Weston.                Representations.
 2018. Wizard of Wikipedia: Knowledge-                    Jeff Johnson, Matthijs Douze, and Hervé Jégou.
 powered conversational agents. In International            2019. Billion-scale similarity search with
 Conference on Learning Representations.                    GPUs. IEEE Transactions on Big Data. DOI:
                                                            https://doi.org/10.1109/TBDATA
Angela Fan, Claire Gardent, Chloé Braud, and
                                                            .2019.2921572
  Antoine Bordes. 2019. Using local knowledge
  graph construction to scale seq2seq models              Armand Joulin and Tomas Mikolov. 2015.
  to multi-document inputs. In Proceedings of               Inferring algorithmic patterns with stack-
  the 2019 Conference on Empirical Methods                  augmented recurrent nets. In Advances

                                                     97
You can also read