A Bi-Encoder LSTM Model For Learning Unstructured Dialogs
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Bi-Encoder LSTM Model For Learning Unstructured Dialogs Diwanshu Shekhar∗ Pooran S. Negi† Mohammand Mahoor‡ University of Denver University of Denver University of Denver Abstract These systems interact with humans to get infor- mation to help complete the task. These include Creating a data-driven model that is trained the digital assistants that are now on every cell- arXiv:2104.12269v1 [cs.CL] 25 Apr 2021 on a large dataset of unstructured dialogs is phone or on home controllers and voice assistants a crucial step in developing Retrieval-based such as Siri, Cortana, Alexa, Google Now/Home, Chatbot systems. This paper presents a Long etc. Short Term Memory (LSTM) based architec- ture that learns unstructured multi-turn dialogs and provides results on the task of selecting Chatbot Systems, the area of this paper, are the best response from a collection of given systems that can carry on extended conversations responses. Ubuntu Dialog Corpus Version with the goal of mimicking unstructured conver- 2 was used as the corpus for training. We sations or ‘chats’ characteristic of human-human show that our model achieves 0.8%, 1.0% interaction. Lowe et al. (2015) explored learn- and 0.3% higher accuracy for Recall@1, Re- ing models such as TF-IDF (Term Frequency- call@2 and Recall@5 respectively than the benchmark model. We also show results on Inverse Document Frequency), Recurrent Neural experiments performed by using several sim- Network (RNN) and a Dual Encoder (DE) based ilarity functions, model hyper-parameters and on Long Short Term Memory (LSTM) model suit- word embeddings on the proposed architecture able to learn from the Ubuntu Dialog Corpus Version 1 (UDCv1). We use this same archi- 1 Introduction tecture but on Ubuntu Dialog Corpus Version 2 (UDCv2) as a benchmark and introduce a new Recently statistical techniques based on recurrent LSTM based architecture called the Bi-Encoder neural networks (RNN) have achieved remarkable LSTM model (BE) that achieves 0.8%, 1.0% and successes in a variety of natural language pro- 0.3% higher accuracy for Recall@1, Recall@2 cessing tasks, leading to a great deal of commer- and Recall@5 respectively than the DE model. cial and academic interests in the field (Bengio In contrast to the DE model, the proposed BE et al., 2013; Cambria and White, 2014). Signifi- model has separate LSTM networks for encod- cant progress in the area of Machine Translation, ing utterances and responses. The BE model also Text Categorization, Spam Filtering, and Summa- has a different similarity measure for utterance rization have been made. Research in developing and response matching than that of the benchmark Dialog Systems or Conversational Agents - per- model. We further show results of various experi- haps a desirable application of the future- have ments necessary to select the best similarity func- been growing rapidly. A Dialog System can com- tion, hyper-parameters and word embedding for municate with human in text, speech or both and the BE model. can be classified into - Task-oriented Systems and Chatbot Systems. Section 2 describes the related current state-of- Task-oriented systems are designed for a partic- the-art research on Chatbot Systems. We describe ular task and set up to have short conversations. the proposed BE model in Section 3. The exper- * diwanshu.shekhar@du.edu iments and results are described in Section 4 and, † pooran.negi@du.edu we conclude the paper in Section 5 with sugges- ‡ mohammad.mahoor@du.edu tions for potential future work.
2 Background DE and CNN models performed better than the DE model. For clarity, we establish a notation in this pa- per wherein the type of the mathematical quan- Another type of corpora-based Chatbot system tity involved will be denoted by its representa- is the Generative Chatbot system. One clear ben- tion. Scalars are represented by lower case letters efit of the Generative systems is that they don’t i, j, k, · · · ; α, β, γ, · · · , vectors are represented need a repository of responses to choose from as via lower case bold letters a, b, · · ·, e, , · · · and a response is generated by the system itself. Rit- matrices are represented by bold upper case letters ter et al. (2011) used Sequence-to-Sequence RNN A, B, · · · , E, · · · . Calligraphic letters A, T , · · · (seq2seq) model, a model that is commonly used are used to represent sets of objects. We con- for Machine Translation (Sutskever et al., 2014), sistently follow similar convention for functions to generate a response given an utterance. Al- where f represents scalar valued functions, bold though seq2seq models works really well in Ma- f represents vector valued functions and bold Ai,. chine translation, the model did not perform very represents the ith row of the matrix A. well in the response generation task as in machine translation words or phrases in the source and tar- 2.1 Related Work get sentences tend to align well with each other; Early Chatbot systems such as ELIZA (Weizen- but in dialogs, a user utterance may share no words baum, 1966), ALICE (Wallace, 2008) and PARRY or phrases with a coherent response. Several mod- (Colby et al., 1971) were based on pattern match- ifications of seq2seq model have been made for ing where a human statement was matched to a response generation. Li et al. (2015) made modifi- pattern and a response was retrieved that pertained cation to address the problem of seq2seq model to the matched pattern. These Chatbot systems producing responses like “I’m OK” or “I don’t were rule based and needed domain expertise to know” that tend to end the conversation. Lowe hand-craft rules in advance which made the design et al. (2017) used hierarchical approach to use of these systems very expensive and tedious. To longer prior context in the seq2seq model. The ba- address this limitation, the idea of corpora-based sic seq2seq model focuses on generating single re- Chatbot System was introduced. At the time of sponses, and so don’t tend to do a good job of con- this research, two large corpora are available to de- tinuously generating responses that cohere across sign the corpora-based Chatbot Systems - Twitter multiple turns. This can be addressed by using Corpus (Ritter et al., 2010) and the Ubuntu Dialog reinforcement learning (Li et al., 2016), as well Corpus (Lowe et al., 2015). Serban et al. (2015) as techniques like adversarial networks (Li et al., did a survey of all available corpora for corpora- 2017) that can select multiple responses that make based Chatbot systems. the overall conversation more natural. A type of corpora-based Chatbot systems that has been popular is the Information Retrieval (IR) Not all Generative Chatbot systems are based based Chatbot systems. In the IR-based Chatbot on seq2seq model. Shang et al. (2015) showed systems, an utterance is matched to a repository of that transduction models can be used to generate responses and the response that matches the most response. Wen et al. (2015) presented a statistical is retrieved. If this repository is too big, the re- language generator based on a semantically con- trieval process may be too slow. To address this trolled Long Short-term Memory (LSTM) struc- problem, Jafarpour et al. (2010) devised a filter- ture. Although not related to Chatbot systems, ing technique based on feature selection to reduce (Pan et al., 2016) introduced an LSTM-E architec- the size of the set of responses to match the given ture that was able to generate a description given utterance with. Wang et al. (2013) used the same a video. (Kannan et al., 2016) demonstrated a hy- filtering technique but used RankSVM to match brid system called Smart Reply that leverages both utterance with responses. Lowe et al. (2015) used the Retrieval and Generative concepts. At the time LSTM-based Dual Encoder model (DE) to retrieve of this research, the generative-based systems are the best response from a set of responses of size 10 not doing so well and most production systems (since the set of responses to choose from was al- are essentially retrieval-based such as - Cleverbot ready small filtering was not necessary). Kadlec (Carpenter, 2011) and Microsoft’s Little Bing sys- et al. (2015) showed that an ensemble of LSTM, tem.
entropy loss is given by: X (q, p) = −q · log(p) − (1 − q)log(1 − p) (3) The model is trained using the Adam Optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 by minimizing the loss function in Eq. (3). Figure 2 shows the Cumulative Match Charac- teristic (CMC) curve that shows the true positive identification rate of the BE model for Recall@k Figure 1: Bi Encoder LSTM Architecture. RNNs are for k ∈ {1, 2, ..., 10}. colored in grey and white to show two different LSTM networks 3 Bi Encoder Model The proposed BE model architecture in Figure 1 is motivated by the typical setup of conversation between two persons. Each person has to encode long and short term conversation contexts to best respond to a spoken sentence (an utterance or con- text). As a natural design choice, in the BE model one Figure 2: CMC Curve of the BE Model LSTM cell (colored in grey) learns encoding of an utterance (or questions or contexts) while the other In this subsequent section, we look at various LSTM cell (colored in white) learn encoding of a experiments that helped us decide to select the best response (or answers or responses). A sequence similarity function, hyper-parameters and word of GloVe embedding vectors (Pennington et al., embedding for the BE model. We also show per- 2014) of an utterance are fed into the upper LSTM formance of the BE model in comparison to the while the sequence of embedding vectors of a re- DE model. sponse are fed into the lower LSTM cell. Vectors representing the final states ht ∈ Rs of the upper 4 Experiments and Results and lower LSTM cells are used for final represen- All our models were implemented in Tensorflow tations of the utterance and the response as ue , re v1.7 and trained using a GeForce GTX 1080 Ti respectively. To drive learning of ue and re , we NVIDIA GPU. We used the same training data, measure their similarity in the hidden vector space UDC as (Lowe et al., 2015) but its second ver- using dot product i.e sion (UDCv2). Models are trained on 1 million pairs of utterances and responses on the training set and evaluated against a test set. We fine-tune sim(ue , re ) = huTe , re i (1) the model with hyper-parameters, determine the The training of the model via BPTT is done by optimum similarity function and word embedding minimizing the binary cross entropy X (q, p) be- using the validation dataset. tween the learned probability p and the ground For evaluation and model selection, we present truth paring probability q = {0, 1}, where 1 de- our model with 10 response candidates, consisting notes ue , re are genuinely paired and 0 denotes of one right response and the rest nine incorrect otherwise. Using similarity in Eq. (1), p is calcu- responses. This set of 10 response candidates per lated as: context is provided in the validation and test set in UDCv2 (more details in Section 4.1). The model ranks these responses and its ranking is considered 1 p= (2) correct if the correct response is among the first k 1+ e(sim(ue ,re )+b) candidates. This quantity is denoted as Recall@k. In eq. (2), b is a scalar free parameter bias that Specifically, we report mean values of Recall@1, is learned by the model. The aforementioned cross Recall@2 and Recall@5.
Context Response Label you just click unless nat get 1 userprefer and distract by creat one ... someth shinier eou eot i ca n’t access the wiki at all - it throw http auth at me eou eot Figure 3: DE Model. All RNNs are colored in white to i think that tu- i think that tu- 0 show the same LSTM network has been fed first by the tori be outdat - tori be outdat - utterance and then by the response veri first instruct veri first instruct fail eou and fail eou and it differ slight to it differ slight to For benchmarking, we use the DE model in what ubotu have what ubotu have (Lowe et al., 2015) and the results of the DE to say eou to say eou model on UDCv2 as published in https://github.com/rkadlec/ Table 1: Examples from the training dataset of UDCv2 ubuntu-ranking-dataset-creator. showing both the correct (1) and incorrect response (0) In contrast to the BE model, the DE model has labels one LSTM cell that encodes both the utterance and the response. The encoding for the utterance, ue is multiplied with a trainable matrix M whose validation and test set; the sampling procedure result is compared with the encoding for response, for the context length in the validation and test re by a dot product (Figure 3). set is changed from an inverse distribution to a We also reproduced the DE model for compar- uniform distribution; the tokenization and entity ison and we refer it as the DER model. Note that replacement procedure was removed; differenti- the DE model was originally modeled and trained ation between the end of an utterance ( eou ) in Theano. and end of turn ( eot ) has been added; a bug that caused the distribution of false responses in 4.1 Data the test and validation sets to be different from the The Ubuntu Dialog Corpus (UDC) is the largest true responses was fixed. freely available multi-turn dialog corpus (Lowe The training set consists of labelled 1 million et al., 2015). It was constructed from the Ubuntu pairs of utterances and responses. It has equal dis- chat logs - a collection of logs from Ubuntu- tribution of true context-response pairs labeled as related chat rooms on the Freenode IRC network. 1 versus the context-distraction pairs labeled as 0. Although multiple users can talk at the same time Keeping all the words that occur at least 5 times, in the chat room, the logs were pre-processed us- the training set has a vocabulary of 91,620. The ing heuristics to create two-person conversations. average utterance is 86 words long and the aver- The resulting corpus consists of almost one mil- age response is 17 words long. lion two-person conversations, where a user seeks The validation dataset consists of 19,560 exam- help with his/her Ubuntu-related problems (the av- ples where each example consists of a context and erage length of a dialog is 8 turns, with a minimum 10 responses where the first response is always the of 3 turns). Because of its size, the corpus is well- true response. The test dataset, structured the same suited for deep learning in the context of dialogue as the validation dataset, consists of 18920 exam- systems. ples. The correct response is the actual next utter- UDCv2 released in 2017 made sev- ance in the dialogue and a false response is ran- eral significant updates to its predecesor domly sampled utterance from elsewhere within a (https://github.com/rkadlec/ set of dialogues in UDC that has been set aside ubuntu-ranking-dataset-creator). for creation of validation and test set (Lowe et al., To summarize - UDCv2 is separated into training, 2015). The words of the UDCv2 are stemmed
Model Id Description Recall@1 Recall@2 Recall@5 (strip suffixes from the end of the word), and lem- DE Dual Encoder (Benchmark) 55.2 72.09 92.43 DER Dual Encoder Reproduced 52.6 70.09 91.51 matized (normalize words that have the same root, BE Bi-Encoder (Proposed) 56.0 73.15 92.7 despite their surface differences). Table 2: Comparison of top-k % accuracy on UDCv2 4.2 Effect of similarity measures on the test set In the BE model, we used dot product similarity Model Id Description Recall@1 Recall@2 Recall@5 between the encoded utterance ut and response BE-19 BE with Cosine similarity 43.09 61.99 86.97 re . Before we made that decision, we evaluted BE-20 BE-21 BE with Polynomial Similarity BE using all hidden states 55.11 54.7 71.64 71.54 92.17 91.63 severa other similarity measures. The description BE-22 BE with deep LSTM model 53.8 71.6 92.4 BE BE with Dot Similarity 56.88 73.24 92.86 of these similarity measures are given in the sub- sequent sections. Table 3: Results of different similarity measures used 4.2.1 Cosine Similarity on the BE model using the validation set Instead of taking the dot product of ue and re , we ignore their magnitude and take the dot product of the utterance and response. The encoding for the their unit vectors. This is shown in the following utterance is given by: equation: T X t2 ue = · ht (7) T2 uTe · re t=1 sim(ue , re ) = (4) |ue | |re | where T is the maximum context length of the ut- terance 4.2.2 Polynomial Similarity Similarly, the encoding for the response is given In machine learning, the polynomial kernel is a by: kernel function commonly used with support vec- T tor machines (SVMs) and other kernelized mod- X t2 els. Although the Radial Basis Function (RBF) re = · h0t (8) T2 t=1 kernel is more popular in SVM classification than the polynomial kernel, Goldberg and Elhadad We keep the similarity function as in the origi- (2008) showed that polynomial kernel gives better nal BE model as shown in Eq. (1). result than the RBF-Kernel for NLP applications: 4.4 Deep LSTM For degree−d polynomials, the polynomial ker- nel is defined as: In this experiment, we added two more layers to the shallow LSTM BE model and looked at the K(x, y) = (xT · y + c)d (5) result. We keep the similarity function as in the original BE model as shown in Eq. (1). where x and y are vectors in the input space, i.e. vectors of features computed from training or test 4.5 Results and Discussion samples and c ≥ 0 is a free parameter trading off Table 2 compares the performance of the proposed the influence of higher-order versus lower-order BE model, the benchmark DE model and the re- terms in the polynomial. When c = 0, the kernel produced DE model, the DER model on UDCv2 is called homogeneous. dataset. Compared to benchmark DE model, the In this experiment, we used the polynomial ker- proposed BE model achieves 0.8%, 1.0% and nel from 0th to the 3rd degree for the similarity 0.3% higher accuracy for Recall@1, Recall@2 measure. The following equation gives the simi- and Recall@5 respectively. Note that compared larity function: to the reproduced DE model, the BE model does 3 better than when it is compared to the benchmark X sim(ue , re ) = (uTe · re )d (6) model. d=0 Table 3 shows the results of various experiments we performed on the BE model. 4.3 Effect of using all hidden states For a given NLP task, choice of words em- In this experiment, we used all the hidden states bedding to real vector space can affect the per- of the LSTM and not just the final states to encode formance of a model. Table 4 shows the results
Embedding Recall@1 Recall@2 Recall@5 Random 41.7 61.1 87.8 Word2Vec 56.55 73.61 92.7 Twitter 27B 200d 52.50 69.59 91.44 Common Crawl 42B 56.88 73.24 92.86 Common Crawl 840B 56.43 73.25 92.66 Table 4: Comparison of performances of the BE model with various embedding types. Results are shown on the validation set Figure 5: Effect of (a) RNN cell size and (b) training batch size on the BE (Bi-Encoder) model 4.6 Error Analysis Figure 4: T-SNE plot of word embeddings of some fre- Similar to Lowe et al. (2017), we performed qual- quently occurring words in UDCv2 itative error analysis on the 100 randomly chosen examples from the test dataset where the model made an error for Recall@1 (Table 5). The er- of using various embedding vectors with the BE rored examples were evaluated by three persons model.We first looked at the random embedding where each one manually gave a score to each ex- and then used the Word2Vec embedding trained amples for the metrics - Difficulty Rating, Model on the UDCv2. We also used the pre-trained Response Rating and Error Category. GloVe embeddding (Mikolov et al., 2013) and ran Difficulty Rating[1-5] measures how difficult the model with all four pre-trained GloVe embed- human finds the context to match the right re- dings that are available - (1) Wikipedia - 6B to- sponse. A rating of 1 on the difficulty scale means kens, 400K vocab, uncased, 50d, 100d, 200d, and that the question is easily answerable by all hu- 300d vectors, (2) Common Crawl - 42B tokens, mans. A 2 indicates moderate difficulty, which 1.9M vocab, uncased, 300d vectors, (3) Common should still be answerable by all humans but only Crawl - 840B tokens, 2.2M vocab, cased, 300d if they are paying attention. A 3 means that the vectors, and (4) Twitter - 2B tweets, 27B tokens, question is fairly challenging, and may either re- 1.2M vocab, uncased, 25d, 50d, 100d and 200d quire some familiarity with Ubuntu or the human vectors. Both pre-trained and trained embeddings respondent paying very close attention to answer on UDCv2 show better results than the random correctly. A 4 is very hard, usually meaning that embedding. Between the Word2Vec and GloVe, there are other responses that are nearly as good as the Common Crawl 42B embedding of the GloVe the true response; many humans would be unable shows the best result. The T-SNE plot of Common to answer questions of difficulty 4 correctly. A 5 Crawl 42B embeddings is shown in Figure 4. As means that the question is effectively impossible: can be seen in the diagram, similar words (for ex- either the true response is completely unrelated to ample - “thank”, “thx” and “ty”) appear embedded the context, or it is very short and generic close to each other. Model Response Rating[1-3] measures the rea- In our experiments, we tuned LSTM cell size soning of the model’s choice. A score of 1 indi- and the training batch size (Figure 5). cates that the model predicted response is com-
pletely unreasonable given the context. A 2 means sponses given a context(utterance). Empirically that the response chosen was somewhat reason- we have shown that on average 92.7%, 73.15% able, and that it’s possible for a human to make and 56.0% of the time, correct response will be a similar mistake. A 3 means that the model’s re- in top 5, top 2 and top 1 correct responses respec- sponse was more suited to the context than the ac- tively in Ubuntu Dialog Corpus Version 2 exceed- tual response. ing the accuracy of the benchmark model in all Error Category[1-4] puts model error in a spe- three metrics. Collobert and Weston (2008) used a cific category. Error Category of 1 relates to tone language model with a Rank loss/similarity where and style of the context. If a model makes an error he had only positive examples and generated neg- attributed to the misspellings, incorrect grammar, ative examples by corrupting the positive ones. use of emoticons, use of technical jargon or com- Several other works have shown the Rank loss mands etc in the context, then the error category to be useful in training situations where pairs of will be 1. Error Category 2 relates to when the correct or incorrect items are to be scored (Gold- context and chosen responses relate to the same berg, 2016). Since UDC dataset matches this sce- topic. Error Category 3 relates to model’s inabil- nario, we recommend the future work to explore ity to account for turn-taking structure. For exam- the BE model with the Rank loss. In a large corpus ple if the last turn in the context asks a question like UDC where users are seeking help in Ubuntu and the model chose a answer where it is not an- related problems, it is reasonable to assume that swering the question. Error Category 4 means the there can be multiple thread of discussion(topics) model picked the response because it sees some related to Ubuntu. Identifying the latent topics and common words between the context and the re- grouping the utterances based on topics will allow sponses. training an ensemble of BE models. As there is no explicit grouping of the utterance, we plan to Difficulty Rating % of Errors identity these hidden topics using Latent Dirich- Impossible (5) 19% let Allocation (LDA). Topics distribution of utter- Very Difficult (4) 11% Difficult (3) 22% ances can be used to group them using probabilis- Moderate (2) 30% tic measure of distance. We hypothesize that en- Easy (1) 18% sembles of BE models may serve in efficient selec- Model Response Rating % of Errors tion of correct responses. Since the retrieval-based Better than actual (3) 23% Reasonable (2) 21% systems have to loop through every single possible Unreasonable (1) 56% responses, if the system needs to go through a very Error Category % of Errors large set, the system may be practically not feasi- Common words (4) 13% ble in production. As shown by (Kannan et al., Turn-taking (3) 45% Same topic (2) 26% 2016) one way to reduce the number of possible Tone and style (1) 16% responses is through clustering. (Jafarpour et al., 2010) and (Wang et al., 2013) also showed sev- Table 5: Qualitative evaluation of the errors from the eral ways of reducing the large set of possible re- BE model sponses to a smaller set. We intend to apply such ideas in our future work. Moreover, in a multi-turn The qualitative analysis results (Table 5) show dialog system capturing longer term context is es- that the BE model was not able to predict well the sential to selecting correct response. Our proposed turn-taking structure of the dialogs. A little more architecture can be extended to more hierarchical than half of the errored examples had the human RNN layers, capturing longer context. We plan to difficulty level ranging from 3 to 5, and almost investigate this further in conjunction with para- half of the model responses in the errored exam- graph vector (Le and Mikolov, 2014). ples were either reasonable or better than the ac- tual response. References 5 Conclusions and Future Work Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new This paper presented a new LSTM based RNN ar- perspectives. IEEE transactions on pattern analysis chitecture that can score a set of pre-defined re- and machine intelligence, 35(8):1798–1828.
Erik Cambria and Bebo White. 2014. Jumping nlp Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, curves: A review of natural language processing re- Alan Ritter, and Dan Jurafsky. 2017. Adversar- search. IEEE Computational intelligence magazine, ial learning for neural dialogue generation. arXiv 9(2):48–57. preprint arXiv:1701.06547. Rollo Carpenter. 2011. Cleverbot. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large Kenneth Mark Colby, Sylvia Weber, and Franklin Den- dataset for research in unstructured multi-turn dia- nis Hilf. 1971. Artificial paranoia. Artificial Intelli- logue systems. arXiv preprint arXiv:1506.08909. gence, 2(1):1–25. Ryan Thomas Lowe, Nissan Pow, Iulian Vlad Serban, Ronan Collobert and Jason Weston. 2008. A unified Laurent Charlin, Chia-Wei Liu, and Joelle Pineau. architecture for natural language processing: Deep 2017. Training end-to-end dialogue systems with neural networks with multitask learning. In Pro- the ubuntu dialogue corpus. Dialogue & Discourse, ceedings of the 25th international conference on 8(1):31–65. Machine learning, pages 160–167. ACM. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Yoav Goldberg. 2016. A primer on neural network rado, and Jeff Dean. 2013. Distributed representa- models for natural language processing. Journal of tions of words and phrases and their compositional- Artificial Intelligence Research, 57:345–420. ity. In Advances in neural information processing systems, pages 3111–3119. Yoav Goldberg and Michael Elhadad. 2008. splitsvm: fast, space-efficient, non-heuristic, polynomial ker- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and nel computation for nlp applications. In Proceed- Yong Rui. 2016. Jointly modeling embedding and ings of the 46th Annual Meeting of the Association translation to bridge video and language. Proceed- for Computational Linguistics on Human Language ings of the IEEE conference on computer vision and Technologies: Short Papers, pages 237–240. Asso- pattern recognition. ciation for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher Sina Jafarpour, Christopher JC Burges, and Alan Rit- Manning. 2014. Glove: Global vectors for word ter. 2010. Filter, rank, and transfer the knowledge: representation. In Proceedings of the 2014 confer- Learning to chat. Advances in Ranking, 10:2329– ence on empirical methods in natural language pro- 9290. cessing (EMNLP), pages 1532–1543. Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Un- 2015. Improved deep learning baselines for ubuntu supervised modeling of twitter conversations. In corpus dialogs. arXiv preprint arXiv:1510.03753. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Association for Computational Linguistics, pages Kaufmann, Andrew Tomkins, Balint Miklos, Greg 172–180. Association for Computational Linguis- Corrado, László Lukács, Marina Ganea, Peter tics. Young, et al. 2016. Smart reply: Automated re- sponse suggestion for email. In Proceedings of the Alan Ritter, Colin Cherry, and William B Dolan. 2011. 22nd ACM SIGKDD International Conference on Data-driven response generation in social media. In Knowledge Discovery and Data Mining, pages 955– Proceedings of the conference on empirical methods 964. ACM. in natural language processing, pages 583–593. As- sociation for Computational Linguistics. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau- arXiv:1412.6980. rent Charlin, and Joelle Pineau. 2015. A survey of available corpora for building data-driven dialogue Quoc Le and Tomas Mikolov. 2014. Distributed rep- systems. arXiv preprint arXiv:1512.05742. resentations of sentences and documents. In Inter- national Conference on Machine Learning, pages Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. 1188–1196. Neural responding machine for short-text conversa- tion. arXiv preprint arXiv:1503.02364. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objec- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. tive function for neural conversation models. arXiv Sequence to sequence learning with neural net- preprint arXiv:1510.03055. works. In Advances in neural information process- ing systems, pages 3104–3112. Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein- Richard S Wallace. 2008. Alice: Artificial intelligence forcement learning for dialogue generation. arXiv foundation inc. Received from: http://www. alice- preprint arXiv:1606.01541. bot. org.
Hao Wang, Zhengdong Lu, Hang Li, and Enhong Chen. 2013. A dataset for research on short-text conversations. In Proceedings of the 2013 Confer- ence on Empirical Methods in Natural Language Processing, pages 935–945. Joseph Weizenbaum. 1966. Eliza—a computer pro- gram for the study of natural language communica- tion between man and machine. Communications of the ACM, 9(1):36–45. Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei- Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural lan- guage generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. A Source Code The source code for this project can be found here - https://github.com/ DiwanshuShekhar/bi_encoder_lstm
You can also read