Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations Lingzhi Wang1,2 , Jing Li3 , Xingshan Zeng1,2∗, Haisong Zhang4 , Kam-Fai Wong1,2 1 The Chinese University of Hong Kong, Hong Kong, China 2 MoE Key Laboratory of High Confidence Software Technologies, China 3 Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China 4 Tencent AI Lab, China 1,2 {lzwang,xszeng,kfwong}@se.cuhk.edu.hk 3 jing-amelia.li@polyu.edu.hk, 4 hansonzhang@tencent.com Abstract [T 1]: Save your money. Scuf is the biggest ripoff in gaming. Quotations are crucial for successful explana- [T 2]: What would you suggest instead? tions and persuasions in interpersonal commu- arXiv:2106.09896v1 [cs.CL] 18 Jun 2021 [T 3]: Just use a normal controller. nications. However, finding what to quote in a conversation is challenging for both humans [T 4]: Ooooooh, I get it now...you’re just dumb. and machines. This work studies automatic [Q]: The dumb ones are the people spending quotation generation in an online conversation over $100 for a controller. A fool and his money and explores how language consistency af- are soon parted. fects whether a quotation fits the given context. Here, we capture the contextual consistency of Figure 1: A Reddit conversation snippet about buying a a quotation in terms of latent topics, interac- Scuf controller. The quotation is in blue and italic. [T 1] tions with the dialogue history, and coherence to [T 4] are history turns while [Q] is for query turn. to the query turn’s existing content. Further, an encoder-decoder neural framework is em- ployed to continue the context with a quotation a quote and ensure its continuity of senses with via language generation. Experiment results the existing contexts. For task illustration, Figure on two large-scale datasets in English and Chi- 1 displays a Reddit conversation snippet centered nese demonstrate that our quotation generation around the worthiness to buy a Scuf controller. To model outperforms the state-of-the-art models. argue against T 4’s viewpoint, we see the query Further analysis shows that topic, interaction, turn quotes Tusser’s old saying for showing that and query consistency are all helpful to learn how to quote in online conversations. buying a controller is a waste of money. As can be observed, it is important for a quotation rec- 1 Introduction ommendation model to capture the key points be- ing discussed (reflected by words like “money” Quotations, or quotes, are memorable phrases or and “dumb” here) and align them to words in sentences widely echoed to spread patterns of wis- the quotation to be predicted (such as “fool” and dom (Booten and Hearst, 2016). They are derived “money”), which allows to quote something rele- from the ancient art of rhetoric and now appear- vant and consistent to the previous concern. ing in various daily activities, ranging from formal To predict quotations, our work explores se- writings (Tan et al., 2015) to everyday conversa- mantic consistency of what will be quoted and tions (Lee et al., 2016), all help us present clear, what was given in the contexts. In context beautiful, and persuasive language. However, for modeling, we distinguish the query turn (hence- many individuals, writing a suitable quotation that forth query) and the other turns in earlier his- fits the ongoing contexts is a daunting task. The tory (henceforth history), where topic, interaction, issue becomes more pressing for quoting in online and query consistency work together to determine conversations where quick responses are usually whether a quote fits the contexts. Here topic con- needed on mobile devices (Lee et al., 2016). sistency ensures that the words in quotation reflect To help online users find what to quote in the the discussion topic (such as “fool” and “money” discussions they are involved in, our work studies in Figure 1). Interaction consistency is to iden- how to recommend an ongoing conversation with tify the turns in history to which the query re- ∗ Xingshan Zeng is the corresponding author. sponds (e.g., T 1 and T 4 in Figure 1) and guide the
quote to follow such interaction. Query consis- ing comprehension (Zheng et al., 2019), which tency measures the language coherence of quote learns to put suitable text fragments (e.g., words, in continuing the story started by the query. For phrases, sentences) in the given contexts. Most example, the quote in Figure 1 is to support the prior studies explore the task in formal writings, query’s argument. such as citing previous work in scientific pa- In previous work of quotation recommendation, pers (He et al., 2010), quoting famous sayings in there are many methods designed for formal writ- books (Tan et al., 2015, 2016), and using idioms in ings (Tan et al., 2015; Liu et al., 2019); whereas news articles (Liu et al., 2019; Zheng et al., 2019). much fewer efforts are made for online conversa- The language they face is mostly formal and well- tions with informal language and complex inter- edited, while we tackle online conversations ex- actions in their contexts. Lee et al. (2016) use a hibiting noisy contexts and hence involving quote ranking model to recommend quotes for Twitter consistency modeling with turn interactions. Lee conversations. Different from them, we attempt et al. (2016) also recommend quotations for con- to generate quotations in a word-by-word manner, versations. However, they consider quotations as which allows the semantic consistency of quotes discrete attributes (for learning to rank) and hence and contexts to be explored. largely ignore the rich information reflected by a Concretely, we propose a neural encoder- quotation’s internal word patterns. Compared with decoder framework to predict a quotation that con- them, our model learns to quote with language tinues the given conversation contexts. We cap- generation, which can usefully exploit how words ture topic consistency with latent topics (i.e., word appear in both contexts and quotations. distributions), which are learned by a neural topic For methodology, we are inspired by the model (Zeng et al., 2018a) and inferred jointly encoder-decoder neural language generation mod- with the other components. Interaction consis- els (Sutskever et al., 2014; Bahdanau et al., 2014). tency is modeled with a turn-based attention over In dialogue domains, such models have achieved the history turns, and the query is additionally en- huge success in digesting contexts and generate coded to initialize the decoder’s states for query microblog hashtags (Wang et al., 2019b), meet- consistency. To the best of our knowledge, we are ing summaries (Li et al., 2019), dialogue re- the first to explore quotation generation in conver- sponses (Hu et al., 2019), etc. Here we explore sations and extensively study the effects of topic, how the encoder-decoder architecture works to interaction, and query consistency on this task. generate quotations in conversations, which has Our empirical study is conducted on two large- never been studied in existing work. Our study scale datasets, one in Chinese from Weibo and is also related to previous research to understand the other in English from Reddit, both of which conversation contexts (Ma et al., 2018; Liu and are constructed as part of this work. Experiment Chen, 2019; Sun et al., 2019), where it is shown to results show that our model significantly outper- be useful to capture interaction structures (Liu and forms both the state-of-the-art model based on Chen, 2019) and latent topics (Zeng et al., 2019). quote rankings (Lee et al., 2016) and the recent For latent topics, we are benefited from the re- topic-aware encoder-decoder model for social me- cent advance of neural topic models (Miao et al., dia language generation (Wang et al., 2019a). For 2017; Wang et al., 2019a)), which allows end-to- example, we achieve 27.2 precision@1 on Weibo end topic inference in neural architectures. Never- compared with 24.0 by Wang et al. (2019a). Fur- theless, none of the above work attempts to study ther discussions show that topic, interaction, and the semantic consistency of quotes in conversation query consistency can all usefully indicate what to contexts, which is a gap our work fills in. quote in online conversations. We also study how length of history and quotation affects the quoting 3 Our Quotation Generation Model results and find that we perform consistently better than comparison models in varying scenarios. This section describes our neural encoder-decoder framework that generates quotations in conversa- 2 Related Work tions, whose architecture is shown in Figure 2. The encoding process works for context model- Our work is in the line with content-based recom- ing of turn interactions (described in Section 3.1) mendation (Liu et al., 2019) or cloze-style read- and latent topics (presented in Section 3.2). For
Decoder −−→ ←−− y1 ym-1 both directions: hci = [hci,ni ; hci,0 ]. Further, GRU GRU … GRU we define the history representations as hc = Turn-based Attention … … hhc1 , hc2 , ..., hcm−1 i, which will be further used to w1 w2 wn wq y1 y2 ym encode the interaction structure (described later). 0 Query Encoder. Similar to the way we encode Interaction Encoder ( ) GRU GRU … GRU GRU each turn in history, a Bi-GRU is first employed ( ( )) to learn query representations q = hcm . Then, … ( , 6) we identify which turns in history the query re- Turn Turn Turn Turn Encoder Encoder Encoder Encoder sponds to for learning interaction consistency. To Turn 1 Turn 2 Turn n Query this end, we put a query-aware attention over the History Encoder Query Encoder Neural Topic Model history turns and result in a context vector below: Figure 2: Our encoder-decoder framework for conver- m−1 sation quotation generation. It encodes turn interac- X tions (for both query and earlier history) and latent top- c= αi · hci , αi = sof tmax(hci · q) (2) ics in contexts. The decoder predicts quotes in aware i=i of topic, interaction, and query consistency. Afterwards, we enrich query representations with the features from history and obtain the history- the decoding process to be discussed in Section aware query representations: 3.3, we predict words in quotes taking topic, in- teraction, and query consistency into considera- q̃ = Wq [q; c] + bq (3) tion. The learning objective of the entire frame- where Wq and bq are learnable parameters. work will be given at last in Section 3.4. Structure Encoder. With the representations 3.1 Interaction Modeling learned above for query q̃ and history hc , To describe turn interactions, we first assume that we can further explore how turns interact with there are m chronologically-ordered turns given their neighbors (henceforth conversation struc- as contexts and each turn Ti is formulated as a ture) with another Bi-GRU. It is fed with the word sequence hwi,1 , wi,2 , ..., wi,ni i (ni denotes hhc1 , hc2 , ..., hcm−1 , q̃i sequence and the hidden the number of words). We consider the m-th turn states sequence hh1 , h2 , ..., hm−1 , hm i is further as the query while others the history (Thistory = put into a memory bank M for decoder’s attentive hT1 , T2 , ..., Tm−1 i). Here we distinguish the query retrieval in quotation generation (see Section 3.3). and its earlier history to separately explore the 3.2 Topic Modeling quote’s language coherence to the query (for query consistency) and its interaction consistency to the Following the common practice (Blei et al., 2003; earlier posted turns. In the following, we will de- Miao et al., 2017), we model topics following the scribe how to encode history and query turns, and bag-of-words (BoW) assumption. Hence, we form how the learned representations work together to a BoW vector xbow (over vocabulary V ) of the explore conversation structure. words in context to learn its discussion topic. The topic inference process is inspired by neural topic History Encoder. Here we describe how to en- models (NTM) (Miao et al., 2017). It is based code turns in history. We first feed each word on a variational auto-encoder (VAE) (Kingma and wij (the j-th word in the i-th turn) in history Welling, 2013) involving an encoding and a de- into an embedding layer to obtain its word vec- coding step to reconstruct the BoW of contexts. tor ci,j . Then word vectors of the i-th turn Ci = hci,1 , ci,2 , ..., ci,ni i are further processed with a BoW Encoding Step. This step is designed to bidirectional gated recurrent unit (Bi-GRU) (Cho learn a latent topic variable z from xbow . Here et al., 2014b). Its hidden states are defined as: words in conversation contexts are assumed to sat- isfy a Gaussian distribution prior on mean µ and −−→ ←−− hci,j = fGRU (ci,j , hci,j−1 ), hci,j = fGRU (ci,j , hci,j+1 ) standard deviation σ (Miao et al., 2017). They are (1) estimated by the following formula: The turn-level representations are hence cap- tured by concatenating the last hidden states of µ = fµ (fe (xbow )), log σ = fσ (fe (xbow )) (4)
where f∗ (·) is a neural perceptron performing Topic and Interaction Consistency. For model- a linear transformation activated with an ReLU ing quote consistency of discussion topics (with θ) function (Nair and Hinton, 2010). and turn interactions (with M ), we design a turn- based attention over conversation contexts to de- BoW Decoding Step. Conditioned on the latent code the quotation. Its attention weights are com- topic z, we further generate words to form the puted in aware of the structure-encoded turn rep- BoW of each conversation xbow . Here we assume resentations hj from M and topic distribution θ: each word wn ∈ xbow is drawn from the conversa- tion’s topic mixture θ, which is a distribution vec- exp(fd (hdi , hj , θ)) tor over the topics. In the following, we show the αij = Pm d (7) j 0 =1 exp(fd (hi , hj 0 , θ)) generation story to decode xbow : • Draw latent topic z ∼ N (µ, σ 2 ). where fd (hdi , hj , θ) captures the topic-aware se- • Topic mixture θ = sof tmax(fθ (z)). mantic dependency the i-th word in quotation to • For the n-th word in the conversation: the j-th turn in contexts and is defined as: – Draw the word wn ∼ sof tmax(fφ (θ)). fd (hdi , hj , θ) = Wd [hdi · hθj ] + bd (8) Here f∗ (·) is a ReLU-activated neural perceptron defined above. The topic mixture θ will be later where hθj = Wθ [hj ; θ] + dθ , and parameters Wd , applied to capture topic consistency when predict- bd , Wθ , and dθ are all trainable. Then we give the ing the quotation. context vector ti conveying both topic and interac- tion features for the i-th word to be generated: 3.3 Quotation Generation m X To predict the quotation y, we first define the prob- ti = αij hj . (9) ability of words in it with the following formula: j=1 |y| Finally, we predict the i-th word in quotation Y P r(y|Thistory , Tquery ) = P r(yi |y
As for LQGM , it is defined as the cross entropy Weibo Reddit loss over all training instances to train the quota- # of quotes 1,053 1,111 tion generation model (QGM): Avg len of quotes 4.0 10.1 |Voc| of quotes 1,251 4,111 N X # of convs 19,081 44,539 LQGM = − log(P r(yn |Cn , θn )) (13) Avg # of turns per conv 2.51 4.25 n=1 Avg len of turn per conv 21.6 71.8 where N is the number of training instances. |Voc| of convs 44,134 72,375 Cn = {Thistory , Tquery }n represents the contexts of the n-th conversation and θn is Cn ’s topic com- Table 1: Statistics of Weibo and Reddit datasets. The position induced by NTM. upper rows are for quotes and the lower rows are for conversations. The “len” refers to the number of tokens 4 Experimental Setup contained. “Avg # of turns” means the average turn number in context. Datasets. For experiments, we construct two new datasets: one in Chinese from Weibo (a pop- ular microblog platform in China and henceforth tend to quote in later turns while Weibo earlier. To Weibo) and the other in English from Reddit further compare users’ quoting behavior, we show (henceforth Reddit)3 . Here the raw Weibo data the distribution of quotation number in Figure 3(a) is released by Wang et al. (2019a) and Reddit ob- and position in Figure 3(b). Figure 3(a) shows tained from a publicly available corpus.4 For both only a few quotations are commonly used in on- Weibo and Reddit, we follow the common prac- line conversations, probably because of its infor- tice form conversations with posts and their com- mal writing style. While for Figure 3(b), we find ments (Li et al., 2015; Zeng et al., 2018b), where only a few Weibo conversations quote 5 turns later a post or comment is considered as a conversation while the distribution on Reddit is much flatter. turn. 0.20 Weibo 0.7 To gather conversations with quotations, we Weibo 0.16 Reddit 0.6 Reddit 0.5 Proportion maintain a quotation list and remove conversations 0.12 Proportion 0.4 containing no quotation from the list. For the re- 0.08 0.3 0.04 0.2 maining, if a conversation has multiple quotes, 0.1 0.00 we construct multiple instances where one cor- 5 6 7 8 9 ,15] ,20] ,30] ,40] 00] 00] 0.0 [10 (15 (20 (30 (40,1(100,5 2 3 4 5 6 7 8 9 10 responds to the prediction of a quotation therein. (a) Frequency Distribution (b) Position Distribution On Weibo, we explore the quoting of Chinese Chengyu.5 For Reddit, we obtain the quotation Figure 3: Quotation distribution over frequency (on list from Wikiquote.6 Afterwards, we remove con- the left) and position (right). X-axis: frequency (left) versation instances with quotations appearing less and context turn number (right); Y-axis: proportion of than 5 times to avoid sparsity (Tan et al., 2015). Fi- quotations (left) and conversations (right). nally, the datasets are randomly splitted into 80%, 10%, and 10%, for training, development, and test. The statistics of the two datasets are shown in Preprocessing. To preprocess Weibo data, we Table 1. We observe that the two datasets exhibit adopted open-source Jieba toolkit7 for Chinese different statistics. For example, from the aver- word segmentation. For Reddit dataset, we em- age turn number in contexts, we find Reddit users ploy natural language toolkit (NLTK8 ) for tok- 3 enization. In BoW preparation, all stop words The datasets are available at https: //github.com/Lingzhi-WANG/ and punctuation were removed following common Datasets-for-Quotation-Recommendation practice to train topic models (Blei et al., 2003). 4 https://files.pushshift.io/reddit/ comments/ Parameter Setting. Here we describe how we 5 https://en.wikipedia.org/wiki/Chengyu set our model. In model architecture, the hidden Chengyu can be seen as a quotable phrase (Wang and Wang, 2013) — memorable rhetorical figures to convey wit and size of all GRUs is set to 300 (bi-direction, 150 striking statement (Bendersky and Smith, 2012). 6 7 https://en.wikiquote.org/wiki/Main_ https://github.com/fxsjy/jieba 8 Page https://www.nltk.org
for each direction). For encoder, we adopt two lay- Finally, the following of our variants are test: ers of bidirectional GRU, and unidirectional GRU 8) IE ONLY: using interaction modeling results for decoder. The parameters in NTM are set up for decoding (w/o topic and query consistency following Zeng et al. (2018a). For input, we set modeling); 9) IE+QE: coupling interaction and the maximum turn length to 150 for Reddit and query consistency (w/o NTM used for topic con- 200 for Weibo, and the maximum quotation length sistency); 10) IE+QE+NTM: our full model. 20. Word embeddings are randomly initialized to 150-dimensional vectors. In model training, we 5 Experimental Results employ Adam optimizer (Kingma and Ba, 2015), In this section, we first show the main comparison with 1e − 3 learning rate and the adoption of early results in Section 5.1. Then Section 5.2 discusses stop (Caruana et al., 2001). Dropout strategy (Sri- what we learn to represent consistency. Finally, vastava et al., 2014) is also used to avoid overfit- Section 5.3 presents more analysis to characterize ting. We adopt beam search (beam size = 5) to quotations in online conversations. generate a ranking list for quote recommendation. 5.1 Main Comparison Results Evaluation Metrics. We first adopt recommen- dation metrics with popular information retrieval Table 2 reports the main comparison results on two metrics Precision at K (P@K) and mean average datasets, where our full model significantly out- precision (MAP) scores (Schütze et al., 2008) performs all comparisons by a large margin. Sev- used. For P@K, K=1 to measure the top predic- eral interesting observations can be drawn: tion, while for MAP we consider the top 5 out- • Quotation is related with context. The poor puts. Here we measure the generation models with performance of weak baselines reveals the chal- their predictions after quotation matching (Section lenging nature of quoting in online conversations. 3.3). Then, generation metrics are employed to It is not possible to learn what to quote without evaluate word-level predictions. Here we consider considering context. both ROUGE (Lin, 2004) from summarization (F1 • Generation models outperform Ranking. Gen- scores of ROUGE-1 and ROUGE-L are adopted) eration models in encoder-decoder style perform and BLEU (Papineni et al., 2002) from translation. much better than ranking. It maybe attributed To allow comparable results, generation models to generation model’s ability to learn word-level are measured with their original outputs (without mapping from source context to quotation. quotation matching) while for ranking competi- • Interaction, query, and topic consistency tors, we take their top-1 ranked quotes. are all useful. We see IE ONLY outperforms S EQ 2S EQ, showing that interaction modeling Comparisons. We first adopt two weak base- helps encode indicative features from context. lines that select quotations unaware of the target Likewise, the results of IE+QE are better than conversation: 1) R ANDOM: selecting quotations IE ONLY, and IE+QE+NTM better than IE+QE, randomly; 2) F REQUENCY: ranking quotations both suggesting that learning query and topic con- with frequency. Then, we compared two ranking sistency contribute to yield a better quotation. baselines: 3) non-neural learning to rank model • Quoting in Reddit is more challenging than (henceforth LTR) with handcrafted features pro- Weibo Chengyu. All models perform worse on posed in Tan et al. (2015). 4) CNN-LSTM (Lee Reddit than Weibo. The possible reason is that et al., 2016): previous quotation recommendation Chinese Chengyu is shorter and renders a smaller model (CNN for turn and quotation encoding and vocabulary than English quotes (see Table 1). LSTM for conversation structure). Next, we consider the encoder-decoder genera- 5.2 Quotation and Consistency tion models without modeling conversation struc- We have shown our effectiveness in main results. ture: 5) S EQ 2S EQ (Cho et al., 2014a): using an Here we further examine our learned consistency RNN for encoding and another RNN for decod- and their effects on quoting. In the rest of this pa- ing; 6) TAKG: Seq2Seq framework incorporating per, without otherwise specified, our model is used latent topics for decoding. 7) the state-of-the-art as a short form of our full model (IE+QE+NTM). (SOTA) model NCIR (Liu et al., 2019) designed For comparison, we select TAKG for its best per- for Chinese idiom generation. formance in Table 2 over all comparison models.
Weibo Reddit Models P@1 MAP RG-1 RG-L BLEU P@1 MAP RG-1 RG-L BLEU Weak Baselines R ANDOM 0.2 0.7 2.1 2.1 0.2 0.1 0.7 5.6 4.5 0.1 F REQUENCY 2.3 6.9 3.1 3.1 2.3 1.0 4.7 1.7 1.5 1.0 Ranking Models LTR 3.6 9.3 5.1 5.1 3.6 1.7 7.1 4.1 3.6 1.7 CNN-LSTM 7.3 11.3 10.5 10.5 7.3 4.1 5.2 6.8 6.0 3.7 Generation Models S EQ 2S EQ 19.9 24.1 22.6 22.5 19.9 7.2 9.8 11.7 10.6 4.7 TAKG 24.0 27.3 26.8 26.7 24.0 12.5 16.0 15.7 14.4 6.7 NCIR 22.6 26.5 25.3 25.2 22.6 7.3 12.2 10.9 9.9 4.1 Our models IE ONLY 21.5 24.8 24.5 24.4 21.5 11.2 14.6 13.9 12.8 5.7 IE+QE 22.0 24.7 25.2 25.1 22.0 13.5 17.4 17.0 15.5 7.0 IE+QE+NTM 27.2‡ 31.6† 29.5 ‡ 29.5‡ 27.2‡ 17.5† 24.0† 20.3† 18.8† 9.5† Table 2: Comparison results on Weibo and Reddit datasets (in %). RG-1 and RG-L refer to ROUGE-1 and ROUGE-L respectively. The best results in each column are in bold. Our full model IE+QE+NTM achieves significantly better performance than all the comparisons (paired t-test. ‡: p < 0.05; †: p < 0.01) 0.7 Human 1 Human 2 0.6 Our Model TAKG Ground Truth 84 78 0.5 Proportion IE ONLY 36 32 0.4 0.3 IE+QE 49 46 0.2 0.1 Table 3: Human evaluation results for the quote coher- 0.0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 ent with query (count out of 100). Figure 4: Attention weights over turns. X-axis: turn position; Y-axis: the normalized weight. ble 3 shows the count of “yes” for the ground truth quote and the output of IE ONLY and IE+QE. In- Interaction Consistency. To understand the po- terestingly, even ground truth quotations cannot sitions of turns a quote is likely to respond to, attain over 85% “yes”, probably because of the we display the turn-based attention weights (Eq. prominent misuse of quotations on social media. 7) over turn position in Figure 4. Also shown is Nevertheless, the better performance of IE+QE the attention weights from TAKG (Wang et al., compared with IE ONLY shows the usefulness to 2019a) for comparison. Here we use Reddit con- model query consistency for ensuring quotation’s versations for interpretation because they involve language coherence to the query. larger turn number (see Table 1). It is seen that TAKG can only attend the first three turns while we assign higher weights to turns closer to query. In doing so, the quotes will continue senses from Topic Consistency. Here we use the example in later history, which fits our intuition that partici- Figure 1 to analyze the topics we learn for model- pants tend to interact with latest information. ing consistency. Recall that the conversation cen- ters around price and value and the quote is used Query Consistency. We carry out a human eval- to argue that only fools will waste the money. We uation to test the coherence of query and the pre- look into the top 3 latent topics (by topic mixture dicted quotations. 100 conversations are sam- θ) and display their top 10 words (by likelihood) pled from Weibo and two native Chinese speak- in Table 4. There appears words like “pay” and ers are invited to examine whether a quote carry “stupied”, which might help to correctly predict on the query’s senses (“yes”) or not (“no”). Ta- “fool” and “money” in the quote.
Topic 1 game property child rights pay guy 24 24 state church guys paid Our Model Our Model 20 TAKG 20 TAKG Topic 2 f**k evidence sh*t guys stupid edit 16 16 MAP(%) MAP(%) 12 12 nice proof dude dumb 8 8 Topic 3 car buy cops police scrubs gun 4 4 0 0 technology shot crime energy Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 (a) Turn Number (b) Token Number Table 4: The top 10 words of the 3 latent topics re- lated to the conversation in Figure 1. Words suggesting Figure 6: MAP scores (y-axis) over context length conversation’s focus are in blue and italic. (left in turn number and right token number) in vary- ing quantiles. For each subfigure, from left to right 30 shows the results in Q1 ([0, 0.25)), Q2 ([0.25, 0.5)), Q3 Our model's MAP (%) 28 ([0.5, 0.75)), and Q4 (0.75, 1)). 26 24 2.5 22 80 Weibo Our Model 20 60 Reddit 2.0 CNN-LSTM Growth Rate 18 1.5 2 3 4 5 6 7 8 9 10 MAP(%) 40 1.0 20 0.5 Figure 5: Our MAP scores over conversations with varying turn number. X-axis: turn number; Y-axis: 0[5, 100) [100, 200) [200, 300) [300, 400) [400, 500] 0.0 1 2 3 4 MAP scores. The best results are seen for 8-turn convs. (a) MAP on Freq. Ranges (b) Growth Rate Comparison 5.3 Sensitivity to Context and Quotations Figure 7: Left subfigure: Our MAP scores (y-axis) on different frequency range (x-axis). Right: growing rate In this section, we study how varying context and (y-axis) on Weibo, where x-axis indicates the order of quotations affect our performance. neighboring ranges. The Effects of Context. Here we examine whether longer context will result in better results. results are shown in Figure 6(b) where our model In the following, we measure context length in consistently outperform TAKG over conversation terms of turn number and token number. context with varying token number. Turn Number. Figure 5 shows our MAP scores The Effects of Quotation. We further study our to quote for Reddit conversations with varying results to predict quotations in varying frequency turn number. Weibo results are not shown here and the MAP scores are reported in Figure 7(a). In for the limited data with turn number > 4. Gener- general, higher scores are observed for more fre- ally, more turns result in better MAP, for the richer quent quotations, as better representations can be information to be captured from turn interactions. extensively learned from training data. We also The scores drop for turn number > 8, probably be- notice a slower growing rate as the frequency in- cause of underfitting and a more complex model creases. To go into more details, we compare the might be needed for interaction modeling. growing rates with ranking model CNN+LSTM To further explore model’s sensitivity to turn and show the results in 7(b) on Weibo (Reddit re- number, we first rank the conversations with turn sults in similar trends). In comparison, we are gen- number and separate them into four quartiles erally less sensitive to quotation frequency (except (Q1 , Q2 , Q3 , Q4 , in order with increasing turn for very rare quotes). It is likely to be benefited number). We then train and test in each quartile, from quotations’ internal structure while ranking and compare the results of our model and TAKG models can be largely affected by label sparsity. in Figure 6(a). As can be seen, our model presents larger margin for quartiles corresponding to larger 5.4 Further Discussions turn number, indicating our ability to encode rich Here we probe into our outputs to provide more information from complex turn interactions. insights to quoting in conversations. Token number. For context length measured with token number, we follow the above steps to Case Study. We first present a qualitative anal- form train and test quartiles for token number. The ysis over the example in Figure 1. To analyze
References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Michael Bendersky and David A. Smith. 2012. A dictionary of wisdom and wit: Learning to extract quotable phrases. In Proceedings of the Work- Figure 8: Attention weights of our model over turns shop on Computational Linguistics for Literature, (in blue) and TAKG over words (in red). CLfL@NAACL-HLT 2012, June 8, 2012, Montréal, Canada, pages 69–77. what the model learns, we visualize our turn-based David M. Blei, Andrew Y. Ng, and Michael I. Jordan. attention and TAKG’s topic-aware attention over 2003. Latent dirichlet allocation. J. Mach. Learn. words in Figure 8. As can be seen, TAKG focuses Res., 3:993–1022. more on topic words “Scuf”, “suggest”, and “con- Kyle Booten and Marti A. Hearst. 2016. Patterns troller”, all reflecting the global discussion focus of wisdom: Discourse-level style in multi-sentence while ignoring query’s intention. Thus, it mistak- quotations. In NAACL HLT 2016, The 2016 Confer- enly quote “A penny saved is a penny earned.”. ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- Instead, we attend the query’s interaction with T1 guage Technologies, San Diego California, USA, and T4 , which results in the correct quotation. June 12-17, 2016, pages 1139–1144. Comparing with Human. Finally, we discuss Rich Caruana, Steve Lawrence, and C Lee Giles. 2001. how human performs on our task. 50 Weibo con- Overfitting in neural nets: Backpropagation, conju- gate gradient, and early stopping. In Advances in versations were hence sampled and two human neural information processing systems, pages 402– annotators (native Chinese speakers) were invited 408. to quote a Chinese Chengyu in the given context. The two annotators give 7 and 8 correct answers Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and respectively, which shows the task is challenging Yoshua Bengio. 2014a. Learning phrase representa- for human. Our model made 13 correct predic- tions using RNN encoder-decoder for statistical ma- tions, exhibiting a better ability to quote in online chine translation. CoRR, abs/1406.1078. conversations. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger 6 Conclusion Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using rnn encoder-decoder We present a novel quotation generation frame- for statistical machine translation. arXiv preprint work for online conversations via the modeling of arXiv:1406.1078. topic, interaction, and query consistency. Experi- Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee ment results on two newly constructed online con- Giles. 2010. Context-aware citation recommenda- versation datasets, Weibo and Reddit, show that tion. In Proceedings of the 19th international con- our model outperforms the previous state-of-the- ference on World wide web, pages 421–430. ACM. art models. Further discussions provide more in- Wenpeng Hu, Zhangming Chan, Bing Liu, Dongyan sights on quoting in online conversations. Zhao, Jinwen Ma, and Rui Yan. 2019. GSN: A graph-structured network for multi-party dialogues. Acknowledgements In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages The research described in this paper is partially 5010–5016. supported by CUHK RSFS #3133237, HK RGC GRF #14204118, HK ITF #ITS/335/18. Jing Li Diederick P Kingma and Jimmy Ba. 2015. Adam: A is supported by the Hong Kong Polytechnic Uni- method for stochastic optimization. In International Conference on Learning Representations (ICLR). versity internal funds (1-BE2W and 1-ZVRH) and NSFC Young Scientists Fund 62006203. Diederik P. Kingma and Max Welling. 2013. Auto- encoding variational bayes. CoRR, abs/1312.6114.
Hanbit Lee, Yeonchan Ahn, Haejun Lee, Seungdo Ha, 40th Annual Meeting of the Association for Com- and Sang-goo Lee. 2016. Quote recommendation putational Linguistics, pages 311–318, Philadelphia, in dialogue using deep neural network. In Proceed- Pennsylvania, USA. Association for Computational ings of the 39th International ACM SIGIR Confer- Linguistics. ence on Research and Development in Information Retrieval, SIGIR ’16, pages 957–960, New York, Hinrich Schütze, Christopher D Manning, and Prab- NY, USA. ACM. hakar Raghavan. 2008. Introduction to information retrieval. In Proceedings of the international com- Jing Li, Wei Gao, Zhongyu Wei, Baolin Peng, and munication of association for computing machinery Kam-Fai Wong. 2015. Using content-level struc- conference, page 260. tures for summarizing microblog repost trees. In Proceedings of the 2015 Conference on Empirical Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Methods in Natural Language Processing, EMNLP Ilya Sutskever, and Ruslan Salakhutdinov. 2014. 2015, Lisbon, Portugal, September 17-21, 2015, Dropout: A simple way to prevent neural networks pages 2168–2178. from overfitting. J. Mach. Learn. Res., 15(1):1929– 1958. Manling Li, Lingyu Zhang, Heng Ji, and Richard J. Radke. 2019. Keep meeting summaries on topic: Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, Abstractive multi-modal meeting summarization. In and Claire Cardie. 2019. DREAM: A challenge Proceedings of the 57th Conference of the Associa- dataset and models for dialogue-based reading com- tion for Computational Linguistics, ACL 2019, Flo- prehension. CoRR, abs/1902.00164. rence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2190–2196. I Sutskever, O Vinyals, and QV Le. 2014. Sequence to Chin-Yew Lin. 2004. ROUGE: A package for auto- sequence learning with neural networks. Advances matic evaluation of summaries. In Text Summariza- in NIPS. tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2015. Learning to recommend quotes for writing. In Yuanchao Liu, Bo Pang, and Bingquan Liu. 2019. Twenty-Ninth AAAI Conference on Artificial Intelli- Neural-based Chinese idiom recommendation for gence. enhancing elegance in essay writing. In Proceed- ings of the 57th Annual Meeting of the Association Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2016. A for Computational Linguistics. neural network approach to quote recommendation in writings. In Proceedings of the 25th ACM Inter- Zhengyuan Liu and Nancy Chen. 2019. Reading turn national on Conference on Information and Knowl- by turn: Hierarchical attention architecture for spo- edge Management, CIKM ’16. ken dialogue comprehension. In Proceedings of the 57th Annual Meeting of the Association for Compu- Lanchun Wang and Shuo Wang. 2013. A study of id- tational Linguistics. iom translation strategies between english and chi- nese. Theory and practice in language studies, Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. 2018. 3(9):1691. Challenging reading comprehension on daily con- versation: Passage completion on multiparty dialog. Yue Wang, Jing Li, Hou Pong Chan, Irwin King, In Proceedings of the 2018 Conference of the North Michael R. Lyu, and Shuming Shi. 2019a. Topic- American Chapter of the Association for Computa- aware neural keyphrase generation for social media tional Linguistics: Human Language Technologies, language. In Proceedings of the 57th Annual Meet- Volume 1 (Long Papers). ing of the Association for Computational Linguis- Yishu Miao, Edward Grefenstette, and Phil Blunsom. tics. 2017. Discovering discrete latent topics with neural Yue Wang, Jing Li, Irwin King, Michael R. Lyu, and variational inference. In Proceedings of the 34th In- Shuming Shi. 2019b. Microblog hashtag generation ternational Conference on Machine Learning, vol- via encoding conversation contexts. In Proceed- ume 70 of Proceedings of Machine Learning Re- ings of the 2019 Conference of the North American search, pages 2410–2419, International Convention Chapter of the Association for Computational Lin- Centre, Sydney, Australia. PMLR. guistics: Human Language Technologies, NAACL- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, linear units improve restricted boltzmann machines. Volume 1 (Long and Short Papers), pages 1624– In Proceedings of the 27th International Conference 1633. on International Conference on Machine Learning, ICML’10, pages 807–814, USA. Omnipress. Jichuan Zeng, Jing Li, Yulan He, Cuiyun Gao, Michael R. Lyu, and Irwin King. 2019. What you Kishore Papineni, Salim Roukos, Todd Ward, and Wei- say and how you say it: Joint modeling of topics Jing Zhu. 2002. Bleu: a method for automatic eval- and discourse in microblog conversations. TACL, uation of machine translation. In Proceedings of the 7:267–281.
Jichuan Zeng, Jing Li, Yan Song, Cuiyun Gao, Michael R Lyu, and Irwin King. 2018a. Topic mem- ory networks for short text classification. arXiv preprint arXiv:1809.03664. Xingshan Zeng, Jing Li, Lu Wang, Nicholas Beauchamp, Sarah Shugars, and Kam-Fai Wong. 2018b. Microblog conversation recommendation via joint modeling of topics and discourse. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 375– 385. Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. Chid: A large-scale chinese idiom dataset for cloze test. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Vol- ume 1: Long Papers, pages 778–787.
You can also read