Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting Yen-Chun Chen and Mohit Bansal UNC Chapel Hill {yenchun, mbansal}@cs.unc.edu Abstract sequence models (Chopra et al., 2016; Nallap- ati et al., 2016; See et al., 2017; Paulus et al., Inspired by how humans summarize long 2018). Abstractive models can be more concise arXiv:1805.11080v1 [cs.CL] 28 May 2018 documents, we propose an accurate and by performing generation from scratch, but they fast summarization model that first selects suffer from slow and inaccurate encoding of very salient sentences and then rewrites them long documents, with the attention model being abstractively (i.e., compresses and para- required to look at all encoded words (in long phrases) to generate a concise overall sum- paragraphs) for decoding each generated summary mary. We use a novel sentence-level pol- word (slow, one by one sequentially). Abstrac- icy gradient method to bridge the non- tive models also suffer from redundancy (repeti- differentiable computation between these tions), especially when generating multi-sentence two neural networks in a hierarchical way, summary. while maintaining language fluency. Em- To address both these issues and combine pirically, we achieve the new state-of-the- the advantages of both paradigms, we pro- art on all metrics (including human eval- pose a hybrid extractive-abstractive architecture, uation) on the CNN/Daily Mail dataset, as with policy-based reinforcement learning (RL) to well as significantly higher abstractiveness bridge together the two networks. Similar to how scores. Moreover, by first operating at humans summarize long documents, our model the sentence-level and then the word-level, first uses an extractor agent to select salient sen- we enable parallel decoding of our neural tences or highlights, and then employs an abstrac- generative model that results in substan- tor network to rewrite (i.e., compress and para- tially faster (10-20x) inference speed as phrase) each of these extracted sentences. To over- well as 4x faster training convergence than come the non-differentiable behavior of our ex- previous long-paragraph encoder-decoder tractor and train on available document-summary models. We also demonstrate the general- pairs without saliency label, we next use actor- ization of our model on the test-only DUC- critic policy gradient with sentence-level metric 2002 dataset, where we achieve higher rewards to connect these two neural networks and scores than a state-of-the-art model. to learn sentence saliency. We also avoid com- mon language fluency issues (Paulus et al., 2018) 1 Introduction by preventing the policy gradients from affect- The task of document summarization has two ing the abstractive summarizer’s word-level train- main paradigms: extractive and abstractive. The ing, which is supported by our human evaluation former method directly chooses and outputs the study. Our sentence-level reinforcement learn- salient sentences (or phrases) in the original doc- ing takes into account the word-sentence hierar- ument (Jing and McKeown, 2000; Knight and chy, which better models the language structure Marcu, 2000; Martins and Smith, 2009; Berg- and makes parallelization possible. Our extractor Kirkpatrick et al., 2011). The latter abstractive combines reinforcement learning and pointer net- approach involves rewriting the summary (Banko works, which is inspired by Bello et al. (2017)’s et al., 2000; Zajic et al., 2004), and has seen sub- attempt to solve the Traveling Salesman Problem. stantial recent gains due to neural sequence-to- Our abstractor is a simple encoder-aligner-decoder
model (with copying) and is trained on pseudo sentence-pairs between the document and ground document-summary sentence pairs obtained via truth summary. Next, our model achieves the simple automatic matching criteria. new state-of-the-art on all metrics of multiple ver- Thus, our method incorporates the abstractive sions of a popular summarization dataset (as well paradigm’s advantages of concisely rewriting sen- as a test-only dataset) both extractively and ab- tences and generating novel words from the full stractively, without loss in language fluency (also vocabulary, yet it adopts intermediate extractive demonstrated via human evaluation and abstrac- behavior to improve the overall model’s quality, tiveness scores). Finally, our parallel decoding re- speed, and stability. Instead of encoding and at- sults in a significant 10-20x speed-up over the pre- tending to every word in the long input document vious best neural abstractive summarization sys- sequentially, our model adopts a human-inspired tem with even better accuracy.1 coarse-to-fine approach that first extracts all the salient sentences and then decodes (rewrites) them 2 Model (in parallel). This also avoids almost all redun- In this work, we consider the task of summa- dancy issues because the model has already cho- rizing a given long text document into several sen non-redundant salient sentences to abstrac- (ordered) highlights, which are then combined tively summarize (but adding an optional final to form a multi-sentence summary. Formally, reranker component does give additional gains by given a training set of document-summary pairs removing the fewer across-sentence repetitions). {xi , yi }N i=1 , our goal is to approximate the func- Empirically, our approach is the new state-of- tion h : X → Y, X = {xi }N N i=1 , Y = {yi }i=1 the-art on all ROUGE metrics (Lin, 2004) as well such that h(xi ) = yi , 1 ≤ i ≤ N . Further- as on METEOR (Denkowski and Lavie, 2014) more, we assume there exists an abstracting func- of the CNN/Daily Mail dataset, achieving sta- tion g defined as: ∀s ∈ Si , ∃d ∈ Di such that tistically significant improvements over previous g(d) = s, 1 ≤ i ≤ N , where Si is the set of sum- models that use complex long-encoder, copy, and mary sentences in xi and Di the set of document coverage mechanisms (See et al., 2017). The sentences in yi . i.e., in any given pair of docu- test-only DUC-2002 improvement also shows our ment and summary, every summary sentence can model’s better generalization than this strong ab- be produced from some document sentence. For stractive system. In addition, we surpass the pop- simplicity, we omit subscript i in the remainder ular lead-3 baseline on all ROUGE scores with an of the paper. Under this assumption, we can fur- abstractive model. Moreover, our sentence-level ther define another latent function f : X → Dn abstractive rewriting module also produces sub- that satisfies f (x) = {dt }nj=1 and y = h(x) = stantially more (3x) novel N -grams that are not [g(d1 ), g(d2 ), . . . , g(dn )], where [, ] denotes sen- seen in the input document, as compared to the tence concatenation. This latent function f can be strong flat-structured model of See et al. (2017). seen as an extractor that chooses the salient (or- This empirically justifies that our RL-guided ex- dered) sentences in a given document for the ab- tractor has learned sentence saliency, rather than stracting function g to rewrite. Our overall model benefiting from simply copying longer sentences. consists of these two submodules, the extractor We also show that our model maintains the same agent and the abstractor network, to approximate level of fluency as a conventional RNN-based the above-mentioned f and g, respectively. model because the reward does not leak to our ab- stractor’s word-level training. Finally, our model’s 2.1 Extractor Agent training is 4x and inference is more than 20x faster The extractor agent is designed to model f , which than the previous state-of-the-art. The optional can be thought of as extracting salient sentences final reranker gives further improvements while from the document. We exploit a hierarchical neu- maintaining a 7x speedup. ral model to learn the sentence representations of Overall, our contribution is three fold: First the document and a ‘selection network’ to extract we propose a novel sentence-level RL technique sentences based on their representations. for the well-known task of abstractive summariza- 1 We are releasing our code, best pretrained models, tion, effectively utilizing the word-then-sentence as well as output summaries, to promote future research: hierarchical structure without annotated matching https://github.com/ChenRocks/fast_abs_rl
Extraction Probabilities (Policy) r1 Convolutional Sentence Encoder r1 r2 bi-LSTM bi-LSTM bi-LSTM bi-LSTM LSTM Embedded Word Vectors LSTM LSTM r1 r2 r3 r1 r2 r3 r4 r1 r2 r3 r4 h0 CONV r1 r2 r3 r4 h0 h1 r1 r2 r3 r4 h0 h1 h4 r2 r3 r4 h0 h1 h4 r3 r4 h0 h1 h4 Context-aware Sent. Reps. Encoded Sentence Representations r4 h0 h1 h4 (from previous extraction step) h0 Figure 1: Our extractor agent: the convolutional h1 encoder h4 computes representation rj for each sentence. The RNN encoder (blue) computes context-awareh1 h4 representation hj and then the RNN decoder (green) h4 selects sentence jt at time step t. With jt selected, hjt will be fed into the decoder at time t + 1. 2.1.1 Hierarchical Sentence Representation RL Agent We use a temporal convolutional model (Kim, Observation Extractor Action (extract sent.) 2014) to compute rj , the representation of each in- d1 d2 d3 d4 d5 d1jt dg(d 2 jdt) 3 sdt4 d5 djt g(djt ) st dividual sentence in the documents (details in sup- Policy Gradient plementary). To further incorporate global context d1 d2 d3 d4 d5 djt g(d jt ) s t Update of the document and capture the long-range se- d1 d2 d3 d4 d5 djt g(djt ) st Reward Abstractor mantic dependency between sentences, a bidirec- d1 d2 d3 d4 d5 djt g(djt ) st Document Sentences tional LSTM-RNN (Hochreiter and Schmidhuber, d1 d2 d3 d4 d5 dd1jt dg(d 2 jdt 3) d st4 d5 djt g(djt ) st Summary Sentence Generated Sentence 1997; Schuster et al., 1997) is applied on the con- (ground truth) volutional output. This enables learning a strong Figure 2: Reinforced training of the extractor (for representation, denoted as hj for the j-th sentence one extraction step) and its interaction with the ab- in the document, that takes into account the con- stractor. For simplicity, the critic network is not text of all previous and future sentences in the shown. Note that all d’s and st are raw sentences, same document. not vector representations. 2.1.2 Sentence Selection Next, to select the extracted sentences based on the In Eqn. 3, zt is the output of the added LSTM- above sentence representations, we add another RNN (shown in green in Fig. 1) which is referred LSTM-RNN to train a Pointer Network (Vinyals to as the decoder. All the W ’s and v’s are trainable et al., 2015), to extract sentences recurrently. We parameters. At each time step t, the decoder per- calculate the extraction probability by: forms a 2-hop attention mechanism: It first attends to hj ’s to get a context vector et and then attends > vp tanh(Wp1 hj + Wp2 et ) if jt 6= jk to hj ’s again for the extraction probabilities.2 This utj = ∀k < t model is essentially classifying all sentences of the −∞ otherwise document at each extraction step. An illustration (1) of the whole extractor is shown in Fig. 1. P (jt |j1 , . . . , jt−1 ) = softmax(ut ) (2) 2.2 Abstractor Network where et ’s are the output of the glimpse opera- tion (Vinyals et al., 2016): The abstractor network approximates g, which compresses and paraphrases an extracted docu- atj = vg> tanh(Wg1 hj + Wg2 zt ) (3) ment sentence to a concise summary sentence. We t t α = softmax(a ) (4) 2 Note that we force-zero the extraction prob. of already X et = αjt Wg1 hj (5) extracted sentences so as to prevent the model from using re- peating document sentences and suffering from redundancy. j This is non-differentiable and hence only done in RL training.
use the standard encoder-aligner-decoder (Bah- Abstractor Training: For the abstractor training, danau et al., 2015; Luong et al., 2015). We add the we create training pairs by taking each summary copy mechanism3 to help directly copy some out- sentence and pairing it with its extracted docu- of-vocabulary (OOV) words (See et al., 2017). For ment sentence (based on Eqn. 6). The network more details, please refer to the supplementary. is trained as an usual sequence-to-sequence model to minimize the cross-entropy loss L(θabs ) = 3 Learning 1 PM −M m=1 logP θabs (wm |w1:m−1 ) of the decoder language model at each generation step, where Given that our extractor performs a non- θabs is the set of trainable parameters of the ab- differentiable hard extraction, we apply stan- stractor and wm the mth generated word. dard policy gradient methods to bridge the back- propagation and form an end-to-end trainable 3.2 Reinforce-Guided Extraction (stochastic) computation graph. However, sim- Here we explain how policy gradient techniques ply starting from a randomly initialized network are applied to optimize the whole model. To to train the whole model in an end-to-end fash- make the extractor an RL agent, we can formu- ion is infeasible. When randomly initialized, the late a Markov Decision Process (MDP)5 : at each extractor would often select sentences that are extraction step t, the agent observes the current not relevant, so it would be difficult for the ab- state ct = (D, djt−1 ), samples an action jt ∼ stractor to learn to abstractively rewrite. On the πθa ,ω (ct , j) = P (j) from Eqn. 2 to extract a doc- other hand, without a well-trained abstractor the ument sentence and receive a reward6 extractor would get noisy reward, which leads to a bad estimate of the policy gradient and a r(t + 1) = ROUGE-LF1 (g(djt ), st ) (7) sub-optimal policy. We hence propose optimiz- after the abstractor summarizes the extracted sen- ing each sub-module separately using maximum- tence djt . We denote the trainable parameters of likelihood (ML) objectives: train the extractor to the extractor agent by θ = {θa , ω} for the decoder select salient sentences (fit f ) and the abstractor to and hierarchical encoder respectively. We can then generate shortened summary (fit g). Finally, RL is train the extractor with policy-based RL. We illus- applied to train the full model end-to-end (fit h). trate this process in Fig. 2. The vanilla policy gradient algorithm, REIN- 3.1 Maximum-Likelihood Training for FORCE (Williams, 1992), is known for high vari- Submodules ance. To mitigate this problem, we add a critic Extractor Training: In Sec. 2.1.2, we have network with trainable parameters θc to predict formulated our sentence selection as classifica- the state-value function V πθa ,ω (c). The predicted tion. However, most of the summarization datasets value of critic bθc ,ω (c) is called the ‘baseline’, are end-to-end document-summary pairs with- which is then used to estimate the advantage func- out extraction (saliency) labels for each sentence. tion: Aπθ (c, j) = Qπθa ,ω (c, j) − V πθa ,ω (c) be- Hence, we propose a simple similarity method to cause the total return Rt is an estimate of action- provide a ‘proxy’ target label for the extractor. value function Q(ct , jt ). Instead of maximizing Similar to the extractive model of Nallapati et al. Q(ct , jt ) as done in REINFORCE, we maximize (2017), for each ground-truth summary sentence, Aπθ (c, j) with the following policy gradient: we find the most similar document sentence djt ∇θa ,ω J(θa , ω) = by:4 (8) E[∇θa ,ω logπθ (c, j)Aπθ (c, j)] jt = argmaxi (ROUGE-Lrecall (di , st )) (6) And the critic is trained to minimize the square Given these proxy training labels, the extractor is loss: Lc (θc , ω) = (bθc ,ω (ct ) − Rt )2 . This is then trained to minimize the cross-entropy loss. 5 Strictly speaking, this is a Partially Observable Markov 3 We use the terminology of copy mechanism (originally Decision Process (POMDP). We approximate it as an MDP named pointer-generator) in order to avoid confusion with by assuming that the RNN hidden state contains all past info. the pointer network (Vinyals et al., 2015). 6 In Eqn. 6, we use ROUGE-recall because we want the 4 Nallapati et al. (2017) selected sentences greedily to extracted sentence to contain as much information as possible maximize the global summary-level ROUGE, whereas we for rewriting. Nevertheless, for Eqn. 7, ROUGE-F1 is more match exactly 1 document sentence for each GT summary suitable because the abstractor g is supposed to rewrite the sentence based on the individual sentence-level score. extracted sentence d to be as concise as the ground truth s.
known as the Advantage Actor-Critic (A2C), a namic decisions of number-of-sentences based on synchronous variant of A3C (Mnih et al., 2016). the input document, eliminates the need for tuning For more A2C details, please refer to the supp. a fixed number of steps, and enables a data-driven Intuitively, our RL training works as follow: If adaptation for any specific dataset/application. the extractor chooses a good sentence, after the ab- stractor rewrites it the ROUGE match would be 3.3 Repetition-Avoiding Reranking high and thus the action is encouraged. If a bad Existing abstractive summarization systems on sentence is chosen, though the abstractor still pro- long documents suffer from generating repeating duces a compressed version of it, the summary and redundant words and phrases. To mitigate would not match the ground truth and the low this issue, See et al. (2017) propose the coverage ROUGE score discourages this action. Our RL mechanism and Paulus et al. (2018) incorporate with a sentence-level agent is a novel attempt in tri-gram avoidance during beam-search at test- neural summarization. We use RL as a saliency time. Our model without these already performs guide without altering the abstractor’s language well because the summary sentences are gener- model, while previous work applied RL on the ated from mutually exclusive document sentences, word-level, which could be prone to gaming the which naturally avoids redundancy. However, we metric at the cost of language fluency.7 do get a small further boost to the summary quality by removing a few ‘across-sentence’ repetitions, Learning how many sentences to extract: In a via a simple reranking strategy: At sentence-level, typical RL setting like game playing, an episode we apply the same beam-search tri-gram avoid- is usually terminated by the environment. On the ance (Paulus et al., 2018). We keep all k sentence other hand, in text summarization, the agent does candidates generated by beam search, where k is not know in advance how many summary sentence the size of the beam. Next, we then rerank all to produce for a given article (since the desired k n combinations of the n generated summary sen- length varies for different downstream applica- tence beams. The summaries are reranked by the tions). We make an important yet simple, intuitive number of repeated N -grams, the smaller the bet- adaptation to solve this: by adding a ‘stop’ ac- ter. We also apply the diverse decoding algorithm tion to the policy action space. In the RL training described in Li et al. (2016) (which has almost no phase, we add another set of trainable parameters computation overhead) so as to get the above ap- vEOE (EOE stands for ‘End-Of-Extraction’) with proach to produce useful diverse reranking lists. the same dimension as the sentence representation. We show how much the redundancy affects the The pointer-network decoder treats vEOE as one summarization task in Sec. 6.2. of the extraction candidates and hence naturally results in a stop action in the stochastic policy. 4 Related Work We set the reward for the agent performing EOE to ROUGE-1F1 ([{g(djt )}t ], [{st }t ]); whereas for Early summarization works mostly focused on ex- any extraneous, unwanted extraction step, the tractive and compression based methods (Jing and agent receives zero reward. The model is there- McKeown, 2000; Knight and Marcu, 2000; Clarke fore encouraged to extract when there are still re- and Lapata, 2010; Berg-Kirkpatrick et al., 2011; maining ground-truth summary sentences (to ac- Filippova et al., 2015). Recent large-sized corpora cumulate intermediate reward), and learn to stop attracted neural methods for abstractive summa- by optimizing a global ROUGE and avoiding extra rization (Rush et al., 2015; Chopra et al., 2016). extraction.8 Overall, this modification allows dy- Some of the recent success in neural abstractive 7 models include hierarchical attention (Nallapati During this RL training of the extractor, we keep the ab- stractor parameters fixed. Because the input sentences for the et al., 2016), coverage (Suzuki and Nagata, 2016; abstractor are extracted by an intermediate stochastic policy Chen et al., 2016; See et al., 2017), RL based met- of the extractor, it is impossible to find the correct target sum- ric optimization (Paulus et al., 2018), graph-based mary for the abstractor to fit g with ML objective. Though it is possible to optimize the abstractor with RL, in out prelim- attention (Tan et al., 2017), and the copy mecha- inary experiments we found that this does not improve the nism (Miao and Blunsom, 2016; Gu et al., 2016; overall ROUGE, most likely because this RL optimizes at a See et al., 2017). sentence-level and can add across-sentence redundancy. We achieve SotA results without this abstractor-level RL. important information been generated); while ROUGE-L is 8 We use ROUGE-1 for terminal reward because it is a used as intermediate rewards since it is known for better mea- better measure of bag-of-words information (i.e., has all the surement of language fluency within a local sentence.
Our model shares some high-level intuition with to enhance the copying abstractive summarizer. extract-then-compress methods. Earlier attempts Finally, there are some loosely-related recent in this paradigm used Hidden Markov Models and works: Zhou et al. (2017) proposed selective gate rule-based systems (Jing and McKeown, 2000), to improve the attention in abstractive summa- statistical models based on parse trees (Knight rization. Tan et al. (2018) used an extract-then- and Marcu, 2000), and integer linear programming synthesis approach on QA, where an extraction based methods (Martins and Smith, 2009; Gillick model predicts the important spans in the passage and Favre, 2009; Clarke and Lapata, 2010; Berg- and then another synthesis model generates the fi- Kirkpatrick et al., 2011). Recent approaches in- nal answer. Swayamdipta et al. (2017) attempted vestigated discourse structures (Louis et al., 2010; cascaded non-recurrent small networks on extrac- Hirao et al., 2013; Kikuchi et al., 2014; Wang tive QA, resulting a scalable, parallelizable model. et al., 2015), graph cuts (Qian and Liu, 2013), Fan et al. (2017) added controlling parameters to and parse trees (Li et al., 2014; Bing et al., 2015). adapt the summary to length, style, and entity pref- For neural models, Cheng and Lapata (2016) used erences. However, none of these used RL to bridge a second neural net to select words from an ex- the non-differentiability of neural models. tractor’s output. Our abstractor does not merely ‘compress’ the sentences but generatively produce 5 Experimental Setup novel words. Moreover, our RL bridges the ex- Please refer to the supplementary for full training tractor and the abstractor for end-to-end training. details (all hyperparameter tuning was performed Reinforcement learning has been used to op- on the validation set). We use the CNN/Daily Mail timize the non-differential metrics of language dataset (Hermann et al., 2015) modified for sum- generation and to mitigate exposure bias (Ran- marization (Nallapati et al., 2016). Because there zato et al., 2016; Bahdanau et al., 2017). Henß are two versions of the dataset, original text and et al. (2015) use Q-learning based RL for extrac- entity anonymized, we show results on both ver- tive summarization. Paulus et al. (2018) use RL sions of the dataset for a fair comparison to prior policy gradient methods for abstractive summa- works. The experiment runs training and evalu- rization, utilizing sequence-level metric rewards ation for each version separately. Despite the fact with curriculum learning (Ranzato et al., 2016) that the 2 versions have been considered separately or weighted ML+RL mixed loss (Paulus et al., by the summarization community as 2 different 2018) for stability and language fluency. We use datasets, we use same hyper-parameter values for sentence-level rewards to optimize the extractor both dataset versions to show the generalization of while keeping our ML trained abstractor decoder our model. We also show improvements on the fixed, so as to achieve the best of both worlds. DUC-2002 dataset in a test-only setup. Training a neural network to use another fixed 5.1 Evaluation Metrics network has been investigated in machine trans- lation for better decoding (Gu et al., 2017a) and For all the datasets, we evaluate standard ROUGE- real-time translation (Gu et al., 2017b). They used 1, ROUGE-2, and ROUGE-L (Lin, 2004) on full- a fixed pretrained translator and applied policy length F1 (with stemming) following previous gradient techniques to train another task-specific works (Nallapati et al., 2017; See et al., 2017; network. In question answering (QA), Choi et al. Paulus et al., 2018). Following See et al. (2017), (2017) extract one sentence and then generate the we also evaluate on METEOR (Denkowski and answer from the sentence’s vector representation Lavie, 2014) for a more thorough analysis. with RL bridging. Another recent work attempted a new coarse-to-fine attention approach on sum- 5.2 Modular Extractive vs. Abstractive marization (Ling and Rush, 2017) and found de- Our hybrid approach is capable of both extrac- sired sharp focus properties for scaling to larger in- tive and abstractive (i.e., rewriting every sentence) puts (though without metric improvements). Very summarization. The extractor alone performs ex- recently (concurrently), Narayan et al. (2018) use tractive summarization. To investigate the effect of RL for ranking sentences in pure extraction-based the recurrent extractor (rnn-ext), we implement a summarization and Çelikyilmaz et al. (2018) in- feed-forward extractive baseline ff-ext (details in vestigate multiple communicating encoder agents supplementary). It is also possible to apply RL
Models ROUGE-1 ROUGE-2 ROUGE-L METEOR Extractive Results lead-3 (See et al., 2017) 40.34 17.70 36.57 22.21 Narayan et al. (2018) (sentence ranking RL) 40.0 18.2 36.6 - ff-ext 40.63 18.35 36.82 22.91 rnn-ext 40.17 18.11 36.41 22.81 rnn-ext + RL 41.47 18.72 37.76 22.35 Abstractive Results See et al. (2017) (w/o coverage) 36.44 15.66 33.42 16.65 See et al. (2017) 39.53 17.28 36.38 18.72 Fan et al. (2017) (controlled) 39.75 17.29 36.54 - ff-ext + abs 39.30 17.02 36.93 20.05 rnn-ext + abs 38.38 16.12 36.04 19.39 rnn-ext + abs + RL 40.04 17.61 37.59 21.00 rnn-ext + abs + RL + rerank 40.88 17.80 38.54 20.38 Table 1: Results on the original, non-anonymized CNN/Daily Mail dataset. Adding RL gives statisti- cally significant improvements for all metrics over non-RL rnn-ext models (and over the state-of-the-art See et al. (2017)) in both extractive and abstractive settings with p < 0.01. Adding the extra reranking stage yields statistically significant better results in terms of all ROUGE metrics with p < 0.01. to extractor without using the abstractor (rnn-ext Models R-1 R-2 R-L Extractive Results + RL).9 Benefiting from the high modularity of lead-3 (Nallapati et al., 2017) 39.2 15.7 35.5 our model, we can make our summarization sys- Nallapati et al. (2017) 39.6 16.2 35.3 tem abstractive by simply applying the abstractor ff-ext 39.51 16.85 35.80 rnn-ext 38.97 16.65 35.32 on the extracted sentences. Our abstractor rewrites rnn-ext + RL 40.13 16.58 36.47 each sentence and generates novel words from a Abstractive Results large vocabulary, and hence every word in our Nallapati et al. (2016) 35.46 13.30 32.65 Fan et al. (2017) (controlled) 38.68 15.40 35.47 overall summary is generated from scratch; mak- Paulus et al. (2018) (ML) 38.30 14.81 35.49 ing our full model categorized into the abstractive Paulus et al. (2018) (RL+ML) 39.87 15.82 36.90 paradigm.10 We run experiments on separately ff-ext + abs 38.73 15.70 36.33 rnn-ext + abs 37.58 14.68 35.24 trained extractor/abstractor (ff-ext + abs, rnn-ext + rnn-ext + abs + RL 38.80 15.66 36.37 abs) and the reinforced full model (rnn-ext + abs + rnn-ext + abs + RL + rerank 39.66 15.85 37.34 RL) as well as the final reranking version (rnn-ext Table 2: ROUGE for anonymized CNN/DM. + abs + RL + rerank). (2017) and a strong lead-3 baseline. For producing 6 Results our summary, we simply concatenate the extracted For easier comparison, we show separate tables sentences from the extractors. From Table 1 and for the original-text vs. anonymized versions – Table 2, we can see that our feed-forward extrac- Table 1 and Table 2, respectively. Overall, our tor out-performs the lead-3 baseline, empirically model achieves strong improvements and the new showing that our hierarchical sentence encoding state-of-the-art on both extractive and abstractive model is capable of extracting salient sentences.11 settings for both versions of the CNN/DM dataset The reinforced extractor performs the best, be- (with some comparable results on the anonymized cause of the ability to get the summary-level re- version). Moreover, Table 3 shows the gener- ward and the reduced train-test mismatch of feed- alization of our abstractive system to an out-of- ing the previous extraction decision. The improve- domain test-only setup (DUC-2002), where our ment over lead-3 is consistent across both tables. model achieves better scores than See et al. (2017). In Table 2, it outperforms the previous best neural extractive model (Nallapati et al., 2017). In Ta- 6.1 Extractive Summarization ble 1, our model also outperforms a recent, con- In the extractive paradigm, we compare our model 11 The ff-ext model outperforms rnn-ext possibly because with the extractive model from Nallapati et al. it does not predict sentence ordering; thus is easier to opti- mize and the n-gram based metrics do not consider sentence 9 In this case the abstractor function g(d) = d. ordering. Also note that in our MDP formulation, we cannot 10 Note that the abstractive CNN/DM dataset does not in- apply RL on ff-ext due to its historyless nature. Even if ap- clude any human-annotated extraction label, and hence our plied naively, there is no mean for the feed-forward model to models do not receive any direct extractive supervision. learn the EOE described in Sec. 3.2.
Models R-1 R-2 R-L Relevance Readability Total See et al. (2017) 37.22 15.78 33.90 See et al. (2017) 120 128 248 rnn-ext + abs + RL 39.46 17.34 36.72 rnn-ext + abs + RL + rerank 137 133 270 Table 3: Generalization to DUC-2002 (F1). Equally good/bad 43 39 82 Table 4: Human Evaluation: pairwise comparison current sentence-ranking RL model by Narayan between our final model and See et al. (2017). et al. (2018), showing that our pointer-network ex- tractor and reward formulations are very effective model (rnn-ext + abs + RL + rerank) is clearly su- when combined with A2C RL. perior than the one of See et al. (2017). We are comparable on R-1 and R-2 but a 0.4 point im- 6.2 Abstractive Summarization provement on R-L w.r.t. Paulus et al. (2018).14 After applying the abstractor, the ff-ext based We also outperform the results of Fan et al. (2017) model still out-performs the rnn-ext model. Both on both original and anonymized dataset versions. combined models exceed the pointer-generator Several previous works have pointed out that ex- model (See et al., 2017) without coverage by a tractive baselines are very difficult to beat (in large margin for all metrics, showing the effec- terms of ROUGE) by an abstractive system (See tiveness of our 2-step hierarchical approach: our et al., 2017; Nallapati et al., 2017). Note that our method naturally avoids repetition by extracting best model is one of the first abstractive models multiple sentences with different keypoints.12 to outperform the lead-3 baseline on the original- Moreover, after applying reinforcement learn- text CNN/DM dataset. Our extractive experiment ing, our model performs better than the best model serves as a complementary analysis of the effect of of See et al. (2017) and the best ML trained model RL with extractive systems. of Paulus et al. (2018). Our reinforced model out- performs the ML trained rnn-ext + abs baseline 6.3 Human Evaluation with statistical significance of p < 0.01 on all met- We also conduct human evaluation to ensure ro- rics for both version of the dataset, indicating the bustness of our training procedure. We measure effectiveness of the RL training. Also, rnn-ext + relevance and readability of the summaries. Rel- abs + RL is statistically significant better than See evance is based on the summary containing im- et al. (2017) for all metrics with p < 0.01.13 In portant, salient information from the input article, the supplementary, we show the learning curve of being correct by avoiding contradictory/unrelated our RL training, where the average reward goes information, and avoiding repeated/redundant in- up quickly after the extractor learns the End-of- formation. Readability is based on the summa- Extract action and then stabilizes. For all the rys fluency, grammaticality, and coherence. To above models, we use standard greedy decoding evaluate both these criteria, we design the follow- and find that it performs well. ing Amazon MTurk experiment: we randomly se- lect 100 samples from the CNN/DM test set and Reranking and Redundancy Although the ask the human testers (3 for each sample) to rank extract-then-abstract approach inherently will not between summaries (for relevance and readabil- generate repeating sentences like other neural- ity) produced by our model and that of See et al. decoders do, there might still be across-sentence (2017) (the models were anonymized and ran- redundancy because the abstractor is not aware domly shuffled), i.e. A is better, B is better, both of other extracted sentences when decoding one. are equally good/bad. Following previous work, Hence, we incorporate an optional reranking strat- the input article and ground truth summaries are egy described in Sec. 3.3. The improved ROUGE also shown to the human participants in addition scores indicate that this successfully removes to the two model summaries.15 From the results some remaining redundancies and hence produces shown in Table 4, we can see that our model is more concise summaries. Our best abstractive better in both relevance and readability w.r.t. See 12 A trivial lead-3 + abs baseline obtains ROUGE of et al. (2017). (37.37, 15.59, 34.82), which again confirms the importance 14 of our reinforce-based sentence selection. We do not list the scores of their pure RL model because 13 they discussed its bad readability. We calculate statistical significance based on the boot- 15 strap test (Noreen, 1989; Efron and Tibshirani, 1994) with We selected human annotators that were located in the 100K samples. Output of Paulus et al. (2018) is not available US, had an approval rate greater than 95%, and had at least so we couldn’t test for statistical significance there. 10,000 approved HITs on record.
Speed Novel N -gram (%) Models total time (hr) words / sec Models 1-gm 2-gm 3-gm 4-gm (See et al., 2017) 12.9 14.8 See et al. (2017) 0.1 2.2 6.0 9.7 rnn-ext + abs + RL 0.68 361.3 rnn-ext + abs + RL + rerank 0.3 10.0 21.7 31.6 rnn-ext + abs + RL + rerank 2.00 (1.46 +0.54) 109.8 reference summaries 10.8 47.5 68.2 78.2 Table 5: Speed comparison with See et al. (2017). Table 6: Abstractiveness: novel n-gram counts. 6.4 Speed Comparison 7 Analysis Our two-stage extractive-abstractive hybrid model 7.1 Abstractiveness is not only the SotA on summary quality met- rics, but more importantly also gives a significant We compute an abstractiveness score (See et al., speed-up in both train and test time over a strong 2017) as the ratio of novel n-grams in the gen- neural abstractive system (See et al., 2017).16 erated summary that are not present in the in- Our full model is composed of a extremely fast put document. The results are shown in Table 6: extractor and a parallelizable abstractor, where the our model rewrites substantially more abstractive computation bottleneck is on the abstractor, which summaries than previous work. A potential rea- has to generate summaries with a large vocabulary son for this is that when trained with individual from scratch.17 The main advantage of our ab- sentence-pairs, the abstractor learns to drop more stractor at decoding time is that we can first com- document words so as to write individual sum- pute all the extracted sentences for the document, mary sentences as concise as human-written ones; and then abstract every sentence concurrently (in thus the improvement in multi-gram novelty. parallel) to generate the overall summary. In Ta- 7.2 Qualitative Analysis on Output Examples ble 5, we show the substantial test-time speed-up of our model compared to See et al. (2017).18 We We show examples of how our best model selects calculate the total decoding time for producing all sentences and then rewrites them. In the supple- summaries for the test set.19 Due to the fact that mentary Fig. 4 and Fig. 5, we can see how the ab- the main test-time speed bottleneck of RNN lan- stractor rewrites the extracted sentences concisely guage generation model is that the model is con- while keeping the mentioned facts. Adding the strained to generate one word at a time, the total reranker makes the output more compact globally. decoding time is dependent on the number of to- We observe that when rewriting longer text, the tal words generated; we hence also report the de- abstractor would have many facts to choose from coded words per second for a fair comparison. Our (Fig. 5 sentence 2) and this is where the reranker model without reranking is extremely fast. From helps avoid redundancy across sentences. Table 5 we can see that we achieve a speed up of 18x in time and 24x in word generation rate. Even 8 Conclusion after adding the (optional) reranker, we still main- We propose a novel sentence-level RL model tain a 6-7x speed-up (and hence a user can choose for abstractive summarization, which makes the to use the reranking component depending on their model aware of the word-sentence hierarchy. Our downstream application’s speed requirements).20 model achieves the new state-of-the-art on both 16 CNN/DM versions as well a better generalization The only publicly available code with a pretrained model for neural summarization which we can test the speed. on test-only DUC-2002, along with a significant 17 The time needed for extractor is negligible w.r.t. the ab- speed-up in training and decoding. stractor because it does not require large matrix multiplica- tion for generating every word. Moreover, with convolutional encoder at word-level made parallelizable by the hierarchical Acknowledgments rnn-ext, our model is scalable for very long documents. 18 For details of training speed-up, please see the supp. We thank the anonymous reviewers for their help- 19 We time the model of See et al. (2017) using beam size of ful comments. This work was supported by a 4 (used for their best-reported scores). Without beam-search, Google Faculty Research Award, a Bloomberg it gets significantly worse ROUGE of (36.62, 15.12, 34.08), so we do not compare speed-ups w.r.t. that version. Data Science Research Grant, an IBM Faculty 20 Most of the recent neural abstractive summarization sys- Award, and NVidia GPU awards. tems are of similar algorithmic complexity to that of See et al. (2017). The main differences such as the training objective attentional-decoder’s sequential generation; and this is the (ML vs. RL) and copying (soft/hard) has negligible test run- component that we substantially speed up via our parallel time compared to the slowest component: the long-summary sentence decoding with sentence-selection RL.
References James Clarke and Mirella Lapata. 2010. Discourse constraints for document compression. Computa- Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, tional Linguistics, 36(3):411–441. Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. An actor-critic Michael Denkowski and Alon Lavie. 2014. Meteor algorithm for sequence prediction. In ICLR. universal: Language specific translation evaluation for any target language. In Proceedings of the EACL Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2014 Workshop on Statistical Machine Translation. gio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. Bradley Efron and Robert J Tibshirani. 1994. An intro- Michele Banko, Vibhu O. Mittal, and Michael J. Wit- duction to the bootstrap. CRC press. brock. 2000. Headline generation based on statis- Angela Fan, David Grangier, and Michael Auli. 2017. tical translation. In Proceedings of the 38th An- Controllable abstractive summarization. arXiv nual Meeting on Association for Computational Lin- preprint, abs/1711.05217. guistics, ACL ’00, pages 318–325, Stroudsburg, PA, USA. Association for Computational Linguistics. Katja Filippova, Enrique Alfonseca, Carlos Col- menares, Lukasz Kaiser, and Oriol Vinyals. 2015. Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Sentence compression by deletion with lstms. In Norouzi, and Samy Bengio. 2017. Neural combi- Proceedings of the 2015 Conference on Empir- natorial optimization with reinforcement learning. ical Methods in Natural Language Processing arXiv preprint 1611.09940. (EMNLP’15). Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Dan Gillick and Benoit Favre. 2009. A scalable global 2011. Jointly learning to extract and compress. In model for summarization. In Proceedings of the Proceedings of the 49th Annual Meeting of the Asso- Workshop on Integer Linear Programming for Nat- ciation for Computational Linguistics: Human Lan- ural Langauge Processing, ILP ’09, pages 10–18, guage Technologies - Volume 1, HLT ’11, pages Stroudsburg, PA, USA. Association for Computa- 481–490, Stroudsburg, PA, USA. Association for tional Linguistics. Computational Linguistics. Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Jiatao Gu, Kyunghyun Cho, and Victor O. K. Li. 2017a. Guo, and Rebecca J. Passonneau. 2015. Abstractive Trainable greedy decoding for neural machine trans- multi-document summarization via phrase selection lation. In EMNLP. and merging. In ACL. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Asli Çelikyilmaz, Antoine Bosselut, Xiaodong He, and Li. 2016. Incorporating copying mechanism in Yejin Choi. 2018. Deep communicating agents for sequence-to-sequence learning. In Proceedings of abstractive summarization. NAACL-HLT. the 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and pages 1631–1640, Berlin, Germany. Association for Hui Jiang. 2016. Distraction-based neural networks Computational Linguistics. for modeling documents. In IJCAI. Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- Jianpeng Cheng and Mirella Lapata. 2016. Neural tor O. K. Li. 2017b. Learning to translate in real- summarization by extracting sentences and words. time with neural machine translation. In EACL. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume Sebastian Henß, Margot Mieskes, and Iryna Gurevych. 1: Long Papers), pages 484–494, Berlin, Germany. 2015. A reinforcement learning approach for adap- Association for Computational Linguistics. tive single- and multi-document summarization. In International Conference of the German Society for Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Computational Linguistics and Language Technol- Polosukhin, Alexandre Lacoste, and Jonathan Be- ogy (GSCL-2015), pages 3–12. rant. 2017. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Karl Moritz Hermann, Tomáš Kočiský, Edward Meeting of the Association for Computational Lin- Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- guistics (Volume 1: Long Papers), pages 209–220. leyman, and Phil Blunsom. 2015. Teaching ma- Association for Computational Linguistics. chines to read and comprehend. In Advances in Neu- ral Information Processing Systems (NIPS). Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with at- Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, tentive recurrent neural networks. In Proceedings of Norihito Yasuda, and Masaaki Nagata. 2013. the 2016 Conference of the North American Chap- Single-document summarization as a tree knapsack ter of the Association for Computational Linguistics: problem. In Proceedings of the 2013 Conference on Human Language Technologies, pages 93–98, San Empirical Methods in Natural Language Process- Diego, California. Association for Computational ing, pages 1515–1520, Seattle, Washington, USA. Linguistics. Association for Computational Linguistics.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Annie Louis, Aravind Joshi, and Ani Nenkova. 2010. short-term memory. Neural Comput., 9(9):1735– Discourse indicators for content selection in summa- 1780. rization. In Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dia- Hakan Inan, Khashayar Khosravi, and Richard Socher. logue, SIGDIAL ’10, pages 147–156, Stroudsburg, 2017. Tying word vectors and word classifiers: A PA, USA. Association for Computational Linguis- loss framework for language modeling. In ICLR. tics. Hongyan Jing and Kathleen R. McKeown. 2000. Cut Minh-Thang Luong, Hieu Pham, and Christopher D. and paste based text summarization. In Proceed- Manning. 2015. Effective approaches to attention- ings of the 1st North American Chapter of the As- based neural machine translation. In Empir- sociation for Computational Linguistics Conference, ical Methods in Natural Language Processing NAACL 2000, pages 178–185, Stroudsburg, PA, (EMNLP), pages 1412–1421, Lisbon, Portugal. As- USA. Association for Computational Linguistics. sociation for Computational Linguistics. Yuta Kikuchi, Tsutomu Hirao, Hiroya Takamura, Man- André F. T. Martins and Noah A. Smith. 2009. Sum- abu Okumura, and Masaaki Nagata. 2014. Single marization with a joint model for sentence extraction document summarization based on nested tree struc- and compression. In Proceedings of the Workshop ture. In Proceedings of the 52nd Annual Meeting of on Integer Linear Programming for Natural Lan- the Association for Computational Linguistics (Vol- gauge Processing, ILP ’09, pages 1–9, Stroudsburg, ume 2: Short Papers), pages 315–320, Baltimore, PA, USA. Association for Computational Linguis- Maryland. Association for Computational Linguis- tics. tics. Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sen- Yoon Kim. 2014. Convolutional neural networks tence compression. In EMNLP. for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Language Processing (EMNLP), pages 1746–1751, rado, and Jeff Dean. 2013. Distributed representa- Doha, Qatar. Association for Computational Lin- tions of words and phrases and their composition- guistics. ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A vances in Neural Information Processing Systems method for stochastic optimization. In ICLR. 26, pages 3111–3119. Curran Associates, Inc. Kevin Knight and Daniel Marcu. 2000. Statistics- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi based summarization - step one: Sentence compres- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, sion. In Proceedings of the Seventeenth National David Silver, and Koray Kavukcuoglu. 2016. Asyn- Conference on Artificial Intelligence and Twelfth chronous methods for deep reinforcement learning. Conference on Innovative Applications of Artificial In Proceedings of The 33rd International Confer- Intelligence, pages 703–710. AAAI Press. ence on Machine Learning, volume 48 of Proceed- ings of Machine Learning Research, pages 1928– Chen Li, Yang Liu, Fei Liu, Lin Zhao, and Fuliang 1937, New York, New York, USA. PMLR. Weng. 2014. Improving multi-documents summa- rization by sentence compression based on expanded Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. constituent parse trees. In Proceedings of the 2014 Summarunner: A recurrent neural network based se- Conference on Empirical Methods in Natural Lan- quence model for extractive summarization of doc- guage Processing (EMNLP), pages 691–701. Asso- uments. In AAAI Conference on Artificial Intelli- ciation for Computational Linguistics. gence. Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A sim- Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos ple, fast diverse decoding algorithm for neural gen- santos, Caglar Gulcehre, and Bing Xiang. 2016. eration. arXiv preprint, abs/1611.08562. Abstractive text summarization using sequence-to- sequence rnns and beyond. In CoNLL. Chin-Yew Lin. 2004. Rouge: A package for automatic Shashi Narayan, Shay B. Cohen, and Mirella Lapata. evaluation of summaries. In Text Summarization 2018. Ranking sentences for extractive summariza- Branches Out: Proceedings of the ACL-04 Work- tion with reinforcement learning. NAACL-HLT. shop, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Eric W Noreen. 1989. Computer-intensive methods for testing hypotheses. Wiley New York. Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fine attention models for document summarization. In Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Proceedings of the Workshop on New Frontiers in 2013. On the difficulty of training recurrent neural Summarization, pages 33–42. Association for Com- networks. In Proceedings of the 30th International putational Linguistics. Conference on Machine Learning, volume 28 of
Proceedings of Machine Learning Research, pages Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 1310–1318, Atlanta, Georgia, USA. PMLR. 2015. Pointer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Romain Paulus, Caiming Xiong, and Richard Socher. editors, Advances in Neural Information Processing 2018. A deep reinforced model for abstractive sum- Systems 28, pages 2692–2700. Curran Associates, marization. In ICLR. Inc. Ofir Press and Lior Wolf. 2017. Using the output em- Xun Wang, Yasuhisa Yoshida, Tsutomu Hirao, Kat- bedding to improve language models. In Proceed- suhito Sudoh, and Masaaki Nagata. 2015. Sum- ings of the 15th Conference of the European Chap- marization based on task-oriented discourse parsing. ter of the Association for Computational Linguistics: IEEE/ACM Trans. Audio, Speech and Lang. Proc., Volume 2, Short Papers, pages 157–163. Association 23(8):1358–1367. for Computational Linguistics. Ronald J. Williams. 1992. Simple statistical gradient- Xian Qian and Yang Liu. 2013. Fast joint compres- following algorithms for connectionist reinforce- sion and summarization via graph cuts. In Proceed- ment learning. Mach. Learn., 8(3-4):229–256. ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1492–1502, David Zajic, Bonnie Dorr, and Richard Schwartz. 2004. Seattle, Washington, USA. Association for Compu- Bbn/umd at duc-2004: Topiary. In HLT-NAACL tational Linguistics. 2004 Document Understanding Workshop, pages 112–119, Boston, Massachusetts. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level train- Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. ing with recurrent neural networks. In ICLR. 2017. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th An- Alexander M. Rush, Sumit Chopra, and Jason Weston. nual Meeting of the Association for Computational 2015. A neural attention model for abstractive sen- Linguistics (Volume 1: Long Papers), pages 1095– tence summarization. In Proceedings of the 2015 1104. Association for Computational Linguistics. Conference on Empirical Methods in Natural Lan- guage Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics. Mike Schuster, Kuldip K. Paliwal, and A. General. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083. Association for Computational Linguistics. Jun Suzuki and Masaaki Nagata. 2016. Rnn-based encoder-decoder approach with word frequency es- timation. In EACL. Swabha Swayamdipta, Ankur P. Parikh, and Tom Kwiatkowski. 2017. Multi-mention learning for reading comprehension with neural cascades. arXiv preprint, abs/1711.00894. Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, and Ming Zhou. 2018. S-net: From answer extraction to answer generation for machine reading comprehen- sion. In AAAI. Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with a graph- based attentional neural model. In ACL. Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Sequence to sequence for sets. In International Conference on Learning Represen- tations (ICLR).
Supplementary Materials A.3 Actor-Critic Policy Gradient A Model Details Here we discuss the details of the actor-critic pol- icy gradient training. Given the MDP formulation A.1 Convolutional Encoder described in Sec. 3.2 , the return (total discounted future reward) is Here we describe the convolutional sentence rep- resentation used in Sec. 2.1.1. We use the tempo- Ns ral convolutional model proposed by Kim (2014) Rt = X γ t r(t + 1) (9) to compute the representation of every individual t=1 sentence in the document. First, the words are con- verted to the distributed vector representation by a for each recurrent step t. To learn a optimal policy learned word embedding matrix Wemb . The se- π ∗ that maximize the state-value function: quence of the word vectors from each sentence is ∗ then fed through 1-D single-layer convolution fil- V π (c) = Eπ∗ [Rt |ct = c] ters with various window sizes (3, 4, 5) to capture the temporal dependencies of nearby words and we will make use of the action-value function then followed by relu non-linear activation and max-over-time pooling. The convolutional repre- Qπθ (c, j) = Eπθ [Rt |ct = c, jt = j] sentation rj for the jth sentence is then obtained by concatenating the outputs from the activations of all filter window sizes. We then take the policy gradient theorem and then substitute the action-value function with the A.2 Abstractor Monte-Carlo sample: In this section we discuss the architecture choices ∇θ J(θ) = Eπθ [∇θ logπθ (c, j)Qπθ (c, j)] (10) for our abstractor network in Sec. 2.2. At a high- Ns level, it is a sequence-to-sequence model with at- 1 X = ∇θ logπθ (ct , jt )Rt (11) tention and copy mechanism (but no coverage). Ns t=1 Note that the abstractor network is a separate neu- ral network from the extractor agent without any which runs a single episode and gets the return (es- form of parameter sharing. timate of action-value function) by sampling from the policy πθ , where Ns is the total number of Sequence-Attention-Sequence Model We use sentences the agent extracts. This gradient up- a standard encoder-aligner-decoder model (Bah- date is also known as the REINFORCE algorithm danau et al., 2015; Luong et al., 2015) with the (Williams, 1992). bilinear multiplicative attention function (Luong The vanilla REINFORCE algorithm is known et al., 2015), fatt (hi , zj ) = h> i Wattn zj , for the for high variance. To mitigate this problem we context vector ej . We share the source and target add a critic network with trainable parameters θc embedding matrix Wemb as well as output projec- having the same structure as the pointer-network’s tion matrix as in Inan et al. (2017); Press and Wolf decoder (described in Sec. 2.1.2) but change the fi- (2017); Paulus et al. (2018). nal output layer to regress the state-value function Copy Mechanism We add the copying mech- V πθa ,ω (c). The predicted value bθc ,ω (c) is called anism as in See et al. (2017) to extend the de- the baseline and is subtracted from the action- coder to predict over the extended vocabulary of value function to estimate the advantage words in the input document. A copy probability pcopy = σ(vẑ> ẑj + vs> zj + vw > w + b) is calculated j Aπθ (c, j) = Qπθa ,ω (c, j) − bθc ,ω (c) by learnable parameters v’s and b, and then is used to further compute a weighted sum of the probabil- where θ = {θa , θc , ω} denotes the set of all train- ity of source vocabulary and the predefined vocab- able parameters. The new policy gradient for ulary. At test time, an OOV prediction is replaced our extractor can be estimated by substituting the by the document word with the highest attention action-value function in Eqn. 10 by the advantage score. and then use Monte-Carlo samples (use Rt to esti-
You can also read