Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models∗ Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Junji Tomita NTT Media Intelligence Laboratory, NTT Corporation itsumi.saito.df@hco.ntt.co.jp Abstract els have achieved state-of-the-art results in vari- ous language generation tasks, including abstrac- Pre-trained sequence-to-sequence (seq-to-seq) tive summarization. arXiv:2003.13028v1 [cs.CL] 29 Mar 2020 models have significantly improved the accu- racy of several language generation tasks, in- However, when generating a summary, it is es- cluding abstractive summarization. Although sential to correctly predict which part of the source the fluency of abstractive summarization has text should be included in the summary. Some been greatly improved by fine-tuning these previous studies without pre-training have exam- models, it is not clear whether they can also ined combining extractive summarization with ab- identify the important parts of the source text stractive summarization (Gehrmann et al., 2018; to be included in the summary. In this study, Hsu et al., 2018). Although pre-trained seq-to-seq we investigated the effectiveness of combin- ing saliency models that identify the impor- models have achieved higher accuracy compared tant parts of the source text with the pre- to previous models, it is not clear whether model- trained seq-to-seq models through extensive ing “Which part of the source text is important?” experiments. We also proposed a new com- can be learned through pre-training. bination model consisting of a saliency model The purpose of this study is to clarify the effec- that extracts a token sequence from a source tiveness of combining saliency models that iden- text and a seq-to-seq model that takes the se- tify the important part of the source text with a pre- quence as an additional input text. Experimen- trained seq-to-seq model in the abstractive sum- tal results showed that most of the combina- tion models outperformed a simple fine-tuned marization task. Our main contributions are as seq-to-seq model on both the CNN/DM and follows: XSum datasets even if the seq-to-seq model is pre-trained on large-scale corpora. Moreover, • We investigated nine combinations of pre- for the CNN/DM dataset, the proposed com- trained seq-to-seq and token-level saliency bination model exceeded the previous best- models, where the saliency models share the performed model by 1.33 points on ROUGE- parameters with the encoder of the seq-to-seq L. model or extract important tokens indepen- dently of the encoder. 1 Introduction Pre-trained language models such as BERT (De- • We proposed a new combination model, the vlin et al., 2019) have significantly improved the conditional summarization model with im- accuracy of various language processing tasks. portant tokens (CIT), in which a token se- However, we cannot apply BERT to language gen- quence extracted by a saliency model is ex- eration tasks as is because its model structure plicitly given to a seq-to-seq model as an ad- is not suitable for language generation. Several ditional input text. pre-trained seq-to-seq models for language gen- • We evaluated the combination models on eration (Lewis et al., 2019; Raffel et al., 2019) the CNN/DM (Hermann et al., 2015) and based on an encoder-decoder Transformer model, XSum (Narayan et al., 2018) datasets. Our which is a standard model for language genera- CIT model outperformed a simple fine-tuned tion, have recently been proposed. These mod- model in terms of ROUGE scores on both ∗ Work in progress. datasets.
2 Task Definition Multi-head Attention The encoder and de- coder blocks use multi-head attention, which con- Our study focuses on two tasks: abstractive sum- sists of a combination of K attention heads marization and saliency detection. The main task and is denoted as Multihead(Q, K, V ) = is abstractive summarization and the sub task is Concat(head1 , ..., headk )W o , where each head is saliency detection, which is the prediction of im- headi = Attention(QWiQ , KWiK , V WiV ). portant parts of the source text. The problem for- mulations of each task are described below. The weight matrix A in each attention-head Attention(Q̃, K̃, Ṽ ) = AṼ is defined as Task 1 (Abstractive summarization) Given the source text X, the output is an abstractive sum- ! mary Y = (y1 , . . . , yT ). Q̃K̃ > A = softmax √ ∈ RI×J , (3) Task 2 (Saliency detection) Given the source text dk X with L words X= (x1 , . . . , xL ), the output is the saliency score S = {S1 , S2 , ...SL }. where dk = d/k, Q̃ ∈ RI×d , K̃, Ṽ ∈ RJ×d . In this study, we investigate several combina- In the m-th layer of self-attention, the same rep- tions of models for these two tasks. resentation H·m is given to Q, K, and V . In the context-attention, we give Hdm to Q and HeM to K 3 Pre-trained seq-to-seq Model and V . There are several pre-trained seq-to-seq models applied for abstractive summarization (Song et al., 3.2 Summary Loss Function 2019; Dong et al., 2019; Raffel et al., 2019). The To fine-tune the seq-to-seq model for abstractive models use a simple Transformer-based encoder- summarization, we use cross entropy loss as decoder model (Vaswani et al., 2017) in which the encoder-decoder model is pre-trained on large un- N T 1 XX labeled data. Lsum = − log P (ytn ), (4) NT n=1 t=1 3.1 Transformer-based Encoder-Decoder In this work, we define the Transformer-based where N is the number of training samples. encoder-decoder model as follows. Encoder The encoder consists of M layer en- 4 Saliency Models coder blocks. The input of the encoder is X = Several studies have proposed the combination {xi , x2 , ...xL }. The output through the M layer of a token-level saliency model and a seq-to-seq encoder blocks is defined as model, which is not pre-trained, and reported its HeM = {hM M M e1 , he2 , ...heL } ∈ R L×d . (1) effectiveness (Gehrmann et al., 2018; Zhou et al., The encoder block consists of a self-attention 2017). We also use a simple token-level saliency module and a two-layer feed-forward network. model as a basic model in this study. Decoder The decoder consists of M layer de- 4.1 Basic Saliency Model coder blocks. The inputs of the decoder are the output of the encoder HeM and the output of the A basic saliency model consists of M -layer previous step of the decoder {y1 , ..., yt−1 }. The Transformer encoder blocks (Encodersal ) and a output through the M layer Transformer decoder single-layer feed-forward network. We define the blocks is defined as saliency score of the l-th token (1 ≤ l ≤ L) in the source text as HdM = {hM M d1 , ..., hdt } ∈ R t×d . (2) In each step t, the hM dt is projected to the vocab- Sl = σ(W1> Encodersal (X)l + b1 ), (5) ulary space and the decoder outputs the highest probability token as the next token. The Trans- where Encodersal () represents the output of the former decoder block consists of a self-attention last layer of Encodersal , W1 ∈ Rd and b1 are module, a context-attention module, and a two- learnable parameters, and σ represents a sigmoid layer feed-forward network. function.
4.2 Two Types of Saliency Model for Combination types We roughly categorize the Combination combinations into three types. Figure 1 shows an image of each combination. In this study, we use two types of saliency model for combination: a shared encoder and an extrac- The first type uses the shared encoder (§5.1). tor. Each model structure is based on the basic These models consist of the shared encoder and saliency model. We describe them below. the decoder, where the shared encoder module plays two roles: saliency detection and the encod- Shared encoder The shared encoder shares the ing of the seq-to-seq model. The saliency scores parameters of Encodersal and the encoder of the are used to bias the representation of the seq-to- seq-to-seq model. This model is jointly trained seq model for several models in this type. with the seq-to-seq model and the saliency score The second type uses the extractor (§5.2, §5.3). is used to bias the representations of the seq-to-seq These models consist of the extractor, encoder, model. and decoder and follow two steps: first, the ex- tractor extracts the important tokens or sentences Extractor The extractor extracts the important from the source text, and second, the encoder uses tokens or sentences from the source text on the them as an input of the seq-to-seq models. Our basis of the saliency score. The extractor is sep- proposed model (CIT) belongs to this type. arated with the seq-to-seq model, and each model The third type uses both the shared encoder and is trained independently. the extractor (§5.4). These models consist of the extractor, shared encoder, and decoder and also 4.3 Pseudo Reference Label follow two steps: first, the extractor extracts the The saliency model predicts the saliency score Sl important tokens from the source text, and second, for each token xl . If there is a reference label rl ∈ the shared encoder uses them as an input of the {0, 1} for each xl , we can train the saliency model seq-to-seq model. in a supervised manner. However, the reference label for each token is typically not given, since Loss function From the viewpoint of the loss the training data for the summarization consists of function, there are two major types of model: only the source text and its reference summary. those that use the saliency loss (§4.4) and those Although there are no reference saliency labels, that do not. We also denote the loss function for we can make pseudo reference labels by aligning the seq-to-seq model as Labs and the loss function both source and summary token sequences and ex- for the extractor as Lext . Lext is trained with Lsal , tracting common tokens (Gehrmann et al., 2018). and Labs is trained with Lsum or Lsum + Lsal . We used pseudo labels when we train the saliency model in a supervised manner. 5.1 Using Shared Encoder to Combine the Saliency Model and the Seq-to-seq Model 4.4 Saliency Loss Function Multi-Task (MT) This model trains the shared To train the saliency model in a supervised way encoder and the decoder by minimizing both the with pseudo reference labels, we use binary cross summary and saliency losses. The loss function of entropy loss as this model is Labs = Lsum + Lsal . N L 1 X X rln log Sln + Selective Encoding (SE) This model uses the Lsal =− , saliency score to weight the shared encoder out- NL (1 − rln ) log(1 − Sln ) n=1 l=1 put. Specifically, the final output hM el of the shared (6) encoder is weighted as where rln is a pseudo reference label of token xl in h̃M M el = hel Sl . (7) the n-th sample. 5 Combined Models Then, we replace the input of the decoder hM el with h̃M el . Although Zhou et al. (2017) used BiGRU, This section describes nine combinations of the we use Transformer for fair comparison. The loss pre-trained seq-to-seq model and saliency models. function of this model is Labs = Lsum .
(a) SE (b) SA (c) SEG (d) CIT (e) CIT+SE + + (Shifted Right) (Shifted Right) (Shifted Right) (Shifted Right) (Shifted Right) Sentence Extraction Token Extraction Token Extraction Figure 1: Combinations of seq-to-seq and saliency models. Purple: Encoder. Blue: Decoder. Red: Shared encoder, which is a shared model for saliency detection and encoding, used in (a), (b), and (e). Yellow: Extractor, which is an independent saliency model to extract important (c) sentences Xs or (d), (e) tokens C from the source text X. Each of these colored blocks represents M -layer Transformer blocks. Gray: Linear transformation. Green: Context attention. Pink: Output trained in a supervised manner, where S is the saliency score and Y is the summary. Combination of SE and MT This model has where Nj and Xj are the number of tokens and the same structure as the SE. The loss function of the set of tokens within the j-th sentence. Top P this model is Labs = Lsum + Lsal . sentences are extracted according to the sentence- level saliency score and then concatenated as one Selective Attention (SA) This model weights text Xs . These extracted sentences are then used the attention scores of the decoder side, unlike as the input of the seq-to-seq model. the SE model. Specifically, the attention score In the training, we extracted Xs , which max- ati ∈ RL in each step t is weighted by Sl . ati is a imizes the ROUGE-L scores with the reference t-th row of Ai ∈ RT ×L , which is a weight matrix summary text. In the test, we used the average of the i-th attention head in the context-attention number of sentences in Xs in the training set as P . (Eq. (3)). The loss function of the extractor is Lext = Lsal , at Sl and that of the seq-to-seq model is Labs = Lsum . ãtil = P il t . (8) l ail Sl 5.3 Proposed: Using Extractor to Extract an Gehrmann et al. (2018) took a similar approach Additional Input Text in that their model weights the copy probability of a pointer-generator model. However, as the Conditional Summarization Model with Im- pre-trained seq-to-seq model does not have a copy portant Tokens We propose a new combination mechanism, we weight the context-attention for all of the extractor and the seq-to-seq model, CIT, Transformer decoder blocks. The loss function of which can consider important tokens explicitly. this model is Labs = Lsum . Although the SE and SA models softly weight the representations of the source text or attention Combination of SA and MT This model has scores, they cannot select salient tokens explicitly. the same structure as the SA. The loss function of SEG explicitly extracts the salient sentences from this model is Labs = Lsum + Lsal . the source text, but it cannot give token-level infor- mation to the seq-to-seq model, and it sometimes 5.2 Using the Extractor to Refine the Input drops important information when extracting sen- Text tences. In contrast, CIT uses the tokens extracted Sentence Extraction then Generation (SEG) according to saliency scores as an additional input This model first extracts the saliency sentences of the seq-to-seq model. By adding token-level in- on the basis of a sentence-level saliency score Sj . formation, CIT can effectively guide the abstrac- Sj is calculated by using the token level saliency tive summary without dropping any important in- score of the extractor, Sl , as formation. 1 X Specifically, K tokens C = {c1 , ..., cK } are ex- Sj = Sl , (9) tracted in descending order of the saliency score Nj l:xl ∈Xj S. S is obtained by inputting X to the extractor.
The order of C retains the order of the source text set train dev eval avg. length CNN/DM 287,227 13,368 11,490 69.6 X. A combined text X̃ = Concat(C, X) is given XSum 203,150 11,279 11,267 26.6 to the seq-to-seq model as the input text. The loss function of the extractor is Lext = Lsal , and that Table 1: Details of the datasets used in this paper. of the seq-to-seq model is Labs = Lsum . models R1 R2 RL BART (Lewis et al., 2019) 44.16 21.28 40.90 5.4 Proposed: Combination of Extractor and BART (our fine-tuning) 43.79 21.00 40.58 Shared Encoder MT 44.84 21.71 41.52 SE 44.59 21.49 41.28 Combination of CIT and SE This model com- SE + MT 45.23 22.07 41.94 bines the CIT and SE, so CIT uses an extractor SA 44.72 21.59 41.40 for extracting important tokens, and SE is trained SA + MT 44.93 21.81 41.61 SEG 44.62 21.51 41.29 by using a shared encoder in the seq-to-seq model. CIT 45.74 22.50 42.44 The SE model is trained in an unsupervised way. CIT + SE 45.80 22.53 42.48 The output HeM ∈ RL+K of the shared encoder CIT + SA 45.74 22.48 42.44 is weighted by saliency score S ∈ RL+K with Table 2: Results of BART and combined models on Eq. (7), where S is estimated by using the output CNN/DM dataset. Five row-groups are the models de- of the shared encoder with Eq. (5). The loss func- scribed in §3, §5.1, §5.2, §5.3, and §5.4 in order from tion of the extractor is Lext = Lsal , and that of the top to bottom. seq-to-seq model is Labs = Lsum . Combination of CIT and SA This model com- RoBERTaBASE , we used Transformers2 . We set bines the CIT and SA, so we also train two the learning rate to 0.00005 and the batch size to saliency models. The SA model is trained in 32. an unsupervised way, the same as the CIT + SE model. The attention score ati ∈ RL+K is 6.2 Evaluation Metrics weighted by S ∈ RL+K with Eq. (8). The loss We used ROUGE scores (F1), including ROUGE- function of the extractor is Lext = Lsal , and that 1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L), of the seq-to-seq model is Labs = Lsum . as the evaluation metrics (Lin, 2004). ROUGE 6 Experiments scores were calculated using the files2rouge toolkit3 . 6.1 Dataset 6.3 Results We used the CNN/DM dataset (Hermann et al., 2015) and the XSum dataset (Narayan et al., Do saliency models improve summarization ac- 2018), which are both standard datasets for news curacy in highly extractive datasets? Rouge summarization. The details of the two datasets are scores of the combined models on the CNN/DM listed in Table 1. The CNN/DM is a highly ex- dataset are shown in Table 2. We can see that all tractive summarization dataset and the XSum is a combined models outperformed the simple fine- highly abstractive summarization dataset. tuned BART. This indicates that the saliency de- tection is effective in highly extractive datasets. 6.1.1 Model Configurations One of the proposed models, CIT + SE, achieved We used BARTLARGE (Lewis et al., 2019), which the highest accuracy. The CIT model alone also is one of the state-of-the-art models, as the pre- outperformed other saliency models. This indi- trained seq-to-seq model and RoBERTaBASE (Liu cates that the CIT model effectively guides the ab- et al., 2019) as the initial model of the extractor. stractive summarization by combining explicitly In the extractor of CIT, stop words and duplicate extracted tokens. tokens are ignored for the XSum dataset. We used fairseq1 for the implementation of the Do saliency models improve summarization ac- seq-to-seq model. For fine-tuning of BARTLARGE curacy in highly abstractive datasets? Rouge and the combination models, we used the same pa- scores of the combined models on the XSum 2 rameters as the official code. For fine-tuning of https://github.com/huggingface/ transformers 1 3 https://github.com/pytorch/fairseq https://github.com/pltrdy/files2rouge
models R-1 R-2 R-L models data params R-1 R-2 R-L BART (Lewis et al., 2019) 45.14 22.27 37.25 MASSBASE 1 16G 123M 42.12 19.50 39.01 BART (our fine-tuning) 45.10 21.80 36.54 BertSumExtAbs2 16G 156M 42.13 19.60 39.18 MT 44.57 21.40 36.31 UniLM3 16G 340M 43.33 20.21 40.51 SE 45.34 21.98 36.79 T511B 4 750G 11B 43.52 21.55 40.69 SE + MT 44.64 21.30 36.07 BART5LARGE 160G 400M 44.16 21.28 40.90 SA 45.35 22.02 36.83 PEGASUS6C4 750G 568M 43.90 21.20 40.76 SA + MT 45.37 21.98 36.77 PEGASUS6HugeNews 3.8T 568M 44.17 21.47 41.11 SEG 41.03 18.20 32.75 ProphetNet7 160G 400M 44.20 21.17 41.30 CIT 45.42 22.13 36.92 ERNIE-GEN8LARGE 16G 340M 44.02 21.17 41.26 CIT + SE 45.36 22.02 36.83 UniLMv2BASE 9 160G 110M 43.16 20.42 40.14 CIT + SA 45.30 21.94 36.76 CIT 160G 525M 45.74 22.50 42.44 Table 3: Results of BART and combined models on Table 5: Results of state-of-the-art models and the pro- XSum dataset. The underlined result represents the posed model on CNN/DM dataset. We also report best result among the models that outperformed our the size of pre-training data and parameters utilized simple fine-tuning result. for each model. 1 (Song et al., 2019); 2 (Liu and Lap- ata, 2019); 3 (Dong et al., 2019); 4 (Raffel et al., 2019); CNN/DM XSum 5 models R-1 R-2 R-L R-1 R-2 R-L (Lewis et al., 2019); 6 (Zhang et al., 2019a) 7 (Yan Lead3 40.3 17.7 36.6 16.30 1.61 11.9 et al., 2020); 8 (Xiao et al., 2020); 9 (Bao et al., 2020) BertSumExt1 43.25 20.24 39.63 – – – CIT: Top-K toks 46.72 20.53 37.73 24.28 3.50 15.63 models data params R-1 R-2 R-L CIT: Top-3 sents 42.64 20.05 38.89 21.71 4.45 16.91 MASSBASE 1 16G 123M 39.75 17.24 31.95 BertSumExtAbs2 16G 156M 38.81 16.50 31.27 Table 4: Results of saliency models on CNN/DM and BART3LARGE 160G 400M 45.14 22.27 37.25 XSum datasets. CIT extracted the top-K tokens or top- PEGASUS4C4 750G 568M 45.20 22.06 36.99 3 sentences from the source text. 1 (Liu and Lapata, PEGASUS4HugeNews 3.8T 568M 47.21 24.56 39.25 2019). The summaries in XSum are highly extrac- UniLMv2BASE 5 160G 110M 44.00 21.11 36.08 CIT 160G 525M 45.42 22.13 36.92 tive, so the result of BertSumExt for XSum was not reported. Table 6: Results of state-of-the-art models and the pro- posed model on XSum dataset. 1 (Song et al., 2019); 2 (Liu and Lapata, 2019); 3 (Lewis et al., 2019); 4 (Zhang dataset are shown in Table 3. The CIT model et al., 2019a); 5 (Bao et al., 2020) performed the best, although its improvement was smaller than on the CNN/DM dataset. Moreover, the accuracy of the MT, SE + MT, and SEG mod- els, while the ROUGE-L score was lower than els decreased on the XSum dataset. These results the other sentence-based extraction method. This were very different from those on the CNN/DM is because that our token-level extractor finds the dataset. important tokens whereas the seq-to-seq model One reason for the difference can be traced learns how to generate a fluent summary incorpo- to the quality of the pseudo saliency labels. rating these important tokens. CNN/DM is a highly extractive dataset, so it is On the other hand, the extractive result on the relatively easy to create token alignments for gen- XSum dataset was lower. For highly abstractive erating pseudo saliency labels, while in contrast, datasets, there is little overlap between the tokens. a summary in XSum is highly abstractive and We need to consider how to make the high-quality short, which makes it difficult to create pseudo pseudo saliency labels and how to evaluate the labels with high quality by simple token align- similarity of these two sequences. ment. To improve the accuracy of summarization Does the CIT model outperform other fine- in this dataset, we have to improve the quality of tuned models? Our study focuses on the com- the pseudo saliency labels and the accuracy of the binations of saliency models and the pre-trained saliency model. seq-to-seq model. However, there are several How accurate are the outputs of the extractors? studies that focus more on the pre-training strat- We analyzed the quality of the tokens extracted egy. We compared the CIT model with those by the extractor in CIT. The results are summa- models. Their ROUGE scores are shown in Ta- rized in Table 4. On the CNN/DM dataset, the bles 5 and 6. From Table 5, we can see that ROUGE-1 and ROUGE-2 scores of our extrac- our model outperformed the recent pre-trained tor (Top-K tokens) were higher than other mod- models on the CNN/DM dataset. Even though
PEGASUSHugeNews was pre-trained on the largest by using reinforcement learning. Cao et al. corpus comprised of news-like articles, the ac- (2018) proposed a search and rewrite model. curacy of abstractive summarization was not im- Mendes et al. (2019) proposed a combination of proved much. Our model improved the accuracy sentence-level extraction and compression. None without any additional pre-training. This result in- of these models are based on a pre-trained model. dicates that it is more effective to combine saliency In contrast, our purpose is to clarify whether models with the seq-to-seq model for generating a combined models are effective or not, and we highly extractive summary. are the first to investigate the combination of On the other hand, on the XSum dataset, pre-trained seq-to-seq and saliency models. We PEGASUSHugeNews improved the ROUGE scores compared a variety of combinations and clarified and achieved the best results. In the XSum dataset, which combination is the most effective. summaries often include the expressions that are not written in the source text. Therefore, increas- 8 Conclusion ing the pre-training data and learning more pat- terns were effective. However, by improving the This is the first study that has conducted extensive quality of the pseudo saliency labels, we should experiments to investigate the effectiveness of in- be able to improve the accuracy of the CIT model. corporating saliency models into the pre-trained seq-to-seq model. From the results, we found 7 Related Work and Discussion that saliency models were effective in finding im- portant parts of the source text, even if the seq- Pre-trained Language Models for Abstractive to-seq model is pre-trained on large-scale cor- Summarization Liu (2019) used BERT for their pora, especially for generating an highly extrac- sentence-level extractive summarization model. tive summary. We also proposed a new combi- Zhang et al. (2019b) proposed a new pre-trained nation model, CIT, that outperformed simple fine- model that considers document-level informa- tuning and other combination models. Our combi- tion for sentence-level extractive summarization. nation model improved the summarization accu- Several researchers have published pre-trained racy without any additional pre-training data and encoder-decoder models very recently (Wang can be applied to any pre-trained model. While re- et al., 2019; Lewis et al., 2019; Raffel et al., 2019). cent studies have been conducted to improve sum- Wang et al. (2019) pre-trained a Transformer- marization accuracy by increasing the amount of based pointer-generator model. Lewis et al. pre-training data and developing new pre-training (2019) pre-trained a standard Transformer-based strategies, this study sheds light on the importance encoder-decoder model using large unlabeled data of saliency models in abstractive summarization. and achieved state-of-the-art results. Dong et al. (2019) and Xiao et al. (2020) extended the BERT structure to handle seq-to-seq tasks. References All the studies above focused on how to learn a universal pre-trained model; they did not consider Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, the combination of pre-trained and saliency mod- Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. els for an abstractive summarization model. 2020. UniLMv2: Pseudo-masked language mod- els for unified language model pre-training. CoRR, Abstractive Summarization with Saliency abs/2002.12804. Models Hsu et al. (2018), Gehrmann et al. (2018), and You et al. (2019) incorporated a Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2018. sentence- and word-level extractive model in the Retrieve, rerank and rewrite: Soft template based neural summarization. In ACL, pages 152–161. pointer-generator model. Their models weight the copy probability for the source text by using an Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- extractive model and guide the pointer-generator tive summarization with reinforce-selected sentence model to copy important words. Li et al. (2018) rewriting. In ACL, pages 675–686. proposed a keyword guided abstractive sum- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and marization model. Chen and Bansal (2018) Kristina Toutanova. 2019. BERT: pre-training of proposed a sentence extraction and re-writing deep bidirectional transformers for language under- model that is trained in an end-to-end manner standing. In NAACL-HLT, pages 4171–4186.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- of transfer learning with a unified text-to-text trans- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, former. CoRR, abs/1910.10683. and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understand- Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- ing and generation. In NeurIPS, pages 13042– Yan Liu. 2019. MASS: Masked sequence to se- 13054. quence pre-training for language generation. In ICML, pages 5926–5936. Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob In EMNLP, pages 4098–4109. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Karl Moritz Hermann, Tomas Kocisky, Edward you need. In NIPS, pages 5998–6008. Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- leyman, and Phil Blunsom. 2015. Teaching ma- Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, and chines to read and comprehend. In NIPS, pages Jingming Liu. 2019. Denoising based sequence- 1693–1701. to-sequence pre-training for text generation. In EMNLP-IJCNLP, pages 4001–4013. Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified Dongling Xiao, Han Zhang, Yu-Kun Li, Yu Sun, Hao model for extractive and abstractive summarization Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE- using inconsistency loss. In ACL, pages 132–141. GEN: an enhanced multi-flow pre-training and fine- tuning framework for natural language generation. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- CoRR, abs/2001.11314. jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, 2019. BART: denoising sequence-to-sequence pre- Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming training for natural language generation, translation, Zhou. 2020. Prophetnet: Predicting future n- and comprehension. CoRR, abs/1910.13461. gram for sequence-to-sequence pre-training. CoRR, abs/2001.04063. Chenliang Li, Weiran Xu, Si Li, and Sheng Gao. 2018. Guiding generation for abstractive text summariza- Yongjian You, Weijia Jia, Tianyi Liu, and Wenmian tion based on key information guide network. In Yang. 2019. Improving abstractive document sum- ACL, pages 55–60. marization with salient information modeling. In ACL, pages 2132–2141. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In ACL, pages 74– Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- 81. ter J. Liu. 2019a. PEGASUS: pre-training with ex- tracted gap-sentences for abstractive summarization. Yang Liu. 2019. Fine-tune BERT for extractive sum- CoRR, abs/1912.08777. marization. CoRR, abs/1903.10318. Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. Yang Liu and Mirella Lapata. 2019. Text summariza- HIBERT: document level pre-training of hierarchi- tion with pretrained encoders. In EMNLP-IJCNLP, cal bidirectional transformers for document summa- pages 3728–3738. rization. In ACL, pages 5059–5069. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 2017. Selective encoding for abstractive sentence Luke Zettlemoyer, and Veselin Stoyanov. 2019. summarization. In ACL, pages 1095–1104. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Afonso Mendes, Shashi Narayan, Sebastião Miranda, Zita Marinho, André F. T. Martins, and Shay B. Cohen. 2019. Jointly extracting and compressing documents with summary state representations. In NAACL, pages 3955–3966. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for ex- treme summarization. In EMNLP, pages 1797– 1807. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits
You can also read