BACO: A Background Knowledge- and Content-Based Framework for Citing Sentence Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
BACO: A Background Knowledge- and Content-Based Framework for Citing Sentence Generation Yubin Ge1 , Ly Dinh1 , Xiaofeng Liu2 , Jinsong Su3 , Ziyao Lu3 , Ante Wang3 , Jana Diesner1 1 University of Illinois at Urbana-Champaign, USA 2 Harvard University, USA 3 Xiamen University, China {yubinge2, dinh4, jdiesner}@illinois.edu jssu@xmu.edu.cn Abstract meaningful citations is particularly challenging for new scholars (Mansourizadeh and Ahmad, 2011). In this paper, we focus on the problem of To assist scholars with note taking on prior work citing sentence generation, which entails gen- erating a short text to capture the salient in- when working on a new research problem, this formation in a cited paper and the connec- paper focuses on the task of citing sentence genera- tion between the citing and cited paper. We tion, which entails identifying salient information present BACO, a BAckground knowledge- from cited papers and capturing connections be- and COntent-based framework for citing sen- tween cited and citing papers. With this work, we tence generation, which considers two types hope to reduce scientific information overload for of information: (1) background knowledge by researchers by providing examples of concise cit- leveraging structural information from a cita- tion network; and (2) content, which repre- ing sentences that address information from cited sents in-depth information about what to cite papers in the context of a new research problem and why to cite. First, a citation network is en- and related write up. While this task cannot and is coded to provide background knowledge. Sec- not meant to replace the scholarly tasks of finding, ond, we apply salience estimation to identify reading, and synthesizing prior work, the proposed what to cite by estimating the importance of computational solution is intended to support espe- sentences in the cited paper. During the decod- cially new researchers in practicing the process of ing stage, both types of information are com- writing effective and focused reflections on prior bined to facilitate the text generation. We then conduct joint training of the generator and cita- work given a new context or problem. tion function classification to make the model A number of recent papers have focused on aware of why to cite. Our experimental results the task of citing sentence generation (Hu and show that our framework outperforms compar- Wan, 2014; Saggion et al., 2020; Xing et al., 2020), ative baselines. which is defined as generating a short text that de- scribes a cited paper B in the context of a citing 1 Introduction paper A, and the sentences before and after the cit- A citation systematically, strategically, and criti- ing sentences in paper A are considered as context. cally synthesizes content from a cited paper in the However, previous work has mainly utilized limited context of a citing paper (Smith, 1981). A paper’s information from citing and cited papers to solve text that refers to prior work, which we herein refer this task. We acknowledge that any such solution, to as citing sentences, forms the conceptual basis including ours, is a simplification of the intricate for a research question or problem; identifies is- process of how scholars write citing sentences. sues, contradictions, or gaps with state of the art Given this motivation, we explore two sets of solutions; and prepares readers to understand the information to generate citing sentences, namely contributions of a citing paper, e.g., in terms of background knowledge in the form of citation net- theory, methods, or findings (Elkiss et al., 2008). works, and content from both citing and cited pa- Writing meaningful and concise citing sentences pers, as shown in Figure 1. Using citation networks that capture the gist of cited papers and identify was inspired by the fact that scholars have ana- connections between citing and cited papers is not lyzed such networks to identify the main themes trivial (White, 2004). Learning how to write up in- and research developments in domain areas such formation about related work with appropriate and as information sciences (Hou et al., 2018), busi- 1466 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 1466–1478 August 1–6, 2021. ©2021 Association for Computational Linguistics
background knowledge, and the given citing and cited papers to provide content information. We ex- tend a standard pointer-generator (See et al., 2017) to copy words from cited and citing papers, and de- termine what to cite by estimating sentence salience in the cited paper. The various pieces of captured information are then combined as the context for the decoder. Furthermore, we extend our frame- work to include why to cite by jointly training the generation with citation function classification and facilitate the acquisition of the content information. As for the dataset, we extended the ACL Anthol- ogy Network corpus (AAN) (Radev et al., 2013) Figure 1: An example from our dataset (source: ACL with extracted citing sentences by using RegEx. Anthology Network corpus (Radev et al., 2013). The red text in the citing paper is the citing sentence, and We then hand-annotated the citation functions on the special token #REFR indicates the citation of the a subset of the dataset, and trained a citation func- cited paper. Our framework aims at capturing informa- tion labeling model based on SciBERT (Beltagy tion from two perspectives: background knowledge and et al., 2019). The resulting labeling model was then content. The background knowledge is learned by ob- used to automatically label the rest data to build a taining structural features of the citation network. The large-scale dataset. content information entails estimated sentence salience We summarize our contributions as follows: (higher salience is highlighted by darker color) in the cited paper and the corresponding citation function of • We propose a BAckground knowledge- and the cited paper to the citing paper. COntent-based framework, named BACO, for cit- ing sentence generation. • We manually annotated a subset of citing sen- ness modeling (Li et al., 2017), and pharmaceutical tences with citation functions to train a SciBERT- research (Chen and Guan, 2011). based model to automatically label the rest data for citing sentence generation. We use the content of citing and cited papers as a second set of features to capture two more • Based on the results from experiments, we in-depth content features: (1) What to cite - while show that BACO outperforms comparative base- the overall content of a cited paper needs to be lines by at least 2.57 points on ROUGE-2. understood by the authors of the citing paper, not all content is relevant for writing citing sentences. 2 Related Work Therefore, we follow the example of estimating Several studies on citing sentence generation salient sentences (Yasunaga et al., 2019) and use have used keyword-based summarization methods the predicted salience to filter crucial information (Hoang and Kan, 2010; Chen and Zhuge, 2016, that should be integrated into the resulting citing 2019). To that end, they built keyword-based trees sentence; (2) Why to cite - we define “citation func- to extract sentences from cited papers as related tion” as an approximation of an author’s reason for work write-ups. These studies have two limitations: citing a paper (Teufel et al., 2006). A number of First, since related work sections are not simply previous research on citation functions has used (chronological) summaries of cited papers, syn- citing sentences and their context for classification thesizing prior work in this manner is insufficient. (Zhao et al., 2019; Cohan et al., 2019). Our pa- Second, extractive summarization uses verbatim per involves citation functions into citing sentence content from cited papers, which implies intellec- generation so that the generated citing sentences tual property issues (e.g., copyright violations) as can be coherent given their context, and can still well as ethical problems, such as a lack of intellec- contain the motivation for a specific citation. tual engagement with prior work. Alternatively, ab- In this paper, we propose a BAckground stractive summarization approaches, such as meth- knowledge- and COntent-based framework, named ods based on linear programming (Hu and Wan, BACO. Specifically, we encode a citation network 2014) and neural seq2seq methods (Wang et al., based on citation relations among papers to obtain 2018), have also been explored. These approaches 1467
mainly focus on utilizing papers’ content informa- is guided by the cited paper. In contrast to that, tion, specifically on the text of cited papers directly. a negative citation is one that explicitly states the A recent paper that went beyond summarizing the weakness(es) of a cited paper. A neutral citation content of cited papers (Xing et al., 2020) used is one that objectively summarizes the cited paper a multi-source, pointer-generator network with a without an additional evaluation. In addition to cross attention mechanism to calculate the attention these three categories, we also consider mixed ci- distribution between the citing sentences’ context tation contexts (Cullars, 1990), which are citations and the cited paper’s abstract. that contain both positive and negative evaluations Our paper is based on the premise that citation of a cited papers, or where the evaluation is unclear. network analysis can provide background knowl- Given that our paper is a first attempt to integrate edge that facilitates the understanding of papers in citation functions into citing sentence generation, a field. Prior analyses of citation networks have we opted to start with a straightforward valence been used to reveal the cognitive structure and in- category schema before exploring more complex terconnectedness of scientific (sub-)fields (Moore schemas in future work. et al., 2005; Bruner et al., 2010), and to understand and detect trends in academic fields (You et al., 3 Dataset and Annotation 2017; Asatani et al., 2018). Network analysis has also been applied to citation networks to identify We first extended the AAN1 (Radev et al., 2013) influential papers and key concepts (Huang et al., with the extracted citing sentences using RegEx. 2018), and to scope out research areas. We followed the process in (Xing et al., 2020) to la- While previous studies have shown that using bel 1,200 randomly sampled citing sentences with text from citing papers is useful to generate cit- their citation functions. The mark-up was done by 6 ing sentences, the benefit of other content-based coders who were provided with definitions of posi- features of a citation (e.g., reasons for citing) is in- tive, negative, neutral, and mixed citation functions, sufficiently understood (Xing et al., 2020). Extant and ample examples for each valence category. Our literature on citation context analysis (Moravcsik codebook including definitions and examples of ci- and Murugesan, 1975; Lipetz, 1965), which fo- tation functions is shown in Table 1. After the cused on the connections between the citing and annotation, we randomly split the dataset into 800 cited papers with respect to purposes and reasons instances for training and the remaining 400 for for citations, has found that citation function (Ding testing. We then used the 800 human-annotated et al., 2014; White, 2004) is an important indi- instances to train a citation function labeling model cator of why a paper chose to cite specific pa- with 10-fold cross validation. The labeling task per(s). Based on a content analysis of 750 cit- was treated as a multi-class classification problem. ing sentences from 60 papers published in two Our labeling model was built upon SciBERT prominent physics journals, Lipetz (1965) iden- (Beltagy et al., 2019), a pre-trained language model tified 11 citation functions, such as questioned, af- based on BERT (Devlin et al., 2019) but trained firmed, or refuted cited paper’s premises. Sim- on a large corpus of scientific text. We added a ilarly, Moravcsik and Murugesan (1975) quali- multilayer perceptron (MLP) to SciBERT, and fine- tatively coded the citation context of 30 articles tuned the whole model on our dataset. As for the on high energy physics, finding 10 citation func- input, we concatenated each citing sentence with its tions grouped into 5 pairs: conceptual-operational, context in the citing paper, and inserted a special organic-perfunctory, evolutionary-juxtapositional, tag [CLS] at the beginning and another special confirmative-negational, valuable-redundant. tag [SEP] to separate them. The final hidden state Citation context analysis has also been used to that corresponded to [CLS] was used as the aggre- study the valence of citing papers towards cited gate sequence representation. This state was fed papers (Athar, 2011; Abu-Jbara et al., 2013) by into the MLP, followed by the softmax function for classifying citation context as positive, negative, or predicting the citation function of the citing sen- neutral. In this paper, we adopt Abu-Jbara et al. tence. We report details of test results and dataset (2013)’s definition of a positive citation as a cita- statistics in the Appendix, Section A.1. tion that explicitly states the strength(s) of a cited 1 paper, or a situation where the citing paper’s work http://aan.how/download/ 1468
Citation Definition Example function When a citing paper explicitly supports a cited paper’s ”Our architecture is inspired by state- Positive premises and findings, and/or that the cited paper’s finding(s) of-the-art model #REFR” is used to corroborate with the citing paper’s study. ”Unbounded Content Early versions When a citing paper points out weaknesses in the cited paper’s of the customization system #REFR Negative premises and findings, as well as explicitly rejecting the finding(s) only allowed a small number of from the cited paper. entries for things like lexical types.” When a citing paper objectively summarizes the cited paper’s ”#REFR translates the extracted Neutral premises and findings, without explicitly offering any evaluations Vietnamese phrases into of the cited paper’s finding(s). corresponding English ones” ”In previous work, this has been done When a citing paper does not clearly express their evaluations successfully for question answering Mixed towards the cited paper, or that the citing paper contains tasks #REFR, but not for web search multiple citation functions in one sentence. in general.” Table 1: Definitions and examples of our proposed citation functions. The special token #REFR indicates the citation of a paper. 4 Methodology layers to compute the hidden representation of each node. This GAT has been shown to be effective on Our proposed framework includes an encoder and a multiple citation network benchmarks. We input generator, as shown in Figure 2. The encoder takes a set of node pairs {(vp , vq )} into it for training of the citation network and the citing and cited papers the link prediction task. We pre-trained our graph as input, and encodes them to provide background encoder network using negative sampling to learn knowledge and content information, respectively. the node representations hnp for each paper p, which The generator contains a decoder that can copy contains structural information of the citation net- words from citing and cited paper while retaining work and can provide background knowledge for the ability to produce novel words, and a salience the downstream task. estimator that identifies key information from the cited paper. We then trained the framework with 4.1.2 Hierarchical RNN-based Encoder citation function classification to enable the recog- nition of why a paper was cited. Given the word sequence {cwi } of the citing sen- tence’s context and the word sequence {awj } of the 4.1 Encoder cited paper’s abstract, we input the embedding of Our encoder (the yellow shaded area in Figure 2) word tokens (e.g., e(wt )) into a hierarchical RNN- consists of two parts, a graph encoder that was based encoder that includes a word-level Bi-LSTM trained to provide background knowledge based on and a sentence-level Bi-LSTM. The output word- the citation network, and a hierarchical RNN-based level representation of the citing sentence’s context encoder that encodes the content information of the is denoted as {hcw i }, and the cited paper’s abstract citing and cited papers. is encoded similarly as its word-level representa- tion {haw j }. Meanwhile, their sentence-level repre- 4.1.1 Graph Encoder sentations are represented as {hcs as m } and {hn }. We designed a citation network pre-training method for providing the background knowledge. In detail, 4.2 Generator we first constructed a citation network as a directed Our generator (the green shaded area in Figure 2) is graph G = (V, E). V is a set of nodes/papers2 and an extension of the standard pointer generator (See E is a set of directed edges. Each edge links a citing et al., 2017). It integrates both background knowl- paper (source) to a cited paper (target). To utilize edge and content information as context for text G in our task, we employed a graph attention net- generation. The generator contains a decoder and work (GAT) (Veličković et al., 2018) as our graph an additional salience estimator that predicts the encoder, which leverages masked self-attentional salience of sentences in the cited paper’s abstract 2 We use node and paper interchangeably for refining the corresponding attention. 1469
Figure 2: The overall architecture of the proposed framework. The encoder is used to encode the citation network and papers’ (both citing and cited) text. The generator estimates salience of sentence in the cited paper’s abstract, and utilizes this information for text generation. The framework is additionally trained with citation functions. 4.2.1 Decoder abstract, we calculated the generation probability The decoder is a unidirectional LSTM conditioned and copy probabilities as follows: on all encoded hidden states. The attention distri- bution is calculated as in (Bahdanau et al., 2015). [pgen , pcopy1 , pcopy2 ] = softmax(Wctx cctx t Since we considered both the citing sentence’s + Wabs cabs net t + Wnet ct + Wdec st context and the cited paper’s abstract on the source + Wemb e(wt−1 ) + bptr ), (4) side, we applied the attention mechanism to {hcw i } and {haw j } separately to obtain two attention vec- where pgen is the probability of generating words, tors actx abs t , at , and their corresponding context vec- pcopy1 is the probability of copying words from tors ct , cabs ctx t at the step t. We then aggregated the citing sentence’s context, pcopy2 is the prob- input context c∗t from the citing sentence’s context, ability of copying words from the cited paper’s the cited paper’s abstract, and background knowl- abstract, st represents the hidden state of the de- edge by applying a dynamic fusion operation based coder at step t, and e(wt−1 ) indicates the input on modality attention as described in (Moon et al., word embedding. Meanwhile, the context vector 2018b,a), which selectively attenuated or ampli- c∗t , which can be seen as an enhanced representa- fied each modality based on their importance to the tion of source-side information, was concatenated task: with the decoder state st to produce the vocabulary distribution Pvocab : [attctx ; attabs ; attnet ] = σ(Wm [cctx abs net t ; ct ; ct ] + bm ), (1) Pvocab = softmax(V0 (V[st ; c∗t ] + b) + b0 ). (5) exp(attm ) ˜m= P att , (2) m0 ∈{abs,ctx,net} exp(attm0 ) Finally, for each text, we defined an extended X vocabulary as the union of the vocabulary and all c∗t = ˜ m cm att t , (3) words appearing in the source text, and calculated m∈{abs,ctx,net} the probability distribution over the extended vo- where cnet = [h n ; hn ] represents the learned back- cabulary to predict words w: t p q ground knowledge for papers p and q, and is X kept constant during all decoding steps t, and P (w) =pgen Pvocab (w) + pcopy1 actx t,i [attctx ; attabs ; attnet ] is the attention vector. i:cwi =w To enable our model to copy words from both X + pcopy2 aabs t,i . (6) the citing sentence’s context and the cited paper’s i:awi =w 1470
4.2.2 Salience Estimation negative log-likelihood of all target words wt∗ and The estimation of the salience of each sentence used them as the objective function of generation: that occurs in a cited paper’s abstract was used X to identify what information needed to be concen- Lgen = − log P (wt∗ ). (9) t trated for the generation. We assumed a sentence’s salience to depend on the citing paper such that 4.3.2 Salience Estimation Loss the same sentences from one cited paper can have To include extra supervision into the salience esti- different salience in the context of different citing mation, we adopted a ROUGE-based approxima- papers. Hence, we represented this salience as a tion (Yasunaga et al., 2017) as the target. We as- conditional probability P (si |Dsrc ), which can be sume citing sentences to depend heavily on salient interpreted as the probability of picking sentence si sentences from the cited papers’ abstracts. Based from a cited paper’s abstract given the citing paper on this premise, we calculated the ROUGE scores Dsrc . between the citing sentence and sentences in the We first obtained the document representation corresponding cited paper’s abstract to obtain an dsrc of a citing paper as the average of all its ab- approximation of the salience distribution as the stract’s sentence representations. Then, for cal- ground-truth. If a sentence shared a high ROUGE culating salience, which is defined as P (si |Dsrc ), score with the citing sentence, this sentence would we designed an attention mechanism that assigns be considered as a salient sentence because the a weight αi to each sentence si in a cited paper’s citing sentence was likely to be generated based abstract Dtgt . This weight is expected to be large on this sentence, while a low ROUGE score im- if the semantics of si are similar to dsrc . Formally, plied that this sentence may be ignored during the we have: generation process due to its low salience. Kull- αi = vT tanh(Wdoc dsrc + Wsent has back–Leibler divergence was used as our loss func- i + bsal ), tion for enforcing the output salience distribution (7) to be close to the normalized ROUGE score distri- αi bution of sentences in the cited paper’s abstract: α̃i = P , (8) sk ∈Dtgt αk Lsal = DKL (Rkα̃), (10) where has th i is the i sentence representation in the βr(si ) cited paper’s abstract, v, Wdoc , Wsent and bsal are Ri = P , (11) sk ∈Dtgt βr(sk ) learnable parameters, and α̃i is the salience score of the sentence si . where α̃, R ∈ Rm , Ri refers to the scalar in- We then used the estimated salience of sentences dexed i in R (1 ≤ i ≤ m), and r(si ) is the in the cited paper’s abstract to update the word- average of ROUGE-1 and ROUGE-2 F1 scores level attention of the cited paper’s abstract {haw j } between the sentence si in the cited paper’s ab- so that the decoder can focus on these important stract and the citing sentence. We also introduced sentences during text generation. Considering that a hyper-parameter β as a constant rescaling factor the estimated salience α̃i is a sentence weight, we to sharpen the distribution. determined each token in a sentence to share the same value of α̃i . Accordingly, the new attention 4.3.3 Citation Function Classification aabs abs = We added a supplementary component to enable the t of the cited paper’s abstract became at citation function classification to be trained with the α̃i aabs abs t . After normalizing at , the context vector abs generator, aiming to make the generation conscious ct was updated accordingly. of why to cite. Following a prior general pipeline of 4.3 Model Training citation function classification (Cohan et al., 2019; During model training, the objective of our frame- Zhao et al., 2019), we first concatenated the last hid- work covers three parts: generation loss, salience den state sT of the decoder, which we considered estimation loss, and citation function classification. as a representation of the generated citing sentence, with the document representation dctx of the cit- 4.3.1 Generation Loss ing sentence’s context. Here, dctx was calculated The generation loss was based on the prediction as the average of its sentence representations. We of words from the decoder. We minimized the then fed the concatenated representation into an 1471
MLP followed by the softmax function to predict scores are computed based on eigenvector central- the probability of the citation function ŷfunc for the ity within weighted-graphs. generated citing sentence. Cross-entropy loss was set as the objective function for training the clas- 5.2 Experimental Results sifier with the ground truth label yfunc , which is a As the results in Table 2 show, our proposed one-hot vector: framework (BACO) outperformed all of the con- N K sidered baselines. BACO achieved scores of 1 XX i i 32.54 (ROUGE-1), 9.71 (ROUGE-2), and 24.90 Lfunc =− yfunc (j) log ŷfunc (j), (12) N (ROUGE-L). We also observed that the extractive i=1 j=1 methods performed comparatively poorly and no- where N refers to the size of training data and K tably worse than the abstractive methods. All ab- is the number of different citation functions. stractive methods did better than EXT-Oracle; a Finally, all aforementioned losses were com- result different from performance on other sum- bined as the training objective of the whole frame- marization tasks, such as news document summa- work: rization. We think that this deviation from prior performance outcomes is because citing sentence J (θ) = Lgen + λS Lsal + λF Lfunc , (13) in the domain of scholarly papers contain new ex- where λS and λF are the hyper-parameters to bal- pressions when referring to cited papers, which ance these losses. requires high-level summarizing or paraphrasing of cited papers instead of copying sentences ver- 5 Experiments batim from cited papers. Our results suggest that extractive methods may not be suitable for our task. 5.1 Metrics and Baselines Among the extractive methods we tested, we ob- Following previous work, we report ROUGE-1 served EXT-Oracle to be superior to others, which (unigram), ROUGE-2 (bigram), and ROUGE-L aligns with our expectation of EXT-Oracle to serve (longest common subsequence) scores to evaluate as an upper bound of extractive methods. For ab- the generated citing sentences (Lin, 2004). Im- stractive methods, our framework achieved about plementation details are shown in the Appendix, 2.57 points improvement on ROUGE-2 F1 score Section A.2. We also report ROUGE F1 score on compared to PTGEN-Cross. We assume two rea- our dataset. Finally, we compare our model to sons for this improvement: First, BACO uses richer competitive baselines: text features, e.g., what to cite (sentence salience • PTGEN (See et al., 2017): is the original estimation) and why to cite (citation function clas- pointer-generator network. sification), that provide useful information for this • EXT-Oracle (Xing et al., 2020): selects the task. Second, we included structural information best possible sentence from the abstract of a cited from the citation network, which might offer sup- paper that gives the highest ROUGE w.r.t. the plemental background knowledge about a field that ground truth. This method can be seen as an upper is not explicitly covered by the given cited and bound of extractive methods. citing papers. • PTGEN-Cross (Xing et al., 2020): enhances the original pointer-generator network with a cross 5.3 Ablation Study attention mechanism applied to the citing sen- We performed an ablation study to investigate the tence’s context and the cited paper’s abstract. efficacy of the three main components in our frame- Additionally, we report results from using sev- work: (1) we removed the node features (papers) eral extractive methods that have been used for that are output from the graph encoder to test the summarization tasks3 , including: effectiveness of background knowledge; (2) we re- • LexRank (Erkan and Radev, 2004): is an un- moved the predicted salience of sentences in the supervised graph-based method for computing rel- abstracts of cited papers to assess the effectiveness ative importance of extractive summarization. of one part of content (what to cite); and (3) we • TextRank (Mihalcea and Tarau, 2004): is an removed the training of citation function classifi- unsupervised algorithm where sentence importance cation and only trained the generator to test the 3 We apply extractive methods on the cited paper’s abstract effectiveness of the other part of content (why to to extract one sentence as the citing sentence. cite). As the removal of node features of papers re- 1472
Models R-1 R-2 R-L might also align with how scholars write citing sen- Extractive tences: they focus on specific parts or elements of LexRank 11.96 1.04 9.69 cited papers, e.g., methods or results, and do not TextRank 12.35 1.19 10.04 consider all parts equally when writing citing sen- EXT-Oracle 22.60 4.21 16.83 tences. Lastly, the ROUGE-2 F1 score dropped by Abstractive 1.04 after the removal of citation function classifi- PTGEN 24.60 6.16 19.19 cation; indicating that this feature is also helpful PTGEN-Cross∗ 27.08 7.14 20.61 to the text generation task. We conclude that for a citing sentence generation, considering and train- BACO 32.54 9.71 24.90 ing a model on background knowledge, sentence salience, and citation function improves the perfor- Table 2: Experimental results for our framework and mance. comparative models. ∗ indicates our re-implementation. 5.4 Case study Models R-1 R-2 R-L We present an illustrative example generated by BACO 32.54 9.71 24.90 our re-implementation of PTGEN-Cross versus by -w/o BK 31.02 8.90 22.46 BACO, and compare both to ground truth (see Ap- -w/o SA 31.43 7.51 22.77 pendix, Section A.3). The output from BACO -w/o CF 30.84 8.67 23.31 showed a higher overlap with the ground truth, specifically because it included background that Table 3: Ablation study for different components of is not explicitly covered in the cited paper. Fur- our framework. w/o = without, BK = background knowledge, SA = salience estimation, CF = citation thermore, our output contained the correct citation function. function (“... have been shown to be effective”), which was present in the ground truth, but missing in PTGEN-Cross’s output. duces the input to the dynamic fusion operation for the context vector (Equation 1), we changed Equa- 5.5 Human Evaluation tion 2 to a sigmoid function so that the calculated We sampled 50 instances from the generated texts. attention becomes a vector of size 2 when com- Three graduate students who are fluent in English bining the context vectors of the citing sentence’s and familiar with NLP were asked to rate citing sen- context and the cited paper’s abstract. tences produced by BACO and the re-implemented Table 3 presents the results of the ablation study. PTGEN-Cross with respect to four aspects on a 1 We observed the ROUGE-2 F1 score to drop by (very poor) to 5 (excellent) point scale: fluency 0.81 after the removal of the nodes (papers) fea- (whether a citing sentence is fluent), relevance ture. This indicates that considering background (whether a citing sentence is relevant to the cited knowledge in a structured representation is useful paper’s abstract), coherence (whether a citing sen- for citing sentence generation. The ROUGE-2 F1 tence is coherent within its context), and overall score dropped by 2.20 after disregarding salience quality. Every instance was scored by the three of sentences in the cited paper. This implies that judges, and we averaged their scores (Table 4). sentence-level salience estimation is beneficial, and Our results showed that citing sentences gener- it can be used to identify important sentences dur- ated by BACO score were generally better than ing the decoding phase so that the decoder can pay output by PTGEN-Cross (e.g., Relevance score: higher attention to those sentences. This process BACO=3.07; PTGEN-Cross=2.64). This finding provided further evidence for the effectiveness of Gold BACO PTGEN-Cross including the features we used for this task. Fluency 4.91 3.64 3.52 Relevance 4.86 3.07 2.64 6 Conclusions and Future Work Coherence 4.88 2.77 2.61 We have brought together multiple pieces of infor- Overall 4.79 2.95 2.69 mation from and about cited and citing papers to improve citing sentence generation. We integrated Table 4: Human evaluation results. them into BACO, a BAckground knowledge- and 1473
COntent-based framework for citing sentence gen- References eration, which learns and uses information that Amjad Abu-Jbara, Jefferson Ezra, and Dragomir Radev. relate to (1) background knowledge; and (2) con- 2013. Purpose and polarity of citation: Towards nlp- tent. Extensive experimental results suggest that based bibliometrics. In Proceedings of 2013 Con- our framework outperforms competitive baseline ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- models. guage Technologies (NAACL-HLT). This work is limited in several ways. We only demonstrated the utility of our model within the Kimitaka Asatani, Junichiro Mori, Masanao Ochi, and Ichiro Sakata. 2018. Detecting trends in academic standard RNN-based seq2seq framework. Sec- research from a citation network using network rep- ondly, our citation functions scheme only contained resentation learning. PloS one, 13(5):e0197260. valence-based items. Finally, while this method is Awais Athar. 2011. Sentiment analysis of citations us- intended to support scholars in practicing strate- ing sentence structure-based features. In Proceed- gic note taking on prior work with respect to a ings of the ACL 2011 Student Session. new literature review or research project, we did not evaluate the usefulness or effectiveness of this Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly training option for researchers. learning to align and translate. In 3rd International In future work, we plan to investigate the adapta- Conference on Learning Representations (ICLR). tion of our framework into more powerful models Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: such as Transformer (Vaswani et al., 2017). We A pretrained language model for scientific text. In also hope to extend our citation functions scheme Proceedings of 2019 Conference on Empirical Meth- beyond valence of the citing sentences to more ods in Natural Language Processing and the 9th In- fine-grained categories, such as those outlined ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP). in Moravcsik and Murugesan (1975) and Lipetz (1965). Mark W Bruner, Karl Erickson, Brian Wilson, and Jean Côté. 2010. An appraisal of athlete development models through citation network analysis. Psychol- Impact Statement ogy of sport and exercise, 11(2):133–139. Jingqiang Chen and Hai Zhuge. 2016. Summariza- This work is intended to support scholars in doing tion of related work through citations. In 12th Inter- research, not to replace or automate any scholarly national Conference on Semantics, Knowledge and responsibilities. Finding, reading, understanding, Grids (SKG). IEEE. reviewing, reflecting upon, and properly citing lit- Jingqiang Chen and Hai Zhuge. 2019. Automatic gen- erature are key components of the research process eration of related work through summarizing cita- and require deep intellectual engagement, which tions. Concurrency and Computation: Practice and remains a human task. The presented approach is Experience, 31(3):e4261. meant to help scholars to see examples for how to Kaihua Chen and Jiancheng Guan. 2011. A bibliomet- strategically synthesize scientific papers relevant to ric investigation of research performance in emerg- a certain topic or research problem, thereby help- ing nanobiopharmaceuticals. Journal of Informet- ing them to cope with information overload (or rics, 5(2):233–247. “research deluge”) and honing their scholarly writ- Arman Cohan, Waleed Ammar, Madeleine van Zuylen, ing skills. Additional professional responsibilities and Field Cady. 2019. Structural scaffolds for ci- also still apply, such as not violating intellectual tation intent classification in scientific publications. property/ copyright issues. In Proceedings of 2019 Conference of the North American Chapter of the Association for Computa- We believe that this work does not present fore- tional Linguistics: Human Language Technologies seeable negative societal consequence. While not (NAACL-HLT). intended, our method may be misused for the auto- John Cullars. 1990. Citation characteristics of italian mated generation of parts of literature reviews. We and spanish literary monographs. The Library Quar- strongly discourage this misuse as it violates basic terly, 60(4):337–356. assumptions about scholarly diligence, responsibili- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ties, and expectations. We advocate for our method Kristina Toutanova. 2019. Bert: Pre-training of deep to be used as a scientific writing training tool. bidirectional transformers for language understand- ing. In Proceedings of 2019 Conference of the North 1474
American Chapter of the Association for Computa- Ben-Ami Lipetz. 1965. Improvement of the selectivity tional Linguistics: Human Language Technologies of citation indexes to science literature through in- (NAACL-HLT). clusion of citation relationship indicators. American Documentation, 16(2):81–90. Ying Ding, Guo Zhang, Tamy Chambers, Min Song, Xiaolong Wang, and Chengxiang Zhai. 2014. Kobra Mansourizadeh and Ummul K. Ahmad. 2011. Content-based citation analysis: The next generation Citation practices among non-native expert and of citation analysis. Journal of the Association for novice scientific writers. Journal of English for Aca- Information Science and Technology, 65(9):1820– demic Purposes, 10(3):152–161. 1833. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring- ing order into text. In Proceedings of 2004 Con- John Duchi, Elad Hazan, and Yoram Singer. 2011. ference on Empirical Methods in Natural Language Adaptive subgradient methods for online learning Processing (EMNLP). and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159. Seungwhan Moon, Leonardo Neves, and Vitor Car- valho. 2018a. Multimodal named entity disambigua- Aaron Elkiss, Siwei Shen, Anthony Fader, Güneş tion for noisy social media posts. In Proceedings of Erkan, David States, and Dragomir Radev. 2008. the 56th Annual Meeting of the Association for Com- Blind men and elephants: What do citation sum- putational Linguistics (ACL). maries tell us about a research article? Journal of the American Society for Information Science and Seungwhan Moon, Leonardo Neves, and Vitor Car- Technology, 59(1):51–62. valho. 2018b. Multimodal named entity recognition for short social media posts. In Proceedings of 2018 Günes Erkan and Dragomir R Radev. 2004. Lexrank: Conference of the North American Chapter of the Graph-based lexical centrality as salience in text Association for Computational Linguistics: Human summarization. Journal of Artificial Intelligence Re- Language Technologies (NAACL-HLT). search, 22:457–479. Spencer Moore, Alan Shiell, Penelope Hawe, and Va- Cong Duy Vu Hoang and Min-Yen Kan. 2010. To- lerie A Haines. 2005. The privileging of communi- wards automated related work summarization. In tarian ideas: citation practices and the translation of Proceedings of the 23rd International Conference on social capital into public health research. American Computational Linguistics (COLING). Journal of Public Health, 95(8):1330–1337. Michael J Moravcsik and Poovanalingam Murugesan. Jianhua Hou, Xiucai Yang, and Chaomei Chen. 2018. 1975. Some results on the function and quality of Emerging trends and new developments in infor- citations. Social Studies of Science, 5(1):86–92. mation science: A document co-citation analysis (2009–2016). Scientometrics, 115(2):869–892. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- Yue Hu and Xiaojun Wan. 2014. Automatic genera- resentation. In Proceedings of the 2014 conference tion of related work sections in scientific papers: an on Empirical Methods in Natural Language Process- optimization approach. In Proceedings of 2014 Con- ing (EMNLP). ference on Empirical Methods in Natural Language Processing (EMNLP). Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The acl an- Xin Huang, Chang-an Chen, Changhuan Peng, Xudong thology network corpus. Language Resources and Wu, Luoyi Fu, and Xinbing Wang. 2018. Topic- Evaluation, 47(4):919–944. sensitive influential paper discovery in citation net- work. In Pacific-Asia Conference on Knowledge Horacio Saggion, Alexander Shvets, Àlex Bravo, et al. Discovery and Data Mining. Springer. 2020. Automatic related work section generation: experiments in scientific document abstracting. Sci- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A entometrics, 125(3):3159–3185. method for stochastic optimization. In 3rd Inter- Abigail See, Peter J Liu, and Christopher D Manning. national Conference on Learning Representations 2017. Get to the point: Summarization with pointer- (ICLR). generator networks. In Proceedings of 55th Annual Meeting of the Association for Computational Lin- Xuerong Li, Han Qiao, and Shouyang Wang. 2017. Ex- guistics (ACL). ploring evolution and emerging trends in business model study: a co-citation analysis. Scientometrics, Linda C Smith. 1981. Citation analysis. Library 111(2):869–887. Trends, 30(3):83–106. Chin-Yew Lin. 2004. ROUGE: A package for auto- Simone Teufel, Advaith Siddharthan, and Dan Tidhar. matic evaluation of summaries. In Text Summariza- 2006. Automatic classification of citation function. tion Branches Out, pages 74–81, Barcelona, Spain. In Proceedings of 2006 Conference on Empirical Association for Computational Linguistics. Methods in Natural Language Processing (EMNLP). 1475
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of 31st International Con- ference on Neural Information Processing Systems (NeurIPS). Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations (ICLR). Yongzhen Wang, Xiaozhong Liu, and Zheng Gao. 2018. Neural related work summarization with a joint context-driven attention mechanism. In Pro- ceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). Howard D White. 2004. Citation analysis and discourse analysis revisited. Applied linguistics, 25(1):89–116. Xinyu Xing, Xiaosheng Fan, and Xiaojun Wan. 2020. Automatic generation of citation texts in scholarly papers: A pilot study. In Proceedings of 58th An- nual Meeting of the Association for Computational Linguistics (ACL). Michihiro Yasunaga, Jungo Kasai, Rui Zhang, Alexan- der R Fabbri, Irene Li, Dan Friedman, and Dragomir R Radev. 2019. Scisummnet: A large an- notated corpus and content-impact models for scien- tific paper summarization with citation networks. In Proceedings of the AAAI Conference on Artificial In- telligence (AAAI). Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Radev. 2017. Graph-based neural multi-document summarization. In Proceedings of 21st Confer- ence on Computational Natural Language Learning (CoNLL). Hanlin You, Mengjun Li, Keith W Hipel, Jiang Jiang, Bingfeng Ge, and Hante Duan. 2017. Development trend forecasting for coherent light generator tech- nology based on patent citation network analysis. Scientometrics, 111(1):297–315. He Zhao, Zhunchen Luo, Chong Feng, Anqing Zheng, and Xiaopeng Liu. 2019. A context-based frame- work for modeling the role and function of on-line resource citations in scientific literature. In Proceed- ings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). 1476
A Appendices Precision Recall F1-score cross 95.43 95.43 95.43 A.1 Experiments for Citation Function test 91.59 91.59 91.59 Labeling Model To test our citation function labeling model, we Table 5: The results of the citation function labeling applied 10-fold cross-validation to our training model for cross-validation (denoted as cross) and on the external test data (denoted as test) dataset with 800 citing sentences. We then tested our trained model on the test data with 400 sen- tences, which we refer to as the external test set. #REFR to mark the citation of the cited paper. We set the hidden size of the MLP in our label- The reference signs to other papers are masked ing model to 256, and adopted a dropout with a as #OTHEREFR. The #CITE in context indicates rate of 0.2. For the optimizer, an Adam (Kingma the position where the citing sentence should be and Ba, 2015) with a learning rate of 2e-3 was inserted. The output of our framework has a higher used. The batch size was set to 8. We used F1 overlap with the ground truth than the output from score for evaluating labeling accuracy. Since there PTGEN-Cross. Please note that our framework was imbalance among the distributions of labels, was able to infer that the mentioned methods in we choose the micro-F1 score specifically. The the generated citing sentence are “supervised”, and results are shown in Table 5. After training, we we believe that this knowledge was gained from used the trained model to label the rest of the data the citation network where other neighboring cited (84,376 instances) for further training the citing sen- papers explicitly mentioned “supervised methods”. tence generation model. The final dataset contains Also, the generated citing sentence from our frame- 85,576 instances. Following (Xing et al., 2020), we work showed a positive citation function (... have used the above-mentioned 400 citing sentences as been shown to be effective) as the ground truth, the test set, and combined the 800 citing sentences while PTGEN-Cross’s output expressed the wrong with the rest of the model-labelled instances as our citation function (neutral). We think the underly- training set. The average length of citing sentences ing reason for this difference in outputs may be in the training and test data is 28.72 words and that our joint training of citing sentence generation 26.45 words, respectively. and citation function classification, which forced our framework to recognize the corresponding ci- A.2 Implementation Details tation function during the generation and further We used pre-trained Glove vectors (Pennington improved the performance. et al., 2014) to initialize word embeddings with the vector dimension 300 and followed Veličković et al. (2018) to initialize the node features for the graph encoder as a bag-of-words representation of the paper’s abstract. The hidden state size of LSTM was set to 256 for the encoder, and 512 for the de- coder. An AdaGrad (Duchi et al., 2011) optimizer was used with a learning rate of 0.15 and an ini- tial accumulator value of 0.1. We picked 64 as the batch size for training. For the rescaling factor β in Equation 11, we chose 40 based on the results reported in Yasunaga et al. (2017). We also used ROUGE scores (Lin, 2004) for quantitative eval- uation, and reported the F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L for comparing BACO to alternative models. A.3 Case Study We present an example generated by our re- implementation of the baseline PTGEN-Cross, and our framework in Table 6. Note that we used 1477
The state-of-the-art methods used for relation classification are primarily based on statistical machine learning, and their performance strongly depends on the quality of the extracted features. The extracted features are often derived from the output of pre-existing natural lang- uage processing (NLP) systems, which leads to the propagation of the errors in the existing tools and hinders the performance of these systems. In this paper, we exploit a convolutional deep neural network (DNN) to extract lexical and sentence level features. our method takes all of the Citing Paper’s Abstract word tokens as input without complicated pre-processing. First, the word tokens are transformed to vectors by looking up word embeddings. Then, lexical level features are extracted according to the given nouns. Meanwhile, sentence level features are learned using a convolutional approach. These two level features are concatenated to form the final extracted feature vector. Finally, the features are fed into a softmax classifier to predict the relationship between two marked nouns. The experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods. We present a novel approach to relation extraction, based on the obser- vation that the information required to assert a relationship between two named entities in the same sentence is typically captured by the shortest path between the two entities in the dependency graph. Cited Paper’s Abstract Experiments on extracting top-level relations from the ace (automated content extraction) newspaper corpus show that the new shortest path dependency kernel outperforms a recent approach based on dependency tree kernels. The task of relation classification is to predict semantic relations between pairs of nominals and can be defined as follows: given a sentence S with the annotated pairs of nominals e and e , we aim to identify the relations between e and e #OTHREFR. There is considerable interest in automatic relation classification, both as an end in itself and as an intermediate step Context in a variety of NLP applications. #CITE. Supervised approaches are further divided into feature-based methods and kernel-based methods. Feature-based methods use a set of features that are selected after perform- ing textual analysis. They convert these features into symbolic IDs, which are then transformed into a vector using a paradigm that is similar to the bag-of-words model. The most representative methods for relation classification use supervised Ground Truth paradigm ; such methods have been shown to be effective and yield relatively high performance #OTHREFR; #REFR. There has been a wide body of approaches to predict relation extraction PTGEN-Cross between nominals #OTHREFR; #REFR. Most methods for relation classification are supervised and have been BACO shown to be effective #OTHREFR; #REFR. Table 6: Example output citing sentences 1478
You can also read