W-RST: Towards a Weighted RST-style Discourse Framework
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
W-RST: Towards a Weighted RST-style Discourse Framework Patrick Huber∗, Wen Xiao∗, Giuseppe Carenini Department of Computer Science University of British Columbia Vancouver, BC, Canada, V6T 1Z4 {huberpat, xiaowen3, carenini}@cs.ubc.ca Abstract document. Afterwards, the discourse tree is built by hierarchically aggregating EDUs into larger con- Aiming for a better integration of data-driven stituents annotated with an importance indicator and linguistically-inspired approaches, we ex- (in RST called nuclearity) and a relation holding arXiv:2106.02658v1 [cs.CL] 4 Jun 2021 plore whether RST Nuclearity, assigning a bi- nary assessment of importance between text between siblings in the aggregation. The nuclear- segments, can be replaced by automatically ity attribute in RST thereby assigns each sub-tree generated, real-valued scores, in what we call either a nucleus-attribute, indicating central impor- a Weighted-RST framework. In particular, we tance of the sub-tree in the context of the document, find that weighted discourse trees from aux- or a satellite-attribute, categorizing the sub-tree as iliary tasks can benefit key NLP downstream of peripheral importance. The relation attribute fur- applications compared to nuclearity-centered ther characterizes the connection between sub-trees approaches. We further show that real-valued (e.g. Elaboration, Cause, Contradiction). importance distributions partially and interest- ingly align with the assessment and uncer- One central requirement of the RST discourse tainty of human annotators. theory, as for all linguistic theories, is that a trained human should be able to specify and interpret the 1 Introduction discourse representations. While this is a clear advantage when trying to generate explainable Ideally, research in Natural Language Processing outcomes, it also introduces problematic, human- (NLP) should balance and integrate findings from centered simplifications; the most radical of which machine learning approaches with insights and the- is arguably the nuclearity attribute, indicating the ories from linguistics. With the enormous success importance among siblings. of data-driven approaches over the last decades, Intuitively, such a coarse (binary) importance this balance has arguably and excessively shifted, assessment does not allow to represent nuanced dif- with linguistic theories playing a less and less crit- ferences regarding sub-tree importance, which can ical role. Even more importantly, there are only potentially be critical for downstream tasks. For little attempts made to improve such theories in instance, the importance of two nuclei siblings is light of recent empirical results. rather intuitive to interpret. However, having sib- In the context of discourse, two main theories lings annotated as “nucleus-satellite” or “satellite- have emerged in the past: The Rhetorical Structure nucleus” leaves the question on how much more Theory (RST) (Carlson et al., 2002) and PDTB important the nucleus sub-tree is compared to the (Prasad et al., 2008). In this paper, we focus on satellite, as shown in Figure 1. In general, it is RST, exploring whether the underlying theory can unclear (and unlikely) that the actual importance be refined in a data-driven manner. distributions between siblings with the same nucle- In general, RST postulates a complete discourse arity attribution are consistent. tree for a given document. To obtain this formal Based on this observation, we investigate the representation as a projective consituency tree, a potential of replacing the binary nuclearity assess- given document is first separated into so called El- ment postulated by RST with automatically gen- ementary Discourse Units (or short EDUs), repre- erated, real-valued importance scores in a new, senting clause-like sentence fragments of the input Weighted-RST framework. In contrast with pre- ∗ Equal contribution. vious work that has assumed RST and developed
attribute, assigned to every internal node of the dis- course tree, encoding relative importance between the nodes’ sub-trees, with the nucleus expressing primary importance and a satellite signifying sup- plementary sub-trees. (3) A relation attribute for every internal node describing the relationship be- tween the sub-trees of a node (e.g., Contrast, Evi- dence, Contradiction). Arguably, the weakest aspect of an RST repre- Figure 1: Document wsj 0639 from the RST-DT cor- sentation is the nuclearity assessment, which makes pus with inconsistent importance differences between a too coarse differentiation between primary and N-S attributions. (The top-level satellite is clearly more secondary importance of sub-trees. However, de- central to the overall context than the lower-level satel- spite its binary assignment of importance and even lite. However, both are similarly assigned the satellite though the nuclearity attribute is only one of three attribution by at least one annotator). Top relation: An- components of an RST tree, it has major implica- notator 1: N-S, Annotator 2: N-N. tions for many downstream tasks, as already shown computational models of discourse by simply ap- early on by Marcu (1999), using the nuclearity plying machine learning methods to RST annotated attribute as the key signal in extractive summariza- treebanks (Ji and Eisenstein, 2014; Feng and Hirst, tion. Further work in sentiment analysis (Bhatia 2014; Joty et al., 2015; Li et al., 2016; Wang et al., et al., 2015) also showed the importance of nu- 2017; Yu et al., 2018), we rely on very recent empir- clearity for the task by first converting the con- ical studies showing that weighted “silver-standard” stituency tree into a dependency tree (more aligned discourse trees can be inferred from auxiliary tasks with the nuclearity attribute) and then using that such as sentiment analysis (Huber and Carenini, tree to predict sentiment more accurately. Both 2020b) and summarization (Xiao et al., 2021). of these results indicate that nuclearity, even in In our evaluation, we assess both, computational the coarse RST version, already contains valuable benefits and linguistic insights. In particular, we information. Hence, we believe that this coarse- find that automatically generated, weighted dis- grained classification is reasonable when manually course trees can benefit key NLP downstream tasks. annotating discourse, but see it as a major point We further show that real-valued importance scores of improvement, if a more fine-grained assessment (at least partially) align with human annotations could be correctly assigned. We therefore explore and can interestingly also capture uncertainty in the potential of assigning a weighted nuclearity human annotators, implying some alignment of the attribute in this paper. importance distributions with linguistic ambiguity. While plenty of studies have highlighted the important role of discourse for real-world down- 2 Related Work stream tasks, including summarization, (Gerani First introduced by Mann and Thompson (1988), et al., 2014; Xu et al., 2020; Xiao et al., 2020), sen- the Rhetorical Structure Theory (RST) has been timent analysis (Bhatia et al., 2015; Hogenboom one of the primary guiding theories for discourse et al., 2015; Nejat et al., 2017) and text classifi- analysis (Carlson et al., 2002; Subba and Di Eu- cation (Ji and Smith, 2017), more critical to our genio, 2009; Zeldes, 2017; Gessler et al., 2019; approach is very recent work exploring such con- Liu and Zeldes, 2019), discourse parsing (Ji and nection in the opposite direction. In Huber and Eisenstein, 2014; Feng and Hirst, 2014; Joty et al., Carenini (2020b), we exploit sentiment related in- 2015; Li et al., 2016; Wang et al., 2017; Yu et al., formation to generate “silver-standard” nuclearity 2018), and text planning (Torrance, 2015; Gatt and annotated discourse trees, showing their potential Krahmer, 2018; Guz and Carenini, 2020). The RST on the domain-transfer discourse parsing task. Cru- framework thereby comprehensively describes the cially for our purposes, this approach internally organization of a document, guided by the author’s generates real-valued importance-weights for trees. communicative goals, encompassing three compo- For the task of extractive summarization, we fol- nents: (1) A projective constituency tree structure, low our intuition given in Xiao et al. (2020) and often referred to as the tree span. (2) A nuclearity Xiao et al. (2021), exploiting the connection be-
Figure 2: Three phases of our approach to generate weighted RST-style discourse trees. Left and center steps are described in section 3, right component is described in section 4. † = As in Huber and Carenini (2020b), ‡ = As in Marcu (1999), ∗ = Sentiment prediction component is a linear combination, mapping the aggregated embedding to the sentiment output. The linear combination has been previously learned on the training portion of the dataset. tween summarization and discourse. In particular, model proposed by Angelidis and Lapata (2018) in Xiao et al. (2021), we demonstrate that the self- on a corpus with document-level sentiment gold- attention matrix learned during the training of a labels, internally annotating each input-unit (in our transformer-based summarizer captures valid as- case EDUs) with a sentiment- and attention-score. pects of constituency and dependency discourse After the MIL model is trained (center), a tuple trees. (si , ai ) containing a sentiment score si and an at- To summarize, building on our previous work on tention ai is extracted for each EDU i. Based on creating discourse trees through distant supervision, these tuples representing leaf nodes, the CKY algo- we take a first step towards generating weighted rithm (Jurafsky and Martin, 2014) is applied to find discourse trees from the sentiment analysis and the tree structure to best align with the overall doc- summarization tasks. ument sentiment, through a bottom-up aggregation approach defined as3 : 3 W-RST Treebank Generation Given the intuition from above, we combine infor- sl ∗ al + sr ∗ ar al + ar sp = ap = mation from machine learning approaches with al + ar 2 insights from linguistics, replacing the human- centered nuclearity assignment with real-valued with nodes l and r as the left and right child- weights obtained from the sentiment analysis and nodes of p respectively. The attention scores summarization tasks1 . An overview of the process (al , ar ) are here interpreted as the importance to generate weighted RST-style discourse trees is weights for the respective sub-trees (wl = al /(al + shown in Figure 2, containing the training phase ar ) and wr = ar /(al + ar )), resulting in a com- (left) and the W-RST discourse inference phase plete, normalized and weighted discourse structure (center) described here. The W-RST discourse eval- as required for W-RST. We call the discourse tree- uation (right), is covered in section 4. bank generated with this approach W-RST-Sent. 3.1 Weighted Trees from Sentiment 3.2 Weighted Trees from Summarization To generate weighted discourse trees from senti- In order to derive weighted discourse trees from a ment, we slightly modify the publicly available summarization model we follow Xiao et al. (2021)4 , code2 presented in Huber and Carenini (2020b) by generating weighted discourse trees from the self- removing the nuclearity discretization component. attention matrices of a transformer-based summa- An overview of our method is shown in Fig- rization model. An overview of our method is ure 2 (top), while a detailed view is presented in shown in Figure 2 (bottom), while a detailed view the left and center parts of Figure 3. First (on the is presented in the left and center parts of Figure 4. left), we train the Multiple Instance Learning (MIL) We start by training a transformer-based extrac- tive summarization model (left), containing three 1 Please note that both tasks use binarized discourse trees, 3 as commonly used in computational models of RST. Equations taken from Huber and Carenini (2020b) 2 4 Code available at https://github.com/nlpat/ Code available at https://github.com/ MEGA-DT Wendy-Xiao/summ_guided_disco_parser
Figure 3: Three phases of our approach. Left/Center: Detailed view into the generation of weighted RST-style discourse trees using the sentiment analysis downstream task. Right: Sentiment discourse application evaluation Figure 4: Three phases of our approach. Left/Center: Detailed view into the generation of weighted RST-style discourse trees using the summarization downstream task. Right: Summarization discourse application evaluation components: (1) A pre-trained BERT EDU En- wl = max(A(k+1):j,i:k )/(wl + wr ) coder generating EDU embeddings, (2) a standard wr = max(Ai:k,(k+1):j )/(wl + wr ) transformer architecture as proposed in Vaswani et al. (2017) and (3) a final classifier, mapping the In this way, the importance scores of the two sub- outputs of the transformer to a probability score for trees represent a real-valued distribution. In com- each EDU, indicating whether the EDU should be bination with the unlabeled structure computation, part of the extractive summary. we generate a weighted discourse tree for each doc- With the trained transformer model, we then ument. We call the discourse treebank generated extract the self-attention matrix A and build a dis- with the summarization downstream information course tree in bottom-up fashion (as shown in the W-RST-Summ. center of Figure 4). Specifically, the self-attention matrix A reflects the relationships between units 4 W-RST Discourse Evaluation in the document, where entry Aij measures how much the i-th EDU relies on the j-th EDU. Given To assess the potential of W-RST, we consider two this information, we generate an unlabeled con- evaluation scenarios (Figure 2, right): (1) Apply stituency tree using the CKY algorithm (Jurafsky weighted discourse trees to the tasks of sentiment and Martin, 2014), optimizing the overall tree score, analysis and summarization and (2) analyze the as previously done in Xiao et al. (2021). weight alignment with human annotations. In terms of weight-assignment, given a sub-tree 4.1 Weight-based Discourse Applications spanning EDUs i to j, split into child-constituents at EDU k, then max(Ai:k,(k+1):j ), representing the In this evaluation scenario, we address the question maximal attention value that any EDU in the left of whether W-RST trees can support downstream constituent is paying to an EDU in the right child- tasks better than traditional RST trees with nucle- constituent, reflects how much the left sub-tree re- arity. Specifically, we leverage the discourse trees lies on the right sub-tree, while max(A(k+1):j,i:k ) learned from sentiment for the sentiment analysis defines how much the right sub-tree depends on the task itself and, similarly, rely on the discourse trees left. We define the importance-weights of the left learned from summarization to benefit the summa- (wl ) and right (wr ) sub-trees as: rization task.
4.1.1 Sentiment Analysis the unsupervised summarization method by Marcu In order to predict the sentiment of a document (1999). in W-RST-Sent based on its weighted discourse We choose this straightforward algorithm over tree, we need to introduce an additional source more elaborate and hyper-parameter heavy ap- of information to be aggregated according to such proaches to avoid confounding factors, since our tree. Here, we choose word embeddings, as com- aim is to evaluate solely the potential of the monly used as an initial transformation in many weighted discourse trees compared to standard models tackling the sentiment prediction task (Kim, RST-style annotations. In the original algorithm, 2014; Tai et al., 2015; Yang et al., 2016; Adhikari a summary is computed based on the nuclearity et al., 2019; Huber and Carenini, 2020a). To attribute by recursively computing the importance avoid introducing additional confounding factors scores for all units as: through sophisticated tree aggregation approaches dN , u ∈ P rom(N ) (e.g. TreeLSTMs (Tai et al., 2015)), we select Sn (u, N ) = S(u, C(N )) s.t. a simple method, aiming to directly compare the u ∈ C(N ) otherwise inferred tree-structures and allowing us to better as- sess the performance differences originating from where C(N ) represents the child of N , and the weight/nuclearity attribution (see right step in P rom(N ) is the promotion set of node N , which is Figure 3). More specifically, we start by comput- defined in bottom-up fashion as follows: (1) P rom ing the average word-embedding for each leaf node of a leaf node is the leaf node itself. (2) P rom of leaf i (here containing a single EDU) in the dis- an internal node is the union of the promotion sets course tree. of its nucleus children. Furthermore, dN represents j
ing any positional bias in the data (e.g. the lead bias), which would confound the results. 4.2 Weight Alignment with Human Annotation As for our second W-RST discourse evaluation task, we investigate if the real-valued importance- Figure 5: Three phases of our approach. Left: Genera- weights align with human annotations. To be able tion of W-RST-Sent/Summ discourse trees. Right: Lin- to explore this scenario, we generate weighted guistic evaluation tree annotations for an existing discourse treebank RST-style treebanks, we need to specify how close (RST-DT (Carlson et al., 2002)). In this eval- two weights have to be for such configuration to uation task we verify if: (1) The nucleus in a apply. Formally, we set a threshold t as follows: gold-annotation generally receives more weight than a satellite (i.e. if importance-weights gener- If: |wl − wr | < t → nucleus-nucleus ally favour nuclei over satellites) and, similarly, if Else: If: wl > wr → nucleus-satellite nucleus-nucleus relations receive more balanced Else: If: wl ≤ wr → satellite-nucleus weights. (2) In accordance with Figure 1, we fur- This way, RST-style treebanks with nuclearity ther explore how well the weights capture the ex- attributions can be generated from W-RST-Sent and tend to which a relation is dominated by the nu- W-RST-Summ and used for the sentiment analy- cleus. Here, our intuition is that for inconsistent sis and summarization downstream tasks. For the human nuclearity annotations the spread should nuclearity-attributed baseline of the sentiment task, generally be lower than for consistent annotations, we use a similar approach as for the W-RST eval- assuming that human misalignment in the discourse uation procedure, but assign two distinct weights annotation indicates ambivalence on the impor- wn and ws to the nucleus and satellite child re- tance of sub-trees. spectively. Since it is not clear how much more To test for these two properties, we use discourse important a nucleus node is compared to a satellite documents individually annotated by two human using the traditional RST notation, we define the annotators and analyze each sub-tree within the two weights based on the threshold t as: doubly-annotated documents with consistent inter- wn = 1 − (1 − 2t)/4 ws = (1 − 2t)/4 annotator structure assessment for their nuclear- ity assignment. For each of the 6 possible inter- The intuition behind this formulation is that annotator nuclearity assessments, consisting of 3 for a high threshold t (e.g. 0.8), the nuclearity consistent annotation classes (namely N-N/N-N, N- needs to be very prominent (the difference be- S/N-S and S-N/S-N) and 3 inconsistent annotation tween the normalized weights needs to exceed classes (namely N-N/N-S, N-N/S-N and N-S/S- 0.8), making the nucleus clearly more important N)5 , we explore the respective weight distribution than the satellite, while for a small threshold (e.g. of the document annotated with the two W-RST 0.1), even relatively balanced weights (for exam- tasks – sentiment analysis and summarization (see ple wl = 0.56, wr = 0.44) will be assigned as Figure 5). We compute an average spread sc for “nucleus-satellite”, resulting in the potential dif- each of the 6 inter-annotator nuclearity assessments ference in importance of the siblings to be less classes c as: eminent. For the nuclearity-attributed baseline for summa- j
5 Experiments nuclearity 5.1 Experimental Setup 56 1 Nucleus Ratio Sentiment Analysis: We follow our previous Accuracy approach in Huber and Carenini (2020b) for the 55 0.8 model training and W-RST discourse inference steps (left and center in Figure 3) using the adapted 54 MILNet model from Angelidis and Lapata (2018) 0.6 trained with a batch-size of 200 and 100 neurons 53 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 in a single layer bi-directional GRU with 20% dropout for 25 epochs. Next, discourse trees are 23 1 generated using the best-performing heuristic CKY Nucleus Ratio Avg ROUGE method with the stochastic exploration-exploitation 22 0.8 trade-off from Huber and Carenini (2020b) (beam 21 size 10, linear decreasing τ ). As word-embeddings in the W-RST discourse evaluation (right in Figure 20 0.6 3), we use GloVe embeddings (Pennington et al., 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2014), which previous work (Tai et al., 2015; Huber and Carenini, 2020a) indicates to be suitable threshold for aggregation in discourse processing. For train- ing and evaluation of the sentiment analysis task, Figure 6: Top: Sentiment Analysis accuracy of the W- we use the 5-class Yelp’13 review dataset (Tang RST model compared to the standard RST framework et al., 2015). To compare our approach against with different thresholds. Bottom: Average ROUGE score (ROUGE-1, -2 and -L) of the W-RST summa- the traditional RST approach with nuclearity, we rization model compared to different thresholds. Full explore the impact of 11 distinct thresholds for the numerical results are shown in Appendix A. baseline described in §4.1.3, ranging from 0 to 1 in 0.1 intervals. N-N N-S S-N Summarization: To be consistent with RST, our N-N 273 99 41 summarizer extracts EDUs instead of sentences N-S - 694 75 from a given document. The model is trained on S-N - - 172 the EDU-segmented CNNDM dataset containing EDU-level Oracle labels published by Xu et al. Table 1: Statistics on consistently and inconsistently (2020). We further use a pre-trained BERT-base annotated samples of the 1, 354 structure-aligned sub- (“uncased”) model to generate the embeddings of trees generated by two distinct human annotators. EDUs. The transformer used is the standard model with 6 layers and 8 heads in each layer (d = 512). We train the extractive summarizer on the training lished by Carlson et al. (2002), 53 of the 385 doc- set of the CNNDM corpus (Nallapati et al., 2016) uments annotated with full RST-style discourse and pick the best attention head using the RST-DT trees are doubly tagged by a second linguist. We dataset (Carlson et al., 2002) as the development use the 53 documents containing 1, 354 consistent set. We test the trees by running the summarization structure annotations between the two analysts to algorithm in Marcu (1999) on the test set of the evaluate the linguistic alignment of our generated CNNDM dataset, and select the top-6 EDUs based W-RST documents with human discourse interpre- on the importance score to form a summary in tations. Out of the 1, 354 structure-aligned sub- natural order. Regarding the baseline model using trees, in 1, 139 cases both annotators agreed on the thresholds, we apply the same 11 thresholds as for nuclearity attribute, while 215 times a nuclearity the sentiment analysis task. mismatch appeared, as shown in detail in Table 1. 5.2 Results and Analysis Weight Alignment with Human Annotation: As discussed in §4.2, this evaluation requires two The results of the experiments on the discourse parallel human generated discourse trees for every applications for sentiment analysis and summa- document. Luckily, in the RST-DT corpus pub- rization are shown in Figure 6. The results for
Sent N-N N-S S-N Sent N-N N-S S-N -0.228 -0.238 -0.240 N-N N-N ∅-0.36 ∅-0.43 ∅-0.45 (106) (33) (19) -0.038 -0.044 N-S - ∅+1.00 ∅+0.96 N-S - (325) (22) S-N - - ∅-0.72 -0.278 S-N - - (115) Summ N-N N-S S-N Summ N-N N-S S-N N-N ∅-0.13 ∅+0.13 ∅-0.66 0.572 0.604 0.506 N-S - N-N ∅+1.00 ∅-0.56 (136) (42) (25) 0.713 0.518 S-N - - ∅+0.22 N-S - (418) (36) S-N - - 0.616 Table 3: Confusion Matrices based on human anno- (134) tation showing the weight-spread relative to the task- average for Sentiment (top) and Summarization (bot- Table 2: Confusion Matrices based on human annota- tom), aligned with the human structure prediction, re- tion showing the absolute weight-spread using the Sen- spectively. Cell value: Relative weight spread as the timent (top) and Summarization (bottom) tasks on 620 divergence from the average spread across all cells in and 791 sub-trees aligned with the human structure Table 2. Color: Positive/Negative divergence, ∅ = Av- prediction, respectively. Cell upper value: Absolute erage value of absolute scores. weight spread for the respective combination of human- annotated nuclearities. Lower value (in brackets): Sup- of the fully-supervised parser is above the perfor- port for this configuration. mance of our W-RST results for the summarization task, the accuracy on the sentiment analysis task sentiment analysis (top) and summarization (bot- is well below our approach. We believe that these tom) thereby show a similar trend: With an in- results are a direct indication of the problematic creasing threshold and therefore a larger number of domain adaptation of fully supervised discourse N-N relations (shown as grey bars in the Figure), parsers, where the application on a similar domain the standard RST baseline (blue line) consistently (Wall Street Journal articles vs. CNN-Daily Mail improves for the respective performance measure articles) leads to superior performances compared of both tasks. However, reaching the best perfor- to our distantly supervised method, however, with mance at a threshold of 0.8 for sentiment analysis larger domain shifts (Wall Street Journal articles vs. and 0.6 for summarization, the performance starts Yelp customer reviews), the performance drops sig- to deteriorate. This general trend seems reason- nificantly, allowing our distantly supervised model able, given that N-N relations represent a rather to outperform the supervised discourse trees for frequent nuclearity connection, however classify- the downstream task. Arguably, this indicates that ing every connection as N-N leads to a severe loss although our weighted approach is still not com- of information. Furthermore, the performance sug- petitive with fully-supervised models in the same gests that while the N-N class is important in both domain, it is the most promising solution available cases, the optimal threshold varies depending on for cross-domain discourse parsing. the task and potentially also the corpus used, mak- With respect to exploring the weight alignment ing further task-specific fine-tuning steps manda- with human annotations, we show a set of confu- tory. The weighted discourse trees following our sion matrices based on human annotation for each W-RST approach, on the other hand, do not require W-RST discourse generation task on the absolute the definition of a threshold, resulting in a single, and relative weight-spread in Tables 2 and 3 re- promising performance (red line) for both tasks spectively. The results for the sentiment analysis in Figure 6. For comparison, we apply the gener- task are shown on the top of both tables, while the ated trees of a standard RST-style discourse parser performance for the summarization task is shown (here the Two-Stage parser by Wang et al. (2017)) at the bottom. For instance, the top right cell of trained on the RST-DT dataset (Carlson et al., 2002) the upper confusion matrix in Table 2 shows that on both downstream tasks. The fully-supervised for 19 sub-trees in the doubly annotated subset of parser reaches an accuracy of 44.77% for sentiment RST-DT one of the annotators labelled the sub- analysis and an average ROUGE score of 26.28 for tree with a nucleus-nucleus nuclearity attribution, summarization. While the average ROUGE score while the second annotator identified it as satellite-
nucleus. The average weight spread (see §4.2) for expected a more ambivalent result with wl ≈ wr those 19 sub-trees is −0.24. Regarding Table 3, we (however, aligned with the overall trend shown in subtract P the average spread across Table 2 defined Table 2). For summarization, we see a very similar as ∅ = ci ∈C (ci )/|C| (with C = {c1 , c2 , ...c6 } trend with the values for N-N/N-S and N-N/S-N containing the cell values in the upper triangle annotated samples. Again, both values are close matrix) from each cell value ci and normalize by to the average, with the N-N/N-S cell showing a max = maxci ∈C (|ci −∅|), with ∅ = −0.177 and more positive spread than N-N/S-N. However for max = 0.1396 across the top table. Accordingly, summarization, the consistent satellite-nucleus an- we transform the −0.24 in the top right cell into notation (bottom right cell) seems misaligned with (−0.24 − avg)/max = −0.45. the rest of the table, following instead the general trend for summarization described in Table 2. All Moving to the analysis of the results, we find in all, the results suggest that the values in most the following trends in this experiment: (1) As pre- cells are well aligned with what we would expect sented in Table 2, the sentiment analysis task tends regarding the relative spread. Interestingly, human to strongly over-predict S-N (i.e., wl > wr ), leading to ex- sub-trees. clusively positive spreads. We believe both trends are consistent with the intrinsic properties of the 6 Conclusion and Future Work tasks, given that the general structure of reviews We propose W-RST as a new discourse framework, tends to become more important towards the end where the binary nuclearity assessment postulated of a review (leading to increased S-N assignments), by RST is replaced with more expressive weights, while for summarization, the lead bias potentially that can be automatically generated from auxiliary produces the overall strong nucleus-satellite trend. tasks. A series of experiments indicate that W-RST (2) To investigate the relative weight spreads for is beneficial to the two key NLP downstream tasks different human annotations (i.e., between cells) of sentiment analysis and summarization. Further, beyond the trends shown in Table 2, we normalize we show that W-RST trees interestingly align with values within a table by subtracting the average the uncertainty of human annotations. and scaling between [−1, 1]. As a result, Table 3 For the future, we plan to develop a neural dis- shows the relative weight spread for different hu- course parser that learns to predict importance man annotations. Apart from the general trends weights instead of nuclearity attributions when described in Table 2, the consistently annotated trained on large W-RST treebanks. More longer samples of the two linguists (along the diagonal term, we want to explore other aspects of RST of the confusion matrices) align reasonably. The that can be refined in light of empirical results, most positive weight spread is consistently found plan to integrate our results into state-of-the-art in the agreed-upon nucleus-satellite case, while the sentiment analysis and summarization approaches nucleus-nucleus annotation has, as expected, the (e.g. Xu et al. (2020)) and generate parallel W-RST lowest divergence (i.e., closest to zero) along the di- structures in a multi-task manner to improve the agonal in Table 3. (3) Regarding the inconsistently generality of the discourse trees. annotated samples (shown in the triangle matrix above the diagonal) it becomes clear that in the sen- Acknowledgments timent analysis model the values for the N-N/N-S and N-N/S-N annotated samples (top row in Ta- We thank the anonymous reviewers for their in- ble 3) are relatively close to the average value. This sightful comments. This research was supported by indicates that, similar to the nucleus-nucleus case, the Language & Speech Innovation Lab of Cloud the weights are also ambivalent, with the N-N/N- BU, Huawei Technologies Co., Ltd and the Nat- S value (top center) slightly larger than the value ural Sciences and Engineering Research Council for N-N/S-N (top right). The N-S/S-N case for of Canada (NSERC). Nous remercions le Conseil the sentiment analysis model is less aligned with de recherches en sciences naturelles et en génie du our intuition, showing a strongly negative weight- Canada (CRSNG) de son soutien. spread (i.e. wl
References Patrick Huber and Giuseppe Carenini. 2020b. MEGA RST discourse treebanks with structure and nuclear- Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and ity from scalable distant sentiment supervision. In Jimmy Lin. 2019. Rethinking complex neural net- Proceedings of the 2020 Conference on Empirical work architectures for document classification. In Methods in Natural Language Processing (EMNLP), Proceedings of the 2019 Conference of the North pages 7442–7457. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Yangfeng Ji and Jacob Eisenstein. 2014. Representa- Volume 1 (Long and Short Papers), pages 4046– tion learning for text-level discourse parsing. In Pro- 4051. ceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Stefanos Angelidis and Mirella Lapata. 2018. Multi- Papers), volume 1, pages 13–24. ple instance learning networks for fine-grained sen- timent analysis. Transactions of the Association for Yangfeng Ji and Noah A Smith. 2017. Neural dis- Computational Linguistics, 6:17–31. course structure for text categorization. In Proceed- ings of the 55th Annual Meeting of the Association Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. for Computational Linguistics (Volume 1: Long Pa- 2015. Better document-level sentiment analysis pers), pages 996–1005. from RST discourse parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Shafiq Joty, Giuseppe Carenini, and Raymond T Ng. Language Processing, pages 2212–2218. 2015. CODRA: A novel discriminative framework for rhetorical analysis. Computational Linguistics, Lynn Carlson, Mary Ellen Okurowski, and Daniel 41(3). Marcu. 2002. RST discourse treebank. Linguistic Data Consortium, University of Pennsylvania. Dan Jurafsky and James H Martin. 2014. Speech and Vanessa Wei Feng and Graeme Hirst. 2014. A linear- language processing, volume 3. Pearson London. time bottom-up discourse parser with constraints Yoon Kim. 2014. Convolutional neural networks for and post-editing. In Proceedings of the 52nd Annual sentence classification. In Proceedings of the 2014 Meeting of the Association for Computational Lin- Conference on Empirical Methods in Natural Lan- guistics (Volume 1: Long Papers), pages 511–521. guage Processing (EMNLP), pages 1746–1751. Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core Qi Li, Tianshi Li, and Baobao Chang. 2016. Discourse tasks, applications and evaluation. Journal of Artifi- parsing with attention-based hierarchical neural net- cial Intelligence Research, 61:65–170. works. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Shima Gerani, Yashar Mehdad, Giuseppe Carenini, pages 362–371. Raymond T Ng, and Bita Nejat. 2014. Abstractive summarization of product reviews using discourse Yang Liu and Amir Zeldes. 2019. Discourse relations structure. In Proceedings of the 2014 conference on and signaling information: Anchoring discourse sig- empirical methods in natural language processing nals in rst-dt. Proceedings of the Society for Compu- (EMNLP), pages 1602–1613. tation in Linguistics, 2(1):314–317. Luke Gessler, Yang Janet Liu, and Amir Zeldes. 2019. William C Mann and Sandra A Thompson. 1988. A discourse signal annotation system for rst trees. In Rhetorical structure theory: Toward a functional the- Proceedings of the Workshop on Discourse Relation ory of text organization. Text, 8(3):243–281. Parsing and Treebanking 2019, pages 56–61. Daniel Marcu. 1999. Discourse trees are good indica- Grigorii Guz and Giuseppe Carenini. 2020. Towards tors of importance in text. Advances in automatic domain-independent text structuring trainable on text summarization, 293:123–136. large discourse treebanks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Language Processing: Findings, pages 3141–3152. Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstrac- tive text summarization using sequence-to-sequence Alexander Hogenboom, Flavius Frasincar, Franciska RNNs and beyond. In Proceedings of The 20th De Jong, and Uzay Kaymak. 2015. Using rhetori- SIGNLL Conference on Computational Natural Lan- cal structure in sentiment analysis. Commun. ACM, guage Learning, pages 280–290. Association for 58(7):69–77. Computational Linguistics. Patrick Huber and Giuseppe Carenini. 2020a. From Bita Nejat, Giuseppe Carenini, and Raymond Ng. 2017. sentiment annotations to sentiment prediction Exploring joint neural model for sentence level dis- through discourse augmentation. In Proceedings of course parsing and sentiment analysis. In Proceed- the 28th International Conference on Computational ings of the 18th Annual SIGdial Meeting on Dis- Linguistics, pages 185–197. course and Dialogue, pages 289–298.
Jeffrey Pennington, Richard Socher, and Christopher D. Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. Manning. 2014. Glove: Global vectors for word rep- 2020. Discourse-aware neural extractive text sum- resentation. In Empirical Methods in Natural Lan- marization. In Proceedings of the 58th Annual Meet- guage Processing (EMNLP), pages 1532–1543. ing of the Association for Computational Linguis- tics, pages 5021–5031. Association for Computa- Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- tional Linguistics. sakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The penn discourse treebank 2.0. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, LREC. Alex Smola, and Eduard Hovy. 2016. Hierarchi- cal attention networks for document classification. Rajen Subba and Barbara Di Eugenio. 2009. An effec- In Proceedings of the 2016 conference of the North tive discourse parser that uses rich linguistic infor- American chapter of the association for computa- mation. In Proceedings of Human Language Tech- tional linguistics: human language technologies, nologies: The 2009 Annual Conference of the North pages 1480–1489. American Chapter of the Association for Computa- tional Linguistics, pages 566–574. Association for Nan Yu, Meishan Zhang, and Guohong Fu. 2018. Computational Linguistics. Transition-based neural rst parsing with implicit syn- tax features. In Proceedings of the 27th Inter- Kai Sheng Tai, Richard Socher, and Christopher D national Conference on Computational Linguistics, Manning. 2015. Improved semantic representations pages 559–570. from tree-structured long short-term memory net- Amir Zeldes. 2017. The GUM corpus: Creating mul- works. In Proceedings of the 53rd Annual Meet- tilayer resources in the classroom. Language Re- ing of the Association for Computational Linguistics sources and Evaluation, 51(3):581–612. and the 7th International Joint Conference on Natu- ral Language Processing (Volume 1: Long Papers), pages 1556–1566. A Numeric Results Duyu Tang, Bing Qin, and Ting Liu. 2015. Docu- The numeric results of our W-RST approach for the ment modeling with gated recurrent neural network sentiment analysis and summarization downstream for sentiment classification. In Proceedings of the tasks presented in Figure 6 are shown in Table 4 2015 conference on empirical methods in natural below, along with the threshold-based approach, as language processing, pages 1422–1432. well as the supervised parser. Mark Torrance. 2015. Understanding planning in text production. Handbook of writing research, pages Sentiment Summarization Approach 1682–1690. Accuracy R-1 R-2 R-L Nuclearity with Threshold Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz t = 0.0 53.76 28.22 8.58 26.45 Kaiser, and Illia Polosukhin. 2017. Attention is all t = 0.1 53.93 28.41 8.69 26.61 you need. In Proceedings of the 31st International t = 0.2 54.13 28.64 8.85 26.83 Conference on Neural Information Processing Sys- t = 0.3 54.33 28.96 9.08 27.14 tems, pages 6000–6010. t = 0.4 54.44 29.36 9.34 27.51 Yizhong Wang, Sujian Li, and Houfeng Wang. 2017. t = 0.5 54.79 29.55 9.50 27.68 A two-stage parsing method for text-level discourse t = 0.6 54.99 29.78 9.65 27.90 analysis. In Proceedings of the 55th Annual Meet- t = 0.7 55.07 29.57 9.45 27.74 ing of the Association for Computational Linguistics t = 0.8 55.32 29.18 9.08 27.32 (Volume 2: Short Papers), pages 184–188. t = 0.9 54.90 28.11 8.29 26.35 t = 1.0 54.15 26.94 7.60 25.27 Wen Xiao, Patrick Huber, and Giuseppe Carenini. 2020. Do we really need that many parameters in trans- Our Weighted RST Framework former for extractive summarization? discourse can weighted 54.76 29.70 9.58 27.85 help! In Proceedings of the First Workshop on Com- putational Approaches to Discourse, pages 124– Supervised Training on RST-DT 134. supervised 44.77 34.20 12.77 32.09 Wen Xiao, Patrick Huber, and Giuseppe Carenini. 2021. Table 4: Results of the W-RST approach compared to Predicting discourse trees from transformer-based threshold-based nuclearity assignments and supervised neural summarizers. In Proceedings of the 2021 training on RST-DT. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4139–4152, Online. Association for Computational Linguistics.
You can also read