DAGN: Discourse-Aware Graph Network for Logical Reasoning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DAGN: Discourse-Aware Graph Network for Logical Reasoning Yinya Huang1∗ Meng Fang2 Yu Cao3 Liwei Wang4 Xiaodan Liang1† 1 Shenzhen Campus of Sun Yat-sen University 2 Tencent Robotics X 3 School of Computer Science, The University of Sydney 4 The Chinese University of Hong Kong yinya.huang@hotmail, mfang@tencent.com, ycao8647@uni.sydney.edu.au, lwwang@cse.cuhk.edu.hk, xdliang328@gmail.com Abstract A main challenge for the QA models is to un- cover the logical structures under passages, such Recent QA with logical reasoning questions re- as identifying claims or hypotheses, or pointing quires passage-level relations among the sen- tences. However, current approaches still out flaws in arguments. To achieve this, the QA focus on sentence-level relations interacting models should first be aware of logical units, which among tokens. In this work, we explore aggre- can be sentences or clauses or other meaningful gating passage-level clues for solving logical text spans, then identify the logical relationships reasoning QA by using discourse-based infor- between the units. However, the logical structures mation. We propose a discourse-aware graph are usually hidden and difficult to be extracted, and network (DAGN) that reasons relying on the most datasets do not provide such logical structure discourse structure of the texts. The model en- annotations. codes discourse information as a graph with elementary discourse units (EDUs) and dis- An intuitive idea for unwrapping such logical in- course relations, and learns the discourse- formation is using discourse relations. For instance, aware features via a graph network for down- as a conjunction, “because” indicates a causal re- stream QA tasks. Experiments are conducted lationship, whereas “if” indicates a hypothetical on two logical reasoning QA datasets, Re- relationship. However, such discourse-based infor- Clor and LogiQA, and our proposed DAGN mation is seldom considered in logical reasoning achieves competitive results. The source code is available at https://github.com/Eleanor- tasks. Modeling logical structures is still lacking in H/DAGN. logical reasoning tasks, while current opened meth- ods use contextual pre-trained models (Yu et al., 1 Introduction 2020). Besides, previous graph-based methods A variety of QA datasets have promoted the devel- (Ran et al., 2019; Chen et al., 2020a) that construct opment of reading comprehensions, for instance, entity-based graphs are not suitable for logical rea- SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang soning tasks because of different reasoning units. et al., 2018), DROP (Dua et al., 2019), and so In this paper, we propose a new approach to on. Recently, QA datasets with more complicated solve logical reasoning QA tasks by incorporating reasoning types, i.e., logical reasoning, are also discourse-based information. First, we construct introduced, such as ReClor (Yu et al., 2020) and discourse structures. We use discourse relations LogiQA (Liu et al., 2020). The logical questions from the Penn Discourse TreeBank 2.0 (PDTB are taken from standardized exams such as GMAT 2.0) (Prasad et al., 2008) as delimiters to split texts and LSAT, and require QA models to read compli- into elementary discourse units (EDUs). A logic cated argument passages and identify logical rela- graph is constructed in which EDUs are nodes tionships therein. For example, selecting a correct and discourse relations are edges. Then, we pro- assumption that supports an argument, or finding pose a Discourse-Aware Graph Network (DAGN) out a claim that weakens an argument in a passage. for learning high-level discourse features to rep- Such logical reasoning is beyond the capability of resent passages.The discourse features are incor- most of the previous QA models which focus on porated with the contextual token features from reasoning with entities or numerical keywords. pre-trained language models. With the enhanced ∗ features, DAGN predicts answers to logical ques- This work was done during Yinya Huang’s internship in Tencent with L. Wang and M. Fang. tions. Our experiments show that DAGN surpasses † Corresponding Author: Xiaodan Liang. current opened methods on two recent logical rea- 5848 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5848–5855 June 6–11, 2021. ©2021 Association for Computational Linguistics
Updated EDU Embedding E2 E3 E8 Prediction E1 E2 E3 E4 E5 E6 E7 E8 E5 E7 FFN E1 Graph Reasoning E4 E6 Concat Discourse-based Logic Graph Context EDUs E1 E2 Token Embedding + A signal…detailed [while] digital system…units [.] With this Pooling E4 tcontext tquestion toption E3 } …disadvantage. [since] there is…signal [,] the duplication… E5 E6 Context Question Option Large-scale Pre-trained Model original [,] which are errors [.] Input Text Answer option EDUs E E8 Residual BiGRU 7 Context Question Option Dialog systems…systems [because] error cannot…signals [.] (1) (2) (3) Context: Theoretically, analog systems are superior to digital systems. A signal in a pure analog system can be infinitely detailed, while digital systems cannot produce signals that are more precise than their digital units. With this theoretical advantage there is a practical disadvantage. Since there is no limit on the potential detail of the signal, the duplication of an analog representation allows tiny variations from the original, which are errors. Question: The statements above, if true, most strongly support which one of the following? Option Digital systems are the best information systems because error cannot occur in the emission of digital signals. Figure 1: The architecture of our proposed method with an example below. soning QA datasets, ReClor and LogiQA. 2.1 Graph Construction Our main contributions are three-fold: Our discourse-based logic graph is constructed via two steps: delimiting text into elementary discourse • We propose to construct logic graphs from units (EDUs) and forming the graph using their texts by using discourse relations as edges relations as edges, as illustrated in Figure 1(1). and elementary discourse units as nodes. Discourse Units Delimitation It is studied that • We obtain discourse features via graph neural clause-like text spans delimited by discourse rela- networks to facilitate logical reasoning in QA tions can be discourse units that reveal the rhetori- models. cal structure of texts (Mann and Thompson, 1988; Prasad et al., 2008). We further observe that such • We show the effectiveness of using logic discourse units are essential units in logical reason- graph and feature enhancement by noticeable ing, such as being assumptions or opinions. As the improvements on two datasets, ReClor and example shown in Figure 1, the “while” in the con- LogiQA. text indicates a comparison between the attributes of “pure analog system” and that of “digital sys- 2 Method tems”. The “because” in the option provides evi- Our intuition is to explicitly use discourse-based dence “error cannot occur in the emission of digital information to mimic the human reasoning process signals” to the claim “digital systems are the best for logical reasoning questions. The questions are information systems”. in multiple choices format, which means given a We use PDTB 2.0 (Prasad et al., 2008) to help triplet (context, question, answer options), models drawing discourse relations. PDTB 2.0 contains answer the question by selecting the correct an- discourse relations that are manually annotated on swer option. Our framework is shown in Figure 1. the 1 million Wall Street Journal (WSJ) corpus and We first construct a discourse-based logic graph are broadly characterized into “Explicit” and “Im- from the raw text. Then we conduct reasoning via plicit” connectives. The former apparently presents graph networks to learn and update the discourse- in sentences such as discourse adverbial “instead” based features, which are incorporated with the or subordinating conjunction “because”, whereas contextual token embeddings for downstream an- the latter are inferred by annotators between succes- swer prediction. sive pairs of text spans split by punctuation marks 5849
such as “.” or “;”. We simply take all the “Explicit” 2019; Chen et al., 2020a), we also learn graph connectives as well as common punctuation marks node representations to obtain higher-level features. to form our discourse delimiter library (details are However, we consider different graph construction given in Appendix A), with which we delimit the and encoding. Specifically, let G k = (V k , E k ) de- texts into EDUs. For each data sample, we segment note a graph corresponding to the k-th option in the context and options, ignoring the question since answer choices. For each node vi ∈ V, the node the question usually does not carry logical content. embedding vi is initialized with the correspond- ing EDU embedding ei . Ni = {j|(vj , vi ) ∈ E k } Discourse Graph Construction We define the indicates the neighbors of node vi . Wrji is the ad- discourse-based graphs with EDUs as nodes, the jacency matrix for one of the two edge types, where “Explicit” connectives as well as the punctuation rE indicates graph edges corresponding to the ex- marks as two types of edges. We assume that plicit connectives, and rI indicates graph edges each connective or punctuation mark connects the corresponding to punctuation marks. EDUs before and after it. For example, the op- The model first calculates weight αi for each tion sentence in Figure 1 is delimited into two node with a linear transformation and a sigmoid EDUs, EDU7 =“digital systems are the best infor- function αi = σ(Wα (vi ) + bα ), then conducts mation systems” and EDU8 =“error cannot occur message propagation with the weights: in the emission of digital signals” by the connec- tive r =“because”. Then the returned triplets are 1 X (EDU7 , r, EDU8 ) and (EDU8 , r, EDU7 ). For each ṽi = ( αj Wrji vj ), rji ∈ {rE , rI } (1) |Ni | j∈Ni data sample with the context and multiple answer options, we separately construct graphs correspond- where ṽi is the message representation of node vi . ing to each option, with EDUs in the same context αj and vj are the weight and the node embedding and every single option. The graph for the single of vj respectively. option k is denoted by G k = (V k , E k ). After the message propagation, the node repre- 2.2 Discourse-Aware Graph Network sentations are updated with the initial node embed- dings and the message representations by We present the Discourse-Aware Graph Network (DAGN) that uses the constructed graph to exploit vi0 = ReLU(Wu vi + ṽi + bu ), (2) discourse-based information for answering logical questions. It consists of three main components: an where Wu and bu are weight and bias respectively. EDU encoding module, a graph reasoning module, The updated node representations vi0 will be used and an answer prediction module. The former two to enhance the contextual token embedding via are demonstrated in Figure 1(2), whereas the final summation in corresponding positions. Thus t0l = component is in Figure 1(3). tl + vn0 , where l ∈ Sn and Sn is the corresponding token indices set for n-th EDU. EDU Encoding An EDU span embedding is obtained from its token embeddings. There Answer Prediction The probabilities of options are two steps. First, similar to previous works are obtained by feeding the discourse-enhanced (Yu et al., 2020; Liu et al., 2020), we en- token embeddings into the answer prediction mod- code such input sequence “ context ule. The model is end-to-end trained using cross question || option ” into contex- entropy loss. Specifically, the embedding sequence tual token embeddings with pre-trained language first goes through a layer normalization (Ba et al., models, where and are the special to- 2016), then a bidirectional GRU (Cho et al., 2014). kens for RoBERTa (Liu et al., 2019) model, and The output embeddings are then added to the input || denotes concatenation. Second, given the token ones as the residual structure (He et al., 2016). We embedding sequence {t1 , t2 , ..., tLP }, the n-th EDU finally obtain the encoded sequence after another embedding is obtained by en = tl , where Sn layer normalization on the added embeddings. l∈Sn We then merge the high-level discourse features is the set of token indices belonging to n-th EDU. and the low-level token features. Specifically, the Graph Reasoning After EDU encoding, DAGN variant-length encoded context sequence, question- performs reasoning over the discourse graph. In- and-option sequence are pooled via weighted sum- spired by previous graph-based models (Ran et al., mation wherein the weights are softmax results of 5850
Methods Dev Test Test-E Test-H Methods Dev Test BERT-Large 53.80 49.80 72.00 32.30 BERT-Large 34.10 31.03 XLNet-Large 62.00 56.00 75.70 40.50 RoBERTa-Large 35.02 35.33 RoBERTa-Large 62.60 55.60 75.50 40.00 DAGN 35.48 38.71 DAGN 65.20 58.20 76.14 44.11 DAGN (Aug) 36.87 39.32 DAGN (Aug) 65.80 58.30 75.91 44.46 * The results are taken from the ReClor paper. Table 2: Experimental results (accuracy %) of DAGN * DAGN ranks the 1st on the public ReClor leader- compared with baseline models on LogiQA dataset. board1 until 17th Nov., 2020 before submitting it to NAACL. Until now, we find that several better results Methods Dev appeared in the leaderboard and they are not opened. DAGN 65.20 Table 1: Experimental results (accuracy %) of DAGN ablation on nodes compared with baseline models on ReClor dataset. DAGN - clause nodes 64.40 Test-E = Test-EASY, Test-H = Test-HARD. DAGN - sentence nodes 64.40 ablation on edges DAGN - single edge type 64.80 a linear transformation of the sequence, resulting DAGN - fully connected edges 61.60 in single feature vectors separately. We concate- ablation on graph reasoning DAGN w/o graph module 64.00 nate them with “” embedding from the back- bone pre-trained model, and feed the new vector Table 3: Ablation study results (accurcy %) on ReClor into a two-layer perceptron with a GELU activa- development set. tion (Hendrycks and Gimpel, 2016) to get the out- put features for classification. models. As for DAGN, we fine-tune RoBERTa- Large as the backbone. DAGN (Aug) is a variant 3 Experiments that augments the graph features. DAGN reaches 58.20% of test accuracy on We evaluate the performance of DAGN on two log- ReClor. DAGN (Aug) reaches 58.30%, therein ical reasoning datasets, ReClor (Yu et al., 2020) 75.91% on EASY subset, and 44.46% on HARD and LogiQA (Liu et al., 2020), and conduct ab- subset. Compared with RoBERTa-Large, the lation study on graph construction and graph net- improvement on the HARD subset is remark- work. The implementation details are shown in ably 4.46%. This indicates that the incorporated Appendix B. discourse-based information supplements the short- coming of the baseline model, and that the dis- 3.1 Datasets course features are beneficial for such logical rea- ReClor contains 6,138 questions modified from soning. Besides, DAGN and DAGN (Aug) also standardized tests such as GMAT and LSAT, which outperform the baseline models on LogiQA, espe- are split into train / dev / test sets with 4,638 / 500 cially showing 4.01% improvement over RoBERTa- / 1,000 samples respectively. The training set and Large on the test set. the development set are available. The test set is blind and hold-out, and split into an EASY subset 3.3 Ablation Study and a HARD subset according to the performance We conduct ablation study on graph construction of BERT-base model (Devlin et al., 2019). The details as well as the graph reasoning module. The test results are obtained by submitting the test pre- results are reported in Table 3. dictions to the leaderboard. LogiQA consists of 8,678 questions that are collected from National Varied Graph Nodes We first use clauses or sen- Civil Servants Examinations of China and manu- tences in substitution for EDUs as graph nodes. For ally translated into English by professionals. The clause nodes, we simply remove “Explicit” connec- dataset is randomly split into train / dev / test sets tives during discourse unit delimitation. So that with 7,376 / 651 / 651 samples respectively. Both the texts are just delimited by punctuation marks. datasets contain multiple logical reasoning types. For sentence nodes, we further reduce the delim- iter library to solely period (“.”). Using the modi- 3.2 Results fied graphs with clause nodes or coarser sentence nodes, the accuracy of DAGN drops to 64.40%. The experimental results are shown in Tables 1 This indicates that clause or sentence nodes carry and 2. Since there is no public method for both 1 datasets, we compare DAGN with the baseline https://bit.ly/2UOQfaS 5851
less discourse information and act poorly as logical based on graphs. There are also other methods reasoning units. such as neuro-symbolic models (Saha et al., 2021) and adversarial training (Pereira et al., 2020). Our Varied Graph Edges We make two changes of paper uses a graph-based model. However, for the edges: (1) modifying the edge type, (2) mod- uncovering logical relations, graph nodes and edges ifying the edge linking. For edge type, all edges are customized with discourse information. are regarded as a single type. For edge linking, we Discourse information provides a high-level un- ignore discourse relations and connect every pair of derstanding of texts and hence is beneficial for nodes, turning the graph into fully-connected. The many of the natural language tasks, for instance, resulting accuracies drop to 64.80% and 61.60% text summarization (Cohan et al., 2018; Joty et al., respectively. It is proved that in the graph we built, 2019; Xu et al., 2020a; Feng et al., 2020), neural edges link EDUs in reasonable manners, which machine translation (Voita et al., 2018), and coher- properly indicates the logical relations. ent text generation (Wang et al., 2020; Bosselut Ablation on Graph Reasoning We remove the et al., 2018). There are also discourse-based ap- graph module from DAGN and give a comparison. plications for reading comprehension. DISCERN This model solely contains an extra prediction mod- (Gao et al., 2020) segments texts into EDUs and ule than the baseline. The performance on ReClor learns interactive EDU features. Mihaylov and dev set is between the baseline model and DAGN. Frank (2019) provide additional discourse-based Therefore, despite the prediction module benefits annotations and encodes them with discourse- the accuracy, the lack of graph reasoning leads to aware self-attention models. Unlike previous the absence of discourse features and degenerates works, DAGN first uses discourse relations as the performance. It demonstrates the necessity of graph edges connecting EDUs for texts, then learns discourse-based structure in logical reasoning. the discourse features via message passing with graph neural networks. 4 Related Works 5 Conclusion Recent datasets for reading comprehension tend to be more complicated and require models’ capabil- In this paper, we introduce a Discourse-Aware ity of reasoning. For instance, HotpotQA (Yang Graph Network (DAGN) to addressing logical rea- et al., 2018), WikiHop (Welbl et al., 2018), Open- soning QA tasks. We first treat elementary dis- BookQA (Mihaylov et al., 2018), and MultiRC course units (EDUs) that are split by discourse (Khashabi et al., 2018) require the models to have relations as basic reasoning units. We then build multi-hop reasoning. DROP (Dua et al., 2019) and discourse-based logic graphs with EDUs as nodes MA-TACO (Zhou et al., 2019) need the models to and discourse relations as edges. DAGN then learns have numerical reasoning. WIQA (Tandon et al., the discourse-based features and enhances them 2019) and CosmosQA (Huang et al., 2019) require with contextual token embeddings. DAGN reaches causal reasoning that the models can understand competitive performances on two recent logical the counterfactual hypothesis or find out the cause- reasoning datasets ReClor and LogiQA. effect relationships in events. However, the logical reasoning datasets (Yu et al., 2020; Liu et al., 2020) Acknowledgements require the models to have the logical reasoning The authors would like to thank Wenge Liu, capability of uncovering the inner logic of texts. Jianheng Tang, Guanlin Li and Wei Wang for Deep neural networks are used for reasoning- their support and useful discussions. This driven RC. Evidence-based methods (Madaan et al., work was supported in part by National Nat- 2020; Huang et al., 2020; Rajagopal et al., 2020) ural Science Foundation of China (NSFC) un- generate explainable evidence from a given context der Grant No.U19A2073 and No.61976233, as the backup of reasoning. Graph-based methods Guangdong Province Basic and Applied Ba- (Qiu et al., 2019; De Cao et al., 2019; Cao et al., sic Research (Regional Joint Fund-Key) Grant 2019; Ran et al., 2019; Chen et al., 2020b; Xu No.2019B1515120039, Shenzhen Basic Research et al., 2020b; Zhang et al., 2020) explicitly model Project (Project No. JCYJ20190807154211365), the reasoning process with constructed graphs, then Zhijiang Lab’s Open Fund (No. 2020AA3AB14) learn and update features through message passing and CSIG Young Fellow Support Fund. 5852
References Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- deep bidirectional transformers for language under- ton. 2016. Layer normalization. stat, 1050:21. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Antoine Bosselut, Asli Celikyilmaz, Xiaodong He, Computational Linguistics: Human Language Tech- Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. nologies, Volume 1 (Long and Short Papers), pages Discourse-aware neural rewards for coherent text 4171–4186. generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel for Computational Linguistics: Human Language Stanovsky, Sameer Singh, and Matt Gardner. 2019. Technologies, Volume 1 (Long Papers), pages 173– Drop: A reading comprehension benchmark requir- 184. ing discrete reasoning over paragraphs. In Proceed- ings of the 2019 Conference of the North American Yu Cao, Meng Fang, and Dacheng Tao. 2019. Bag: Chapter of the Association for Computational Lin- Bi-directional attention entity graph convolutional guistics: Human Language Technologies, Volume 1 network for multi-hop reasoning question answering. (Long and Short Papers), pages 2368–2378. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin- tional Linguistics: Human Language Technologies, wei Geng, and Ting Liu. 2020. Dialogue Volume 1 (Long and Short Papers), pages 357–362. discourse-aware graph convolutional networks for abstractive meeting summarization. arXiv preprint Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi- arXiv:2012.03502. aochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, and Wei Chu. 2020a. Question directed Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty, graph attention network for numerical reasoning Steven CH Hoi, Caiming Xiong, Irwin King, and over text. In Proceedings of the 2020 Conference on Michael Lyu. 2020. Discern: Discourse-aware en- Empirical Methods in Natural Language Processing tailment reasoning network for conversational ma- (EMNLP), pages 6759–6768. chine reading. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xi- Processing (EMNLP), pages 2439–2449. aochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, and Wei Chu. 2020b. Question directed graph attention network for numerical reasoning Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian over text. In Proceedings of the 2020 Conference Sun. 2016. Deep residual learning for image recog- on Empirical Methods in Natural Language Process- nition. In Proceedings of the IEEE conference on ing (EMNLP), pages 6759–6768, Online. Associa- computer vision and pattern recognition, pages 770– tion for Computational Linguistics. 778. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- Dan Hendrycks and Kevin Gimpel. 2016. Gaus- danau, and Yoshua Bengio. 2014. On the properties sian error linear units (gelus). arXiv preprint of neural machine translation: Encoder–decoder ap- arXiv:1606.08415. proaches. In Proceedings of SSST-8, Eighth Work- shop on Syntax, Semantics and Structure in Statisti- Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and cal Translation, pages 103–111. Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense rea- Arman Cohan, Franck Dernoncourt, Doo Soon Kim, soning. In Proceedings of the 2019 Conference on Trung Bui, Seokhwan Kim, Walter Chang, and Na- Empirical Methods in Natural Language Processing zli Goharian. 2018. A discourse-aware attention and the 9th International Joint Conference on Natu- model for abstractive summarization of long docu- ral Language Processing (EMNLP-IJCNLP), pages ments. In Proceedings of the 2018 Conference of 2391–2401. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Yinya Huang, Meng Fang, Xunlin Zhan, Qingxing nologies, Volume 2 (Short Papers), pages 615–621, Cao, Xiaodan Liang, and Liang Lin. 2020. Rem- New Orleans, Louisiana. Association for Computa- net: Recursive erasure memory network for com- tional Linguistics. monsense evidence refinement. arXiv preprint arXiv:2012.13185. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents Shafiq Joty, Giuseppe Carenini, Raymond Ng, and with graph convolutional networks. In Proceed- Gabriel Murray. 2019. Discourse analysis and its ap- ings of the 2019 Conference of the North American plications. In Proceedings of the 57th Annual Meet- Chapter of the Association for Computational Lin- ing of the Association for Computational Linguistics: guistics: Human Language Technologies, Volume 1 Tutorial Abstracts, pages 12–17, Florence, Italy. As- (Long and Short Papers), pages 2306–2317. sociation for Computational Linguistics. 5853
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Association for Computational Linguistics, pages Shyam Upadhyay, and Dan Roth. 2018. Looking be- 6140–6150, Florence, Italy. Association for Compu- yond the surface: A challenge set for reading com- tational Linguistics. prehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chap- Dheeraj Rajagopal, Niket Tandon, Peter Clark, Bha- ter of the Association for Computational Linguistics: vana Dalvi, and Eduard Hovy. 2020. What-if i ask Human Language Technologies, Volume 1 (Long Pa- you to explain: Explaining the effects of perturba- pers), pages 252–262. tions in procedural text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- Diederik P Kingma and Jimmy Ba. 2015. Adam: guage Processing: Findings, pages 3345–3355. A method for stochastic optimization. In ICLR (Poster). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, machine comprehension of text. In Proceedings of Yile Wang, and Yue Zhang. 2020. Logiqa: A chal- the 2016 Conference on Empirical Methods in Natu- lenge dataset for machine reading comprehension ral Language Processing, pages 2383–2392. with logical reasoning. IJCAI 2020. Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Liu. 2019. Numnet: Machine reading comprehen- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, sion with numerical reasoning. In Proceedings of Luke Zettlemoyer, and Veselin Stoyanov. 2019. the 2019 Conference on Empirical Methods in Nat- Roberta: A robustly optimized bert pretraining ap- ural Language Processing and the 9th International proach. arXiv preprint arXiv:1907.11692. Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2474–2484. Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Ab- hilasha Ravichander, Eduard Hovy, and Shrimai Amrita Saha, Shafiq Joty, and Steven CH Hoi. 2021. Prabhumoye. 2020. Eigen: Event influence gen- Weakly supervised neuro-symbolic module net- eration using pre-trained language models. arXiv works for numerical reasoning. arXiv preprint preprint arXiv:2010.11764. arXiv:2101.11802. William C Mann and Sandra A Thompson. 1988. Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Pe- Rhetorical structure theory: Toward a functional the- ter Clark, and Antoine Bosselut. 2019. Wiqa: A ory of text organization. Text, 8(3):243–281. dataset for “what if...” reasoning over procedural Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish text. In Proceedings of the 2019 Conference on Sabharwal. 2018. Can a suit of armor conduct elec- Empirical Methods in Natural Language Processing tricity? a new dataset for open book question an- and the 9th International Joint Conference on Natu- swering. In Proceedings of the 2018 Conference on ral Language Processing (EMNLP-IJCNLP), pages Empirical Methods in Natural Language Processing, 6078–6087. pages 2381–2391. Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Todor Mihaylov and Anette Frank. 2019. Discourse- Titov. 2018. Context-aware neural machine trans- aware semantic self-attention for narrative reading lation learns anaphora resolution. In Proceedings comprehension. In Proceedings of the 2019 Con- of the 56th Annual Meeting of the Association for ference on Empirical Methods in Natural Language Computational Linguistics (Volume 1: Long Papers), Processing and the 9th International Joint Confer- pages 1264–1274. ence on Natural Language Processing (EMNLP- IJCNLP), pages 2541–2552, Hong Kong, China. As- Wei Wang, Piji Li, and Hai-Tao Zheng. 2020. Con- sociation for Computational Linguistics. sistency and coherency enhanced story generation. arXiv preprint arXiv:2010.08822. Lis Pereira, Xiaodong Liu, Fei Cheng, Masayuki Asa- hara, and Ichiro Kobayashi. 2020. Adversarial train- Johannes Welbl, Pontus Stenetorp, and Sebastian ing for commonsense inference. In Proceedings of Riedel. 2018. Constructing datasets for multi-hop the 5th Workshop on Representation Learning for reading comprehension across documents. Transac- NLP, pages 55–60, Online. Association for Compu- tions of the Association for Computational Linguis- tational Linguistics. tics, 6:287–302. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. sakaki, Livio Robaldo, Aravind K Joshi, and Bon- 2020a. Discourse-aware neural extractive text sum- nie L Webber. 2008. The penn discourse treebank marization. In Proceedings of the 58th Annual Meet- 2.0. In LREC. Citeseer. ing of the Association for Computational Linguistics, pages 5021–5031, Online. Association for Computa- Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei tional Linguistics. Li, Weinan Zhang, and Yong Yu. 2019. Dynami- cally fused graph network for multi-hop reasoning. Yunqiu Xu, Meng Fang, Ling Chen, Yali Du, In Proceedings of the 57th Annual Meeting of the Joey Tianyi Zhou, and Chengqi Zhang. 2020b. Deep 5854
reinforcement learning with stacked hierarchical at- A Discourse Delimiter Library tention for text-based games. In Advances in Neural Information Processing Systems, volume 33, pages Our discourse delimiter library consists of two 16495–16507. Curran Associates, Inc. parts, the “Explicit” connectives annotated in Penn Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Discourse TreeBank 2.0 (DPTB 2.0) (Prasad et al., William Cohen, Ruslan Salakhutdinov, and Christo- 2008), as well as a set of punctuation marks. The pher D Manning. 2018. Hotpotqa: A dataset for overall discourse delimiters used in our method are diverse, explainable multi-hop question answering. presented in Table 4. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages Explicit Connectives 2369–2380. ’once’, ’although’, ’though’, ’but’, ’because’, ’nevertheless’, ’before’, ’for example’, ’until’, ’if’, Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi ’previously’, ’when’, ’and’, ’so’, ’then’, ’while’, ’as long Feng. 2020. Reclor: A reading comprehension as’, ’however’, ’also’, ’after’, ’separately’, ’still’, ’so that’, dataset requiring logical reasoning. In ICLR 2020 ’or’, ’moreover’, ’in addition’, ’instead’, ’on the other : Eighth International Conference on Learning Rep- hand’, ’as’, ’for instance’, ’nonetheless’, ’unless’, resentations. ’meanwhile’, ’yet’, ’since’, ’rather’, ’in fact’, ’indeed’, ’later’, ’ultimately’, ’as a result’, ’either or’, ’therefore’, Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan ’in turn’, ’thus’, ’in particular’, ’further’, ’afterward’, Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to- ’next’, ’similarly’, ’besides’, ’if and when’, ’nor’, tree learning for solving math word problems. In ’alternatively’, ’whereas’, ’overall’, ’by comparison’, ’till’, ’in contrast’, ’finally’, ’otherwise’, ’as if’, ’thereby’, Proceedings of the 58th Annual Meeting of the Asso- ’now that’, ’before and after’, ’additionally’, ’meantime’, ciation for Computational Linguistics, pages 3928– ’by contrast’, ’if then’, ’likewise’, ’in the end’, 3937. ’regardless’, ’thereafter’, ’earlier’, ’in other words’, ’as soon as’, ’except’, ’in short’, ’neither nor’, ’furthermore’, Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan ’lest’, ’as though’, ’specifically’, ’conversely’, Roth. 2019. “going on a vacation” takes longer ’consequently’, ’as well’, ’much as’, ’plus’, ’and’, than “going for a walk”: A study of temporal com- ’hence’, ’by then’, ’accordingly’, ’on the contrary’, monsense understanding. In Proceedings of the ’simultaneously’, ’for’, ’in sum’, ’when and if’, ’insofar 2019 Conference on Empirical Methods in Natu- as’, ’else’, ’as an alternative’, ’on the one hand on the ral Language Processing and the 9th International other hand’ Joint Conference on Natural Language Processing Punctuation Marks (EMNLP-IJCNLP), pages 3363–3369, Hong Kong, ’.’, ’,’, ’;’, ’:’ China. Association for Computational Linguistics. Table 4: The discourse delimiter library in our imple- mentation. B Implementation Details We fine-tune RoBERTa-Large (Liu et al., 2019) as the backbone pre-trained language model for DGAN, which contains 24 hidden layers with hid- den size 1024. The overall model is end-to-end trained and updated by Adam (Kingma and Ba, 2015) optimizer with an overall learning rate of 5e- 6 and a weight decay of 0.01. The overall dropout rate is 0.1. The maximum sequence length is 256. We tune the model on the dev set to obtain the best iteration steps of graph reasoning, which is 2 for ReClor data, and 3 for LogiQA data. The model is trained for 10 epochs with a batch size of 16 on Nvidia Tesla V100 GPU. For the answer prediction module, the hidden size of GRU is the same as the token embeddings in the pre-trained language model, which is 1024. The two-layer perceptron first projects the concatenated vectors with a hidden size of 1024 × 3 to 1024, then project 1024 to 1. 5855
You can also read