S&I Reader: Multi-granularity Gated Multi-hop Skimming and Intensive Reading Model for Machine Reading Comprehension - ResearchGate
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.Doi Number S&I Reader: Multi-granularity Gated Multi-hop Skimming and Intensive Reading Model for Machine Reading Comprehension Yong Wang1, Chong Lei2, and Duoqian Miao3 1 School of Artificial Intelligence, Liangjiang, Chongqing University of Technology, Chongqing 401135, China 2 College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China 3 Department of Computer Science and Technology, Tongji University, Shanghai 201804, China Corresponding author: Yong Wang (e-mail: ywang@cqut.edu.cn) The research is supported by the National Natural Science Foundation of China under Grant 61976158 and Grant 61673301. ABSTRACT Machine reading comprehension is a very challenging task, which aims to determine the answer span based on the given context and question. The newly developed pre-training language model has achieved a series of successes in various natural language understanding tasks with its powerful contextual representation ability. However, these pre-training language models generally lack the downstream processing structure for specific tasks, which limits further performance improvement. In order to solve this problem and deepen the model's understanding of the question and context, this paper proposes S&I Reader. On the basis of the pre-training model, skimming, intensive reading, and gated mechanism modules are added to simulate the behavior of humans reading text and filtering information. Based on the idea of granular computing, a multi-granularity module for computing context granularity and sequence granularity is added to the model to simulate the behavior of human beings to understand the text from words to sentences, from parts to the whole. Compared with the previous machine reading comprehension model, our model structure is novel. The skimming module and multi-granularity module proposed in this paper are used to solve the problem that the previous model ignores the key information of the text and cannot understand the text with multi granularity. Experiments show that the model proposed in this paper is effective for both Chinese and English datasets. It can better understand the question and context and give a more accurate answer. The performance has made new progress on the basis of the baseline model. INDEX TERMS Gated mechanism, granular computing, intensive reading, machine reading comprehension, pre-training model, skimming. I. INTRODUCTION models for these machine reading comprehension tasks have Important books must be read over and over again, and achieved good results. every time you read it, you will find it beneficial to open the With the development of the popular pre-training book. language models with ultra-large-scale parameters in recent Jules Renard(1864-1910) years, which use large-scale corpus for training [2]. The latest results (SOTA) have been achieved in various natural Machine reading comprehension (MRC) is a basic and language processing tasks including machine reading challenging task in natural language processing. It requires comprehension. These pre-training language models such as the machine to give the answer while understanding the the bidirectional Transformer [3] structure of Bert [4] and given question and context. Common machine reading Albert [5], the generalized autoregressive pre-training model comprehension tasks are divided into cloze test, multiple XLNet [6] are used as the model encoder to extract choice, span extraction, and free answering according to the contextual language features of related texts and fine-tune in answer form [1]. With the development of natural language conjunction with downstream processing structures for processing and deep learning theories, more and more specific tasks. VOLUME XX, 2017 1 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) With the great success of the development of pre-training realization of human-centered operations in the presence of language models, people have focused more attention on the multi-faceted data. It contains a large number of techniques encoder side of the model. People can directly benefit from to minimize uncertainty [10]. A recognized feature of multiple powerful encoders with similar structures, leading artificial intelligence is that people can observe and analyze to a bottleneck in the development of downstream processing the same problem from extremely different granularities. Not structures tailored to specific tasks. However, it is time- only can people solve problems in different granular worlds, consuming and resource-consuming to encode the general but they can also quickly jump from one granular world to knowledge contained in large-scale corpus into language another. This ability to deal with worlds of different models with ultra-large-scale parameters. Moreover, due to granularities is a powerful manifestation of human problem the slow development of language representation encoding solving [11]. The granular computing model divides the technology, the performance of the pre-training language research object into several layers with different granularities, model is limited. These all highlight the importance of and each layer is related to each other to form a unified whole developing downstream processing structures for specific [12]. Different granularities indicate different angles and tasks. Therefore, this paper focuses on the downstream ranges of information. The idea of granular computing helps processing structure of the pre-training model. the model to solve the problem from multiple levels and Many studies have shown that many models pay attention angles, and helps the model understand the relationship to the unimportant parts of the text and ignore the important between the part of the text and the whole. parts [7]. At the same time, this paper also found that the When humans perform reading comprehension tasks, they previous model still has the phenomenon of over-stability. usually have comprehensive reading comprehension That is, it is susceptible to interference sentences in the behaviors such as skimming and intensive reading. First, context where there are many same words with the question, grasp the key information in the question and context by which indicates that the model is stably matched only skimming, and then further grasp the main idea of the text literally, but not semantically matched. This paper selects an and filter the important information that matches the example in the Chinese dataset DuReader 2.0 for an question by intensive reading. Read repeatedly, from the explanation, as shown in Table 1. global theme to the partial information of the text, and finally determine the answer to the question. Therefore, this paper Table 1. An over-stability MRC example. proposes S&I Reader, which aims to help the model Context: determine the valid information in the question and context, 第 32 届夏季奥林匹克运动会原定 2020 年 7 月在日本 and understand the text in terms of word granularity, context 东京举行,但 2020 年受到全球新冠疫情的影响,经与 granularity, and sequence granularity. Our model solves the 国际奥委会协商,东京奥运会推迟大约一年后举行, problem of incomplete semantics of prediction answers 暂定 2021 年 7 月 23 日正式开幕,8 月 8 日闭幕。 caused by insufficient learning and the phenomenon of over- Question: stability caused by only literal matching to some extent. This 第 32 届夏季奥林匹克运动会开幕时间 paper uses RoBERTa [13] as the encoder of the model and GroundTruth: 2021 年 7 月 23 日 the baseline model for comparison, to give full play to the advantages of its encoding context language features. Over-stability Answer:2020 年 7 月 The model contains the following four parts: 1. Skimming Reading Module, which is used to determine Table 1. An over-stability MRC example. the keywords in question and context and their Context: corresponding related parts, helping the model pay attention The 32nd Summer Olympic Games was originally scheduled to be held in Tokyo, Japan in July 2020. to the important content of the text. However, in 2020, affected by the global COVID-19 2. Intensive Reading Module, which is used to compare epidemic, after consultation with the International the similarity and relevance of each word in the question and Olympic Committee, the Tokyo Olympic Games was context to generate a question-aware context representation. postponed about one year later, and it was tentatively The model cycles repeatedly the Skimming Reading Module scheduled to officially open on July 23, 2021 and close on and Intensive Reading Module by the Multi-hop mechanism August 8, 2021. to simulate the process of human skimming and intensive Question: reading multiple times. Opening time of the 32nd Summer Olympic Games 3. Gated Mechanism, which is used to determine the part GroundTruth: July 23, 2021 that needs to be memorized or forgotten in the context and Over-stability Answer: July 2020 update. This module is used to simulate the behavior of human beings, and to filter and memorize important Granular computing (GrC) has been applied in many information. fields since Zadeh [8],[9] proposed. Granular computing is 4. Multi-granularity Module, which is used to help the an effective solution to structured problems. It simulates the model understand the relationship between the whole and the 2 VOLUME XX, 2017 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) part of the context at multiple levels in terms of word cancels the NSP task and adds the Sentence Order Prediction granularity, context granularity, and sequence granularity. task (SOP) to predict the order between sentences. ALBERT The S&I Reader proposed in this paper is evaluated on the can still maintain good performance while greatly reducing the DuReader 2.0 [14] and SQuAD v1.1 [15]. Experiments number of parameters. prove that this model is effective in both Chinese and English The novelty of the model proposed in this paper is that we datasets, and the performance of this model has been further propose a bidirectional attention over attention mechanism, improved on the basis of the pre-training model. At the same which overcomes the inability of the previous model to focus time, ablation experiments and experimental analysis also on key information when establishing the relationship prove that this model is more effective for grasping the key between context and question. The multi-granularity module information of the text and solving the over-stability problem. proposed in this paper helps the model understand the text from the perspective of the whole and the part, and overcomes II. RELATED WORK the previous model to understand the text only from the In recent years, various machine reading comprehension perspective of word granularity. At the same time, our model datasets have been released one after another, such as SQuAD, is based on the pre-training model to establish a downstream MS MARCO [16], RACE [17], CNN & Daily Mail [18], and processing structure for span extraction reading DuReader. It has aroused people's interest in the research of comprehension tasks, and further improves the accuracy on machine reading comprehension and made it possible to build the basis of the pre-training model. deep network models according to specific tasks [19]. And it The main contributions in this paper include the following provides a test platform to evaluate MRC model extensively. three aspects. In the early stage, the focus of the classic reading 1. This paper introduces the development process of comprehension model was to study the multiple ways of machine reading comprehension and analyzes the current interaction between the question and the context, and the necessity of developing downstream data processing corresponding attention mechanism increases the model's structures based on pre-training language models. understanding of the relationship between the question words 2. This paper proposes the S&I Reader model, Bidirectional and the context words and then output the predicted answer. Attention over Attention, and Multi-granularity Module. This These include GA Reader [20] using a one-way attention model can effectively focus on and filter effective information, mechanism. Match-LSTM [21] using an LSTM structure with and understand the text at multiple levels and multiple angles. an attention mechanism for information alignment. Bi-DAF It solves the problems of over-focusing on unimportant parts [22] using a bidirectional attention flow mechanism. R-Net and ignoring important parts, over-stability problem, and [23] using a self-matching attention mechanism to obtain more insufficient learning in previous models. comprehensive contextual information. J-Net [24] uses an 3. The experiments and analysis in this paper show that the attention pooling mechanism to filter key information. QANet model can further improve the performance on the basis of the [25] using depth separability convolution [26] and multi-head pre-training model. self-attention mechanism [3] simulates local and global information interaction and so on. These end-to-end neural III. S&I Reader network models have achieved remarkable success. The model is used for span extraction reading comprehension In recent years, with the accumulation of large-scale corpus tasks. The task is defined as a given question = and the development of pre-training language models based { 1 , 2 , … , } containing words and a context = on the multi-head self-attention mechanism, it has { 1 , 2 , … , } containing words. According to question demonstrated powerful performance in a number of natural and context , the model extracts a continuous word language processing fields. Such as BERT, RoBERTa, and subsequence = { +1 , +2 , … , + } from as the ALBERT, etc. Among them, BERT uses two unsupervised predicted answer and outputs it. prediction tasks for pre-training, that is, using Masked The model architecture is shown in Figure 1. The model is Language Model to predict the words in the covered part and composed of Encoder and downstream structure. The Next Sentence Prediction (NSP) task to understand the downstream structure is mainly composed of the following relationship between two sentences. RoBERTa changed the four parts: Skimming Reading Module, Intensive Reading static Mask mechanism on the basis of BERT, and instead Module, Gated Mechanism, and Multi-granularity Module. adopted the dynamic Mask mechanism, which eliminated the Among them, Skimming Reading Module and Intensive NSP task. At the same time, it used a larger corpus and a larger Reading Module are included in Multi-hop Mechanism. batch of data for pre-training to achieve better results. ALBERT removes the restriction that the word vector 1) ENCODER dimension and hidden layer dimension must be the same on The encoder includes Embedding and Interaction, which are the basis of BERT, and it shares parameters across layers on used to represent the embedding of the text sequence and the Encoder side, avoiding the large increase of parameters represent the contextual language association feature. with the increase of model depth. At the same time, ALBERT 2 VOLUME XX, 2017 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) FIGURE 1. S&I Reader architecture. • Embedding ̅ +1 = 2 +1 ∙ ( 1 +1 ℎ +1 + 1 +1 ) + 2 +1 (4) Firstly, the question and context are sent to the Encoder. +1 = (ℎ +1 + ( ̅ +1 )) (5) The words in the text are converted into sub-token by wordpiece [27]. For each sub-token, its input embedding is the Where is the index of multi-head attention, +1 , sum of token embedding, segment embedding, and position +1 , +1 +1 and are trainable matrix for the -th embedding. attention head. 1 +1 , 2 +1 , 1 +1 and 2 +1 are the Then the embeddings of all sub-tokens in the question trainable matrix and bias for the +1-th layer. and context are spliced together to form a sequence. At the We use = {ℎ1 , ℎ2 , ℎ3 , … , ℎ } to represent the output of same time, set the length of question and context to the last layer of Encoder. And is sent to the Skimming and respectively. If it is too long, it will be truncated, if it Reading Module. is not enough, it will be filled with [PAD]. The starting position of each sequence is the special classification indicator 2) SKIMMING READING MODULE [CLS], and the indicator [SEP] is added to the end of the After the output vector of the last layer of the Encoder is question and context respectively. Let the output embedding obtained, it is intercepted according to the position of of the sequence be expressed as = { 1 , 2 , 3 , … , }, and question and context in the sequence to obtain = send it to the interaction layer to establish contextual language {ℎ2 , ℎ3 , ℎ4 , … , ℎ +1 }, = {ℎ +3 , ℎ +4 , … , ℎ + +2 }. association features. Inspired by attention over attention [28], this paper proposes bidirectional attention over attention. As shown in Figure 2, it • Interaction aims to help the model determine the keywords in the question The output embedding is sent to a multi-layer bidirectional and their associated key parts in the context, and the keywords Transformer Encoder. Each layer includes a multi-head self- in the context and their associated key parts in the question, so attention mechanism and a fully connected layer, and these that the model perceives the key content in the text. The two parts include dropout, residual connection, and Layer- algorithm description of Bidirectional attention over attention Normalization. As shown in formulas (1)-(5), = is shown in Algorithm 1. { 1 , 2 , 3 , … , } is used to represent the -th layer features, As described in Algorithm 1. Firstly, calculate the similarity +1-th layer features +1 is calculated as follows. of each pair of words between the question and the context by the trilinear function, as shown in formula (6), to obtain the +1 +1 ( ) ( ) similarity matrix , ∈ × . Among them, represents , = ( ) (1) the similarity between the -th word in the context and the - √ th word in the question. Softmax is applied to each row of to get 1 , as shown in formula (7), to determine which word ℎ̅ +1 = ∑ +1 (∑ +1 , ∙ ( )) (2) in the question is the closest to each word in the context. =1 =1 Softmax is applied to each column of to get 2 , as shown in formula (8), to determine which word in the context is most ℎ +1 = ( + (ℎ̅ +1 )) (3) relevant to each word in the question. 2 VOLUME XX, 2017 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) FIGURE 2. Skimming Reading Module (Bidirectional Attention over Attention) Algorithm 1. Calculate keywords and associated key parts in = ( , , ∙ ) (6) context and question. 1 = → ( ) (7) Input: Question and paragraph representation 2 = ↓ ( ) (8) encoded by encoder Output: Key information-aware context representation Where is a trainable matrix. and key information-aware question representation is obtained by averaging 1 in the context direction, as shown in formula (9), to highlight the keywords in the Step 1: Calculate the word-level similarity matrix question. And calculate the context key part attention between the question and the context by the trilinear associated with the question keyword to get , as shown in function formula (10). As shown in formula (11), highlight the key part Step 2: Softmax is applied to each row and column of the of the context and add it to the context vector representation to similarity matrix to obtain respectively 1 and 2 get the key information- aware context representation. Step 3: 1 is averaged in the context direction to get the question keyword weight . 2 is averaged in the = ↓ ( 1 ) (9) question direction to get the context keyword weight = 2 ⋅ (10) . = + ⨀ (11) Step 4: Calculate the weights of the key parts associated with the context and question keywords to get and Where ⨀ represents the element-wise multiplication. respectively 2 is averaged in the direction of the question word to get Step 5: Highlight the key part of the context and question , as shown in formula (12), to highlight context to obtain key-information aware context representation keywords. And calculate the attention of the key part of the and question representation question word associated with the context keyword to get , as shown in formula (13). As shown in formula (14), highlight 2 VOLUME XX, 2017 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) the key part of the question word, and add it to the question The model generates the update vector by formula (18). word representation to get the key information-aware question At the same time, and are combined into a linear layer representation. with sigmoid by formula (19). That is, when the part of is = → ( 2 ) (12) more related to the content of the question, the memory weight = 1 ∙ (13) approaches 1, and more relevant information will be = + ⨀ (14) retained. The model uses the update vector to update it to get the output vector by formula (20), ∈ × . 3) INTENSIVE READING MODULE In this module, the and obtained above are = ℎ( ∙ + ) (18) sent to the bidirectional attention flow layer [22] to establish a = ( [ ; ] + ) (19) complete relationship between the question and context, and = ∘ + (1 − ) ∘ (20) obtain the query-aware context representation. Firstly, ̅ and ̅ are obtained respectively by formulas Where , , and are trainable matrices and (6)-(8). Context-to-query attention is obtained by formula biases. (15) and query-to-context attention is obtained by formula (16). After that, , and are spliced together by 6) MULTI-GRANULARITY MODULE formula (17) and sent to a linear layer to obtain the query- If a whole is divided into three parts and interpreted as three aware context representation . granules, we immediately obtain a three-way granular computing model [32]. This paper proposes a multi- = ̅ ∙ (15) granularity module. The model can understand text in terms of ̅ = ̅ ∙ ∙ (16) word granularity, context granularity, and sequence = 3 ([ ; ⊙ ; ⊙ ]) + 3 (17) granularity. The aforementioned module uses the similarity between the Where 3 and 3 are trainable matrices and biases. context and question to find key information and establish the association relationship. Therefore, the output vector 4) MULTI-HOP MECHANISM obtained before can be regarded as the word granularity, In the model, the multi-hop mechanism is used to simulate the which reflects the process of the model processing the local behavior of human beings to deepen their understanding of the information of the text. At the same time, in order to simulate text by reading many times. When the model passes through the human behavior of understanding the main idea of the text multiple skimming reading modules and intensive reading from the whole, the module is used to calculate the context modules, it will constantly adjust the key information of granularity representing the global meaning of the context and judgment and obtain a more comprehensive context the sequence granularity of the global meaning of the sequence representation. composed of context and questions. It makes the model The above-mentioned query-aware context representation understand the text from the word granularity, context and key information-aware question representation granularity and sequence granularity, and the relationship are re-sent into the Skimming Reading Module and Intensive between the whole text and the local text. Reading Module multiple times. Multi-Granularity Module and the aforementioned modules Experiments show that the multi-hop mechanism can are parallel processing structures in the model architecture. improve the performance of the model. At the same time, this This module removes the [PAD] filling part of the context paper will further test the effect of multi-hop times on the representation and takes the average value to obtain the model performance in the subsequent ablation experiments. context granularity vector . And the [CLS] identifier in the sequence has the ability to characterize the global sequence, 5) GATED MECHANISM so the corresponding vector in is taken out as the sequence This paper is inspired by LSTM [29], GRU [30], and granularity vector . The above three granularities are Ruminating Reader [31]. Gated Mechanism is added to the added and sent to the linear layer to obtain the output vector model. It is used to simulate the behavior of human beings to of the parallel structure by formulas (21)-(22). It should filter and memorize important content after repeated reading be pointed out that the question granularity was also while ignoring unimportant content. That is, the above introduced in the previous work of this paper. In subsequent mentioned query-aware context representation and key experiments, it was found that removing the question information-aware question representation are sent to granularity helps to improve the performance of the model. the gated mechanism to allow the model to determine the part Therefore, only the word granularity, context granularity, and that needs to be memorized or forgotten, and use to sequence granularity are retained in the model. generate an update vector to update the results of the model memory. = + + (21) = 4 ∙ + 4 (22) VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) Where 4 and 4 are trainable matrices and biases. 7) PREDICTION LAYER 2) DATASET The requirement of the span extraction reading • DuReader 2.0 comprehension task is to extract a continuous subsequence in DuReader 2.0 [14] is a Chinese machine reading the context as the predicted answer. Therefore, the output comprehension dataset. The questions and contexts of the vector obtained above is sent to the linear layer and dataset are from Baidu Search and Baidu Zhidao, and the softmax is used to obtain the probability of each word as the answers are manually annotated. Previous machine reading starting position and ending position of the predicted answer. comprehension datasets, such as SQuAD [15], pay more And the continuous subsequence with the highest probability attention to the description of facts and true-or-false questions. is extracted as the model predicted answer, as shown in On this basis, DuReader 2.0 adds opinion questions to better formulas (23)-(25). fit people's questioning habits. DuReader 2.0 training set contains 15K samples and the dev set contains 1.4k samples. = ( ∙ + ) (23) Since the dataset is not open to the test set, this paper evaluates = ( ⋅ + ) (24) the performance of the model and other related models on the ̅ = + (25) dev set. Where , , and are trainable matrices and biases. • SQUAD v1.1 The model selects the index pairs from ̅ with the SQuAD v1.1 is an English question and answer dataset. The maximum value of start ≤ end as the starting and ending sample is composed of triples. positions of the prediction answers. The context is derived from 536 Wikipedia articles. The loss function of the model in training is shown in Annotators asked questions and provided answers in the formula (26). context. There were more than 100,000 question and answer pairs. The difference from CNN & Daily Mail is that the answer is no longer a word or entity, but a continuous word 1 sequence in the context, which increases the difficulty of the = − ∑ [log ( ) + log ( )] (26) task. SQuAD v1.1 includes 87.5K training set samples, 10.1K =0 dev set samples and 10.1K test set samples. This paper Where and represent the starting position and evaluates model performance on the dev set. ending position of the groundtruth of the -th sample. is the total number of samples. 3) METRICS The model uses F1 and EM as evaluation indicators. EM IV. EXPERIMENTS AND ANALYSIS measures whether the predicted answer matches the 1) SETUP groundtruth exactly. F1 measures the word-level matching This paper uses the Chinese pre-training model RoBERTa as between the predicted answer and groundtruth, which is the Encoder of the model and uses it as the baseline model. At calculated by Precision and Recall. The Precision represents the same time, the pre-training models BERT and ALBERT the proportion of the predicted correct continuous span in the with the same hyper-parameters and ALBERT-Large with predicted answer. The Recall represents the proportion of the larger hyper-parameters were used for comparative predicted correct continuous span in the groundtruth. F1 is experiments. The model is implemented in DuReader 2.0 by defined as follows. Tensorflow-1.12.0 [33] and in SQuAD v1.1 with Pytorch 1.0.1 [34], and experiments are performed by NVIDIA 2 × × 1 = GeForce GTX 1080Ti. The hyper-parameters used in the + model are shown in Table 2. 4) PERFORMANCE Table 2. Our Model Hyper Parameters In Table 3, this paper compares the evaluation results of Hyper Parameters Values multiple models in the DuReader 2.0 and SQuAD v1.1 dev set. batch size 4 Based on the baseline model, our model has further improved epoch 3 F1 (+0.94%; +0.526%) and EM (+0.918%; 0.464%). Table 4 max query length (DuReader) 16 shows the comparison of model parameters. The improvement max query length (SQuAD) 24 of EM is obvious, indicating that the model can deepen the max sequence length 512 understanding of the text and help the model predict more learning rate 3 × 10−5 accurate answers. doc stride 384 warmup rate 0.1 multi-hop 3 VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) Table 3. The Results from models for DeReader2.0 and The model can further understand the text semantics. As SQuAD v1.1 shown in Table 5, we give a sample in the DuReader 2.0 dev DuReader(dev) SQuAD(dev) set. In this sample, the baseline model has a misunderstanding Model F1 EM F1 EM of the question and context text, and cannot accurately locate ALBERT 82.615 70.430 86.719 78.742 ALBERTlarge 83.664 71.983 88.179 80.643 and predict the answer. Based on the baseline model, the BERT 83.856 72.912 87.882 80.047 model adds the multi-hop mechanism, which makes the model RoBERTa 83.911 72.618 88.565 80.766 deepen the understanding of the text, so as to correctly predict Our Model 84.851 73.536 89.091 81.230 the answer span. Table 4. Parameters Comparison on Model Table 5. A comparative MRC example Params Context: Model (M) 对于 ATM 机每日取款限额中的每日的意思就是 00:00 BERT 110 到当天的 23:59,这个时间就是当日,过了 0 点就是次 RoBERTa 110 日了。 银行的 ATM 最高取款只能取 2 万元,ATM 转账 S&I Reader 119 只能转款 5 万元,取现超过 2 万要到银行柜台办理,超 过五万需要提前一天预约,转账超过 5 万的话需要到 In the experiment of DuReader 2.0, the model was trained 银行柜台办理。 for 10,890 steps, the checkpoint of the model was saved every Question: 2000 steps and the performance of the model was recorded. atm 机取款限额 The changes of F1 and EM of the model and the baseline model with the number of training steps are shown in Figure 3 GroundTruth: 2 万元 and Figure 4. In the early stage of training, due to the increase RoBERTa: 00:00 到当天的 23:59 of model parameters, the performance of the model is slightly Our Model: 2 万元 lower than the baseline model. After full training, the performance of the model is basically better than the baseline Table 5. A comparative MRC example model. Context: For the ATM machine’s daily withdrawal limit, “daily” means from 00:00 to 23:59 of the day. This time is the same day, and after 0:00 it is the next day. The bank’s ATM withdrawals can only withdraw 20,000 yuan, and ATM transfers can only transfer 50,000 yuan. Cash withdrawals of more than 20,000 yuan must be processed at the bank counter. If the cash exceeds 50,000 yuan, an appointment must be made one day in advance. If the transfer exceeds 50,000 yuan, it must be processed at the bank counter. Question: ATM withdrawal limit GroundTruth: 20,000 yuan FIGURE 3. F1 comparison line chart RoBERTa: 00:00 to 23:59 of the day Our Model: 20,000 yuan This paper finds that the model can solve the over-stability problem of previous models to a certain extent. As shown in Table 6, a sample in the DuReader 2.0 dev set is selected as an example. This paper finds that the baseline model is only matched literally, that is, the bold part marked in Table 6. It can be seen that the baseline model only literally matches the part similar to the question text in the context, and gets the wrong answer. Our model can match and find the correct answer according to the semantics of the question and context. FIGURE 4. EM comparison line chart VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) Table 6. An over-stability example in dev set should be in March next year|UK final result of the Context: referendum count of 382 voting districts showed that 51.9% 24 日,英国脱欧公投,截止到北京时间下午 1:01 分, 据 of the people chose to support leaving the European Union. 英国广播公司 BBC 报道,英国脱欧公投结果揭晓,英国 Britain's "Brexit" has hit the British financial market 将正式脱欧。脱欧票数 1683 万 5512 票,留欧 1569 万 severely. The pound fell sharply today, hitting a 30-year low. The prospects for European and American stock 2093 票。 根据目前结果显示,脱欧派在公投中占据上风, markets were bleak, and Asian stock markets also fell. In 这意味着英国将会退出欧盟。投票结果将在 14:00 左右 addition, some media speculated that British Prime Minister 正式公布。 英国广播公司(BBC)当地时间 6 月 24 日曾 Cameron, who holds a "stay in Europe" position, might 作出官方预测,英国脱欧公投的最终结果基本已经确 resign as a result. 定:英国将离开欧盟。这使得其他包括北爱尔兰和苏格 Question: 兰在内的地区纷纷考虑独立可能。另据报道,爱尔兰新 Brexit time 芬党表示,他们将会就北爱尔兰独立、爱尔兰统一举行 GroundTruth: March next year 新的投票。苏格兰首席大臣接受采访时则表示:“我们 RoBERTa: 1:01 pm Beijing time 已经看到苏格兰加入欧盟的那一天了。”这预示苏格兰 Our Model: March next year 可能脱离英国独立。|只是宣布,公投结果是脱欧,但还 4) ABLATION 没正式脱欧,英国高等法院对于脱欧的裁定让议会进行 In order to analyze the influence of Skimming Reading 无休止的讨论,可能上诉最高院,没意外,脱欧应该要明 Module, Intensive Reading Module, Gated Mechanism, 年 3 月 | 英 国 公 投 382 个 投 票 区 计 票 最 终 结 果 显 Multi-granularity Module, and the number of multi-hop on the 示,51.9%的民众选择支持脱离欧盟。英国“脱欧”重创 performance of the model, ablation experiments in the 英国金融市场,今天英镑暴跌超,创 30 年来新低。另外, DuReader 2.0 were carried out in this paper. Table 7 shows 欧美股市前景暗淡,亚洲股市也应声下跌。此外,有媒体 the performance of this model in different ablation 猜测持“留欧”立场的英国首相卡梅伦可能因此辞职。 experiments. Question: 英国脱欧时间 Table 7. Ablation result GroundTruth: 明年 3 月 DuReader(dev) Ablation RoBERTa: 北京时间下午 1:01 分 F1 EM Our Model: 明年 3 月 1.w/o Skimming Module 84.368 72.406 2.w/o Intensive Module 84.418 72.430 Table 6. An over-stability example in dev set 3.w/o Gated Mechanism 83.735 72.054 On the 24th, the Brexit referendum ended at 1:01 pm 4.w/o Multi-granularity 84.542 73.465 Beijing time. According to the BBC report, the results of 5.multihop-1 84.329 72.618 the Brexit referendum were announced and the UK will 6.multihop-2 84.198 72.689 officially leave the European Union. The number of votes 7.multihop-3(final model) 84.851 73.536 for Brexit was 16.83512 and 15.692093 million for staying 8.multihop-4 84.097 73.042 in Europe. According to the current results, the Brexitists 9.multihop-5 84.151 72.406 have the upper hand in the referendum, which means that Britain will withdraw from the European Union. The voting According to the multiple groups of ablation experiments in results will be officially announced around 14:00. The BBC Table 7, compared with Experiment 7, Experiment 1 shows made an official prediction on June 24, local time, that the that bidirectional attention over attention helps the model pay final result of the Brexit referendum has basically been attention to key content and can improve model performance determined, Britain will leave the European Union. This to a certain extent. Experiment 2 shows that further makes other regions including Northern Ireland and establishing a more complete relationship between the Scotland consider the possibility of independence. question and the context will help improve performance. According to another report, the Irish Sinn Fein Party stated Experiment 3 shows that Gated Mechanism helps the model that they will hold new votes on the independence of filter unimportant information and can significantly improve Northern Ireland and the unification of Ireland. The Chief the performance of the model. Experiment 4 shows that the Minister of Scotland said in an interview: "We have already seen the day when Scotland joins the European Union." model processes text at multiple granularities, which helps the This indicates that Scotland may become independent from model understand text information at multiple levels and Britain. It was only announced that the result of the further improves the performance of the model. referendum was Brexit, but it has not formally left the At the same time, Experiments 5-9 show that appropriately European Union. The British High Court's ruling on Brexit increasing the number of multi-hop can help the model has allowed the parliament to have endless discussions and understand the text semantics more deeply, solve the problem may appeal to the Supreme Court. No surprise, Brexit of insufficient learning and over-stability, and improve the VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) accuracy of model predict answer. However, the increase of the number of multi-hop will also lead to the increase of model parameters and calculation, which will affect the performance and efficiency of the model. The experimental results show that the model achieves the best performance in the multi-hop- 3. 5) SKIMMING READING MODULE VERIFICATION In order to further verify and explain the effectiveness of the Skimming Reading Module in the model, this paper selects a sample from the DuReader 2.0 dev set, as shown in Table 8. When the sample data enters the Skimming Reading Module of the model, the model judges the sample's question keywords and their associated context key parts, and context keywords and their associated question key parts, and draws the heat maps, as shown in the Figure 5 and Figure 6. Table 8. A dev set example Context: 据悉,每年通过注册安全工程师过的人还是比较多, 加上需求此证书的企业减少,许多单位的注册安全工 程师基本属于挂名,市场情况不是很好,注册安全工 程师挂靠价格不是很高,去年为 1 万-2 万之间,其收入 要视地区和公司盈利状况。 Question: FIGURE 5. Question keywords and associated 安全工程师挂靠价格 contextual key parts. GroundTruth: 1 万-2 万 Table 8. A dev set example Context: It is reported that there are still a lot of registered safety engineers every year. In addition, there are fewer companies that require this certificate. The registered safety engineers of many units are basically nominal, the market situation is not very good, and the price of registered safety engineers is not very high. Between 10,000 and 20,000 yuan, the income depends on the region and the company's profitability. Question: The price of a registered safety engineer. GroundTruth: Between 10,000 and 20,000 yuan FIGURE 6. Context keywords and associated question key parts. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) The horizontal and vertical axes of Figure 5 and Figure 6 [3] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008. represent the question and context text, respectively. From [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre- Figures 5 and Figure 6, we can see that the model accurately training of deep bidirectional transformers for language judged the keyword of the question, "price". It can be seen understanding," arXiv preprint arXiv:1810.04805, 2018. [5] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. from Figure 5 that the Skimming Reading Module of the Soricut, "Albert: A lite bert for self-supervised learning of model identifies the key parts highly related to the question language representations," arXiv preprint arXiv:1909.11942, keyword "price" in the context semantics, such as "income", 2019. "market", "region" and "company's profitability " and so on. [6] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for Therefore, it can be further verified and explained by the language understanding," in Advances in neural information above sample and heat maps that the Skimming Reading processing systems, 2019, pp. 5753-5763. Module in the model has the ability to semantically identify [7] R. Jia and P. Liang, "Adversarial examples for evaluating reading comprehension systems," arXiv preprint keywords and corresponding key parts. arXiv:1707.07328, 2017. [8] L. A. Zadeh, "Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic," Fuzzy sets V. CONCLUSION and systems, vol. 90, no. 2, pp. 111-127, 1997. [9] L. A. Zadeh, "Some reflections on soft computing, granular This paper proposes a reading comprehension model S&I computing and their roles in the conception, design and Reader that combines multiple skimming and intensive utilization of information/intelligent systems," Soft computing, reading modules, gated mechanism, and multi-granularity vol. 2, no. 1, pp. 23-25, 1998. [10] Y. Zhang, D. Miao, W. Pedrycz, T. Zhao, J. Xu, and Y. Yu, module. The model focuses on developing the downstream "Granular structure-based incremental updating for multi-label processing structure of the pre-training model to solve the classification," Knowledge-Based Systems, vol. 189, p. 105066, performance bottleneck caused by the development of the 2020. [11] B. Zhang and L. Zhang, "Problem solving theory and current pre-training model. The downstream structure of the application," ed: Tsinghua University Press, Beijing, 1990. model simulates the behaviors of human beings in solving [12] C. Yunxian, L. Renjie, Z. Shuliang, and G. Fenghua, "Measuring reading comprehension tasks, such as skimming, intensive multi-spatiotemporal scale tourist destination popularity based reading, filtering effective information, and understanding text on text granular computing," PloS one, vol. 15, no. 4, p. e0228175, 2020. from multi-granularity. Experiments have proved that the [13] Y. Liu et al., "Roberta: A robustly optimized bert pretraining model is effective in both Chinese and English datasets. And approach," arXiv preprint arXiv:1907.11692, 2019. the performance of the model can be further improved on the [14] W. He et al., "Dureader: a chinese machine reading comprehension dataset from real-world applications," arXiv basis of the pre-training model, which can solve the errors preprint arXiv:1711.05073, 2017. caused by insufficient learning and over-stability to a certain [15] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, "Squad: extent, and deepen the model's understanding of the text. At 100,000+ questions for machine comprehension of text," arXiv preprint arXiv:1606.05250, 2016. the same time, experiments and analyses have also verified the [16] T. Nguyen et al., "Ms marco: A human-generated machine effectiveness of the Skimming Reading Module proposed in reading comprehension dataset," 2016. this paper to select key information of the text. We pay [17] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, "Race: Large- attention to the fact that S&I Reader continues to add a large scale reading comprehension dataset from examinations," arXiv preprint arXiv:1704.04683, 2017. number of parameters and multi-layer structure based on the [18] K. M. Hermann et al., "Teaching machines to read and pre-training model with a large number of parameters. comprehend," in Advances in neural information processing Therefore, it is our next important work to simplify the systems, 2015, pp. 1693-1701. [19] Z. Wu and H. Xu, "Improving the robustness of machine reading structure and parameters of the model while improving the comprehension model with hierarchical knowledge and auxiliary performance of the model, so that the model can better meet unanswerability prediction," Knowledge-Based Systems, p. the requirements of resources and time in industrial 106075, 2020. [20] B. Dhingra, H. Liu, Z. Yang, W. W. Cohen, and R. applications. At the same time, in the process of testing the Salakhutdinov, "Gated-attention readers for text model, we find that for a small number of samples, because comprehension," arXiv preprint arXiv:1606.01549, 2016. the model can not get the internal meaning of the entities in [21] S. Wang and J. Jiang, "Machine comprehension using match- lstm and answer pointer," arXiv preprint arXiv:1608.07905, the text and the relationship between entities, the model gets 2016. irrelevant answers. Therefore, how to introduce knowledge [22] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, graph into the model to obtain prior knowledge and how to "Bidirectional attention flow for machine comprehension," better understand the context and question to avoid absurd arXiv preprint arXiv:1611.01603, 2016. [23] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, "Gated answers are also our next research direction. self-matching networks for reading comprehension and question answering," in Proceedings of the 55th Annual Meeting of the REFERENCES Association for Computational Linguistics (Volume 1: Long [1] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, "Neural Papers), 2017, pp. 189-198. machine reading comprehension: Methods and trends," Applied [24] J. Zhang, X. Zhu, Q. Chen, L. Dai, S. Wei, and H. Jiang, Sciences, vol. 9, no. 18, p. 3698, 2019. "Exploring question understanding and adaptation in neural- [2] Z. Zhang et al., "Semantics-aware bert for language network-based question answering," arXiv preprint understanding," arXiv preprint arXiv:1909.02209, 2019. arXiv:1703.04617, 2017. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3079165, IEEE Access Author Name: Preparation of Papers for IEEE Access (February 2017) [25] A. W. Yu et al., "Qanet: Combining local convolution with global self-attention for reading comprehension," arXiv preprint Yong Wang received Ph.D. degree from the arXiv:1804.09541, 2018. East China Normal University in 2007. He is [26] L. Kaiser, A. N. Gomez, and F. Chollet, "Depthwise separable currently an Associate Professor with School convolutions for neural machine translation," arXiv preprint of Artificial Intelligence, Liangjiang, arXiv:1706.03059, 2017. Chongqing University of Technology. His [27] W. Yonghui et al., "Bridging the gap between human and research interests include deep learning, machine translation," arXiv preprint arXiv:1609.08144, 2016. natural language processing, multimedia and [28] Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu, "Attention- big data technology. over-attention neural networks for reading comprehension," arXiv preprint arXiv:1607.04423, 2016. [29] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. [30] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014. [31] Y. Gong and S. R. Bowman, "Ruminating reader: Reasoning Chong Lei received the bachelor’s degree with gated multi-hop attention," arXiv preprint from the Taiyuan University of Technology in arXiv:1704.07415, 2017. 2017, where he is currently pursuing the [32] Y. Yao, "Three-way granular computing, rough sets, and formal master’s degree with the College of Computer concept analysis," International Journal of Approximate Science and Engineering, Chongqing Reasoning, vol. 116, pp. 106-125, 2020. University of Technology. His research [33] M. Abadi et al., "Tensorflow: Large-scale machine learning on interests include deep learning, natural heterogeneous distributed systems," arXiv preprint language processing, and machine reading arXiv:1603.04467, 2016. comprehension. [34] A. Paszke et al., "Pytorch: An imperative style, high- performance deep learning library," in Advances in neural information processing systems, 2019, pp. 8026-8037. Duoqian Miao received the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 1997. He is currently a Professor with Department of Computer Science and Technology, Tongji University. His research interests include artificial intelligence, machine learning, big data analysis and granular computing. VOLUME XX, 2017 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
You can also read