Matching Article Pairs with Graphical Decomposition and Convolutions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Matching Article Pairs with Graphical Decomposition and Convolutions Bang Liu† , Di Niu† , Haojie Wei‡ , Jinghong Lin‡ , Yancheng He‡ , Kunfeng Lai‡ , Yu Xu‡ † University of Alberta, Edmonton, AB, Canada {bang3, dniu}@ualberta.ca ‡ Platform and Content Group, Tencent, Shenzhen, China {fayewei, daphnelin, collinhe, calvinlai, henrysxu}@tencent.com Abstract Traditional term-based matching approaches es- timate the semantic distance between a pair of Identifying the relationship between two arti- text objects via unsupervised metrics, e.g., via TF- arXiv:1802.07459v2 [cs.CL] 28 May 2019 cles, e.g., whether two articles published from different sources describe the same break- IDF vectors, BM25 (Robertson et al., 2009), LDA ing news, is critical to many document un- (Blei et al., 2003) and so forth. These methods derstanding tasks. Existing approaches for have achieved success in query-document match- modeling and matching sentence pairs do not ing, information retrieval and search. In recent perform well in matching longer documents, years, a wide variety of deep neural network mod- which embody more complex interactions be- els have also been proposed for text matching (Hu tween the enclosed entities than a sentence et al., 2014; Qiu and Huang, 2015; Wan et al., does. To model article pairs, we propose the 2016; Pang et al., 2016), which can capture the Concept Interaction Graph to represent an ar- ticle as a graph of concepts. We then match semantic dependencies (especially sequential de- a pair of articles by comparing the sentences pendencies) in natural language through layers of that enclose the same concept vertex through recurrent or convolutional neural networks. How- a series of encoding techniques, and aggregate ever, existing deep models are mainly designed for the matching signals through a graph convo- matching sentence pairs, e.g., for paraphrase iden- lutional network. To facilitate the evaluation tification, answer selection in question-answering, of long article matching, we have created two omitting the complex interactions among key- datasets, each consisting of about 30K pairs of breaking news articles covering diverse top- words, entities or sentences that are present in a ics in the open domain. Extensive evaluations longer article. Therefore, article pair matching re- of the proposed methods on the two datasets mains under-explored in spite of its importance. demonstrate significant improvements over a In this paper, we apply the divide-and-conquer wide range of state-of-the-art methods for nat- philosophy to matching a pair of articles and bring ural language matching. deep text understanding from the currently domi- nating sequential modeling of language elements 1 Introduction to a new level of graphical document representa- Identifying the relationship between a pair of arti- tion, which is more suitable for longer articles. cles is an essential natural language understand- Specifically, we have made the following contri- ing task, which is critical to news systems and butions: search engines. For example, a news system needs First, we propose the so-called Concept Inter- to cluster various articles on the Internet report- action Graph (CIG) to represent a document as ing the same breaking news (probably in different a weighted graph of concepts, where each con- ways of wording and narratives), remove redun- cept vertex is either a keyword or a set of tightly dancy and form storylines (Shahaf et al., 2013; Liu connected keywords. The sentences in the arti- et al., 2017; Zhou et al., 2015; Vossen et al., 2015; cle associated with each concept serve as the fea- Bruggermann et al., 2016). The rich semantic and tures for local comparison to the same concept ap- logic structures in longer documents have made it pearing in another article. Furthermore, two con- a different and more challenging task to match a cept vertices in an article are also connected by pair of articles than to match a pair of sentences or a weighted edge which indicates their interaction a query-document pair in information retrieval. strength. The CIG does not only capture the essen-
tial semantic units in a document but also offers a Text: Concept Interaction Graph: [1] Rick asks Morty to travel with him [1, 2] [5, 6] way to perform anchored comparison between two in the universe. Rick articles along the common concepts found. [2] Morty doesn't want to go as Rick always Morty brings him dangerous experiences. Rick Summer Second, we propose a divide-and-conquer [3] However, the destination of this journey is the Candy Planet, which is an fascinating Morty framework to match a pair of articles based on place that attracts Morty. Candy [4] The planet is full of delicious candies. Planet the constructed CIGs and graph convolutional net- [5] Summer wishes to travel with Rick. [3, 4] works (GCNs). The idea is that for each concept [6] However, Rick doesn't like to travel with Summer. vertex that appears in both articles, we first ob- tain the local matching vectors through a range of Figure 1: An example to show a piece of text and its text pair encoding schemes, including both neu- Concept Interaction Graph representation. ral encoding and term-based encoding. We then tion and convolutions can improve the classifica- aggregate the local matching vectors into the fi- tion accuracy by 17.31% and 23.09% on the two nal matching result through graph convolutional datasets, respectively. layers (Kipf and Welling, 2016; Defferrard et al., 2016). In contrast to RNN-based sequential mod- 2 Concept Interaction Graph eling, our model factorizes the matching process into local matching sub-problems on a graph, each In this section, we present our Concept Interac- focusing on a different concept, and by using tion Graph (CIG) to represent a document as an GCN layers, generates matching results based on undirected weighted graph, which decomposes a a holistic view of the entire graph. document into subsets of sentences, each subset Although there exist many datasets for sentence focusing on a different concept. Given a docu- matching, the semantic matching between longer ment D, a CIG is a graph GD , where each vertex articles is a largely unexplored area. To the best in GD is called a concept, which is a keyword or a of our knowledge, to date, there does not exist a set of highly correlated keywords in document D. labeled public dataset for long document match- Each sentence in D will be attached to the single ing. To facilitate evaluation and further research concept vertex that it is the most related to, which on document and especially news article match- most frequently is the concept the sentence men- ing, we have created two labeled datasets1 , one tions. Hence, vertices will have their own sentence annotating whether two news articles found on In- sets, which are disjoint. The weight of the edge ternet (from different media sources) report the between a pair of concepts denotes how much the same breaking news event, while the other anno- two concepts are related to each other and can be tating whether they belong to the same news story determined in various ways. (yet not necessarily reporting the same breaking As an example, Fig. 1 illustrates how we con- news event). These articles were collected from vert a document into a Concept Interaction Graph. major Internet news providers in China, including We can extract keywords Rick, Morty, Summer, Tencent, Sina, WeChat, Sohu, etc., covering di- and Candy Planet from the document using stan- verse topics, and were labeled by professional ed- dard keyword extraction algorithms, e.g., Tex- itors. Note that similar to most other natural lan- tRank (Mihalcea and Tarau, 2004). These key- guage matching models, all the approaches pro- words are further clustered into three concepts, posed in this paper can easily work on other lan- where each concept is a subset of highly corre- guages as well. lated keywords. After grouping keywords into concepts, we attach each sentence in the document Through extensive experiments, we show that to its most related concept vertex. For example, in our proposed algorithms have achieved signifi- Fig. 1, sentences 1 and 2 are mainly talking about cant improvements on matching news article pairs, the relationship between Rick and Morty, and are as compared to a wide range of state-of-the-art thus attached to the concept (Rick, Morty). Other methods, including both term-based and deep text sentences are attached to vertices in a similar way. matching algorithms. With the same encoding or The attachment of sentences to concepts naturally term-based feature representation of a pair of arti- dissects the original document into multiple dis- cles, our approach based on graphical decomposi- joint sentence subsets. As a result, we have repre- 1 Our code and datasets are available at: sented the original document with a graph of key https://github.com/BangLiu/ArticlePairMatching concepts, each with a sentence subset, as well as
Doc A Construct KeyGraph Get Edge Weights Siamese Vertex Feature Siamese Term-based by Word Co-occurrence by Vertex Similarities Encoder matching matching w 1 Matching Layer w w w 2 Aggregation Layer Doc B w w w Context Layer Contex Layer Result w w 5 transformed w w w 3 features Concept Classify w w Interaction 4 KeyGraph w Sentences 1 Sentences 2 Document Pair Graph Concatenate Input Detect Concepts Assign Sentences by Community Detection by Similarities Term-based Feature GCN Layers … w Concept 1 Concept 2 Extractor w w Vertex Feature vertex w S1 S2 S1 S2 features w w w Siamese Term-based Global Concept 5 Concept 3 Feature Extractor matching matching matching w w w S1 S2 S1 S2 w w Concepts Sentence 1, Sentence 2 w w Concept 4 with Concepts w sentences S1 S2 CIG with vertex features (a) Representation (b) Encoding (c) Transformation (d) Aggregation Figure 2: An overview of our approach for constructing the Concept Interaction Graph (CIG) from a pair of documents and classifying it by Graph Convolutional Networks. the interaction topology among them. use each keyword directly as a concept. The ben- Fig 2 (a) illustrates the construction of CIGs for efit brought by concept detection is that it reduces a pair of documents aligned by the discovered con- the number of vertices in a graph and speeds up cepts. Here we first describe the detailed steps to matching, as will be shown in Sec. 4. construct a CIG for a single document: Sentence Attachment. After the concepts are KeyGraph Construction. Given a document discovered, the next step is to group sentences by D, we first extract the named entities and key- concepts. We calculate the cosine similarity be- words by TextRank (Mihalcea and Tarau, 2004). tween each sentence and each concept, where sen- After that, we construct a keyword co-occurrence tences and concepts are represented by TF-IDF graph, called KeyGraph, based on the set of found vectors. We assign each sentence to the concept keywords. Each keyword is a vertex in the Key- which is the most similar to the sentence. Sen- Graph. We connect two keywords by an edge if tences that do not match any concepts in the docu- they co-occur in a same sentence. ment will be attached to a dummy vertex that does We can further improve our model by perform- not contain any keywords. ing co-reference resolution and synonym analysis Edge Construction. To construct edges that re- to merge keywords with the same meaning. How- veal the correlations between different concepts, ever, we do not apply these operations due to the for each vertex, we represent its sentence set as a time complexity. concatenation of the sentences attached to it, and Concept Detection (Optional). The structure calculate the edge weight between any two ver- of KeyGraph reveals the connections between key- tices as the TF-IDF similarity between their sen- words. If a subset of keywords are highly cor- tence sets. Although edge weights may be de- related, they will form a densely connected sub- cided in other ways, our experience shows that graph in the KeyGraph, which we call a concept. constructing edges by TF-IDF similarity generates Concepts can be extracted by applying commu- a CIG that is more densely connected. nity detection algorithms on the constructed Key- When performing article pair matching, the Graph. Community detection is able to split a above steps will be applied to a pair of documents KeyGraph Gkey into a set of communities C = DA and DB , as is shown in Fig. 2 (a). The only ad- {C1 , C2 , ..., C|C| }, where each community Ci con- ditional step is that we align the CIGs of the two tains the keywords for a certain concept. By using articles by the concept vertices, and for each com- overlapping community detection, each keyword mon concept vertex, merge the sentence sets from may appear in multiple concepts. As the num- DA and DB for local comparison. ber of concepts in different documents varies a lot, 3 Article Pair Matching through Graph we utilize the betweenness centrality score based Convolutions algorithm (Sayyadi and Raschid, 2013) to detect keyword communities in KeyGraph. Given the merged CIG GAB of two documents DA Note that this step is optional, i.e., we can also and DB described in Sec. 2, we match a pair of ar-
ticles in a “divide-and-conquer” manner by match- vectors, i.e., ing the sentence sets from DA and DB associated with each concept and aggregating local matching mAB (v) = (|cA (v) − cB (v)|, cA (v) ◦ cB (v)), results into a final result through multiple graph (1) convolutional layers. Our approach overcomes the where ◦ denotes Hadamard product. limitation of previous text matching algorithms, Term-based Similarities: we also generate an- by extending text representation from a sequential other matching vector for each v by directly cal- (or grid) point of view to a graphical view, and can culating term-based similarities between SA (v) therefore better capture the rich semantic interac- and SB (v), based on 5 metrics: the TF-IDF co- tions in longer text. sine similarity, TF cosine similarity, BM25 co- Fig. 2 illustrates the overall architecture of our sine similarity, Jaccard similarity of 1-gram, and proposed method, which consists of four steps: Ochiai similarity measure. These similarity scores a) representing a pair of documents by a sin- are concatenated into another matching vector gle merged CIG, b) learning multi-viewed match- m0AB (v) for v, as shown in Fig. 2 (b). ing features for each concept vertex, c) struc- Matching Aggregation via GCN The local turally transforming local matching features by matching vectors must be aggregated into a final graph convolutional layers, and d) aggregating lo- matching score for the pair of articles. We propose cal matching features to get the final result. Steps to utilize the ability of the Graph Convolutional (b)-(d) can be trained end-to-end. Network (GCN) filters (Kipf and Welling, 2016) to capture the patterns exhibited in the CIG GAB Encoding Local Matching Vectors. Given the at multiple scales. In general, the input to the GCN merged CIG GAB , our first step is to learn an is a graph G = (V, E) with N vertices vi ∈ V, and appropriate matching vector of a fixed length for edges eij = (vi , vj ) ∈ E with weights wij . The each individual concept v ∈ GAB to express the input also contains a vertex feature matrix denoted semantic similarity between SA (v) and SB (v), the by X = {xi }N i=1 , where xi is the feature vector sentence sets of concept v from documents DA of vertex vi . For a pair of documents DA and DB , and DB , respectively. This way, the matching of we input their CIG GAB (with N vertices) with two documents is converted to match the pair of a (concatenated) matching vector on each vertex sentence sets on each vertex of GAB . Specifically, into the GCN, such that the feature vector of ver- we generate local matching vectors based on both tex vi in GCN is given by neural networks and term-based techniques. Siamese Encoder: we apply a Siamese neural xi = (mAB (vi ), m0AB (vi )). network encoder (Neculoiu et al., 2016) onto each vertex v ∈ GAB to convert the word embeddings Now let us briefly describe the GCN lay- (Mikolov et al., 2013) of {SA (v), SB (v)} into a ers (Kipf and Welling, 2016) used in Fig. 2 (c). fixed-sized hidden feature vector mAB (v), which Denote the weighted adjacency matrix of the we call the match vector. graph as A ∈ RN ×N where Aij = wij (in CIG, We use a Siamese structure to take SA (v) and it is the TF-IDF similarity between vertex i and SB (v)} (which are two sequences of word embed- j). P Let D be a diagonal matrix such that Dii = (0) dings) as inputs, and encode them into two con- j Aij . The input layer to the GCN is H = X, text vectors through the context layers that share which contains the original vertex features. Let the same weights, as shown in Fig. 2 (b). The H (l) ∈ RN ×Ml denote the matrix of hidden rep- context layer usually contains one or multiple bi- resentations of the vertices in the lth layer. Then directional LSTM (BiLSTM) or CNN layers with each GCN layer applies the following graph con- max pooling layers, aiming to capture the contex- volutional filter onto the previous hidden represen- tual information in SA (v) and SB (v)}. tations: 1 1 Let cA (v) and cB (v) denote the context vectors H (l+1) = σ(D̃− 2 ÃD̃− 2 H (l) W (l) ), (2) obtained for SA (v) and SB (v), respectively. Then, the matching vector mAB (v) for vertex v is given where à = A + IN , IN is the identity matrix, P and by the subsequent aggregation layer, which con- D̃ is a diagonal matrix such that D̃ii = j A˜ij . catenates the element-wise absolute difference and They are the adjacency matrix and the degree ma- the element-wise multiplication of the two context trix of graph G, respectively.
2016-07-19 2016-10-07 E)'?,13(/1$F/'( 2016-10-08 2016-11-02 app. In fact, the proposed article pair matching 2016-07-26 7,-)%8$
Table 1: Description of evaluation datasets. datasets. For both datasets, we use 60% of all the samples as the training set, 20% as the devel- Dataset Pos Samples Neg Samples Train Dev Test opment (validation) set, and the remaining 20% CNSE 12865 16198 17438 5813 5812 as the test set. We carefully ensure that different CNSS 16887 16616 20102 6701 6700 splits do not contain any overlaps to avoid data As a typical example, Fig. 3 shows the events leakage. The metrics used for performance eval- contained in the story 2016 U.S. presidential elec- uation are the accuracy and F1 scores of binary tion, where each tag shows a breaking news event classification results. For each evaluated method, possibly reported by multiple articles with dif- we perform training for 10 epochs and then choose ferent narratives (articles not shown here). We the epoch with the best validation performance to group highly coherent events together. For exam- be evaluated on the test set. ple, there are multiple events about Election tele- Baselines. We test the following baselines: vision debates. One of our objectives is to identify • Matching by representation-focused or whether two news articles report the same event, interaction-focused deep neural network e.g., a yes when they are both reporting Trump and models: DSSM (Huang et al., 2013), C- Hilary’s first television debate, though with differ- DSSM (Shen et al., 2014), DUET (Mitra ent wording, or a no, when one article is report- et al., 2017), MatchPyramid (Pang et al., ing Trump and Hilary’s second television debate 2016), ARC-I (Hu et al., 2014), ARC-II (Hu while the other is talking about Donald Trump is et al., 2014). We use the implementations elected president. from MatchZoo (Fan et al., 2017) for the Datasets. To the best of our knowledge, there evaluation of these models. is no publicly available dataset for long document matching tasks. We created two datasets: the Chi- • Matching by term-based similarities: BM25 nese News Same Event dataset (CNSE) and Chi- (Robertson et al., 2009), LDA (Blei et al., nese News Same Story dataset (CNSS), which are 2003) and SimNet (which is extracting the labeled by professional editors. They contain long five text-pair similarities mentioned in Sec. 3 Chinese news articles collected from major Inter- and classifying by a multi-layer feedforward net news providers in China, covering diverse top- neural network). ics in the open domain. The CNSE dataset con- tains 29, 063 pairs of news articles with labels rep- • Matching by a large-scale pre-training lan- resenting whether a pair of news articles are re- guage model: BERT (Devlin et al., 2018). porting about the same breaking news event. Sim- Note that we focus on the capability of long text ilarly, the CNSS dataset contains 33, 503 pairs of matching. Therefore, we do not use any short text articles with labels representing whether two doc- information, such as titles, in our approach or in uments fall into the same news story. The average any baselines. In fact, the “relationship” between number of words for all documents in the datasets two documents is not limited to ”whether the same is 734 and the maximum value is 21791. event or not”. Our algorithm is able to identify In our datasets, we only labeled the major event a general relationship between documents, e.g., (or story) that a news article is reporting, since whether two episodes are from the same season in the real world, each breaking news article on of a TV series. The definition of the relationship the Internet must be intended to report some spe- (e.g., same event/story, same chapter of a book) is cific breaking news that has just happened to at- solely defined and supervised by the labeled train- tract clicks and views. Our objective is to deter- ing data. For these tasks, the availability of other mine whether two news articles intend to report information such as titles can not be assumed. the same breaking news. As shown in Table 2, we evaluate different vari- Note that the negative samples in the two ants of our own model to show the effect of differ- datasets are not randomly generated: we select ent sub-modules. In model names, “CIG” means document pairs that contain similar keywords, and that in CIG, we directly use keywords as concepts exclude samples with TF-IDF similarity below a without community detection, whereas “CIGcd ” certain threshold. The datasets have been made means that each concept vertex in the CIG con- publicly available for research purpose. tains a set of keywords grouped via community de- Table 1 shows a detailed breakdown of the two tection. To generate the matching vector on each
Table 2: Accuracy and F1-score results of different algorithms on CNSE and CNSS datasets. CNSE CNSS CNSE CNSS Baselines Our models Acc F1 Acc F1 Acc F1 Acc F1 I. ARC-I 53.84 48.68 50.10 66.58 XI. CIG-Siam 74.47 73.03 75.32 78.58 II. ARC-II 54.37 36.77 52.00 53.83 XII. CIG-Siam-GCN 74.58 73.69 78.91 80.72 III. DUET 55.63 51.94 52.33 60.67 XIII. CIGcd -Siam-GCN 73.25 73.10 76.23 76.94 IV. DSSM 58.08 64.68 61.09 70.58 XIV. CIG-Sim 72.58 71.91 75.16 77.27 V. C-DSSM 60.17 48.57 52.96 56.75 XV. CIG-Sim-GCN 83.35 80.96 87.12 87.57 VI. MatchPyramid 66.36 54.01 62.52 64.56 XVI. CIGcd -Sim-GCN 81.33 78.88 86.67 87.00 VII. BM25 69.63 66.60 67.77 70.40 XVII. CIG-Sim&Siam-GCN 84.64 82.75 89.77 90.07 VIII. LDA 63.81 62.44 62.98 69.11 XVIII. CIG-Sim&Siam-GCN-Simg 84.21 82.46 90.03 90.29 IX. SimNet 71.05 69.26 70.78 74.50 XIX. CIG-Sim&Siam-GCN-BERTg 84.68 82.60 89.56 89.97 X. BERT fine-tuning 81.30 79.20 86.64 87.08 XX. CIG-Sim&Siam-GCN-Simg &BERTg 84.61 82.59 89.47 89.71 vertex, “Siam” indicates the use of Siamese en- used for the baseline method SimNet. coder, while “Sim” indicates the use of term-based As we mentioned in Sec. 1, our code and similarity encoder, as shown in Fig. 2. “GCN” datasets have been open sourced. We implement means that we convolve the local matching vec- our model using PyTorch 1.0 (Paszke et al., 2017). tors on vertices through GCN layers. Finally, The experiments without BERT are carried out on “BERTg ” or “Simg ” indicates the use of additional an MacBook Pro with a 2 GHz Intel Core i7 pro- global features given by BERT or the five term- cessor and 8 GB memory. We use L2 weight de- based similarity metrics mentioned in Sec. 3, ap- cay on all the trainable variables, with parame- pended to the graphically merged matching vector ter λ = 3 × 10−7 . The dropout rate between mAB , for final classification. every two layers is 0.1. We apply gradient clip- Implementation Details. We use Stanford ping with maximum gradient norm 5.0. We use CoreNLP (Manning et al., 2014) for word segmen- the ADAM optimizer (Kingma and Ba, 2014) with tation (on Chinese text) and named entity recogni- β1 = 0.8, β2 = 0.999, = 108 . We use a learn- tion. For Concept Interaction Graph construction ing rate warm-up scheme with an inverse exponen- with community detection, we set the minimum tial increase from 0.0 to 0.001 in the first 1000 community size (number of keywords contained steps, and then maintain a constant learning rate in a concept vertex) to be 2, and the maximum size for the remainder of training. For all the exper- to be 6. iments, we set the maximum number of training Our neural network model consists of word epochs to be 10. embedding layer, Siamese encoding layer, Graph 4.1 Results and Analysis transformation layers, and classification layer. For embedding, we load the pre-trained word vectors Table 2 summarizes the performance of all and fix it during training. The embeddings of out the compared methods on both datasets. Our of vocabulary words are set to be zero vectors. For model achieves the best performance on both two the Siamese encoding network, we use 1-D con- datasets and significantly outperforms all other volution with number of filters 32, followed by methods. This can be attributed to two reasons. an ReLU layer and Max Pooling layer. For graph First, as the input of article pairs are re-organized transformation, we utilize 2 layers of GCN (Kipf into Concept Interaction Graphs, the two doc- and Welling, 2016) for experiments on the CNSS uments are aligned along the corresponding se- dataset, and 3 layers of GCN for experiments on mantic units for easier concept-wise comparison. the CNSE dataset. When the vertex encoder is the Second, our model encodes local comparisons five-dimensional features, we set the output size around different semantic units into local match- of GCN layers to be 16. When the vertex encoder ing vectors, and aggregate them via graph convo- is the Siamese network encoder, we set the output lutions, taking semantic topologies into considera- size of GCN layers to be 128 except the last layer. tion. Therefore, it solves the problem of matching For the last GCN layer, the output size is always documents via divide-and-conquer, which is suit- set to be 16. For the classification module, it con- able for handling long text. sists of a linear layer with output size 16, an ReLU Impact of Graphical Decomposition. Com- layer, a second linear layer, and finally a Sigmoid paring method XI with methods I-VI in Table 2, layer. Note that this classification module is also they all use the same word vectors and use neu-
ral networks for text encoding. The key differ- viewed matching vectors. ence is that our method XI compares a pair of ar- Impact of Added Global Features. Compar- ticles over a CIG in per-vertex decomposed fash- ing methods XVIII, XIX, XX with method XVII, ion. We can see that the performance of method we can see that adding more global features, such XI is significantly better than methods I-VI. Sim- as global similarities (Simg ) and/or global BERT ilarly, comparing our method XIV with methods encodings (BERTg ) of the article pair, can hardly VII-IX, they all use the same term-based similar- improve performance any further. This shows that ities. However, our method achieves significantly graphical decomposition and convolutions are the better performance by using graphical decompo- main factors that contribute to the performance sition. Therefore, we conclude that graphical de- improvement. Since they already learn to aggre- composition can greatly improve long text match- gate local comparisons into a global semantic re- ing performance. lationship, additionally engineered global features Note that the deep text matching models I- cannot help. VI lead to bad performance, because they were Model Size and Parameter Sensitivity: Our invented mainly for sequence matching and can biggest model without BERT is XVIII, which hardly capture meaningful semantic interactions contains only ∼34K parameters. In comparison, in article pairs. When the text is long, it is hard BERT contains 110M-340M parameters. How- to get an appropriate context vector representa- ever, our model significantly outperforms BERT. tion for matching. For interaction-focused neural We tested the sensitivity of different parame- network models, most of the interactions between ters in our model. We found that 2 to 3 layers words in two long articles will be meaningless. of GCN layers gives the best performance. Fur- Impact of Graph Convolutions. Compare ther introducing more GCN layers does not im- methods XII and XI, and compare methods XV prove the performance, while the performance is and XIV. We can see that incorporating GCN lay- much worse with zero or only one GCN layer. Fur- ers has significantly improved the performance on thermore, in GCN hidden representations of a size both datasets. Each GCN layer updates the hid- between 16 and 128 yield good performance. Fur- den vector of each vertex by integrating the vec- ther increasing this size does not show obvious im- tors from its neighboring vertices. Thus, the GCN provement. layers learn to graphically aggregate local match- For the optional community detection step in ing features into a final result. CIG construction, we need to choose the minimum Impact of Community Detection. By compar- size and the maximum size of communities. We ing methods XIII and XII, and comparing methods found that the final performance remains similar XVI and XV, we observe that using community if we vary the minimum size from 2∼3 and the detection, such that each concept is a set of corre- maximum size from 6∼10. This indicates that our lated keywords instead of a single keyword, leads model is robust and insensitive to these parame- to slightly worse performance. This is reasonable, ters. as using each keyword directly as a concept vertex Time complexity. For keywords of news arti- provides more anchor points for article compari- cles, in real-world industry applications, they are son . However, community detection can group usually extracted in advance by highly efficient highly coherent keywords together and reduces the off-the-shelf tools and pre-defined vocabulary. For average size of CIGs from 30 to 13 vertices. This CIG construction, let Ns be the number of sen- helps to reduce the total training and testing time tences in two documents, Nw be the number of of our models by as much as 55%. Therefore, one unique words in documents, and Nk represents may choose whether to apply community detec- the number of unique keywords in a document. tion to trade accuracy off for speedups. Building keyword graph requires O(Ns Nk + Nw2 ) Impact of Multi-viewed Matching. Compar- complexity (Sayyadi and Raschid, 2013), and ing methods XVII and XV, we can see that the betweenness-based community detection requires concatenation of different graphical matching vec- O(Nk3 ). The complexity of sentence assignment tors (both term-based and Siamese encoded fea- and weight calculation is O(Ns Nk + Nk2 ). For tures) can further improve performance. This graph classification, our model size is not big and demonstrates the advantage of combining multi- can process document pairs efficiently.
5 Related Work et al., 2017; Paul et al., 2016) for long text match- ing. In contrast, our method represents documents Graphical Document Representation. A major- by a novel graph representation and combines the ity of existing works can be generalized into four representation with GCN. categories: word graph, text graph, concept graph, Finally, pre-training models such as BERT (De- and hybrid graph. Word graphs use words in a vlin et al., 2018) can also be utilized for text document as vertices, and construct edges based matching. However, the model is of high complex- on syntactic analysis (Leskovec et al., 2004), co- ity and is hard to satisfy the speed requirement in occurrences (Zhang et al., 2018; Rousseau and real-world applications. Vazirgiannis, 2013; Nikolentzos et al., 2017) or Graph Convolutional Networks. We also con- preceding relation (Schenker et al., 2003). Text tributed to the use of GCNs to identify the rela- graphs use sentences, paragraphs or documents tionship between a pair of graphs, whereas pre- as vertices, and establish edges by word co- viously, different GCN architectures have mainly occurrence, location (Mihalcea and Tarau, 2004), been used for completing missing attributes/links text similarities (Putra and Tokunaga, 2017), or (Kipf and Welling, 2016; Defferrard et al., 2016) hyperlinks between documents (Page et al., 1999). or for node clustering or classification (Hamilton Concept graphs link terms in a document to real et al., 2017), but all within the context of a sin- world concepts based on knowledge bases such as gle graph, e.g., a knowledge graph, citation net- DBpedia (Auer et al., 2007), and construct edges work or social network. In this work, the pro- based on syntactic/semantic rules. Hybrid graphs posed Concept Interaction Graph takes a simple (Rink et al., 2010; Baker and Ellsworth, 2017) approach to represent a document by a weighted consist of different types of vertices and edges. undirected graph, which essentially helps to de- Text Matching. Traditional methods repre- compose a document into subsets of sentences, sent a text document as vectors of bag of words each subset focusing on a different sub-topic or (BOW), term frequency inverse document fre- concept. quency (TF-IDF), LDA (Blei et al., 2003) and so forth, and calculate the distance between vec- 6 Conclusion tors. However, they cannot capture the semantic We propose the Concept Interaction Graph to or- distance and usually cannot achieve good perfor- ganize documents into a graph of concepts, and in- mance. troduce a divide-and-conquer approach to match- In recent years, different neural network archi- ing a pair of articles based on graphical decom- tectures have been proposed for text pair match- position and convolutional aggregation. We cre- ing tasks. For representation-focused models, they ated two new datasets for long document matching usually transform text pairs into context represen- with the help of professional editors, consisting tation vectors through a Siamese neural network, of about 60K pairs of news articles, on which we followed by a fully connected network or score have performed extensive evaluations. In the ex- function which gives the matching result based on periments, our proposed approaches significantly the context vectors (Qiu and Huang, 2015; Wan outperformed an extensive range of state-of-the- et al., 2016; Liu et al., 2018; Mueller and Thya- art schemes, including both term-based and deep- garajan, 2016; Severyn and Moschitti, 2015). For model-based text matching algorithms. Results interaction-focused models, they extract the fea- suggest that the proposed graphical decomposi- tures of all pair-wise interactions between words tion and the structural transformation by GCN lay- in text pairs, and aggregate the interaction fea- ers are critical to the performance improvement in tures by deep networks to give a matching result matching article pairs. (Hu et al., 2014; Pang et al., 2016). However, the intrinsic structural properties of long text docu- ments are not fully utilized by these neural models. Therefore, they cannot achieve good performance for long text pair matching. There are also research works which utilize knowledge (Wu et al., 2018), hierarchical property (Jiang et al., 2019) or graph structure (Nikolentzos
References Jyun-Yu Jiang, Mingyang Zhang, Cheng Li, Mike Ben- dersky, Nadav Golbandi, and Marc Najork. 2019. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Semantic text matching for long-form documents. Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. Diederik P Kingma and Jimmy Ba. 2014. Adam: A The semantic web, pages 722–735. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Collin Baker and Michael Ellsworth. 2017. Graph methods for multilingual framenets. In Proceedings Thomas N Kipf and Max Welling. 2016. Semi- of TextGraphs-11: the Workshop on Graph-based supervised classification with graph convolutional Methods for Natural Language Processing, pages networks. arXiv preprint arXiv:1609.02907. 45–50. Heeyoung Lee, Angel Chang, Yves Peirsman, Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un- Nathanael Chambers, Mihai Surdeanu, and Dan supervised event coreference resolution with rich Jurafsky. 2013. Deterministic coreference resolu- linguistic features. In Proceedings of the 48th An- tion based on entity-centric, precision-ranked rules. nual Meeting of the Association for Computational Computational Linguistics, 39(4):885–916. Linguistics, pages 1412–1422. Association for Com- Heeyoung Lee, Marta Recasens, Angel Chang, Mihai putational Linguistics. Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In David M Blei, Andrew Y Ng, and Michael I Jordan. Proceedings of the 2012 Joint Conference on Empir- 2003. Latent dirichlet allocation. Journal of ma- ical Methods in Natural Language Processing and chine Learning research, 3(Jan):993–1022. Computational Natural Language Learning, pages 489–500. Association for Computational Linguis- Daniel Bruggermann, Yannik Hermey, Carsten Orth, tics. Darius Schneider, Stefan Selzer, and Gerasimos Spanakis. 2016. Storyline detection and tracking us- Jure Leskovec, Marko Grobelnik, and Natasa Milic- ing dynamic latent dirichlet allocation. In Proceed- Frayling. 2004. Learning sub-structures of docu- ings of the 2nd Workshop on Computing News Sto- ment semantic graphs for document summarization. rylines (CNS 2016), pages 9–19. Bang Liu, Di Niu, Kunfeng Lai, Linglong Kong, and Michaël Defferrard, Xavier Bresson, and Pierre Van- Yu Xu. 2017. Growing story forest online from dergheynst. 2016. Convolutional neural networks on massive breaking news. In Proceedings of the 2017 graphs with fast localized spectral filtering. In Ad- ACM on Conference on Information and Knowledge vances in Neural Information Processing Systems, Management, pages 777–785. ACM. pages 3844–3852. Bang Liu, Ting Zhang, Fred X Han, Di Niu, Kunfeng Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Lai, and Yu Xu. 2018. Matching natural language Kristina Toutanova. 2018. Bert: Pre-training of deep sentences with hierarchical sentence factorization. bidirectional transformers for language understand- In Proceedings of the 2018 World Wide Web Con- ing. arXiv preprint arXiv:1810.04805. ference on World Wide Web, pages 1237–1246. In- ternational World Wide Web Conferences Steering Yixing Fan, Liang Pang, JianPeng Hou, Jiafeng Guo, Committee. Yanyan Lan, and Xueqi Cheng. 2017. Matchzoo: A toolkit for deep text matching. arXiv preprint Christopher Manning, Mihai Surdeanu, John Bauer, arXiv:1707.07270. Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language pro- William L Hamilton, Rex Ying, and Jure Leskovec. cessing toolkit. In Proceedings of 52nd annual 2017. Representation learning on graphs: Methods meeting of the association for computational lin- and applications. arXiv preprint arXiv:1709.05584. guistics: system demonstrations, pages 55–60. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring- Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai ing order into text. In Proceedings of the 2004 con- Chen. 2014. Convolutional neural network architec- ference on empirical methods in natural language tures for matching natural language sentences. In processing. Advances in neural information processing systems, pages 2042–2050. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, representations in vector space. arXiv preprint Alex Acero, and Larry Heck. 2013. Learning deep arXiv:1301.3781. structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM Bhaskar Mitra, Fernando Diaz, and Nick Craswell. international conference on Conference on informa- 2017. Learning to match using local and distributed tion & knowledge management, pages 2333–2338. representations of text for web search. In Proceed- ACM. ings of the 26th International Conference on World
Wide Web, pages 1291–1299. International World François Rousseau and Michalis Vazirgiannis. 2013. Wide Web Conferences Steering Committee. Graph-of-word and tw-idf: new approach to ad hoc ir. In Proceedings of the 22nd ACM international Jonas Mueller and Aditya Thyagarajan. 2016. Siamese conference on Information & Knowledge Manage- recurrent architectures for learning sentence similar- ment, pages 59–68. ACM. ity. In Thirtieth AAAI Conference on Artificial Intel- ligence. Hassan Sayyadi and Louiqa Raschid. 2013. A graph analytical approach for topic detection. ACM Trans- Paul Neculoiu, Maarten Versteegh, Mihai Rotaru, and actions on Internet Technology (TOIT), 13(2):4. Textkernel BV Amsterdam. 2016. Learning text similarity with siamese recurrent networks. ACL Adam Schenker, Mark Last, Horst Bunke, and Abra- 2016, page 148. ham Kandel. 2003. Clustering of web docu- ments using a graph model. SERIES IN MA- Giannis Nikolentzos, Polykarpos Meladianos, François CHINE PERCEPTION AND ARTIFICIAL INTEL- Rousseau, Yannis Stavrakas, and Michalis Vazir- LIGENCE, 55:3–18. giannis. 2017. Shortest-path graph kernels for doc- ument similarity. In Proceedings of the 2017 Con- Aliaksei Severyn and Alessandro Moschitti. 2015. ference on Empirical Methods in Natural Language Learning to rank short text pairs with convolutional Processing, pages 1890–1900. deep neural networks. In Proceedings of the 38th in- ternational ACM SIGIR conference on research and Lawrence Page, Sergey Brin, Rajeev Motwani, and development in information retrieval, pages 373– Terry Winograd. 1999. The pagerank citation rank- 382. ACM. ing: Bringing order to the web. Technical report, Stanford InfoLab. Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Ja- cobs, Heidi Wang, and Jure Leskovec. 2013. Infor- Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, mation cartography: creating zoomable, large-scale Shengxian Wan, and Xueqi Cheng. 2016. Text maps of information. In Proceedings of the 19th matching as image recognition. In AAAI, pages ACM SIGKDD international conference on Knowl- 2793–2799. edge discovery and data mining, pages 1097–1105. ACM. Adam Paszke, Sam Gross, Soumith Chintala, and Gre- gory Chanan. 2017. Pytorch: Tensors and dynamic Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, neural networks in python with strong gpu acceler- and Grégoire Mesnil. 2014. Learning semantic rep- ation. PyTorch: Tensors and dynamic neural net- resentations using convolutional neural networks for works in Python with strong GPU acceleration. web search. In Proceedings of the 23rd Interna- tional Conference on World Wide Web, pages 373– Christian Paul, Achim Rettinger, Aditya Mogadala, 374. ACM. Craig A Knoblock, and Pedro Szekely. 2016. Effi- cient graph-based document similarity. In European Piek Vossen, Tommaso Caselli, and Yiota Kont- Semantic Web Conference, pages 334–349. Springer. zopoulou. 2015. Storylines for structuring massive streams of news. In Proceedings of the First Work- Marten Postma, Filip Ilievski, and Piek Vossen. 2018. shop on Computing News Storylines, pages 40–49. Semeval-2018 task 5: Counting events and par- ticipants in the long tail. In Proceedings of The Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, 12th International Workshop on Semantic Evalua- Liang Pang, and Xueqi Cheng. 2016. A deep ar- tion, pages 70–80. chitecture for semantic matching with multiple po- sitional sentence representations. In AAAI, vol- Jan Wira Gotama Putra and Takenobu Tokunaga. 2017. ume 16, pages 2835–2841. Evaluating text coherence based on semantic sim- ilarity graph. In Proceedings of TextGraphs-11: Yu Wu, Wei Wu, Can Xu, and Zhoujun Li. 2018. the Workshop on Graph-based Methods for Natural Knowledge enhanced hybrid neural network for text Language Processing, pages 76–85. matching. In Thirty-Second AAAI Conference on Artificial Intelligence. Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for community- Ting Zhang, Bang Liu, Di Niu, Kunfeng Lai, and based question answering. In IJCAI, pages 1305– Yu Xu. 2018. Multiresolution graph attention net- 1311. works for relevance matching. In Proceedings of the 27th ACM International Conference on Informa- Bryan Rink, Cosmin Adrian Bejan, and Sanda M tion and Knowledge Management, pages 933–942. Harabagiu. 2010. Learning textual graph patterns ACM. to detect causal event relations. In FLAIRS Confer- ence. Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An un- supervised bayesian modelling approach for story- Stephen Robertson, Hugo Zaragoza, et al. 2009. The line detection on news articles. In Proceedings of probabilistic relevance framework: Bm25 and be- the 2015 Conference on Empirical Methods in Nat- yond. Foundations and Trends R in Information Re- ural Language Processing, pages 1943–1948. trieval, 3(4):333–389.
You can also read