Neural Natural Language Inference Models Enhanced with External Knowledge
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Natural Language Inference Models Enhanced with External Knowledge Qian Chen Xiaodan Zhu University of Science and ECE, Queen’s University Technology of China xiaodan.zhu@queensu.ca cq1231@mail.ustc.edu.cn Zhen-Hua Ling Diana Inkpen University of Science and University of Ottawa Technology of China diana@site.uottawa.ca zhling@ustc.edu.cn Si Wei iFLYTEK Research siwei@iflytek.com Abstract In the last several years, larger anno- tated datasets were made available, e.g., the Modeling natural language inference is a SNLI (Bowman et al., 2015) and MultiNLI very challenging task. With the avail- datasets (Williams et al., 2017), which made ability of large annotated data, it has re- it feasible to train rather complicated neural- cently become feasible to train complex network-based models that fit a large set of models such as neural-network-based in- parameters to better model NLI. Such models ference models, which have shown to have shown to achieve the state-of-the-art per- achieve the state-of-the-art performance. formance (Bowman et al., 2015, 2016; Yu and Although there exist relatively large anno- Munkhdalai, 2017b; Parikh et al., 2016; Sha et al., tated data, can machines learn all knowl- 2016; Chen et al., 2017a,b; Tay et al., 2018). edge needed to perform natural language While neural networks have been shown to be inference (NLI) from these data? If not, very effective in modeling NLI with large train- how can neural-network-based NLI mod- ing data, they have often focused on end-to-end els benefit from external knowledge and training by assuming that all inference knowledge how to build NLI models to leverage it? is learnable from the provided training data. In In this paper, we enrich the state-of-the-art this paper, we relax this assumption and explore neural natural language inference models whether external knowledge can further help NLI. with external knowledge. We demonstrate Consider an example: that the proposed models improve neural NLI models to achieve the state-of-the-art • p: A lady standing in a wheat field. performance on the SNLI and MultiNLI • h: A person standing in a corn field. datasets. In this simplified example, when computers are 1 Introduction asked to predict the relation between these two Reasoning and inference are central to both hu- sentences and if training data do not provide the man and artificial intelligence. Natural language knowledge of relationship between “wheat” and inference (NLI), also known as recognizing tex- “corn” (e.g., if one of the two words does not ap- tual entailment (RTE), is an important NLP prob- pear in the training data or they are not paired in lem concerned with determining inferential rela- any premise-hypothesis pairs), it will be hard for tionship (e.g., entailment, contradiction, or neu- computers to correctly recognize that the premise tral) between a premise p and a hypothesis h. In contradicts the hypothesis. general, modeling informal inference in language In general, although in many tasks learning tab- is a very challenging and basic problem towards ula rasa achieved state-of-the-art performance, we achieving true natural language understanding. believe complicated NLP problems such as NLI 2406 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2406–2417 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
could benefit from leveraging knowledge accumu- transform sentences into fixed-length vector rep- lated by humans, particularly in a foreseeable fu- resentations, which may help a wide range of ture when machines are unable to learn it by them- tasks (Conneau et al., 2017). selves. The second set of models use inter-sentence at- In this paper we enrich neural-network-based tention (Rocktäschel et al., 2015; Wang and Jiang, NLI models with external knowledge in co- 2016; Cheng et al., 2016; Parikh et al., 2016; attention, local inference collection, and inference Chen et al., 2017a). Among them, Rocktäschel composition components. We show the proposed et al. (2015) were among the first to propose neu- model improves the state-of-the-art NLI models ral attention-based models for NLI. Chen et al. to achieve better performances on the SNLI and (2017a) proposed an enhanced sequential infer- MultiNLI datasets. The advantage of using exter- ence model (ESIM), which is one of the best mod- nal knowledge is more significant when the size of els so far and is used as one of our baselines in this training data is restricted, suggesting that if more paper. knowledge can be obtained, it may bring more In this paper we enrich neural-network-based benefit. In addition to attaining the state-of-the- NLI models with external knowledge. Unlike art performance, we are also interested in under- early work on NLI (Jijkoun and de Rijke, 2005; standing how external knowledge contributes to MacCartney et al., 2008; MacCartney, 2009) that the major components of typical neural-network- explores external knowledge in conventional NLI based NLI models. models on relatively small NLI datasets, we aim to merge the advantage of powerful modeling ability 2 Related Work of neural networks with extra external inference knowledge. We show that the proposed model Early research on natural language inference and improves the state-of-the-art neural NLI models recognizing textual entailment has been performed to achieve better performances on the SNLI and on relatively small datasets (refer to MacCartney MultiNLI datasets. The advantage of using exter- (2009) for a good literature survey), which in- nal knowledge is more significant when the size of cludes a large bulk of contributions made under training data is restricted, suggesting that if more the name of RTE, such as (Dagan et al., 2005; knowledge can be obtained, it may have more ben- Iftene and Balahur-Dobrescu, 2007), among many efit. In addition to attaining the state-of-the-art others. performance, we are also interested in understand- More recently the availability of much larger ing how external knowledge affect major compo- annotated data, e.g., SNLI (Bowman et al., nents of neural-network-based NLI models. 2015) and MultiNLI (Williams et al., 2017), has In general, external knowledge has shown to be made it possible to train more complex mod- effective in neural networks for other NLP tasks, els. These models mainly fall into two types including word embedding (Chen et al., 2015; of approaches: sentence-encoding-based models Faruqui et al., 2015; Liu et al., 2015; Wieting and models using also inter-sentence attention. et al., 2015; Mrksic et al., 2017), machine trans- Sentence-encoding-based models use Siamese ar- lation (Shi et al., 2016; Zhang et al., 2017b), lan- chitecture (Bromley et al., 1993). The parameter- guage modeling (Ahn et al., 2016), and dialogue tied neural networks are applied to encode both systems (Chen et al., 2016b). the premise and the hypothesis. Then a neural network classifier is applied to decide relationship 3 Neural-Network-Based NLI Models between the two sentences. Different neural net- with External Knowledge works have been utilized for sentence encoding, such as LSTM (Bowman et al., 2015), GRU (Ven- In this section we propose neural-network-based drov et al., 2015), CNN (Mou et al., 2016), BiL- NLI models to incorporate external inference STM and its variants (Liu et al., 2016c; Lin et al., knowledge, which, as we will show later in Sec- 2017; Chen et al., 2017b; Nie and Bansal, 2017), tion 5, achieve the state-of-the-art performance. self-attention network (Shen et al., 2017, 2018), In addition to attaining the leading performance and more complicated neural networks (Bowman we are also interested in investigating the effects et al., 2016; Yu and Munkhdalai, 2017a,b; Choi of external knowledge on major components of et al., 2017). Sentence-encoding-based models neural-network-based NLI modeling. 2407
Figure 1 shows a high-level general view of the 3.2 Encoding Premise and Hypothesis proposed framework. While specific NLI systems Same as much previous work (Chen et al., vary in their implementation, typical state-of-the- 2017a,b), we encode the premise and the hypoth- art NLI models contain the main components (or esis with bidirectional LSTMs (BiLSTMs). The equivalents) of representing premise and hypoth- premise is represented as a = (a1 , . . . , am ) and esis sentences, collecting local (e.g., lexical) in- the hypothesis is b = (b1 , . . . , bn ), where m ference information, and aggregating and compos- and n are the lengths of the sentences. Then a ing local information to make the global decision and b are embedded into de -dimensional vectors at the sentence level. We incorporate and investi- [E(a1 ), . . . , E(am )] and [E(b1 ), . . . , E(bn )] using gate external knowledge accordingly in these ma- the embedding matrix E ∈ Rde ×|V | , where |V | is jor NLI components: computing co-attention, col- the vocabulary size and E can be initialized with lecting local inference information, and compos- the pre-trained word embedding. To represent ing inference to make final decision. words in its context, the premise and the hypothe- sis are fed into BiLSTM encoders (Hochreiter and 3.1 External Knowledge Schmidhuber, 1997) to obtain context-dependent As discussed above, although there exist relatively hidden states as and bs : large annotated data for NLI, can machines learn asi = Encoder(E(a), i) , (1) all inference knowledge needed to perform NLI from the data? If not, how can neural network- bsj = Encoder(E(b), j) . (2) based NLI models benefit from external knowl- where i and j indicate the i-th word in the premise edge and how to build NLI models to leverage it? and the j-th word in the hypothesis, respectively. We study the incorporation of external, inference-related knowledge in major compo- 3.3 Knowledge-Enriched Co-Attention nents of neural networks for natural language As discussed above, soft-alignment of word pairs inference. For example, intuitively knowledge between the premise and the hypothesis may ben- about synonymy, antonymy, hypernymy and efit from knowledge-enriched co-attention mech- hyponymy between given words may help model anism. Given the relation features rij ∈ Rdr be- soft-alignment between premises and hypotheses; tween the premise’s i-th word and the hypothesis’s knowledge about hypernymy and hyponymy j-th word derived from the external knowledge, may help capture entailment; knowledge about the co-attention is calculated as: antonymy and co-hyponyms (words sharing the same hypernym) may benefit the modeling of eij = (asi )T bsj + F (rij ) . (3) contradiction. The function F can be any non-linear or linear In this section, we discuss the incorporation of functions. In this paper, we use F (rij ) = λ1(rij ), basic, lexical-level semantic knowledge into neu- where λ is a hyper-parameter tuned on the devel- ral NLI components. Specifically, we consider ex- opment set and 1 is the indication function as fol- ternal lexical-level inference knowledge between lows: word wi and wj , which is represented as a vec- ( tor rij and is incorporated into three specific com- 1 if rij is not a zero vector ; ponents shown in Figure 1. We will discuss the 1(rij ) = (4) 0 if rij is a zero vector . details of how rij is constructed later in the exper- iment setup section (Section 4) but instead focus Intuitively, word pairs with semantic relationship, on the proposed model in this section. Note that e.g., synonymy, antonymy, hypernymy, hyponymy while we study lexical-level inference knowledge and co-hyponyms, are probably aligned together. in the paper, if inference knowledge about larger We will discuss how we construct external knowl- pieces of text pairs (e.g., inference relations be- edge later in Section 4. We have also tried a two- tween phrases) are available, the proposed model layer MLP as a universal function approximator can be easily extended to handle that. In this paper, in function F to learn the underlying combination we instead let the NLI models to compose lexical- function but did not observe further improvement level knowledge to obtain inference relations be- over the best performance we obtained on the de- tween larger pieces of texts. velopment datasets. 2408
Figure 1: A high-level view of neural-network-based NLI models enriched with external knowledge in co-attention, local inference collection, and inference composition. Soft-alignment is determined by the co- aci , in addition to their relation from external attention matrix e ∈ Rm×n computed in Equa- knowledge, we can obtain word-level inference tion (3), which is used to obtain the local relevance information for each word. The same calcula- between the premise and the hypothesis. For the tion is performed for bsj and bcj . Thus, we collect hidden state of the i-th word in the premise, i.e., knowledge-enriched local inference information: asi (already encoding the word itself and its con- n X text), the relevant semantics in the hypothesis is am s c s c s c i = G([ai ; ai ; ai − ai ; ai ◦ ai ; αij rij ]) , (7) identified into a context vector aci using eij , more j=1 m specifically with Equation (5). X bm s c s c s c j = G([bj , bj ; bj − bj ; bj ◦ bj ; βij rji ]) , (8) n i=1 exp(eij ) X αij = Pn , aci = αij bsj , (5) where a heuristic matching trick with difference k=1 exp(e ik ) j=1 and element-wise product is used (Mou et al., m exp(eij ) X 2016; Chen et al., 2017a). The last terms in Equa- βij = Pm , bcj = βij asi , (6) k=1 exp(e kj ) tion (7)(8) are used to obtain word-level infer- i=1 ence information from external knowledge. Take where α ∈ Rm×n and β ∈ Rm×n are the nor- Equation (7) as example, rij is the relation fea- malized attention weight matrices with respect to ture between the i-th word in the premise and the 2-axis and 1-axis. The same calculation is per- the j-th word in the hypothesis, but we care formed for each word in the hypothesis, i.e., bsj , more about semantic relation between aligned with Equation (6) to obtain the context vector bcj . word pairs between the premise and the hypoth- esis. Thus, we use a soft-aligned version through 3.4 Local Inference Collection with External the soft-alignment weight αij . For the i-th word Knowledge in the premise, the last term in Equation (7) is By way of comparing the inference-related seman- a word-level inference information based on ex- tic relation between asi (individual word repre- ternal knowledge between the i-th word and the sentation in premise) and aci (context representa- aligned word. The same calculation for hypoth- tion from hypothesis which is align to word asi ), esis is performed in Equation (8). G is a non- we can model local inference (i.e., word-level in- linear mapping function to reduce dimensionality. ference) between aligned word pairs. Intuitively, Specifically, we use a 1-layer feed-forward neural for example, knowledge about hypernymy or hy- network with the ReLU activation function with ponymy may help model entailment and knowl- a shortcut connection, i.e., concatenate Pn the hidden edge about antonymy and co-hyponyms may help states Pm after ReLU with the input j=1 αij rij (or model contradiction. Through comparing asi and β r i=1 ij ji ) as the output a m (or bm ). i j 2409
3.5 Knowledge-Enhanced Inference 4 Experiment Set-Up Composition 4.1 Representation of External Knowledge In this component, we introduce knowledge- Lexical Semantic Relations As described in enriched inference composition. To determine the Section 3.1, to incorporate external knowledge overall inference relationship between the premise (as a knowledge vector rij ) to the state-of-the- and the hypothesis, we need to explore a compo- art neural network-based NLI models, we first sition layer to compose the local inference vectors explore semantic relations in WordNet (Miller, (am and bm ) collected above: 1995), motivated by MacCartney (2009). Specif- ically, the relations of lexical pairs are derived as avi = Composition(am , i) , (9) described in (1)-(4) below. Instead of using Jiang- bvj m = Composition(b , j) . (10) Conrath WordNet distance metric (Jiang and Con- rath, 1997), which does not improve the perfor- mance of our models on the development sets, we Here, we also use BiLSTMs as building blocks add a new feature, i.e., co-hyponyms, which con- for the composition layer, but the responsibility sistently benefit our models. of BiLSTMs in the inference composition layer is completely different from that in the input en- (1) Synonymy: It takes the value 1 if the words in coding layer. The BiLSTMs here read local in- the pair are synonyms in WordNet (i.e., be- ference vectors (am and bm ) and learn to judge long to the same synset), and 0 otherwise. For the types of local inference relationship and dis- example, [felicitous, good] = 1, [dog, wolf] = tinguish crucial local inference vectors for overall 0. sentence-level inference relationship. Intuitively, (2) Antonymy: It takes the value 1 if the words the final prediction is likely to depend on word in the pair are antonyms in WordNet, and 0 pairs appearing in external knowledge that have otherwise. For example, [wet, dry] = 1. some semantic relation. Our inference model con- verts the output hidden vectors of BiLSTMs to (3) Hypernymy: It takes the value 1 − n/8 if one the fixed-length vector with pooling operations word is a (direct or indirect) hypernym of the and puts it into the final classifier to determine other word in WordNet, where n is the num- the overall inference class. Particularly, in addi- ber of edges between the two words in hier- tion to using mean pooling and max pooling sim- archies, and 0 otherwise. Note that we ignore ilarly to ESIM (Chen et al., 2017a), we propose pairs in the hierarchy which have more than 8 to use weighted pooling based on external knowl- edges in between. For example, [dog, canid] edge to obtain a fixed-length vector as in Equation = 0.875, [wolf, canid] = 0.875, [dog, carni- (11)(12). vore] = 0.75, [canid, dog] = 0 (4) Hyponymy: It is simply the inverse of the hy- m Pn w X exp(H( j=1 αij rij )) pernymy feature. For example, [canid, dog] a = Pm Pn avi , (11) = 0.875, [dog, canid] = 0. i=1 i=1 exp(H( j=1 α ij r ij )) Xn Pm exp(H( i=1 βij rji )) (5) Co-hyponyms: It takes the value 1 if the two bw = Pn Pm bvj . (12) words have the same hypernym but they do j=1 j=1 exp(H( β i=1 ij jir )) not belong to the same synset, and 0 other- wise. For example, [dog, wolf] = 1. In our experiments, we regard the function H as a 1-layer feed-forward neural network with ReLU As discussed above, we expect features like syn- activation function. We concatenate all pooling onymy, antonymy, hypernymy, hyponymy and co- vectors, i.e., mean, max, and weighted pooling, hyponyms would help model co-attention align- into the fixed-length vector and then put the vector ment between the premise and the hypothesis. into the final multilayer perceptron (MLP) clas- Knowledge of hypernymy and hyponymy may help sifier. The MLP has one hidden layer with tanh capture entailment; knowledge of antonymy and activation and softmax output layer in our exper- co-hyponyms may help model contradiction. Their iments. The entire model is trained end-to-end, final contributions will be learned in end-to-end through minimizing the cross-entropy loss. model training. We regard the vector r ∈ Rdr as 2410
the relation feature derived from external knowl- general natural language inference (e.g., the color edge, where dr is 5 here. In addition, Table 1 re- yellow potentially contradicts red). ports some key statistics of these features. 4.2 NLI Datasets Feature #Words #Pairs In our experiments, we use Stanford Natural Lan- Synonymy 84,487 237,937 guage Inference (SNLI) dataset (Bowman et al., Antonymy 6,161 6,617 2015) and Multi-Genre Natural Language Infer- Hypernymy 57,475 753,086 Hyponymy 57,475 753,086 ence (MultiNLI) (Williams et al., 2017) dataset, Co-hyponyms 53,281 3,674,700 which focus on three basic relations between a premise and a potential hypothesis: the premise Table 1: Statistics of lexical relation features. entails the hypothesis (entailment), they contradict each other (contradiction), or they are not related In addition to the above relations, we also use (neutral). We use the same data split as in previ- more relation features in WordNet, including in- ous work (Bowman et al., 2015; Williams et al., stance, instance of, same instance, entailment, 2017) and classification accuracy as the evaluation member meronym, member holonym, substance metric. In addition, we test our models (trained on meronym, substance holonym, part meronym, part the SNLI training set) on a new test set (Glockner holonym, summing up to 15 features, but these ad- et al., 2018), which assesses the lexical inference ditional features do not bring further improvement abilities of NLI systems and consists of 8,193 sam- on the development dataset, as also discussed in ples. WordNet 3.0 (Miller, 1995) is used to extract Section 5. semantic relation features between words. The words are lemmatized using Stanford CoreNLP Relation Embeddings In the most recent years 3.7.0 (Manning et al., 2014). The premise and the graph embedding has been widely employed to hypothesis sentences fed into the input encoding learn representation for vertexes and their relations layer are tokenized. in a graph. In our work here, we also capture the relation between any two words in WordNet 4.3 Training Details through relation embedding. Specifically, we em- For duplicability, we release our code1 . All our ployed TransE (Bordes et al., 2013), a widely used models were strictly selected on the development graph embedding methods, to capture relation em- set of the SNLI data and the in-domain devel- bedding between any two words. We used two opment set of MultiNLI and were then tested on typical approaches to obtaining the relation em- the corresponding test set. The main training de- bedding. The first directly uses 18 relation em- tails are as follows: the dimension of the hid- beddings pretrained on the WN18 dataset (Bordes den states of LSTMs and word embeddings are et al., 2013). Specifically, if a word pair has a cer- 300. The word embeddings are initialized by tain type relation, we take the corresponding re- 300D GloVe 840B (Pennington et al., 2014), and lation embedding. Sometimes, if a word pair has out-of-vocabulary words among them are initial- multiple relations among the 18 types; we take an ized randomly. All word embeddings are updated average of the relation embedding. The second ap- during training. Adam (Kingma and Ba, 2014) proach uses TransE’s word embedding (trained on is used for optimization with an initial learning WordNet) to obtain relation embedding, through rate of 0.0004. The mini-batch size is set to 32. the objective function used in TransE, i.e., l ≈ Note that the above hyperparameter settings are t − h, where l indicates relation embedding, t in- same as those used in the baseline ESIM (Chen dicates tail entity embedding, and h indicates head et al., 2017a) model. ESIM is a strong NLI entity embedding. baseline framework with the source code made Note that in addition to relation embedding available at https://github.com/lukecq1231/nli (the trained on WordNet, other relational embedding ESIM core code has also been adapted to sum- resources exist; e.g., that trained on Freebase marization (Chen et al., 2016a) and question- (WikiData) (Bollacker et al., 2007), but such answering tasks (Zhang et al., 2017a)). knowledge resources are mainly about facts (e.g., The trade-off λ for calculating co- relationship between Bill Gates and Microsoft) 1 and are less for commonsense knowledge used in https://github.com/lukecq1231/kim 2411
attention in Equation (3) is selected in Model Test [0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50] based on the LSTM Att. (Rocktäschel et al., 2015) 83.5 development set. When training TransE for DF-LSTMs (Liu et al., 2016a) 84.6 WordNet, relations are represented with vectors TC-LSTMs (Liu et al., 2016b) 85.1 Match-LSTM (Wang and Jiang, 2016) 86.1 of 20 dimension. LSTMN (Cheng et al., 2016) 86.3 Decomposable Att. (Parikh et al., 2016) 86.8 5 Experimental Results NTI (Yu and Munkhdalai, 2017b) 87.3 Re-read LSTM (Sha et al., 2016) 87.5 5.1 Overall Performance BiMPM (Wang et al., 2017) 87.5 Table 2 shows the results of state-of-the-art models DIIN (Gong et al., 2017) 88.0 BCN + CoVe (McCann et al., 2017) 88.1 on the SNLI dataset. Among them, ESIM (Chen CAFE (Tay et al., 2018) 88.5 et al., 2017a) is one of the previous state-of-the-art ESIM (Chen et al., 2017a) 88.0 systems with an 88.0% test-set accuracy. The pro- KIM (This paper) 88.6 posed model, namely Knowledge-based Inference Model (KIM), which enriches ESIM with external Table 2: Accuracies of models on SNLI. knowledge, obtains an accuracy of 88.6%, the best single-model performance reported on the SNLI Model In Cross dataset. The difference between ESIM and KIM is statistically significant under the one-tailed paired CBOW (Williams et al., 2017) 64.8 64.5 BiLSTM (Williams et al., 2017) 66.9 66.9 t-test at the 99% significance level. Note that the DiSAN (Shen et al., 2017) 71.0 71.4 KIM model reported here uses five semantic rela- Gated BiLSTM (Chen et al., 2017b) 73.5 73.6 tions described in Section 4. In addition to that, we SS BiLSTM (Nie and Bansal, 2017) 74.6 73.6 also use 15 semantic relation features, which does DIIN * (Gong et al., 2017) 77.8 78.8 CAFE (Tay et al., 2018) 78.7 77.9 not bring additional gains in performance. These results highlight the effectiveness of the five se- ESIM (Chen et al., 2017a) 76.8 75.8 KIM (This paper) 77.2 76.4 mantic relations described in Section 4. To further investigate external knowledge, we add TransE re- Table 3: Accuracies of models on MultiNLI. * in- lation embedding, and again no further improve- dicates models using extra SNLI training set. ment is observed on both the development and test sets when TransE relation embedding is used (con- catenated) with the semantic relation vectors. We domly sample different ratios of the entire training consider this is due to the fact that TransE embed- set, i.e., 0.8%, 4%, 20% and 100%. “A” indicates ding is not specifically sensitive to inference in- adding external knowledge in calculating the co- formation; e.g., it does not model co-hyponyms attention matrix as in Equation (3), “I” indicates features, and its potential benefit has already been adding external knowledge in collecting local in- covered by the semantic relation features used. ference information as in Equation (7)(8), and “C” Table 3 shows the performance of models on the indicates adding external knowledge in compos- MultiNLI dataset. The baseline ESIM achieves ing inference as in Equation (11)(12). When we 76.8% and 75.8% on in-domain and cross-domain only have restricted training data, i.e., 0.8% train- test set, respectively. If we extend the ESIM with ing set (about 4,000 samples), the baseline ESIM external knowledge, we achieve significant gains has a poor accuracy of 62.4%. When we only to 77.2% and 76.4% respectively. Again, the gains add external knowledge in calculating co-attention are consistent on SNLI and MultiNLI, and we ex- (“A”), the accuracy increases to 66.6% (+ absolute pect they would be orthogonal to other factors 4.2%). When we only utilize external knowledge when external knowledge is added into other state- in collecting local inference information (“I”), the of-the-art models. accuracy has a significant gain, to 70.3% (+ ab- solute 7.9%). When we only add external knowl- 5.2 Ablation Results edge in inference composition (“C”), the accuracy Figure 2 displays the ablation analysis of differ- gets a smaller gain to 63.4% (+ absolute 1.0%). ent components when using the external knowl- The comparison indicates that “I” plays the most edge. To compare the effects of external knowl- important role among the three components in us- edge under different training data scales, we ran- ing external knowledge. Moreover, when we com- 2412
pose the three components (“A,I,C”), we obtain the best result of 72.6% (+ absolute 10.2%). When we use more training data, i.e., 4%, 20%, 100% of the training set, only “I” achieves a significant gain, but “A” or “C” does not bring any signifi- cant improvement. The results indicate that ex- ternal semantic knowledge only helps co-attention and composition when limited training data is lim- ited, but always helps in collecting local inference information. Meanwhile, for less training data, λ is usually set to a larger value. For example, the optimal λ on the development set is 20 for 0.8% training set, 2 for the 4% training set, 1 for the 20% training set and 0.2 for the 100% training set. Figure 3: Accuracies of models under differ- Figure 3 displays the results of using different ent sizes of external knowledge. More external ratios of external knowledge (randomly keep dif- knowledge corresponds to higher accuracies. ferent percentages of whole lexical semantic rela- tions) under different sizes of training data. Note Model SNLI Glockner’s(∆) that here we only use external knowledge in col- (Parikh et al., 2016)* 84.7 51.9 (-32.8) lecting local inference information as it always (Nie and Bansal, 2017)* 86.0 62.2 (-23.8) works well for different scale of the training set. ESIM * 87.9 65.6 (-22.3) Better accuracies are achieved when using more KIM (This paper) 88.6 83.5 ( -5.1) external knowledge. Especially under the condi- tion of restricted training data (0.8%), the model Table 4: Accuracies of models on the SNLI and obtains a large gain when using more than half of (Glockner et al., 2018) test set. * indicates the re- external knowledge. sults taken from (Glockner et al., 2018). set, the performance of the three baseline mod- els dropped substantially on the (Glockner et al., 2018) test set, with the differences ranging from 22.3% to 32.8% in accuracy. Instead, the proposed KIM achieves 83.5% on this test set (with only a 5.1% drop in performance), which demonstrates its better ability of utilizing lexical level inference and hence better generalizability. Figure 5 displays the accuracy of ESIM and KIM in each replacement-word category of the (Glockner et al., 2018) test set. KIM outper- forms ESIM in 13 out of 14 categories, and only performs worse on synonyms. Figure 2: Accuracies of models of incorporat- ing external knowledge into different NLI compo- 5.4 Analysis by Inference Categories nents, under different sizes of training data (0.8%, We perform more analysis (Table 6) using the sup- 4%, 20%, and the entire training data). plementary annotations provided by the MultiNLI dataset (Williams et al., 2017), which have 495 samples (about 1/20 of the entire development set) 5.3 Analysis on the (Glockner et al., 2018) for both in-domain and out-domain set. We com- Test Set pare against the model outputs of the ESIM model In addition, Table 4 shows the results on a newly across 13 categories of inference. Table 6 reports published test set (Glockner et al., 2018). Com- the results. We can see that KIM outperforms pared with the performance on the SNLI test ESIM on overall accuracies on both in-domain and 2413
Category Instance ESIM KIM P/G Sentences Antonyms 1,147 70.4 86.5 e/c p: An African person standing in a wheat Cardinals 759 75.5 93.4 field. Nationalities 755 35.9 73.5 h: A person standing in a corn field. Drinks 731 63.7 96.6 Antonyms WordNet 706 74.6 78.8 e/c p: Little girl is flipping an omelet in the Colors 699 96.1 98.3 kitchen. Ordinals 663 21.0 56.6 h: A young girl cooks pancakes. Countries 613 25.4 70.8 c/e p: A middle eastern marketplace. Rooms 595 69.4 77.6 h: A middle easten store. Materials 397 89.7 98.7 Vegetables 109 31.2 79.8 c/e p: Two boys are swimming with boogie Instruments 65 90.8 96.9 boards. Planets 60 3.3 5.0 h: Two boys are swimming with their floats. Synonyms 894 99.7 92.1 Overall 8,193 65.6 83.5 Table 7: Examples. Word in bold are key words in making final prediction. P indicates a predicted Table 5: The number of instances and accu- label and G indicates gold-standard label. e and c racy per category achieved by ESIM and KIM on denote entailment and contradiction, respectively. the (Glockner et al., 2018) test set. Category In-domain Cross-domain ple, the premise is “An African person standing in ESIM KIM ESIM KIM a wheat field” and the hypothesis “A person stand- Active/Passive 93.3 93.3 100.0 100.0 ing in a corn field”. As the KIM model knows that Antonym 76.5 76.5 70.0 75.0 “wheat” and “corn” are both a kind of cereal, i.e, Belief 72.7 75.8 75.9 79.3 the co-hyponyms relationship in our relation fea- Conditional 65.2 65.2 61.5 69.2 Coreference 80.0 76.7 75.9 75.9 tures, KIM therefore predicts the premise contra- Long sentence 82.8 78.8 69.7 73.4 dicts the hypothesis. However, the baseline ESIM Modal 80.6 79.9 77.0 80.2 cannot learn the relationship between “wheat” and Negation 76.7 79.8 73.1 71.2 “corn” effectively due to lack of enough samples Paraphrase 84.0 72.0 86.5 89.2 Quantity/Time 66.7 66.7 56.4 59.0 in the training sets. With the help of external Quantifier 79.2 78.4 73.6 77.1 knowledge, i.e., “wheat” and “corn” having the Tense 74.5 78.4 72.2 66.7 same hypernym “cereal”, KIM predicts contradic- Word overlap 89.3 85.7 83.8 81.1 tion correctly. Overall 77.1 77.9 76.7 77.4 6 Conclusions Table 6: Detailed Analysis on MultiNLI. Our neural-network-based model for natural lan- guage inference with external knowledge, namely cross-domain subset of development set. KIM out- KIM, achieves the state-of-the-art accuracies. The performs or equals ESIM in 10 out of 13 cate- model is equipped with external knowledge in its gories on the cross-domain setting, while only 7 main components, specifically, in calculating co- out of 13 categories on in-domain setting. It indi- attention, collecting local inference, and compos- cates that external knowledge helps more in cross- ing inference. We provide detailed analyses on our domain setting. Especially, for antonym category model and results. The proposed model of infus- in cross-domain set, KIM outperform ESIM sig- ing neural networks with external knowledge may nificantly (+ absolute 5.0%) as expected, because also help shed some light on tasks other than NLI. antonym feature captured by external knowledge Acknowledgments would help unseen cross-domain samples. We thank Yibo Sun and Bing Qin for early helpful 5.5 Case Study discussion. Table 7 includes some examples from the SNLI test set, where KIM successfully predicts the in- ference relation and ESIM fails. In the first exam- 2414
References Representations for NLP, RepEval@EMNLP 2017, Copenhagen, Denmark, September 8, 2017, pages Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and 36–40. Yoshua Bengio. 2016. A neural knowledge lan- guage model. CoRR, abs/1608.00318. Yun-Nung Chen, Dilek Z. Hakkani-Tür, Gökhan Tür, Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts. Asli Çelikyilmaz, Jianfeng Gao, and Li Deng. 2007. Freebase: A shared database of structured 2016b. Knowledge as a teacher: Knowledge- general human knowledge. In Proceedings of the guided structural attention networks. CoRR, Twenty-Second AAAI Conference on Artificial In- abs/1609.03286. telligence, July 22-26, 2007, Vancouver, British Columbia, Canada, pages 1962–1963. Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re- Antoine Bordes, Nicolas Usunier, Alberto Garcı́a- visiting word embedding for contrasting meaning. Durán, Jason Weston, and Oksana Yakhnenko. In Proceedings of the 53rd Annual Meeting of the 2013. Translating embeddings for modeling multi- Association for Computational Linguistics and the relational data. In Advances in Neural Information 7th International Joint Conference on Natural Lan- Processing Systems 26: 27th Annual Conference on guage Processing of the Asian Federation of Natural Neural Information Processing Systems 2013. Pro- Language Processing, ACL 2015, July 26-31, 2015, ceedings of a meeting held December 5-8, 2013, Beijing, China, Volume 1: Long Papers, pages 106– Lake Tahoe, Nevada, United States., pages 2787– 115. 2795. Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Samuel R. Bowman, Gabor Angeli, Christopher Potts, Long short-term memory-networks for machine and Christopher D. Manning. 2015. A large an- reading. In Proceedings of the 2016 Conference on notated corpus for learning natural language infer- Empirical Methods in Natural Language Process- ence. In Proceedings of the 2015 Conference on ing, EMNLP 2016, Austin, Texas, USA, November Empirical Methods in Natural Language Process- 1-4, 2016, pages 551–561. ing, EMNLP 2015, Lisbon, Portugal, September 17- 21, 2015, pages 632–642. Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2017. Unsupervised learning of task-specific tree struc- Samuel R. Bowman, Jon Gauthier, Abhinav Ras- tures with tree-lstms. CoRR, abs/1707.02786. togi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c parsing and sentence understanding. In Proceedings Barrault, and Antoine Bordes. 2017. Supervised of the 54th Annual Meeting of the Association for learning of universal sentence representations from Computational Linguistics, ACL 2016, August 7-12, natural language inference data. In Proceedings of 2016, Berlin, Germany, Volume 1: Long Papers. the 2017 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2017, Copen- Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard hagen, Denmark, September 9-11, 2017, pages 670– Säckinger, and Roopak Shah. 1993. Signature veri- 680. fication using a siamese time delay neural network. In Advances in Neural Information Processing Sys- Ido Dagan, Oren Glickman, and Bernardo Magnini. tems 6, [7th NIPS Conference, Denver, Colorado, 2005. The PASCAL recognising textual entailment USA, 1993], pages 737–744. challenge. In Machine Learning Challenges, Eval- Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, uating Predictive Uncertainty, Visual Object Clas- and Hui Jiang. 2016a. Distraction-based neural net- sification and Recognizing Textual Entailment, First works for modeling document. In Proceedings of PASCAL Machine Learning Challenges Workshop, the Twenty-Fifth International Joint Conference on MLCW 2005, Southampton, UK, April 11-13, 2005, Artificial Intelligence, IJCAI 2016, New York, NY, Revised Selected Papers, pages 177–190. USA, 9-15 July 2016, pages 2754–2760. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. Hui Jiang, and Diana Inkpen. 2017a. Enhanced 2015. Retrofitting word vectors to semantic lexi- LSTM for natural language inference. In Proceed- cons. In NAACL HLT 2015, The 2015 Conference of ings of the 55th Annual Meeting of the Association the North American Chapter of the Association for for Computational Linguistics, ACL 2017, Vancou- Computational Linguistics: Human Language Tech- ver, Canada, July 30 - August 4, Volume 1: Long nologies, Denver, Colorado, USA, May 31 - June 5, Papers, pages 1657–1668. 2015, pages 1606–1615. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Max Glockner, Vered Shwartz, and Yoav Goldberg. Jiang, and Diana Inkpen. 2017b. Recurrent neural 2018. Breaking nli systems with sentences that re- network-based sentence encoder with gated atten- quire simple lexical inferences. In The 56th Annual tion for natural language inference. In Proceedings Meeting of the Association for Computational Lin- of the 2nd Workshop on Evaluating Vector Space guistics (ACL), Melbourne, Australia. 2415
Yichen Gong, Heng Luo, and Jian Zhang. 2017. Bill MacCartney. 2009. Natural Language Inference. Natural language inference over interaction space. Ph.D. thesis, Stanford University. CoRR, abs/1709.04348. Bill MacCartney, Michel Galley, and Christopher D. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Manning. 2008. A phrase-based alignment model Long short-term memory. Neural Computation, for natural language inference. In 2008 Conference 9(8):1735–1780. on Empirical Methods in Natural Language Pro- cessing, EMNLP 2008, Proceedings of the Confer- Adrian Iftene and Alexandra Balahur-Dobrescu. 2007. ence, 25-27 October 2008, Honolulu, Hawaii, USA, Proceedings of the ACL-PASCAL Workshop on Tex- A meeting of SIGDAT, a Special Interest Group of tual Entailment and Paraphrasing, chapter Hypoth- the ACL, pages 802–811. esis Transformation and Semantic Variability Rules Used in Recognizing Textual Entailment. Associa- Christopher D. Manning, Mihai Surdeanu, John Bauer, tion for Computational Linguistics. Jenny Rose Finkel, Steven Bethard, and David Mc- Closky. 2014. The stanford corenlp natural language Jay J. Jiang and David W. Conrath. 1997. Seman- processing toolkit. In Proceedings of the 52nd An- tic similarity based on corpus statistics and lexical nual Meeting of the Association for Computational taxonomy. In Proceedings of the 10th Research Linguistics, ACL 2014, June 22-27, 2014, Baltimore, on Computational Linguistics International Confer- MD, USA, System Demonstrations, pages 55–60. ence, ROCLING 1997, Taipei, Taiwan, August 1997, pages 19–33. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Con- Valentin Jijkoun and Maarten de Rijke. 2005. Recog- textualized word vectors. In Advances in Neural nizing textual entailment using lexical similarity. In Information Processing Systems 30: Annual Con- Proceedings of the PASCAL Challenges Workshop ference on Neural Information Processing Systems on Recognising Textual Entailment. 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6297–6308. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, George A. Miller. 1995. Wordnet: A lexical database abs/1412.6980. for english. Commun. ACM, 38(11):39–41. Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Zhouhan Lin, Minwei Feng, Cı́cero Nogueira dos San- Yan, and Zhi Jin. 2016. Natural language infer- tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua ence by tree-based convolution and heuristic match- Bengio. 2017. A structured self-attentive sentence ing. In Proceedings of the 54th Annual Meeting of embedding. CoRR, abs/1703.03130. the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume Pengfei Liu, Xipeng Qiu, Jifan Chen, and Xuanjing 2: Short Papers. Huang. 2016a. Deep fusion lstms for text seman- tic matching. In Proceedings of the 54th Annual Nikola Mrksic, Ivan Vulic, Diarmuid Ó Séaghdha, Ira Meeting of the Association for Computational Lin- Leviant, Roi Reichart, Milica Gasic, Anna Korho- guistics, ACL 2016, August 7-12, 2016, Berlin, Ger- nen, and Steve J. Young. 2017. Semantic special- many, Volume 1: Long Papers. isation of distributional word vector spaces using monolingual and cross-lingual constraints. CoRR, Pengfei Liu, Xipeng Qiu, Yaqian Zhou, Jifan Chen, and abs/1706.00374. Xuanjing Huang. 2016b. Modelling interaction of sentence pair with coupled-lstms. In Proceedings of Yixin Nie and Mohit Bansal. 2017. Shortcut- the 2016 Conference on Empirical Methods in Nat- stacked sentence encoders for multi-domain infer- ural Language Processing, EMNLP 2016, Austin, ence. In Proceedings of the 2nd Workshop on Texas, USA, November 1-4, 2016, pages 1703–1712. Evaluating Vector Space Representations for NLP, RepEval@EMNLP 2017, Copenhagen, Denmark, Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and September 8, 2017, pages 41–45. Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Pro- Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and ceedings of the 53rd Annual Meeting of the Associ- Jakob Uszkoreit. 2016. A decomposable attention ation for Computational Linguistics and the 7th In- model for natural language inference. In Proceed- ternational Joint Conference on Natural Language ings of the 2016 Conference on Empirical Meth- Processing of the Asian Federation of Natural Lan- ods in Natural Language Processing, EMNLP 2016, guage Processing, ACL 2015, July 26-31, 2015, Bei- Austin, Texas, USA, November 1-4, 2016, pages jing, China, Volume 1: Long Papers, pages 1501– 2249–2255. 1511. Jeffrey Pennington, Richard Socher, and Christo- Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. pher D. Manning. 2014. Glove: Global vectors for 2016c. Learning natural language inference us- word representation. In Proceedings of the 2014 ing bidirectional LSTM model and inner-attention. Conference on Empirical Methods in Natural Lan- CoRR, abs/1605.09090. guage Processing, EMNLP 2014, October 25-29, 2416
2014, Doha, Qatar, A meeting of SIGDAT, a Special Hong Yu and Tsendsuren Munkhdalai. 2017a. Neural Interest Group of the ACL, pages 1532–1543. semantic encoders. In Proceedings of the 15th Con- ference of the European Chapter of the Association Tim Rocktäschel, Edward Grefenstette, Karl Moritz for Computational Linguistics, EACL 2017, Valen- Hermann, Tomás Kociský, and Phil Blunsom. 2015. cia, Spain, April 3-7, 2017, Volume 1: Long Papers, Reasoning about entailment with neural attention. pages 397–407. CoRR, abs/1509.06664. Hong Yu and Tsendsuren Munkhdalai. 2017b. Neu- Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li. ral tree indexers for text understanding. In Proceed- 2016. Reading and thinking: Re-read LSTM unit ings of the 15th Conference of the European Chap- for textual entailment recognition. In COLING ter of the Association for Computational Linguistics, 2016, 26th International Conference on Computa- EACL 2017, Valencia, Spain, April 3-7, 2017, Vol- tional Linguistics, Proceedings of the Conference: ume 1: Long Papers, pages 11–21. Technical Papers, December 11-16, 2016, Osaka, Japan, pages 2870–2879. Junbei Zhang, Xiaodan Zhu, Qian Chen, Lirong Dai, Si Wei, and Hui Jiang. 2017a. Ex- Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, ploring question understanding and adaptation in Shirui Pan, and Chengqi Zhang. 2017. Disan: Di- neural-network-based question answering. CoRR, rectional self-attention network for rnn/cnn-free lan- abs/arXiv:1703.04617v2. guage understanding. CoRR, abs/1709.04696. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Sen Shiyue Zhang, Gulnigar Mahmut, Dong Wang, and Wang, and Chengqi Zhang. 2018. Reinforced self- Askar Hamdulla. 2017b. Memory-augmented attention network: a hybrid of hard and soft attention chinese-uyghur neural machine translation. In 2017 for sequence modeling. CoRR, abs/1801.10296. Asia-Pacific Signal and Information Processing As- sociation Annual Summit and Conference, APSIPA Chen Shi, Shujie Liu, Shuo Ren, Shi Feng, Mu Li, ASC 2017, Kuala Lumpur, Malaysia, December 12- Ming Zhou, Xu Sun, and Houfeng Wang. 2016. 15, 2017, pages 1092–1096. Knowledge-based semantic embedding for machine translation. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguis- tics, ACL 2016, August 7-12, 2016, Berlin, Ger- many, Volume 1: Long Papers. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. A compare-propagate architecture with alignment fac- torization for natural language inference. CoRR, abs/1801.00102. Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. CoRR, abs/1511.06361. Shuohang Wang and Jing Jiang. 2016. Learning natu- ral language inference with LSTM. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1442– 1451. Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural lan- guage sentences. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelli- gence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4144–4150. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. From paraphrase database to compo- sitional paraphrase model and back. TACL, 3:345– 358. Adina Williams, Nikita Nangia, and Samuel R. Bow- man. 2017. A broad-coverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426. 2417
You can also read