Addressing Discourse and Document Structure in the RTE Search Task
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Addressing Discourse and Document Structure in the RTE Search Task Shachar Mirkin§ , Roy Bar-Haim§ , Jonathan Berant† , Ido Dagan§ , Eyal Shnarch§ , Asher Stern§ , Idan Szpektor§ § Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel † The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel Abstract existing tools for coreference and discourse process- ing provide only limited solutions for such phenom- This paper describes Bar-Ilan University’s ena, we suggest methods to address their gaps. In submissions to RTE-5. This year we fo- particular, we examined complementary methods to cused on the Search pilot, enhancing our identify coreferring phrases as well as some types of entailment system to address two main is- bridging relations which are realized in the form of sues introduced by this new setting: scal- “global information” perceived as known for entire ability and, primarily, document-level dis- documents. As a first step, we considered phrase course. Our system achieved the highest pairs with a certain degree of lexical overlap as po- score on the Search task amongst participat- tentially coreferring, but only if no semantic incom- ing groups, and proposes first steps towards patibility is found between them. For instance, noun addressing this challenging setting. phrases which have the same head, but their mod- ifiers are antonyms, are ruled out. We addressed 1 Introduction the issue of global information by identifying and Bar-Ilan’s research focused this year on the Search weighting prominent document terms and allowing task, which brings about new challenges to entail- their inference even when they are not explicitly ment systems. In this work we put two new aspects mentioned in a sentence. To account for coherence- of the task in the center of attention, namely scala- related discourse phenomena – such as the tendency bility and discourse-based inference, and enhanced of entailing sentences to be adjacent to each other – our core textual entailment engine to address them. we apply a two-phase classification scheme, where While the RTE-5 search dataset is relatively a second-phase meta-classifier is applied, extracting small, we aim at a scalable system that can search features that consider the initial independent classi- for entailing texts over large corpora. To that end, fication of each sentence. we apply as a first step a retrieval component which Using the above ideas and methods, our system considers each hypothesis as a query, expanding its obtained a micro-averaged F1 score of 44.59% on terms using lexical entailment resources. Only sen- the Search task. tences with sufficiently high lexical match are re- trieved and considered as candidate entailing texts, The rest of this paper is organized as follows. In to be classified by the entailment engine, thus saving Section 2 we describe the core system used for run- the system a considerable amount of processing. ning both the Main and the Search tasks, highlight- In the Search task, sentences are situated within a ing the differences relative to the system used in our set of documents. They rely on other sentences for RTE-4 submission. In Section 3 we describe the their interpretation and their entailment is therefore Main task’s submission. In Section 4 we present our dependent on other sentences as well. Hence, dis- approach to the Search task, followed by a descrip- course and document-level information play a cru- tion of the retrieval module (Section 5) and the way cial role in the inference process (Bentivogli et al., we address discourse aspects of the task (Section 6). 2009). In this work we identified several types of Description of the submitted systems and their re- discourse phenomena which occur in a discourse- sults are detailed in Section 7. Section 8 contains dependent setting and are relevant for inference. As conclusions and suggestions for future work.
2 The B IU T EE System We used the version described in (Szpektor and Dagan, 2007), which learns canonical rule For both the Main and Search tasks we used forms applied over the Reuters Corpus, Vol- the Bar-Ilan University Textual Entailment Engine ume 1 (RCV1)1 . (B IU T EE), based on the system used for our RTE- 4 submission (Bar-Haim et al., 2008). B IU T EE ap- The above resources are identical to the ones used plies transformations over the text parse-tree using a in our RTE-4 submission. For RTE-5 we also used knowledge-base of diverse types of entailment rules. the following sources of entailment-rules: These transformations generate many consequents (new texts entailed from the original one), whose • Snow: Snow et al.’s (2006) extension to Word- parse trees are efficiently stored in a packed rep- Net 2.1 with 400,000 additional nodes. resentation, termed Compact Forest (Bar-Haim et • XWN: A resource based on Extended Word- al., 2009). A classifier then makes the entailment Net (Moldovan and Rus, 2001), as described decision by assessing the coverage of the hypothe- in (Mirkin et al., 2009). sis by the generated consequents, compensating for • A geographic resource, denoted Geo, based knowledge gaps in the available rules. on TREC’s TIPSTER gazetteer. We created The following changes were applied to B IU T EE “meronymy” entailment rules such that each in comparison with (Bar-Haim et al., 2008): (a) sev- location entails the location entities in which eral syntactic features are added to our classification it is found. For instance, a city entails the module, as described below; (b) a component for county, the state, the country and the continent supplementing coreference relations is added (see in which it is located, and a country entails its Section 6.1); (c) a different set of entailment re- continent. To attend to ambiguity of location sources is employed, based on performance mea- names, often polysemous with common nouns, sured on the development set. this resource was applied only when the candi- Further enhancements of the system for accom- date geographic name in the text was identi- modating to the Search task are described in Sec. 6. fied as representing a location by the Stanford Named Entity Recognizer (Finkel et al., 2005). Knowledge resources. A variety of knowledge • Abbr: A resource containing about 2000 resources may be employed to induce parse-tree rules for abbreviations, where the abbrevia- transformations, as long as the knowledge can be tion entails the complete phrase (e.g MSG ⇒ represented as entailment rules (denoted LHS ⇒ Monosodium Glutamate). Rules in this re- RHS). In our submissions, the following resources source were generated based on the abbrevia- for entailment rules were utilized. See Sections 3 tion lists of BADC2 and Acronym-Guide3 . and 7 for the specific subsets of resources used in each run: Lastly, we added a small set of rules we devel- • Syntactic rules: These rules capture entailment oped addressing lexical variability involving tempo- inferences associated with common syntactic ral phrases. These rules are based on regular expres- constructs, such as conjunction, relative clause, sions and are generated on the fly. For example, the apposition, etc. (Bar-Haim et al., 2007). occurrence of the date 31/1/1948 in the text triggers the generation of a set of entailment rules including: • WordNet (Fellbaum, 1998): The following WordNet 3.0 relationships were used: syn- 31/1/1948 ⇒ {31/1, January, January 1948, 20th onymy, hyponymy (two levels away from the century, forties} etc. We refer to this resource as original term), the hyponym-instance relation DateRG (Date Rule Generator). and derivation. Classification. B IU T EE’s classification compo- • Wiki: All rules from the Wikipedia-based re- nent is based on a set of lexical and lexical-syntactic source (Shnarch et al., 2009) with DICE co- features, as described in (Bar-Haim et al., 2008). occurrence score above 0.01. Analysis of those features showed that the lexical • DIRT: The DIRT algorithm (Lin and Pantel, 1 http://trec.nist.gov/data/reuters/reuters.html 2001) learns entailment rules between binary 2 http://badc.nerc.ac.uk/help/abbrevs.html predicates, e.g. X explain to Y ⇒ X talk to Y. 3 http://www.acronym-guide.com/
Run Accuracy (%) Resource removed Accuracy (%) ∆Accuracy (%) Main-BIU 1 63.00 WordNet 60.50 2.50 Main-BIU 2 63.80 DIRT 61.67 1.33 Geo 63.80 -0.80 Table 1: Results of our runs on the Main task’s test set. Wiki 64.00 -1.00 Table 2: Results of ablation tests relative to Main-BIU 1 . The features have the most significant impact on the columns from left to right specify, respectively, the name of the resource removed in each ablation test, the accuracy achieved classifier’s decision. Thus, we have engineered an without it and the marginal contribution of the resource. Nega- additional set of lexical-syntactic features: tive figures indicate that the removal of the resource increased the system’s accuracy. 1. A binary feature checking if the main predicate of the hypothesis is covered by the text. The 4 Addressing the Search Task main predicate is found by choosing the pred- icate node closest to the parse-tree root. If no The pilot Search task presents new challenges to in- node is labeled as a predicate, we choose the ference systems. In this task, an inference system is content-word node closest to the root. On the required to identify all sentences that entail a certain development set this method correctly identi- hypothesis in a given (small) corpus. In compari- fies the main predicate of the hypothesis in ap- son to previous RTE challenges, the task is closer to proximately 95% of the cases. practical application setting and better corresponds to natural distribution of entailing texts in a corpus. 2. Features measuring the match between the sub- The task may seem at first look as a variant of In- ject and the object of the hypothesis’ main formation Retrieval (IR), as it requires finding spe- predicate and the corresponding predicate’s ar- cific texts in a large corpus. Yet, it is fundamentally guments in the text. different from IR for two main reasons. First, the 3. A feature measuring the proportion of NP- target output is a set of sentences, each one of them heads in the hypothesis that are covered by the evaluated independently, rather than a set of docu- text. ments. Consequently, a system has to handle target texts which are not self-contained, but are rather de- 3 The Main Task pendent on their surrounding text. Hence, discourse We submitted two runs for the 2-way Main task, de- is a crucial factor. Second, the decision criterion is noted Main-BIU 1 and Main-BIU 2 . Main-BIU 1 uses entailment rather than relevancy. the following resources for entailment rules: the A naı̈ve approach may be applied to the task by syntactic rules resource, WordNet, Wiki, DIRT, Geo, reducing it to a set of text-hypothesis pairs and ap- Abbr and DateRG. Main-BIU 2 contains the same set plying Main-task techniques on each pair. How- of resources with the exception of Geo. Table 1 de- ever, as evident from the development set, where tails the accuracy results achieved by our system on entailing sentences account for merely 4% of the the Main task’s test set. sentences4 , such an approach is highly inefficient, and might not be feasible for larger corpora. Note Table 2 shows the results of ablation tests rela- that only limited processing of test sentences can tive to Main-BIU 1 . As evident from the table, the be done in advance, while most of the computa- only resource that clearly provides leverage is Word- tional effort is required at inference time, i.e. when Net, though performance was also improved by us- the sentence is assessed for entailment of a specific ing DIRT. These two results are consistent with our given hypothesis. Hence, we chose to address the previous ones (Bar-Haim et al., 2008), while Wiki, Search task with an approach in the spirit of IR (pas- which was helpful in previous work, was not in sage retrieval) for Question Answering (e.g. (Tellex this case. Further analysis is required to determine et al., 2003)): the reason for performance degradation by this and First, we apply a simple and fast method to fil- other resources. The two preliminary resources that ter the sentences based on lexical coverage of the handle abbreviations and temporal phrases did not hypothesis in each sentence, discarding from fur- provide any marginal contribution over the other re- 4 sources and are therefore excluded from the table. 810 out of over 20,000 possible sentence-hypothesis pairs.
ther processing any document in which no relevant Using WordNet’s synonymy rule gay ⇔ homosex- sentences are found. Such a filter reduces signifi- ual increases the coverage from 12 to 32 . cantly the amount of sentences that require deeper This retrieval process can be performed within processing, while allowing tradeoff between pre- minutes on the entire development or test set, with cision and recall, as required. Next, we process any set of the resources we employed. and enrich non-filtered sentences with discourse and document-level information. These sentences are 6 Discourse Aspects of the Search Task then classified by a set of supervised classifiers, based on features extracted for each sentence inde- As mentioned, discourse aspects play a key role in pendently. Meta-features are then extracted at the the Search task. We therefore analyzed a sample document-level based on the output of the afore- of the development set’s sentence-hypothesis pairs, mentioned classifiers, and a meta-classifier is ap- looking for discourse phenomena that are involved plied to determine the final classification. in the inference process. In the following sub- The details of our retrieval module, the imple- sections we describe the prominent discourse and mentation for addressing discourse issues and the document-structure phenomena we have identified two-tier classification process are described in the and addressed in our implementation. These phe- next two Sections. nomena are typically poorly addressed by available reference resolvers and discourse processing tools, 5 Candidate Retrieval or fall completely out of their scope. The retrieval module of our system is employed to 6.1 Non-conflicting coreference matching identify candidates for entailment: For each hypoth- esis h, it retrieves candidate sentences based on their A large number of coreference relations in our sam- term coverage of h. A word wh in h is covered by a ple are comprised of terms which share lexical ele- word in a sentence s, ws , if they are either identical ments, such as the airliners’s first flight and the Air- (in terms of their stems5 ) or if a lexical entailment bus A380’s first flight. Although common in corefer- rule ws ⇒ wh is found in the currently employed ence relations, it turns out that standard coreference resource-set. A sentence s is retrieved for a hypoth- resolution tools miss many of these cases. esis h if its coverage of h (percentage of covered For the purpose of identifying additional corefer- content words in h) is equal or is greater than a cer- ring terms, we consider two noun phrases in the tain predefined threshold. The threshold is set em- same document as coreferring if: (i) their heads pirically by tuning it over the development set for are identical and (ii) no semantic incompatibility is each set of resources employed. found between their modifiers. The types of incom- At preprocessing, each sentence in the test set patibility we handle in our current implementation corpus is tokenized, stemmed and stop-words are are antonymy and mismatching numbers. For ex- removed. Given an hypothesis it is processed the ample, two nodes of the noun distance would be same way. We then utilize lexical resources to ap- considered incompatible if one is modified by short ply entailment-based expansion of the hypothesis’ and the second by long. Similarly, two nodes for content words in order to obtain higher coverage by dollars are considered incompatible if they are mod- the corpus sentences and consequently – a higher ified by different numbers. By allowing such lenient recall. For example, the following sentence covers matches we compensate for missing coreference re- three out of the six content words of the hypothesis lations, potentially resulting in an increased over- simply by means of (stemmed) word identity: all system recall. The precision of this method may be further improved by adding more types of con- h : “Spain took steps to legalize homosexual mar- straints to discard incompatible pairs. For example, riages” it can be verified that modifiers are not co-hyponyms s : “Spain’s Prime Minister . . . made legalising gay (e.g. dog food, cat food) or otherwise semantically marriages a key element of his social policy.” disjoint. These additional coreference relationships 5 For stemming we used the Porter Stemmer from: are augmented to each document prior to the classi- http://www.tartarus.org/˜ martin/PorterStemmer fication stage.
6.2 Global information sentences tend to come in bulks. This reflects a Key terms or prominent pieces of information that common coherence aspect, where the discussion of appear in the document, typically at the title or the a specific topic is typically continuous rather than first few sentences, are many times perceived as scattered across the entire document, and is espe- “globally” known throughout the document. For ex- cially apparent in long documents. This locality ample, the geographic location of the document’s phenomena may be useful for entailment classifica- theme, mentioned at the beginning of the document, tion since knowing that a sentence entails the hy- is assumed to be known from that point on, and will pothesis increases the probability that adjacent sen- often not be mentioned in further sentences which tences entail the hypothesis as well. More generally, do refer to that location. for the classification of a given sentence, useful in- This is a bridging phenomenon that is typically formation can be derived from the classification re- not addressed by available discourse processing sults of other sentences in the document, reflecting tools. To compensate for that, we implemented the other discourse and document-level phenomena. following simple method: We identify key terms for To that end, we use a meta-classification scheme each document based on TF-IDF scores, requiring with a two-phase classification process, where a a minimum number of occurrences of the term in meta-classifier utilizes entailment classifications of the document and giving additional weight to terms the first classification phase to extract meta-features in the title. The top-n ranking terms are consid- and determine the final classification decision. This ered global for that document. Then, each sentence scheme also provides a convenient way to com- parse tree in the document is augmented by adding bine scores from multiple classifiers used in the first the documents’ global terms as nodes directly at- classification phase. We refer to these as base- tached to the sentence’s root node. Thus, an occur- classifiers. This scheme and the meta-features we rence of a global term in the hypothesis is matched used are detailed hereunder. in each of the sentences in the document, regardless Let us write (s, h) for a sentence-hypothesis pair. of whether the term explicitly appears in the sen- We denote the (set of pairs in the) development tence. For example, global terms for the topic dis- (training) set as D and in the test set as T . We cussing the ice melting in the Arctic, typically con- split D into two halves, D1 and D2 . We rely tain a location such as Arctic or Antarctica and terms on document-level information to determine entail- referring to ice, like permafrost, icecap or iceshelf. ment. Thus, for a given h, following the candidate retrieval stage, we process all pairs corresponding to Another method for addressing missing corefer- h paired with each sentence in the documents con- ence relations is based on the assumption that adja- taining the candidates. These additional pairs are cent sentences often refer to the same entities and not considered as entailment candidates and are al- events. Thus, when given a sentence for classifi- ways classified as non-entailing. We write R for the cation, we also consider the text of its preceding set of candidate pairs and R0 for the set containing sentence. Specifically, when extracting classifica- both candidates and the abovementioned additional tion features for a given sentence, in addition to the pairs. Note that R ⊆ R0 ⊆ T . We make use of features extracted from the parse tree of the sentence n base-classifiers, C1 , . . . , Cn , among which C ? is itself, we extract the same set of features6 from the a designated classifier with additional roles in the joint tree composed of the tree representations of the process, as described below. Classifiers may differ, current and previous sentences put together. for example, in their classification algorithm. An additional meta-classifier is denoted CM . 6.3 Document-level classification The classification scheme is shown as Algo- Beyond discourse references addressed above, fur- rithm 1. We now elaborate on each of these steps. ther information concerning discourse and docu- At Step 1, features are extracted for every (s, h) ment structure phenomena is available in the Search pair in the training set by each of the base- setting and may contribute to entailment classifi- classifiers. These include the same features as in the cation. For example, we observed that entailing Main task, as well as the features for the joint for- 6 Excluding the tree-kernel feature in (Bar-Haim et al., 2008) est of the current and previous sentence described in
Training tence as entailing, but all sentences in its envi- 1: Extract features for every (s, h) in D ronment are not entailing, both scheme of the 2: Train C1 , . . . , Cn on D1 closest entailing sentence (excluding self) and 3: Classify D2 , using C1 , . . . , Cn the second-closest (including self) produce the 4: Extract meta-features for D2 using the same distance. On the other hand, under the classification of C1 , . . . , Cn ‘closest’ scheme, both an entailing sentence at 5: Train CM on D2 the “edge” of an entailment bulk and the non- entailing sentence just next to it, have a dis- Classification tance of 1: suppose that sentences i, . . . , i + l 6: Extract features for every (s, h) in R0 constitute a bulk of entailing sentences. Then: 7: Classify R0 using C1 , . . . , Cn di−1 = |(i − 1) − i| = 1 and 8: Extract meta-features for R di = |i − (i + 1)| = 1 9: Classify R using CM Under our scheme, however, the non-entailing sentence has a distance of 2 while the entailing Algorithm 1: Meta-classification sentence has a distance of 1, since we consider both the sentence’s own classification and its Section 6.2. In steps 2 and 3, we split the training environment’s classification. We scale the dis- set into two halves (taking half of each topic), train tance and add the feature score: − log(di ). n different classifiers on the first half and then clas- • Smoothed entailment: This feature also ad- sify the second half of the training set using each of dressed the locality phenomenon by smoothing the n classifiers. Given the classification scores of the classification score of sentence i with the the n base-classifiers to the (s, h) pairs in the sec- scores of adjacent sentences, weighted by their ond half of the training set, D2 , we add in Step 4 the distance from the current sentence i. Let s(i) following meta-features to each pair: be the score assigned by C ? to sentence i. We add the Smoothed Entailment feature score: • Classification scores: The classification score of each of the n base-classifiers. This allows P |w|· s(i + w)) w (b the meta-classifier to integrate the decisions SE(i) = P |w| (1) w (b ) made by different classifiers. • Second-closest entailment: Considering the where 0 < b < 1 is a parameter and w is an locality phenomenon described above, we add integer bounded between −N and N , denoting as feature the distance to the second-closest en- the distance from sentence i. tailing sentence in the document (including the • 1st sentence entailing title: As shown in sentence itself), according to the classification (Bensley and Hickl, 2008), the first sentence in of C ? . Formally, let i be the index of the cur- a news article typically entails the article’s ti- rent sentence and J be the set of indices of en- tle. We found this phenomenon to hold for the tailing sentences in the document according to RTE-5 development set as well. We therefore C ? . For each j ∈ J we calculate di,j = |i − j|, assume that for each document s1 ⇒ s0 where and choose the second smallest di,j as di . If en- s1 and s0 are the document’s first sentence tailing sentences indeed always come in bulks, and title respectively. Hence, under entailment then di = 1 for all entailing sentences, but transitivity, if s0 ⇒ h then s1 ⇒ h. The corre- di > 1 for all non-entailing sentences. sponding binary feature states whether the sen- Let us further explain the rationale behind this tence being classified is the first sentence of the score: Suppose we compute the distance to document AND the title entails the hypothesis the closest entailing sentence rather than to the according to C ? . second-closest one. Thus, it is natural to as- • Title entailment: In many texts, and in news sume that we do not count the sentence as clos- articles in particular, the title and the first few est to itself since it disregards the environment sentences are often used to present the entire of the sentence altogether, eliminating the de- document’s content and may therefore be con- sired effect. If C ? mistakenly classifies a sen- sidered as a summary of the document. Thus, it
may be useful to know whether these sentences Resource Min. Coverage P (%) R (%) F1 (%) Wiki 50% 35.5 42.8 38.8 entail the hypothesis, as an indicator to the gen- - 50% 35.6 41.5 38.3 eral potential of the document to include entail- XWN 50% 30.5 46.2 36.7 ing sentences. Two binary features are added WordNet 60% 30.8 43.6 36.1 WN+Wiki 60% 30.3 43.8 35.8 according to the classification of C ? indicat- Lin 80% 22.9 35.2 27.7 ing whether the title entails the hypothesis and whether the first sentence entails it. Table 3: Performance of lexical resources for expansion on the development set showing the best coverage threshold found After adding the meta-features we train a meta- for each resource when using the retrieval module to determine classifier on this new set of features in Step 5. Test entailment. Note that settings using different thresholds are not directly comparable. sentences that passed the retrieval module’s filtering then go through the same process: features are ex- tracted for them and they are classified by the al- hypothesis. The entailment resources used in this ready trained n classifiers (Steps 6 and 7), meta- run are: syntactic rules, WordNet, Wiki, Geo, XWN, features are extracted in Step 8, and a final classi- Abbr, Snow and DateRG. For the classifier, we use fication decision is performed by the meta-classifier the SVMperf package (Joachims, 2006) with a lin- in Step 9. ear kernel. Global information is added by enrich- ing each sentence with the top-three terms from the 7 Search Task - Experiments and Results document, based on the TF-IDF scores (cf. Sec- We submitted three distinct runs for the Search task, tion 6.2), if they occur at least three times in the as described below. document, while title terms are counted twice. Search-BIU1 Our first run determines entailment Search-BIU3 Here, our complete system is ap- between a sentence s and a hypothesis h purely plied, using the meta-classifier, as described in Sec- based on term coverage of h by s, i.e. by using the tion 6.3. The retrieval module’s configuration and retrieval module’s output directly (cf. Section 5). the set of employed entailment resources are iden- For picking the best resource-threshold combination tical to the ones used in Search-BIU 2 . In this sys- for candidate retrieval, we assessed the performance tem, we used two base-classifiers (n = 2): SVMperf of various settings for term expansion. These in- and Naı̈ve Bayes from the WEKA package (Wit- clude the use of WordNet, Wiki, XWN, and Dekang ten and Frank, 2005), where the first among these Lin’s distributional similarity resource (Lin, 1998), is set as our designated classifier C ? which is used as well as unions of these resources and the basic for the computation of the document-level features. setting where no expansions at all are used. Each ex- SVMperf was also used for the meta-classifier. For pansion setting was assessed with a threshold range the smoothed entailment score (cf. Section 6.3), we of 10%-80% on the development set. Several such used b = 0.9 and N = 3, based on tuning on the settings are are shown in Table 3. As seen in the Ta- development set. ble, the best performing setting in terms of micro- The results obtained in each of the above runs averaged F1 – which is therefore used for Search- are detailed in Table 4. For easier comparison we BIU 1 – was the use of Wiki with a 50% coverage also show the results of another lexical run, termed threshold, achieving a slightly better score than us- Search-BIU 10 , where no expansion resources are ing no resources at all. used, as in Search-BIU 2 and Search-BIU 3 . Hence, Search-BIU2 In this run B IU T EE is used, in its Search-BIU 10 can be directly viewed as the candi- standard configuration, i.e., a single classifier is date retrieval step of the next two runs. The entail- used and features are extracted for each sentence ment engine in these two runs applies a second filter independently, without attending to document-level to the candidates based on the inference classifica- considerations. Test-set sentences are pre-filtered tion results, aiming to improve precision of this ini- by the retrieval module using no resources for ex- tial set. Recall is, therefore, limited by that of the pansion7 and with minimum 50% coverage of the candidate retrieval step. 7 We picked this configuration empirically . Note that sys- Although achieving rather close F1 scores, we tems may have different optimal retrieval configurations. note that our submissions’ outputs are substantially
Run P (%) R (%) F1 (%) and discourse-based phenomena which are relevant Search-BIU 1 37.03 55.50 44.42 Search-BIU 10 37.15 53.50 43.85 for inference under such settings. A thorough anal- Search-BIU 2 40.49 47.88 43.87 ysis is required to understand the impact of each of Search-BIU 3 40.98 51.38 45.59 our system’s components and resources. So is the development of sound algorithms for addressing the Table 4: Micro-average results of our Search task runs. discourse phenomena we pointed out. Our system achieved the highest score among the different from each other, as reflected in the number groups that participated in the challenge, but has of sentences classified as entailing: while Search- surpassed our own baseline by only a small mar- BIU 1 marked 1199 sentences as entailing (1152 for gin. Previous work, e.g. (Roth and Sammons, 2007; Search-BIU 10 ), in Search-BIU 2 and Search-BIU 3 Adams et al., 2007) showed that lexical methods the numbers are 946 and 1003, respectively. Com- constitute a strong baseline for RTE systems. Our paring Search-BIU 10 to Search-BIU 3 based on Ta- own results provide another support for this obser- ble 4 and these figures, we learn that 149 sentences vation. Still, by applying our inference engine, we are removed by the latter, of which 89% are false- were able to improve precision relative to the lexical positives. This directly translates to a 10% rela- system, thus improving the overall performance in tive increase in precision with an approximate 4% terms of F1 . This constitutes a way to tradeoff recall relative recall loss. We further learn by compar- and precision depending on one’s needs. We believe ing Search-BIU 2 and Search-BIU 3 that the meta- that further improvement can be achieved by recruit- classification scheme – constituting the difference ing IR and QA know-how to the retrieval phase and between the two systems – is helpful, mainly for re- by providing more comprehensive implementations call increase. Which ones of the meta-features are for the ideas we proposed in this paper. responsible for the improved performance requires further analysis. Acknowledgments An interesting observation concerning the datasets is obtained by comparing the second line This work was partially supported by the Negev in each of Tables 3 and 4, referring to lexical Consortium of the Israeli Ministry of Industry, runs with no expansions, which retrieve sentences Trade and Labor, the PASCAL-2 Network of Ex- based on direct matches between the sentence and cellence of the European Community FP7-ICT- hypothesis terms. On the test set, this configuration 2007-1-216886, the FIRB-Israel research project N. achieves a recall score higher by 29% relatively to RBIN045PXH and the Israel Science Foundation the recall obtained on the development set (53.5% grant 1112/08. Jonathan Berant is grateful to the vs. 41.5%), with an even slightly higher precision. Azrieli Foundation for the award of an Azrieli Fel- Apparently, the test set was much more prone to lowship. favor lexical methods than the development set. This may contribute to understanding why our complete system achieved only little leverage over References the purely lexical run. In any case, it constitutes Rod Adams, Gabriel Nicolae, Cristina Nicolae, and a bias between the datasets, significantly affecting Sanda Harabagiu. 2007. Textual entailment through systems’ training and tuning. extended lexical overlap and lexico-semantic match- ing. In Proceedings of the ACL-PASCAL Workshop We refrained from performing analysis on the on Textual Entailment and Paraphrasing. Search task’s test set as we intend to perform fur- ther experiments using this dataset. Roy Bar-Haim, Ido Dagan, Iddo Greental, and Eyal Shnarch. 2007. Semantic inference at the lexical- 8 Conclusions and Future Work syntactic level. In AAAI. In this work we addressed the RTE Search task, Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Green- tal, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. identified key issues of the task, and have put initial 2008. Efficient semantic deduction and approximate solutions in place. We designed a scalable system matching over compact parse forests. In Proceedings in which we addressed various document-structure of Text Analysis Conference (TAC).
Roy Bar-Haim, Jonathan Berant, and Ido Dagan. 2009. Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernan- A compact forest for scalable inference over en- des, and Gregory Marton. 2003. Quantitative evalu- tailment and paraphrase rules. In Proceedings of ation of passage retrieval algorithms for question an- EMNLP. swering. In Proceedings of SIGIR. Jeremy Bensley and Andrew Hickl. 2008. Unsupervised Ian H. Witten and Eibe Frank. 2005. Data Mining: resource creation for textual inference applications. Practical machine learning tools and techniques, 2nd In Bente Maegaard Joseph Mariani Jan Odjik Stelios Edition. Morgan Kaufmann, San Francisco. Piperidis Daniel Tapias Nicoletta Calzolari (Confer- ence Chair), Khalid Choukri, editor, Proceedings of LREC. Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, Medea Lo Leggio, and Bernardo Magnini. 2009. Considering discourse references in textual entailment annotation. In Proceedings of the 5th International Conference on Generative Ap- proaches to the Lexicon (GL2009). Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Jenny R. Finkel, Trond Grenager, and Christopher Man- ning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Mor- ristown, NJ, USA. Thorsten Joachims. 2006. Training linear svms in lin- ear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Dekang Lin and Patrick Pantel. 2001. Discovery of in- ference rules for question answering. Natural Lan- guage Engineering, 7(4):343–360. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL. Shachar Mirkin, Ido Dagan, and Eyal Shnarch. 2009. Evaluating the inferential utility of lexical-semantic resources. In Proceedings of EACL, Athens, Greece. Dan Moldovan and Vasile Rus. 2001. Logic form trans- formation of wordnet and its applicability to question answering. In Proceedings of ACL. Dan Roth and Mark Sammons. 2007. Semantic and log- ical inference model for textual entailment. In Pro- ceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Ex- tracting lexical reference rules from Wikipedia. In Proceedings of ACL-IJCNLP. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous evi- dence. In Proceedings of COLING-ACL. Idan Szpektor and Ido Dagan. 2007. Learning canonical forms of entailment rules. In Proceedings of RANLP.
You can also read