Open Information Extraction using Wikipedia
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Open Information Extraction using Wikipedia Fei Wu Daniel S. Weld University of Washington University of Washington Seattle, WA, USA Seattle, WA, USA wufei@cs.washington.edu weld@cs.washington.edu Abstract high precision and recall, they are limited by the availability of training data and are unlikely to Information-extraction (IE) systems seek scale to the thousands of relations found in text to distill semantic relations from natural- on the Web. language text, but most systems use super- An alternative paradigm, Open IE, pioneered vised learning of relation-specific examples by the TextRunner system (Banko et al., 2007), and are thus limited by the availability of aims to handle an unbounded number of relations training data. Open IE systems such as and run quickly enough to process Web-scale cor- TextRunner, on the other hand, aim to handle pora. Domain independence is achieved by ex- the unbounded number of relations found tracting the relation name as well as its two argu- on the Web. But how well can these open ments. Most open IE systems use self-supervised systems perform? learning, in which automatic heuristics generate This paper presents WOE, an open IE system labeled data for training the extractor. For exam- which improves dramatically on TextRunner’s ple, TextRunner uses a small set of hand-written precision and recall. The key to WOE’s per- rules to heuristically label training examples from formance is a novel form of self-supervised sentences in the Penn Treebank. learning for open extractors — using heuris- This paper presents WOE (Wikipedia-based tic matches between Wikipedia infobox at- Open Extractor), the first system that au- tribute values and corresponding sentences to tonomously transfers knowledge from random ed- construct training data. Like TextRunner, itors’ effort of collaboratively editing Wikipedia to WOE ’s extractor eschews lexicalized features train an open information extractor. Specifically, and handles an unbounded set of semantic WOE generates relation-specific training examples relations. WOE can operate in two modes: by matching Infobox1 attribute values to corre- when restricted to POS tag features, it runs sponding sentences (as done in Kylin (Wu and as quickly as TextRunner, but when set to use Weld, 2007) and Luchs (Hoffmann et al., 2010)), dependency-parse features its precision and but WOE abstracts these examples to relation- recall rise even higher. independent training data to learn an unlexical- ized extractor, akin to that of TextRunner. WOE 1 Introduction can operate in two modes: when restricted to The problem of information-extraction (IE), gen- shallow features like part-of-speech (POS) tags, it erating relational data from natural-language text, runs as quickly as Textrunner, but when set to use has received increasing attention in recent years. dependency-parse features its precision and recall A large, high-quality repository of extracted tu- rise even higher. We present a thorough experi- ples can potentially benefit a wide range of NLP mental evaluation, making the following contribu- tasks such as question answering, ontology learn- tions: ing, and summarization. The vast majority of • We present WOE, a new approach to open IE IE work uses supervised learning of relation- that uses Wikipedia for self-supervised learn- specific examples. For example, the WebKB ing of unlexicalized extractors. Compared project (Craven et al., 1998) used labeled exam- 1 ples of the courses-taught-by relation to in- An infobox is a set of tuples summarizing the key at- tributes of the subject in a Wikipedia article. For example, duce rules for identifying additional instances of the infobox in the article on “Sweden” contains attributes like the relation. While these methods can achieve Capital, Population and GDP.
with TextRunner (the state of the art) on three Sentence Splitting NLP Annotating Preprocessor corpora, WOE yields between 79% and 90% Synonyms Compiling improved F-measure — generalizing well be- yond Wikipedia. Primary Entity Matching Matcher Sentence Matching • Using the same learning algorithm and fea- Triples tures as TextRunner, we compare four dif- Pattern Classifier over Parser Features ferent ways to generate positive and negative CRF Extractor over Shallow Features Learner training examples with TextRunner’s method, concluding that our Wikipedia heuristic is re- Figure 1: Architecture of WOE. sponsible for the bulk of WOE’s improved ac- curacy. an unlexicalized, relation-independent (open) ex- • The biggest win arises from using parser fea- tractor. As shown in Figure 1, WOE has three main tures. Previous work (Jiang and Zhai, 2007) components: preprocessor, matcher, and learner. concluded that parser-based features are un- necessary for information extraction, but that 3.1 Preprocessor work assumed the presence of lexical features. The preprocessor converts the raw Wikipedia text We show that abstract dependency paths are into a sequence of sentences, attaches NLP anno- a highly informative feature when performing tations, and builds synonym sets for key entities. unlexicalized extraction. The resulting data is fed to the matcher, described 2 Problem Definition in Section 3.2, which generates the training set. Sentence Splitting: The preprocessor first renders An open information extractor is a function each Wikipedia article into HTML, then splits the from a document, d, to a set of triples, article into sentences using OpenNLP. {harg1 , rel, arg2 i}, where the args are noun phrases and rel is a textual fragment indicat- NLP Annotation: As we discuss fully in Sec- ing an implicit, semantic relation between the two tion 4 (Experiments), we consider several varia- noun phrases. The extractor should produce one tions of our system; one version, WOEparse , uses triple for every relation stated explicitly in the text, parser-based features, while another, WOEpos , uses but is not required to infer implicit facts. In this shallow features like POS tags, which may be paper, we assume that all relational instances are more quickly computed. Depending on which stated within a single sentence. Note the dif- version is being trained, the preprocessor uses ference between open IE and the traditional ap- OpenNLP to supply POS tags and NP-chunk an- proaches (e.g., as in WebKB), where the task is notations — or uses the Stanford Parser to create a to decide whether some pre-defined relation holds dependency parse. When parsing, we force the hy- between (two) arguments in the sentence. perlinked anchor texts to be a single token by con- We wish to learn an open extractor without di- necting the words with an underscore; this trans- rect supervision, i.e. without annotated training formation improves parsing performance in many examples or hand-crafted patterns. Our input is cases. Wikipedia, a collaboratively-constructed encyclo- Compiling Synonyms: As a final step, the pre- pedia2 . As output, WOE produces an unlexicalized processor builds sets of synonyms to help the and relation-independent open extractor. Our ob- matcher find sentences that correspond to infobox jective is an extractor which generalizes beyond relations. This is useful because Wikipedia edi- Wikipedia, handling other corpora such as the gen- tors frequently use multiple names for an entity; eral Web. for example, in the article titled “University of Washington” the token “UW” is widely used to 3 Wikipedia-based Open IE refer the university. Additionally, attribute values The key idea underlying WOE is the automatic are often described differently within the infobox construction of training examples by heuristically than they are in surrounding text. Without knowl- matching Wikipedia infobox values and corre- edge of these synonyms, it is impossible to con- sponding text; these examples are used to generate struct good matches. Following (Wu and Weld, 2 We also use DBpedia (Auer and Lehmann, 2007) as a 2007; Nakayama and Nishio, 2008), the prepro- collection of conveniently parsed Wikipedia infoboxes cessor uses Wikipedia redirection pages and back-
ward links to automatically construct synonym denotes the primary entity, e.g., “he” for the sets. Redirection pages are a natural choice, be- page on “Albert Einstein.” This heuristic is cause they explicitly encode synonyms; for ex- dropped when “it” is most common, because ample, “USA” is redirected to the article on the the word is used in too many other ways. “United States.” Backward links for a Wiki- When there are multiple matches to the primary pedia entity such as the “Massachusetts Institute of entity in a sentence, the matcher picks the one Technology” are hyperlinks pointing to this entity which is closest to the matched infobox attribute from other articles; the anchor text of such links value in the parser dependency graph. (e.g., “MIT”) forms another source of synonyms. Matching Sentences: The matcher seeks a unique 3.2 Matcher sentence to match the attribute value. To produce the best training set, the matcher performs three The matcher constructs training data for the filterings. First, it skips the attribute completely learner component by heuristically matching when multiple sentences mention the value or its attribute-value pairs from Wikipedia articles con- synonym. Second, it rejects the sentence if the taining infoboxes with corresponding sentences in subject and/or attribute value are not heads of the the article. Given the article on “Stanford Univer- noun phrases containing them. Third, it discards sity,” for example, the matcher should associate the sentence if the subject and the attribute value hestablished, 1891i with the sentence “The do not appear in the same clause (or in parent/child university was founded in 1891 by . . . ” Given a clauses) in the parse tree. Wikipedia page with an infobox, the matcher iter- Since Wikipedia’s Wikimarkup language is se- ates through all its attributes looking for a unique mantically ambiguous, parsing infoboxes is sur- sentence that contains references to both the sub- prisingly complex. Fortunately, DBpedia (Auer ject of the article and the attribute value; these and Lehmann, 2007) provides a cleaned set of in- noun phrases will be annotated arg1 and arg2 foboxes from 1,027,744 articles. The matcher uses in the training set. The matcher considers a sen- this data for attribute values, generating a training tence to contain the attribute value if the value or dataset with a total of 301,962 labeled sentences. its synonym is present. Matching the article sub- ject, however, is more involved. 3.3 Learning Extractors Matching Primary Entities: In order to match We learn two kinds of extractors, one (WOEparse ) shorthand terms like “MIT” with more complete using features from dependency-parse trees and names, the matcher uses an ordered set of heuris- the other (WOEpos ) limited to shallow features like tics like those of (Wu and Weld, 2007; Nguyen et POS tags. WOEparse uses a pattern learner to al., 2007): classify whether the shortest dependency path be- • Full match: strings matching the full name of tween two noun phrases indicates a semantic rela- the entity are selected. tion. In contrast, WOEpos (like TextRunner) trains a conditional random field (CRF) to output certain • Synonym set match: strings appearing in the text between noun phrases when the text denotes entity’s synonym set are selected. such a relation. Neither extractor uses individual • Partial match: strings matching a prefix or suf- words or lexical information for features. fix of the entity’s name are selected. If the full name contains punctuation, only a prefix 3.3.1 Extraction with Parser Features is allowed. For example, “Amherst” matches Despite some evidence that parser-based features “Amherst, Mass,” but “Mass” does not. have limited utility in IE (Jiang and Zhai, 2007), • Patterns of “the ”: The matcher first we hoped dependency paths would improve preci- identifies the type of the entity (e.g., “city” for sion on long sentences. “Ithaca”), then instantiates the pattern to create Shortest Dependency Path as Relation: Unless the string “the city.” Since the first sentence of otherwise noted, WOE uses the Stanford Parser most Wikipedia articles is stylized (e.g. “The to create dependencies in the “collapsedDepen- city of Ithaca sits . . . ”), a few patterns suffice dency” format. Dependencies involving preposi- to extract most entity types. tions, conjuncts as well as information about the • The most frequent pronoun: The matcher as- referent of relative clauses are collapsed to get sumes that the article’s most frequent pronoun direct dependencies between content words. As
noted in (de Marneffe and Manning, 2008), this nate distinctions which are irrelevant for recog- collapsed format often yields simplified patterns nizing (domain-independent) relations. Lexical which are useful for relation extraction. Consider words in corePaths are replaced with their POS the sentence: tags. Further, all Noun POS tags and “PRP” Dan was not born in Berkeley. are abstracted to “N”, all Verb POS tags to “V”, The Stanford Parser dependencies are: all Adverb POS tags to “RB” and all Adjective nsubjpass(born-4, Dan-1) POS tags to “J”. The preposition dependencies auxpass(born-4, was-2) such as “prep in” are generalized to “prep”. Take −−−−−−−−−→ ←−−−−−− neg(born-4, not-3) the corePath “Dan nsubjpass born prep in prep in(born-4, Berkeley-6) Berkeley” for example, its generalized-corePath −−−−−−−−−→ where each atomic formula represents a binary de- is “N nsubjpass V ←prep −−−− N”. We call such pendence from dependent token to the governor a generalized-corePath an extraction pattern. In token. total, WOE builds a database (named DBp ) of These dependencies form a directed graph, 29,005 distinct patterns and each pattern p is asso- hV, Ei, where each token is a vertex in V , and E ciated with a frequency — the number of matching is the set of dependencies. For any pair of tokens, sentences containing p. Specifically, 311 patterns such as “Dan” and “Berkeley”, we use the shortest have fp ≥ 100 and 3,519 patterns have fp ≥ 5. connecting path to represent the possible relation Learning a Pattern Classifier: Given the large between them: number of patterns in DBp , we assume few valid −−−−−−−−−→ ←−−−−−− Dan nsubjpass born prep in Berkeley open extraction patterns are left behind. The We call such a path a corePath. While we will learner builds a simple pattern classifier, named see that corePaths are useful for indicating when WOE parse , which checks whether the generalized- a relation exists between tokens, they don’t neces- corePath from a test triple is present in DBp , and sarily capture the semantics of that relation. For computes the normalized logarithmic frequency as example, the path shown above doesn’t indicate the probability3 : the existence of negation! In order to capture the max(log(fp ) − log(fmin ), 0) meaning of the relation, the learner augments the w(p) = log(fmax ) − log(fmin ) corePath into a tree by adding all adverbial and adjectival modifiers as well as dependencies like where fmax (54,274 in this paper) is the maximal “neg” and “auxpass”. We call the result an ex- frequency of pattern in DBp , and fmin (set 1 in pandPath as shown below: this work) is the controlling threshold that deter- mines the minimal frequency of a valid pattern. Take the previous sentence “Dan was not born in Berkeley” for example. WOEparse first identi- fies Dan as arg1 and Berkeley as arg2 based WOE traverses the expandPath with respect to the on NP-chunking. It then computes the corePath −−−−−−−−−→ ←−−−−−− token orders in the original sentence when out- “Dan nsubjpass born prep in Berkeley” −−−−−−−−−→ putting the final expression of rel. and abstracts to p=“N nsubjpass V ←prep −−−− N”. It then queries DBp to retrieve the fre- Building a Database of Patterns: For each of the quency fp = 31, 767 and assigns a probabil- 301,962 sentences selected and annotated by the ity of 0.95. Finally, WOEparse traverses the matcher, the learner generates a corePath between triple’s expandPath to output the final expression the tokens denoting the subject and the infobox at- hDan, wasN otBornIn, Berkeleyi. As shown tribute value. Since we are interested in eventu- in the experiments on three corpora, WOEparse ally extracting “subject, relation, object” triples, achieves an F-measure which is between 79% to the learner rejects corePaths that don’t start with 90% greater than TextRunner’s. subject-like dependencies, such as nsubj, nsubj- pass, partmod and rcmod. This leads to a collec- 3.3.2 Extraction with Shallow Features tion of 259,046 corePaths. WOE parsehas a dramatic performance improve- To combat data sparsity and improve learn- ment over TextRunner. However, the improve- ing performance, the learner further generalizes ment comes at the cost of speed — TextRunner the corePaths in this set to create a smaller set 3 How to learn a more sophisticated weighting function is of generalized-corePaths. The idea is to elimi- left as a future topic.
P/R Curve on WSJ P/R Curve on Web P/R Curve on Wikipedia 1.0 1.0 1.0 0.8 0.8 0.8 precision precision precision 0.6 0.6 0.6 0.4 0.4 0.4 WOEparse WOEparse WOEparse 0.2 0.2 0.2 WOEpos WOEpos WOEpos TextRunner TextRunner TextRunner 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 recall recall recall Figure 2: WOEpos performs better than TextRunner, especially on precision. WOEparse dramatically im- proves performance, especially on recall. runs about 30X faster by only using shallow fea- triples are mixed with pseudo-negative ones and tures. Since high speed can be crucial when pro- submitted to Amazon Mechanical Turk for veri- cessing Web-scale corpora, we additionally learn a fication. Each triple was examined by 5 Turk- CRF extractor WOEpos based on shallow features ers. We mark a triple’s final label as positive when like POS-tags. In both cases, however, we gen- more than 3 Turkers marked them as positive. erate training data from Wikipedia by matching sentences with infoboxes, while TextRunner used 4.1 Overall Performance Analysis a small set of hand-written rules to label training In this section, we compare the overall perfor- examples from the Penn Treebank. mance of WOEparse , WOEpos and TextRunner We use the same matching sentence set behind (shared by the Turing Center at the University of DBp to generate positive examples for WOEpos . Washington). In particular, we are going to answer Specifically, for each matching sentence, we label the following questions: 1) How do these systems the subject and infobox attribute value as arg1 perform against each other? 2) How does perfor- and arg2 to serve as the ends of a linear CRF mance vary w.r.t. sentence length? 3) How does chain. Tokens involved in the expandPath are la- extraction speed vary w.r.t. sentence length? beled as rel. Negative examples are generated Overall Performance Comparison from random noun-phrase pairs in other sentences The detailed P/R curves are shown in Figure 2. when their generalized-CorePaths are not in DBp . To have a close look, for each corpus, we ran- WOE pos uses the same learning algorithm and domly divided the 300 sentences into 5 groups and selection of features as TextRunner: a two-order compared the F-measures of three systems in Fig- CRF chain model is trained with the Mallet pack- ure 3. We can see that: age (McCallum, 2002). WOEpos ’s features include • WOEpos is better than TextRunner, especially POS-tags, regular expressions (e.g., for detecting on precision. This is due to better training capitalization, punctuation, etc..), and conjunc- data from Wikipedia via self-supervision. Sec- tions of features occurring in adjacent positions tion 4.2 discusses this in more detail. within six words to the left and to the right of the • WOEparse achieves the best performance, es- current word. pecially on recall. This is because the parser As shown in the experiments, WOEpos achieves features help to handle complicated and long- an improved F-measure over TextRunner between distance relations in difficult sentences. In par- 15% to 34% on three corpora, and this is mainly ticular, WOEparse outputs 1.42 triples per sen- due to the increase on precision. tence on average, while WOEpos outputs 1.05 and TextRunner outputs 0.75. 4 Experiments Note that we measure TextRunner’s precision We used three corpora for experiments: WSJ from & recall differently than (Banko et al., 2007) Penn Treebank, Wikipedia, and the general Web. did. Specifically, we compute the precision & re- For each dataset, we randomly selected 300 sen- call based on all extractions, while Banko et al. tences. Each sentence was examined by two peo- counted only concrete triples where arg1 is a ple to label all reasonable triples. These candidate proper noun, arg2 is a proper noun or date, and
Figure 4: WOEparse ’s F-measure decreases more Figure 3: WOE pos achieves an F-measure, which is slowly with sentence length than WOEpos and Tex- between 15% and 34% better than TextRunner’s. tRunner, due to its better handling of difficult sen- WOE parse achieves an improvement between 79% tences using parser features. and 90% over TextRunner. The error bar indicates one standard deviation. he sold the company”, where “Sources” is wrongly treated as the subject of the object the frequency of rel is over a threshold. Our ex- clause. A sample error of the second type is periments show that focussing on concrete triples hthisY ear, willStarIn, theM oviei extracted generally improves precision at the expense of re- from the sentence “Coming up this year, Long call.4 Of course, one can apply a concreteness fil- will star in the new movie.”, where “this year” is ter to any open extractor in order to trade recall for wrongly treated as part of a compound subject. precision. Taking the WSJ corpus for example, at the dip The extraction errors by WOEparse can be cat- point with recall=0.002 and precision=0.059, egorized into four classes. We illustrate them these two types of errors account for 70% of all with the WSJ corpus. In total, WOEparse got errors. 85 wrong extractions on WSJ, and they are Extraction Performance vs. Sentence Length caused by: 1) Incorrect arg1 and/or arg2 We tested how extractors’ performance varies from NP-Chunking (18.6%); 2) A erroneous de- with sentence length; the results are shown in Fig- pendency parse from Stanford Parser (11.9%); ure 4. TextRunner and WOEpos have good perfor- 3) Inaccurate meaning (27.1%) — for exam- mance on short sentences, but their performance ple, hshe, isN ominatedBy, P residentBushi is deteriorates quickly as sentences get longer. This wrongly extracted from the sentence “If she is is because long sentences tend to have compli- nominated by President Bush ...”5 ; 4) A pattern cated and long-distance relations which are diffi- inapplicable for the test sentence (42.4%). cult for shallow features to capture. In contrast, Note WOEparse is worse than WOEpos in the low WOE parse ’s performance decreases more slowly recall region. This is mainly due to parsing er- w.r.t. sentence length. This is mainly because rors (especially on long-distance dependencies), parser features are more useful for handling diffi- which misleads WOEparse to extract false high- cult sentences and they help WOEparse to maintain confidence triples. WOEpos won’t suffer from such a good recall with only moderate loss of precision. parsing errors. Therefore it has better precision on high-confidence extractions. Extraction Speed vs. Sentence Length We noticed that TextRunner has a dip point We also tested the extraction speed of different in the low recall region. There are two typical extractors. We used Java for implementing the errors responsible for this. A sample error of extractors, and tested on a Linux platform with the first type is hSources, sold, theCompanyi a 2.4GHz CPU and 4G memory. On average, it extracted from the sentence “Sources said takes WOEparse 0.679 seconds to process a sen- 4 tence. For TextRunner and WOEpos , it only takes For example, consider the Wikipedia corpus. From our 300 test sentences, TextRunner extracted 257 triples (at 0.022 seconds — 30X times faster. The detailed 72.0% precision) but only extracted 16 concrete triples (with extraction speed vs. sentence length is in Figure 5, 87.5% precision). showing that TextRunner and WOEpos ’s extraction 5 These kind of errors might be excluded by monitor- ing whether sentences contain words such as ‘if,’ ‘suspect,’ time grows approximately linearly with sentence ‘doubt,’ etc.. We leave this as a topic for the future. length, while WOEparse ’s extraction time grows
and the Stanford parse on Wikipedia is less accu- rate than the gold parse on WSJ. 4.3 Design Desiderata of WOEparse There are two interesting design choices in WOE parse : 1) whether to require arg1 to appear before arg2 (denoted as 1≺2) in the sentence; 2) whether to allow corePaths to contain prepo- sitional phrase (PP) attachments (denoted as PPa). We tested how they affect the extraction perfor- Figure 5: Textrnner and WOEpos ’s running time mance; the results are shown in Figure 7. seems to grow linearly with sentence length, while We can see that filtering PP attachments (PPa) WOE parse ’s time grows quadratically. gives a large precision boost with a noticeable loss in recall; enforcing a lexical ordering of relation quadratically (R2 = 0.935) due to its reliance on arguments (1≺2) yields a smaller improvement in parsing. precision with small loss in recall. Take the WSJ corpus for example: setting 1≺2 and PPa achieves 4.2 Self-supervision with Wikipedia Results a precision of 0.792 (with recall of 0.558). By in Better Training Data changing 1≺2 to 1∼2, the precision decreases to 0.773 (with recall of 0.595). By changing PPa to In this section, we consider how the process of PPa and keeping 1≺2, the precision decreases to matching Wikipedia infobox values to correspond- 0.642 (with recall of 0.687) — in particular, if we ing sentences results in better training data than use gold parse, the precision decreases to 0.672 the hand-written rules used by TextRunner. (with recall of 0.685). We set 1≺2 and PPa as de- To compare with TextRunner, we tested four fault in WOEparse as a logical consequence of our different ways to generate training examples from preference for high precision over high recall. Wikipedia for learning a CRF extractor. Specif- ically, positive and/or negative examples are se- 4.3.1 Different parsing options lected by TextRunner’s hand-written rules (tr for We also tested how different parsing might ef- short), by WOE’s heuristic of matching sentences fect WOEparse ’s performance. We used three pars- with infoboxes (w for short), or randomly (r for ing options on the WSJ dataset: Stanford parsing, short). We use CRF+h1 −h2 to denote a particu- CJ50 parsing (Charniak and Johnson, 2005), and lar approach, where “+” means positive samples, the gold parses from the Penn Treebank. The Stan- “-” means negative samples, and hi ∈ {tr, w, r}. ford Parser is used to derive dependencies from In particular, “+w” results in 221,205 positive ex- CJ50 and gold parse trees. Figure 8 shows the amples based on the matching sentence set6 . All detailed P/R curves. We can see that although extractors are trained using about the same num- today’s statistical parsers make errors, they have ber of positive and negative examples. In contrast, negligible effect on the accuracy of WOE. TextRunner was trained with 91,687 positive ex- amples and 96,795 negative examples generated 5 Related Work from the WSJ dataset in Penn Treebank. Open or Traditional Information Extraction: The CRF extractors are trained using the same Most existing work on IE is relation-specific. learning algorithm and feature selection as Tex- Occurrence-statistical models (Agichtein and Gra- tRunner. The detailed P/R curves are in Fig- vano, 2000; M. Ciaramita, 2005), graphical mod- ure 6, showing that using WOE heuristics to la- els (Peng and McCallum, 2004; Poon and Domin- bel positive examples gives the biggest perfor- gos, 2008), and kernel-based methods (Bunescu mance boost. CRF+tr−tr (trained using TextRun- and R.Mooney, 2005) have been studied. Snow ner’s heuristics) is slightly worse than TextRunner. et al. (Snow et al., 2005) utilize WordNet to Most likely, this is because TextRunner’s heuris- learn dependency path patterns for extracting the tics rely on parse trees to label training examples, hypernym relation from text. Some seed-based 6 frameworks are proposed for open-domain extrac- This number is smaller than the total number of corePaths (259,046) because we require arg1 to appear be- tion (Pasca, 2008; Davidov et al., 2007; Davi- fore arg2 in a sentence — as specified by TextRunner. dov and Rappoport, 2008). These works focus
P/R Curve on WSJ P/R Curve on Web P/R Curve on Wikipedia 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 precision precision precision 0.4 0.4 0.4 CRF+w−w=WOEpos CRF+w−w=WOEpos CRF+w−w=WOEpos CRF+w−tr CRF+w−tr CRF+w−tr 0.2 0.2 0.2 CRF+w−r CRF+w−r CRF+w−r CRF+tr−tr CRF+tr−tr CRF+tr−tr TextRunner TextRunner TextRunner 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 recall recall recall Figure 6: Matching sentences with Wikipedia infoboxes results in better training data than the hand- written rules used by TextRunner. Figure 7: Filtering prepositional phrase attachments (PPa) shows a strong boost to precision, and we see a smaller boost from enforcing a lexical ordering of relation arguments (1≺2). P/R Curve on WSJ the “preemptive IE” framework to avoid relation- specificity (Shinyama and Sekine, 2006). They 1.0 first group documents based on pairwise vector- space clustering, then apply an additional clus- 0.8 precision tering to group entities based on documents clus- parse WOEstanford=WOEparse ters. The two clustering steps make it difficult to 0.6 parse WOECJ50 meet the scalability requirement necessary to pro- parse WOEgold cess the Web. Mintz et al. (Mintz et al., 2009) 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 uses Freebase to provide distant supervision for recall relation extraction. They applied a similar heuris- Figure 8: Although today’s statistical parsers tic by matching Freebase tuples with unstructured make errors, they have negligible effect on the sentences (Wikipedia articles in their experiments) accuracy of WOE compared to operation on gold to create features for learning relation extractors. standard, human-annotated data. Using Freebase to match arbitrary sentences in- stead of matching Wikipedia infobox within corre- sponding articles will potentially increase the size on identifying general relations such as class at- of matched sentences at a cost of accuracy. Also, tributes, while open IE aims to extract relation their learned extractors are relation-specific. Alan instances from given sentences. Another seed- Akbik et al. (Akbik and Broß, 2009) annotated based system StatSnowball (Zhu et al., 2009) can 10,000 sentences parsed with LinkGrammar and perform both relation-specific and open IE by it- selected 46 general linkpaths as patterns for rela- eratively generating weighted extraction patterns. tion extraction. In contrast, WOE learns 29,005 Different from WOE, StatSnowball only employs general patterns based on an automatically anno- shallow features and uses L1-normalization to tated set of 301,962 Wikipedia sentences. The weight patterns. Shinyama and Sekine proposed
KNext system (Durme and Schubert, 2008) per- text. WOE can run in two modes: a CRF extrac- forms open knowledge extraction via significant tor (WOEpos ) trained with shallow features like heuristics. Its output is knowledge represented POS tags; a pattern classfier (WOEparse ) learned as logical statements instead of information rep- from dependency path patterns. Comparing with resented as segmented text fragments. TextRunner, WOEpos runs at the same speed, but Information Extraction with Wikipedia: The achieves an F-measure which is between 15% and YAGO system (Suchanek et al., 2007) extends 34% greater on three corpora; WOEparse achieves WordNet using facts extracted from Wikipedia an F-measure which is between 79% and 90% categories. It only targets a limited number of pre- higher than that of TextRunner, but runs about defined relations. Nakayama et al. (Nakayama and 30X times slower due to the time required for Nishio, 2008) parse selected Wikipedia sentences parsing. and perform extraction over the phrase structure Our experiments uncovered two sources of trees based on several handcrafted patterns. Wu WOE ’s strong performance: 1) the Wikipedia and Weld proposed the K YLIN system (Wu and heuristic is responsible for the bulk of WOE’s im- Weld, 2007; Wu et al., 2008) which has the same proved accuracy, but 2) dependency-parse features spirit of matching Wikipedia sentences with in- are highly informative when performing unlexi- foboxes to learn CRF extractors. However, it calized extraction. We note that this second con- only works for relations defined in Wikipedia in- clusion disagrees with the findings in (Jiang and foboxes. Zhai, 2007). Shallow or Deep Parsing: Shallow features, like In the future, we plan to run WOE over the bil- POS tags, enable fast extraction over large-scale lion document CMU ClueWeb09 corpus to com- corpora (Davidov et al., 2007; Banko et al., 2007). pile a giant knowledge base for distribution to the Deep features are derived from parse trees with NLP community. There are several ways to further the hope of training better extractors (Zhang et improve WOE’s performance. Other data sources, al., 2006; Zhao and Grishman, 2005; Bunescu such as Freebase, could be used to create an ad- and Mooney, 2005; Wang, 2008). Jiang and ditional training dataset via self-supervision. For Zhai (Jiang and Zhai, 2007) did a systematic ex- example, Mintz et al. consider all sentences con- ploration of the feature space for relation extrac- taining both the subject and object of a Freebase tion on the ACE corpus. Their results showed lim- record as matching sentences (Mintz et al., 2009); ited advantage of parser features over shallow fea- while they use this data to learn relation-specific tures for IE. However, our results imply that ab- extractors, one could also learn an open extrac- stracted dependency path features are highly in- tor. We are also interested in merging lexical- formative for open IE. There might be several rea- ized and open extraction methods; the use of some sons for the different observations. First, Jiang and domain-specific lexical features might help to im- Zhai’s results are tested for traditional IE where lo- prove WOE’s practical performance, but the best cal lexicalized tokens might contain sufficient in- way to do this is unclear. Finally, we wish to com- formation to trigger a correct classification. The bine WOEparse with WOEpos (e.g., with voting) to situation is different when features are completely produce a system which maximizes precision at unlexicalized in open IE. Second, as they noted, low recall. many relations defined in the ACE corpus are Acknowledgements short-range relations which are easier for shallow We thank Oren Etzioni and Michele Banko from features to capture. In practical corpora like the Turing Center at the University of Washington for general Web, many sentences contain complicated providing the code of their software and useful dis- long-distance relations. As we have shown ex- cussions. We also thank Alan Ritter, Mausam, perimentally, parser features are more powerful in Peng Dai, Raphael Hoffmann, Xiao Ling, Ste- handling such cases. fan Schoenmackers, Andrey Kolobov and Daniel 6 Conclusion Suskin for valuable comments. This material is based upon work supported by the WRF / TJ Cable This paper introduces WOE, a new approach to Professorship, a gift from Google and by the Air open IE that uses self-supervised learning over un- Force Research Laboratory (AFRL) under prime lexicalized features, based on a heuristic match contract no. FA8750-09-C-0181. Any opinions, between Wikipedia infoboxes and corresponding findings, and conclusion or recommendations ex-
pressed in this material are those of the author(s) A. Gangemi M. Ciaramita. 2005. Unsupervised learn- and do not necessarily reflect the view of the Air ing of semantic relations between concepts of a Force Research Laboratory (AFRL). molecular biology ontology. In IJCAI. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In References http://mallet.cs.umass.edu. E. Agichtein and L. Gravano. 2000. Snowball: Ex- Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- tracting relations from large plain-text collections. sky. 2009. Distant supervision for relation extrac- In ICDL. tion without labeled data. In ACL-IJCNLP. Alan Akbik and Jügen Broß. 2009. Wanderlust: Ex- T. H. Kotaro Nakayama and S. Nishio. 2008. Wiki- tracting semantic relations from natural language pedia link structure and text mining for semantic re- text using dependency grammar patterns. In WWW lation extraction. In CEUR Workshop. Workshop. Dat P.T Nguyen, Yutaka Matsuo, and Mitsuru Ishizuka. Sören Auer and Jens Lehmann. 2007. What have inns- 2007. Exploiting syntactic and semantic informa- bruck and leipzig in common? extracting semantics tion for relation extraction from wikipedia. In from wiki content. In ESWC. IJCAI07-TextLinkWS. Marius Pasca. 2008. Turning web text and search M. Banko, M. Cafarella, S. Soderland, M. Broadhead, queries into factual knowledge: Hierarchical class and O. Etzioni. 2007. Open information extraction attribute extraction. In AAAI. from the Web. In Procs. of IJCAI. Fuchun Peng and Andrew McCallum. 2004. Accurate Razvan C. Bunescu and Raymond J. Mooney. 2005. Information Extraction from Research Papers using Subsequence kernels for relation extraction. In Conditional Random Fields. In HLT-NAACL. NIPS. Hoifung Poon and Pedro Domingos. 2008. Joint Infer- R. Bunescu and R.Mooney. 2005. A shortest ence in Information Extraction. In AAAI. path dependency kernel for relation extraction. In HLT/EMNLP. Y. Shinyama and S. Sekine. 2006. Preemptive infor- mation extraction using unristricted relation discov- Eugene Charniak and Mark Johnson. 2005. Coarse- ery. In HLT-NAACL. to-fine n-best parsing and maxent discriminative Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. reranking. In ACL. Learning syntactic patterns for automatic hypernym M. Craven, D. DiPasquo, D. Freitag, A. McCallum, discovery. In NIPS. T. Mitchell, K. Nigam, and S. Slattery. 1998. Learn- Fabian M. Suchanek, Gjergji Kasneci, and Gerhard ing to extract symbolic knowledge from the world Weikum. 2007. Yago: A core of semantic knowl- wide web. In AAAI. edge - unifying WordNet and Wikipedia. In WWW. Dmitry Davidov and Ari Rappoport. 2008. Unsuper- Mengqiu Wang. 2008. A re-examination of depen- vised discovery of generic relationships using pat- dency path kernels for relation extraction. In IJC- tern clusters and its evaluation by automatically gen- NLP. erated sat analogy questions. In ACL. Fei Wu and Daniel Weld. 2007. Autonomouslly Se- Dmitry Davidov, Ari Rappoport, and Moshe Koppel. mantifying Wikipedia. In CIKM. 2007. Fully unsupervised discovery of concept- specific relationships by web mining. In ACL. Fei Wu, Raphael Hoffmann, and Danel S. Weld. 2008. Information extraction from Wikipedia: Moving Marie-Catherine de Marneffe and Christopher D. Man- down the long tail. In KDD. ning. 2008. Stanford typed dependencies manual. Min Zhang, Jie Zhang, Jian Su, and Guodong Zhou. http://nlp.stanford.edu/downloads/lex-parser.shtml. 2006. A composite kernel to extract relations be- tween entities with both flat and structured features. Benjamin Van Durme and Lenhart K. Schubert. 2008. In ACL. Open knowledge extraction using compositional language processing. In STEP. Shubin Zhao and Ralph Grishman. 2005. Extracting relations with integrated information using kernel R. Hoffmann, C. Zhang, and D. Weld. 2010. Learning methods. In ACL. 5000 relational extractors. In ACL. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Jing Jiang and ChengXiang Zhai. 2007. A systematic Ji-Rong Wen. 2009. Statsnowball: a statistical ap- exploration of the feature space for relation extrac- proach to extracting entity relationships. In WWW. tion. In HLT/NAACL.
You can also read