Extraction of Temporal Facts and Events from Wikipedia
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Extraction of Temporal Facts and Events from Wikipedia Erdal Kuzey, Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany ekuzey,weikum@mpi-inf.mpg.de ABSTRACT construction of several large knowledge bases. DBpedia [4], Recently, large-scale knowledge bases have been constructed KnowItAll[5], Intelligence In Wikipedia [20], and YAGO by automatically extracting relational facts from text. Un- [13] are major examples of such knowledge bases. These fortunately, most of the current knowledge bases focus on machine-readable knowledge bases consist of millions of en- static facts and ignore the temporal dimension. However, tities, their semantic types, and binary relations between en- the vast majority of facts are evolving with time or are valid tities. Unfortunately, most of the current knowledge bases only during a particular time period. Thus, time is a signifi- focus on static facts and ignore the temporal dimension of cant dimension that should be included in knowledge bases. facts, although the vast majority of facts are evolving with In this paper, we introduce a complete information extrac- time or are valid only during a particular time period. It is tion framework that harvests temporal facts and events from crucial to know the temporal scope of facts, e.g., birth/death semi-structured data and free text of Wikipedia articles to dates of people, publication dates of books, release dates of create a temporal ontology. First, we extend a temporal data movies and music albums, dates of important events, times- representation model by making it aware of events. Second, pans of marriages, political positions, job affiliations, etc. we develop an information extraction method which har- The temporal dimension is particularly important for one- vests temporal facts and events from Wikipedia infoboxes, to-one relationships, such as isMarriedTo or isCEOof. Peo- categories, lists, and article titles in order to build a tempo- ple can be married to several spouses, but only at different ral knowledge base. Third, we show how the system can use time intervals. A knowledge base that contains multiple its extracted knowledge for further growing the knowledge spouses for a given person can only be made consistent by base. attaching time intervals to the facts. Moreover, the temporal We demonstrate the effectiveness of our proposed methods dimension helps to distinguish current from outdated facts. through several experiments. We extracted more than one The fact “Jacques Chirac holdsPoliticalPosition Presi- million temporal facts with precision over 90% for extraction dent of France” is a correct fact, but not valid anymore. By from semi-structured data and almost 70% for extraction attaching a time interval, the fact becomes universally valid. from text. Introducing temporal scopes to knowledge bases enables rich applications [21], such as visualizing timelines of important people or events, time-travel knowledge discovery as of a General Terms certain time, inferring the chronological order and perhaps Algorithms causality of facts and events, analyzing the relatedness of events, question answering, text summarization, etc. Such a Categories and Subject Descriptors time-aware knowledge base can be a great asset for histori- ans, media analysts, social researchers, and other knowledge H.1.0 [Information Systems]: Models and Principles workers. Research on this kind of temporal knowledge is very re- Keywords cent. Most related to our objective is the work on T-YAGO Temporal Knowledge, Information Extraction, Ontologies [19] and YAGO2 [6], which are predecessors of the work pre- sented in this paper. Both prior projects have extracted tem- 1. INTRODUCTION poral facts from Wikipedia, with focus on infobox attributes [6], on one hand, and a broader range of semistructured el- The great success of Wikipedia and the progress in in- ements but with thematic customization and restriction to formation extraction techniques has led to the automatic the football domain, on the other hand [19]. In this paper we make the following contributions: • We extend the knowledge representation of the T-YAGO ontology for temporal facts and events. • We develop a rule based information extraction frame- work which extracts temporal facts and events from Copyright is held by the author/owner(s). semi-structured parts of Wikipedia. TempWeb ’12 Apr 16-17 2012, Lyon, France ACM 978-1-4503-1188-5/12/04 . • We introduce a pattern induction method to create
quaternary patterns for temporal fact extraction. We brings machine learning techniques and rule based meth- rank patterns by statistical measures. ods together to improve the recognition and normalization of temporal expressions. The TIE system [8] does not only • We present a pattern-based extraction technique to detect the temporal relationships between times and events, harvest temporal facts from free text. but also uses probabilistic inference to bound the time points • Our collection of more than 1.5 million temporal facts of the begin and end of an event. is publicly available at www.mpi-inf.mpg.de/~ekuzey/ It is important to note that the concept of event in the temporalfacts, for experimental studies, as training above approaches is different from our concept of fact or data for learning-based approaches, and other applica- event, which is defined in Section 3. Rather than augment- tions. ing relational facts with temporal validity points or intervals, systems like TARSQI or TIE focus on the relationships be- The rest of the paper is organized as follows: Section 2 dis- tween temporal expressions. Here, events are defined to be cusses related work. The knowledge representation model is verb forms with tenses, adjectives, and nouns that describe introduced in Section 3. We explain our approach to tem- the temporal aspects of the events. In the example sentence poral fact and event extraction in Sections 4 and 5. Exper- “Obama accepted the award in Oslo last year.”, systems like iments and results are shown in Section 6, and we conclude TARSQI consider the verb “accepted” to be an event, and the paper in Section 7. the adverbial phrase “last year” to be a time point. Fur- thermore, they capture the relation between “accepted” and 2. RELATED WORK “last year”, in a so-called event-time relation. Temporal fact extraction. Although time is a very im- portant dimension for knowledge bases [3], research on tem- 3. DATA MODEL poral fact extraction is very sparse and recent. Only [8, 11, The most common knowledge representation model is Re- 19, 17, 18, 14, 6] address the problem. source Description Framework (RDF) 1 . In RDF, facts are T-YAGO [19] uses regular expressions to extract temporal modeled as subject-property-object (SPO) triples where sub- facts from Wikipedia infoboxes and category names. How- ject and object are entities (or literals, not considered here) ever, it is highly customized to the special domain of foot- and the property is a binary relation. An example fact is: ball, and thus fairly limited in both scope and size. Our sys- Albert_Einstein wasBornIn Ulm. tem, in contrast, is general-purpose and extracts facts from For representing the validity timepoints or timespans of all domains. A reasoning framework for attaching times- facts, we adopt and extend the RDF-style model of T-YAGO pans to existing facts is presented in [17]. The PRAVDA [19]. T-YAGO uses reification to model higher order facts. system [18] provides a method for harvesting temporal facts First order vs. Higher order fact: A first order fact is from free text. PRAVDA uses textual patterns and positive an instance of a binary relation, such as as well as negative seeds to extract candidates for temporal facts. Then it computes confidence scores for candidates by Bill_Clinton isPresidentOf USA. a label propagation algorithm. PRAVDA exhibits good pre- A higher order fact has another fact as its subject. cision, but its coverage is quite limited. For relations like That is, a higher order fact is a fact about another isMarriedTo or worksFor, it merely extracts a few thou- fact. An example is sand temporal facts. Another recent approach is the COTS (Bill_Clinton isPresidentOf USA) startedOnDate 20- method [14], which employs an integer linear program for 01-1993. temporal scoping of existing facts. This is an expensive ma- chinery and relies on domain-specific constraints. Experi- T-YAGO uses this representation to attach time spans to ments in [14] are fairly small-scale, and limited to specific first order facts (base facts), thus creating higher order facts. relations about politicians and movies. In contrast, our sys- It assigns a fact identifier to each fact and the time spans of tem extracts a large number of temporal facts in a scalable facts are attached to these facts. manner and without domain restrictions. f1:Bill_Clinton isPresidentOf USA. YAGO2 [6] is a recent extension of the YAGO[13] knowl- f2:f1 startedOnDate 20-01-1993 edge base with a focus on temporal and spatial knowledge. T-YAGO uses three temporal relations to capture the valid- YAGO2 is similar in spirit to our work. However, it mostly ity time of the facts; startedOnDate, endedOnDate, happene- draws from infobox attributes. YAGO2 does not extract dOndate. It employs the relations startedOnDate, endedOn- facts from free text, nor from titles or lists. The work pre- Date to represent the start and end points of time intervals sented here takes temporal fact extraction a big step further. for the facts valid for a timespan, and uses the relation hap- Event extraction. An area related to temporal fact ex- penedOnDate for temporal facts valid only at single time traction is event extraction as introduced in the TempEval points. Challenge workshops [15, 10]. This task is about identify- We first extend T-YAGO by introducing the notion of ing event-time and event-event temporal relations. In the named events. TempEval Challenge, only a restricted set of Allen-style [2] A named event is an entity which has a certain time pe- temporal relations are used. The state-of-the-art system on riod or time point directly attached to itself, such as this task is TARSQI [16] which detects temporal expres- French Revolution, World War II, Treaty of Lausanne, sions (e.g., “last Friday” or “a week ago” in news articles) FIFA World Cup 2006, 37th G8 summit, 23rd Grammy by marking events and generating temporal associations be- Awards, etc. In general, named events are impor- tween events. HeidelTime [11] is a rule-based system that tant historical events, such as wars, conferences, sport uses regular expressions and knowledge resources for the 1 extraction and normalization of temporal expressions. [1] http://www.w3.org/RDF/
{{Infobox Historical event ferent units of measurements for representing equivalent in- | Event Name = The French Revolution formation. There are about 3,800 distinct infobox types and | Image Name = Prise de la Bastille.jpg each of them has about 20 pairs of attribute and value on | Image Caption = The storming of the Bastille, 14 July average [7]. However, there is a long tail in the distribution 1789 of the number of occurrences of the infobox types [7]. Thus, | Participants = French society it is plausible to consider only popular infobox types, when | Location = [[France]] creating a set of rules for extraction. | Date start = 1789 In this section, we focus on infoboxes of named events. | Date end = 1799 Our aim is to extract as many named events as possible from | Result = [[Proclamation of the abolition of the monar- infoboxes. The infoboxes are harvested by the help of an chy]] attribute map. A complete list of infobox types representing }} named events and the temporal attributes of the particular ...................................................... infoboxes are shown in Table 2. [[Category:French Revolution]] Definition: Attribute Map An attribute map is a func- [[Category:18th-century rebellions]] tion which maps an infobox attribute to a pre-defined rela- [[Category:18th-century revolutions]] tion. An attribute map takes several attributes, such as date, Table 1: Wikipedia infobox for the article French year started, term start, etc. and maps each of them to a Revolution target relation, such as happenedOnDate, startedOnDate, etc. Example event facts that can be extracted from an infobox, such as the one in Table 1, include: events, treaties, etc. Named events are explicit first- f21: French_Revolution type TYAGOEvent class citizens in our data model by specifying their type f22: French_Revolution startedOnDate 1789-##-##. as T-YAGOEvent. f23: French_Revolution endedOnDate 1799-##-##. The process for harvesting event facts from infoboxes is Since temporal information is inherent in many facts, we shown in Algorithm 1. The algorithm scans all Wikipedia further extend T-YAGO by introducing explicit first order articles. When it detects an infobox, it checks whether the temporal facts via meaningful temporal relations such as infobox is a named event infobox. If so, the algorithm cre- wasCreatedOnDate, wasEstablishedOnDate, wasDiscoveredOn- ates the base fact showing that the entity is a named event. Date, etc. In this way, we avoid the complexity of reification Then, it goes through all attributes of the infobox. For each and higher order facts, simplifying the notation and usabil- attribute, it looks up the attribute map to check whether ity. the attribute is in the map. If the attribute map contains Moreover, the confidence of temporal facts is stored in the attribute, the algorithm parses the value of the attribute knowledge base as meta facts. and attaches the temporal value to the base fact. As an ex- ample, if the infobox in Table 1 is given to the algorithm, it will create the event fact 4. TEMPORAL FACT AND EVENT EXTRAC- f21: French_Revolution type TYAGOEvent TION FROM SEMI-STRUCTURED DATA by detecting the infobox type. Next, the attribute Date_start The temporal fact extraction from Wikipedia consists of will be mapped to the relation startedOnDate. A regular- two main phases, extraction from semi-structured data and expression-based date parser is used to parse the value of extraction from text. The former is accomplished via har- attribute Date_start, yielding the date 1789-##-##. Here vesting infoboxes, categories, lists, and titles of Wikipedia we use a slightly modified version of the date parser from articles by manually crafted rules to guarantee high preci- [12]. Then, the event fact f22 is created as below. sion (discussed in this section). The latter is achieved by f22: French_Revolution startedOnDate 1789-##-##. harvesting free text of Wikipedia articles by automatically Then the algorithm will continue by other attributes in generated quaternary patterns (see Section 5). the infobox till the last attribute. 4.1 Extraction from Infoboxes 4.2 Extraction from Categories Wikipedia infoboxes are essentially typed records of attribute- Among more than 500,000 categories of Wikipedia, about value pairs. Infoboxes usually contain the most significant 70,000 categories contain temporal information, such as Con- information about the entity described by the article. For flicts in 2008, 1997 in international relations, Candidates example, the infobox of the Wikipedia article about the for the French presidential election 2007, etc. These cate- French Revolution, shown in Table 1, contains information gories are heuristically extracted by employing various regu- about the name of the event, its location, its date, the par- lar expressions checking whether a category has a date pat- ticipants of the event, and more. tern. The accuracy of these heuristics is high, as shown in A major challenge in harvesting infoboxes is the struc- our experiments. tural diversity that infoboxes exhibit. Each infobox has a Manual inspection shows that temporal categories carry particular type, such as Historical event, Election, Military time information about political events, establishments, dis- conflict, Competition, etc. For example, the infobox in Ta- establishments, discoveries, disasters, accidents, publications, ble 1 has type Historical event. During the evolution of births, deaths, etc. In order to connect the temporal infor- Wikipedia, a large variety of templates has been used to mation extracted from a category name to the entities be- represent the same or similar information. This results in longing to that particular category, we need to determine different types of infoboxes with different attributes and dif- the correct relation between the entities and the tempo-
Algorithm 1 Harvesting Infoboxes for Named Events 1: procedure harvest events(Wikipedia corpus W ) 2: F ← ∅, . Facts about named events 3: create set of infobox types T for named events 4: create attribute map A 5: for all w ∈ W do . for all articles w 6: if w.hasInf obox()&(w.getInf oboxT ype() ∈ T ) then 7: entity ← createEntity(w), . create the entity associated to article w 8: f act ← createFact(entity, type, T Y AGOEvent), . create the event fact with type relation 9: F ← F ∪ f act, . Store fact 10: for all attributei ∈ w.inf obox do . for all attributes in infobox of w 11: if A.contains(attributei ) then 12: relation ← A.getT argetRelation(attributei ) . The target relation is determined 13: value ← parseV alueOf (attributei ) . Value of attributei is parsed 14: if value is in range of relation then 15: F ← F ∪ createF act(f act, relation, value) . The event fact is reified with value and by relation Infobox Type Attribute with temporal infor- Keywords appear- Target Relation mation ing in temporal Military Conflict date category names Election election date election, war, bat- happenedOnDate Treaty date signed tle, conflict, revolu- Olympic event dates tion, riot, rebellion, Historical Event Date etc Film released olympic, champi- happenedOnDate Football match date onship, paralympic, UN resolution date sports, games, etc MMA event date award, prize, etc happenedOnDate Wrestling event date disaster, explosion, happenedOnDate Music festival dates earthquake, etc Awards year law, treaty, resolu- happenedOnDate Attack date tion Accident dates establishment, com- wasEstablishedOnDate Swimming event dates mission, open, etc Competition year disestablishment, wasDestroyedOnDate scuttle, close, etc Congress start|end discovery, descrip- wasDiscoveredOnDate Tournament fromdate|todate tion, notification, introduction, etc Table 2: Event infobox types and attributes. An accident, incident, happenedOnDate attribute map contains all the attributes of event etc infoboxes. construction, archi- wasCreatedOnDate tecture release, song, album, wasReleasedOnDate ral information. For this purpose, we use a category-relation film map. publication, novel, wasPublishedOnDate book, story, essay, A category-relation map is a function which maps a cat- poem, etc egory name to a pre-defined relation by checking key- painting, work wasCreatedOnDate words occurring in the category name. To illustrate, opeara, musical wasCreatedOnDate the temporal category Establishments in 1999 is mapped to the relation wasEstablishedOnDate, since the cate- birth wasBornOnDate gory Establishments in 1999 has the keyword establish- death diedOnDate ment. Table 3 shows the category-relation map used in our experiments. Table 3: Category-relation map Our system extracts the entities belonging to a temporal category and the temporal information appearing in the category name. Moreover, the relevant relation between There are many Wikipedia articles that contain important the entities and the temporal information is determined by dates in their titles; examples are “1941 Coupe de France Fi- the category-relation map. Finally, the temporal fact is con- nal, 1860 Lebanon conflict, Toronto mayoral election 2010, structed. etc.” An important fraction of these articles are about signif- icant historical events such as disasters, battles, conferences, 4.3 Extraction from Titles conflicts, elections, sport competitions, etc. Unfortunately,
most of them do not have infoboxes for harvesting temporal f2:f1 startedOnDate 20-01-1993 facts. Thus, the title of the article is a good source to attach f3:f1 endedOnDate 20-01-2001 a time point to the event. Our pattern-based method consists of two stages: There are titles which seem to contain date patterns, such 1. Automatic Pattern Induction: Creation of textual as “1644 Rafita, Skoda 1202, etc”. However, these are false patterns by means of temporal facts that already exist positives as the four-digit numbers refer to names of aster- in the knowlege base. oids, car models, sports teams, etc. In order to overcome this problem, we test whether the categories of the article 2. Application of Patterns: Applying patterns to free contain a temporal expression that matches the temporal text to create new temporal facts. information in the article title. If a category contains the same four-digit expression as the title, then we assume that 5.1.1 Automatic Pattern Induction this is indeed a year. As our experiments show, this heuris- The process of generating extraction patterns is boot- tics works very well. This way, we extract many facts for strapped by seed facts. Each seed fact for a particular rela- the relation happenedOnDate from titles, which are usually tion has the following format: named events. “entity 1, entity 2, startDate, endDate” Sentences in which entity 1, entity 2, startDate, and end- 4.4 Extraction from Lists Date co-occur are extracted from Wikipedia articles of en- Here, we focus on harvesting the date articles in the Wikipedia tity 1 and entity 2. The quaternary patterns are generated corpus, such as “2005 (year), 25 January, 15 September 1985, from these sentences by the word sequences that appearing etc.” These articles contain itemized lists about important in between the four arguments matching the seed fact. Af- events, birth and death dates of important people, etc. Each ter all such patterns are created, statistical measures like article usually contains three main sections,“events, births, (frequency-based) support is used to rank patterns, and we deaths”. Each section contains many sentences or list items take only the ones above a certain threshold. about significant entities. The birth and death sections of the articles follow a certain structure, which eases the ex- 5.1.2 Application of Patterns traction. The following list, for example, appears in the Once we have the quaternary patterns, it is fairly straight- birth section of the Wikipedia 25th of January:2 forward to apply them to the full text of Wikipedia articles. 1987 – Maria Kirilenko, Russian tennis player As an example, the pattern “ was inaugurated as 1989 – Sheryfa Luna, French singer on ” is created for the relation holdsPo- 1989 – Mikako Tabe, Japanese stage and film actress liticalPosition. When this pattern is applied to the sen- In summary, by harvesting infoboxes, categories, titles, tence “Barack Obama was inaugurated as President of the and lists with hand-crafted extraction rules, we can con- United States on 20 January 2009.”, it produces the follow- struct a very large and highly precise temporal knowledge ing facts: base. In the next section, we discuss how to leverage this f27: Barack_Obama holdsPoliticialPosition President of USA knowledge for extraction even more temporal facts from free f28: f27 startedOnDate 20-01-2009. text. Note that it is often necessary to check the entities in newly extracted fact candidates as to whether they have the proper types for the domain and range of the specified 5. EXTRACTION FROM FREE TEXT relation. We ensure the type consistency using the methods In this section, we present techniques to extract temporal described in [13]. facts from the full text of Wikipedia articles. The complete procedure of temporal fact extraction is de- scribed in Algorithm 2. 5.1 Pattern-based Extraction We exploit the temporal knowledge base constructed from semi-structured data, in order to further grow the knowledge 6. EXPERIMENTS AND EVALUATION base. The key idea is to use the existing facts as seeds for 6.1 Extraction from Semi Structured Text generating quaternary patterns, which can then be applied to natural-language text for extracting additional facts. Our We downloaded the English version of the Wikipedia XML techniques pretty much follow the prior work on pattern dump which is a 23GB text file. Each article is associ- induction for information extraction. However, almost all ated with a unique entity, apart from redirect pages, tem- prior work has focused on binary relations. In contrast, our plate pages, talk pages, and disambiguation pages. Here, patterns have four arguments as we want to capture basic we present the results of the experiments conducted for har- facts together with their validity timespans. vesting semi-structured text from Wikipedia articles. We extracted temporal facts from semi-structured text to cre- A quaternary pattern is a pattern that captures a base ate the temporal knowledge base. We ran our extractors fact with its temporal scope. over infoboxes, categories, lists, and titles of Wikipedia ar- ticles. In our all experiments, we evaluated the precision by As an example, the quaternary pattern “ served using Mechanical Turk over randomly sampled facts. as from to ” can be matched When harvesting infoboxes, we focused on articles about against a full-text article and may produce the following named events. In order to extract the named events from temporal facts: infoboxes, we developed rules for correct extraction. These f1:Bill_Clinton holdsPoliticalPosition President_Of_USA. rules check the infobox type and decide whether it is a pre- defined event type, such as war, battle, conference, etc. We 2 http://en.wikipedia.org/wiki/January 25#Births invoke these extraction rules as a template classifier as well,
Algorithm 2 Temporal Fact Extraction 1: procedure harvest temporal facts(seed facts SF , corpus W , knowledge base KB, relation R ) 2: P ← induce patterns(SF, W, KB, R) . collect patterns 3: create and rank pattern vector v(P ) using patterns in P 4: F ← extract facts(W, v(P ) ) . collect facts 5: return all facts F . facts as final output 6: procedure induce patterns(seed facts SF ,corpus W , knowledge base KB, relation R ) 7: P ←∅ . set of patterns 8: W 2 ← create subcorpus of seed entities(SF, W ) . get articles of seed facts from the entire corpus 9: for all f ∈ SF do . for each fact f 10: args ← args ∪ {(argi (f ))} . get all arguments, quadruples of the fact 11: for all argi ∈ args do . for each quadruple 12: P ← P ∪ {createpattern(argi , W 2)} . for each satisfying strings of W 2, create a pattern 13: return patterns P 14: procedure extract facts(corpus W , patterns vector V ) 15: F ←∅ . set of new facts 16: for all w ∈ W do . for all articles in W 17: for all p ∈ V do . for all patterns 18: for all s ∈ W satisfying p do . for all strings s in w matching the pattern p 19: F ← F ∪ {(apply pattern and create fact(w, p, s))} . creating the fact by applying p on s 20: return F since they decide the type of the template of the infobox. the test corpus of 100 articles (disjoint from the 50 seed Some of these rules are shown in Table 2. articles), in order to extract base facts and the temporal The number of named events extracted from infoboxes information qualifying these base facts. All temporal facts and their precision is shown in Table 5. extracted from free text are evaluated by using Mechanical In addition to temporal facts extracted from Wikipedia Turk. Table 7 shows the results. infoboxes, we also extracted a large number of temporal facts from Wikipedia categories. For the sake of re-usability, 6.3 Comparison to Other Temporal Knowledge we built a database of Wikipedia categories which contains Bases 534,765 categories. By using regular expressions for dates, We also compared the results of our system to the systems we identified 60,582 categories which contain temporal in- that have similar spirit and comparable sizes, e.g., T-YAGO formation. We achieved very high precision in selecting this and YAGO2. T-YAGO exclusively focused on the football subset of relevant categories. We manually assessed 200 ran- domain and gathered around 300,000 temporal facts, but did domly chosen temporal categories, and found that 98% of not report on precision. We compared our results to YAGO2 them do indeed have temporal information. that harvests temporal facts and events from infoboxes. Ta- The evaluation of the temporal fact extraction from cat- ble 6 shows that our system outperformed YAGO2 by a large egories is shown in Table 5. Moreover, Table 4 shows the margin, in terms of coverage for temporal facts and named distribution of extracted temporal facts for different cate- events with time scopes. The precision figures are compa- gories. rable. Meanwhile our methods have been integrated into In addition to infoboxes and categories, lists in in date ar- YAGO2, so new releases will contain our larger-coverage re- ticles of Wikipedia provided us with a high amount of tem- sults. poral facts about birth and death dates. We could extract around 83,000 temporal facts from date articles. 18,000 of # facts # samples Precision these facts were already extracted during extraction from Infoboxes 100K 200 90.0% categories; so 65,000 new facts were found only in lists. Categories 1.49M 240 93.2% Finally, we could extract about 60,000 events from article Lists 83K 100 100.0% titles. The evaluation for lists and titles is shown in the last Titles 60K 100 96.0% two rows of Table 5. 6.2 Extraction from Free Text Table 5: Evaluation of the facts extracted from semi- This section presents results for the extraction from free structured parts of Wikipedia articles. text. We start by presenting the data collection and the experimental setting, then we discuss the results. We used a sample of 100 Wikipedia articles about famous # facts Precision # events Precision politicians. We focused on fact extraction for the relation T-YAGO 300K n/a n/a n/a holdsPoliticalPosition, since holdsPoliticalPosition YAGO2 565K ∼96% 3.1K ∼97% is one of the relations in YAGO2 which have the least num- Our Sys- 1.74M ∼94% 160K ∼93% ber of facts, namely, around 3,000 facts. We selected 50 tem politicians as seed entities in order to automatically create patterns for the relation holdsPoliticalPosition. The pattern induction step produced a list of ranked pat- Table 6: Comparison of T-YAGO, YAGO2 and our terns. Some of the best patterns are shown in Table 8. system in terms of number of temporal facts and The automatically generated patterns are instantiated on number of named events with time scope.
Temporal Category Types Number of Categories Number of Extracted Temporal Facts Categories about disasters 1001 1477 Categories about musicals 570 2938 Categories about accidents 635 4864 Categories about laws 712 6520 Categories about artistic works 1626 7998 Categories about discoveries 1341 13078 Categories about political events 3265 23172 Categories about constructions 698 22267 Categories about books 1759 39818 Categories about (dis)establishments 10895 174649 Categories about movies 1670 254122 Categories about births and deaths 5904 941281 Total 30076 1492184 Table 4: Statistics about the temporal categories and about the temporal facts extracted from them # seeds # articles # facts Precision incorrectly assigned categories. Nevertheless, we are highly 50 100 221 67% satisfied with the precision above 90%, and we are working on techniques to counter the remaining error sources. As for the extraction from free text, we achieved high re- Table 7: Evaluation of pattern based temporal fact call: more than 200 temporal facts for the relation holdsPo- extraction for the relation holdsPoliticalPosition. liticalPosition from merely 100 articles. However, our precision of around 70% needs further improvement. The main problem at this point is that our ranking of quater- Automatically Generated Quaternary Patterns nary patterns is too simple. So we do accept some noisy was from to patterns. We are investigating better ways of pruning out bad patterns. served as from to 7. CONCLUSION AND FUTURE WORK was the serving from until We have presented a framework to extract temporal facts was inaugurated as on from semi-structured part of Wikipedia articles. We also in- troduced an automatic pattern induction method to create notably served as from patterns for given relations and to rank them based on fre- to quency statistics of patterns. We used patterns to extract was appointed on temporal facts from text. was elected in Our work leaves ample space for further improvements. was sworn in as on Possible directions for future works include: • Developing a full-fledged temporal ontology. In is a politician who was the our current work, we extended the T-YAGO architec- from to ture by adding new relations and the notion of named events. However, building a new ontology which is aware of the temporal dimension of entities and facts, Table 8: Some quaternary patterns for the relation the interactions of facts, the ordering and the causality holdsPoliticalPosition. of events, time-line of events, life stories of important entities (e.g., persons, organizations, etc.) still remains 6.4 Discussion a widely open area. We made some interesting observations about Wikipedia • News and other Web sources. We currently use categories. Table 4 shows that a huge amount of temporal Wikipedia articles for extraction. We further aim to facts can be extracted from Wikipedia categories. Although tap into news articles and other high-quality Web sources, prior work on knowledge extraction from Wikipedia cate- since they contain a large amount of temporal infor- gories [13, 4, 9] paid great attention to category names, it mation. disregarded the great potential for extracting temporal facts. Our system extracted about 1.5 million facts from around • Improving recall and precision. Although our ex- 30,000 categories. As our sampling-based evaluation shows, periments show satisfying precision with a large num- the precision of our results is 90% or higher. We do have ber of facts, we would like to extract more facts from a non-negligible number of false positives in the extracted text with high precision. One of the limiting factors facts, though. These typically stem from infoboxes that do for recall and precision is the difficulty of coreference not obey Wikipedia template standards or from articles with resolution, named entity recognition and entity disam-
biguation. We used heuristics to address these chal- International Workshop on Semantic Evaluation, lenges; in future releases, we plan to incorporate ad- SemEval ’10, pages 321–324. ACL, 2010. vanced NLP techniques. Last but not least, we aim [12] F. Suchanek, G. Ifrim, and G. Weikum. LEILA: to improve our ranking of patterns for temporal fact Learning to Extract Information by Linguistic extraction. Analysis. In Proceedings of the 2nd Workshop on Ontology Learning and Population, COLING ’06, 8. REFERENCES pages 18–25. ACL, 2006. [1] D. Ahn, S. Adafre, and M. de Rijke. Towards [13] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A Task-Based Temporal Extraction and Recognition. Core of Semantic Knowledge. In Proceedings of the Internationales Begegnungs-und Forschungszentrum 16th International Conference on World Wide Web, fuer Informatik (IBFI), Schloss Dagstuhl, 2005. WWW ’07, pages 697–706. ACM, 2007. [2] J. F. Allen. Maintaining Knowledge About Temporal [14] P. P. Talukdar, D. Wijaya, and T. Mitchell. Coupled Intervals. In Communications of the ACM, volume 26, Temporal Scoping of Relational Facts. In Proceedings pages 832–843. ACM, 1983. of the Fifth ACM International Conference on Web [3] O. Alonso, M. Gertz, and R. Baeza-Yates. Temporal Search and Data Mining, WSDM ’12, pages 73–82. Analysis of Document Collections: Framework and ACM, 2012. Applications. In Proceedings of the 17th International [15] M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, Conference on String Processing and Information G. Katz, and J. Pustejovsky. SemEval-2007 Task 15: Retrieval, SPIRE’10, pages 290–296. Springer-Verlag, TempEval Temporal Relation Identification. In 2010. Proceedings of the 4th International Workshop on [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, Semantic Evaluations, SemEval ’07, pages 75–80. R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a ACL, 2007. Web of Open Data. In Proceedings of the 6th [16] M. Verhagen and J. Pustejovsky. Temporal Processing International The Semantic Web and 2nd Asian with the TARSQI Toolkit. In 22nd International Conference on Asian Semantic Web Conference, Conference on on Computational Linguistics: ISWC’07/ASWC’07, pages 722–735. Springer-Verlag, Demonstration Papers, COLING ’08, pages 189–192. 2007. ACL, 2008. [5] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. [17] Y. Wang, M. Yahya, and M. Theobald. Time-Aware Popescu, T. Shaked, S. Soderland, D. S. Weld, and Reasoning in Uncertain Knowledge Bases. In A. Yates. Web-Scale Information Extraction in Management of Uncertain Data Workshop, 2010. KnowItAll: (Preliminary Results). In Proceedings of [18] Y. Wang, B. Yang, L. Qu, M. Spaniol, and the 13th International Conference on World Wide G. Weikum. Harvesting Facts From Textual Web Web, WWW ’04, pages 100–110. ACM, 2004. Sources by Constrained Label Propagation. In [6] J. Hoffart, F. M. Suchanek, K. Berberich, Proceedings of the 20th ACM International Conference E. Lewis-Kelham, G. de Melo, and G. Weikum. on Information and Knowledge Management, CIKM YAGO2: Exploring and Querying World Knowledge ’11, pages 837–846. ACM, 2011. in Time, Space, Context, and Many Languages. In [19] Y. Wang, M. Zhu, L. Qu, M. Spaniol, and G. Weikum. Proceedings of the 20th International Conference Timely YAGO: Harvesting, Querying, and Visualizing Companion on World Wide Web, WWW ’11, pages Temporal Knowledge from Wikipedia. In Proceedings 229–232. ACM, 2011. of the 13th International Conference on Extending [7] E. Kuzey. Extraction of Temporal Facts and Events Database Technology, EDBT ’10, pages 697–700. from Wikipedia. Master’s thesis, Saarland University, ACM, 2010. 2011. [20] D. S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty, [8] X. Ling and D. S. Weld. Temporal Information R. Hoffmann, K. Patel, and M. Skinner. Intelligence in Extraction. In Proceedings of the Association for the Wikipedia. In Proceedings of the 23rd National Advancement of Artificial Intelligence, AAAI ’07, Conference on Artificial Intelligence - Volume 3, pages pages 1385 – 1390. AAAI Press, 2010. 1609–1614. AAAI Press, 2008. [9] V. Nastase, M. Strube, B. Boerschinger, C. Zirn, and [21] K. Wong, Y. Xia, W. Li, and C. Yuan. An Overview A. Elghafari. Wikinet: A Very Large Scale of Temporal Information Extraction. International Multi-Lingual Concept Network. In Proceedings of the Journal of Computer Processing of Oriental Seventh International Conference on Language Languages, pages 137–152, 2005. Resources and Evaluation, LREC’10. European Language Resources Association (ELRA), 2010. [10] J. Pustejovsky and M. Verhagen. SemEval-2010 Task 13: Evaluating Events, Time Expressions, and Temporal Relations (TempEval-2). In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, DEW ’09, pages 112–116. ACL, 2009. [11] J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions. In Proceedings of the 5th
You can also read