Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project Tiziano Flati, Daniele Vannella, Tommaso Pasini and Roberto Navigli Dipartimento di Informatica Sapienza Università di Roma {flati,vannella,navigli}@di.uniroma1.it p.tommaso@gmail.com Abstract in these resources range from domain-specific, as in Freebase (Bollacker et al., 2008), to unspec- We present WiBi, an approach to the ified relations, as in BabelNet. However, un- automatic creation of a bitaxonomy for like the case with smaller manually-curated re- Wikipedia, that is, an integrated taxon- sources such as WordNet (Fellbaum, 1998), in omy of Wikipage pages and categories. many large automatically-created resources the We leverage the information available in taxonomical information is either missing, mixed either one of the taxonomies to reinforce across resources, e.g., linking Wikipedia cate- the creation of the other taxonomy. Our gories to WordNet synsets as in YAGO, or coarse- experiments show higher quality and cov- grained, as in DBpedia whose hypernyms link to a erage than state-of-the-art resources like small upper taxonomy. DBpedia, YAGO, MENTA, WikiNet and Current approaches in the literature have mostly WikiTaxonomy. WiBi is available at focused on the extraction of taxonomies from the http://wibitaxonomy.org. network of Wikipedia categories. WikiTaxonomy (Ponzetto and Strube, 2007), the first approach 1 Introduction of this kind, is based on the use of heuristics Knowledge has unquestionably become a key to determine whether is-a relations hold between component of current intelligent systems in many a category and its subcategories. Subsequent ap- fields of Artificial Intelligence. The creation and proaches have also exploited heuristics, but have use of machine-readable knowledge has not only extended them to any kind of semantic relation entailed researchers (Mitchell, 2005; Mirkin et al., expressed in the category names (Nastase and 2009; Poon et al., 2010) developing huge, broad- Strube, 2013). But while the aforementioned at- coverage knowledge bases (Hovy et al., 2013; tempts provide structure for categories that sup- Suchanek and Weikum, 2013), but it has also ply meta-information for Wikipedia pages, sur- hit big industry players such as Google (Singhal, prisingly little attention has been paid to the ac- 2012) and IBM (Ferrucci, 2012), which are mov- quisition of a full-fledged taxonomy for Wikipedia ing fast towards large-scale knowledge-oriented pages themselves. For instance, Ruiz-Casado et systems. al. (2005) provide a general vector-based method The creation of very large knowledge bases which, however, is incapable of linking pages has been made possible by the availability of which do not have a WordNet counterpart. Higher collaboratively-curated online resources such as coverage is provided by de Melo and Weikum Wikipedia and Wiktionary. These resources are (2010) thanks to the use of a set of effective heuris- increasingly becoming enriched with new con- tics, however, the approach also draws on Word- tent in many languages and, although they are Net and sense frequency information. only partially structured, they provide a great deal In this paper we address the task of taxono- of valuable knowledge which can be harvested mizing Wikipedia in a way that is fully indepen- and transformed into structured form (Medelyan dent of other existing resources such as WordNet. et al., 2009; Hovy et al., 2013). Prominent We present WiBi, a novel approach to the cre- examples include DBpedia (Bizer et al., 2009), ation of a Wikipedia bitaxonomy, that is, a tax- BabelNet (Navigli and Ponzetto, 2012), YAGO onomy of Wikipedia pages aligned to a taxonomy (Hoffart et al., 2013) and WikiNet (Nastase and of categories. At the core of our approach lies the Strube, 2013). The types of semantic relation idea that the information at the page and category
nsubj level are mutually beneficial for inducing a wide- cop coverage and fine-grained integrated taxonomy. nn det nn amod 2 WiBi: A Wikipedia Bitaxonomy Julia Fiona Roberts is an American actress NNP NNP NNP VBZ DT JJ NN We induce a Wikipedia bitaxonomy, i.e., a taxon- Figure 1: A dependency tree example with copula. omy of pages and categories, in 3 phases: the Wikipedia guidelines and is validated in the 1. Creation of the initial page taxonomy: we literature (Navigli and Velardi, 2010; Navigli and first create a taxonomy for the Wikipedia Ponzetto, 2012), is that the first sentence of each pages by parsing textual definitions, ex- Wikipedia page p provides a textual definition for tracting the hypernym(s) and disambiguating the concept represented by p. The second assump- them according to the page inventory. tion we build upon is the idea that a lexical tax- 2. Creation of the bitaxonomy: we leverage onomy can be obtained by extracting hypernyms the hypernyms in the page taxonomy, to- from textual definitions. This idea dates back to gether with their links to the corresponding the early 1970s (Calzolari et al., 1973), with later categories, so as to induce a taxonomy over developments in the 1980s (Amsler, 1981; Calzo- Wikipedia categories in an iterative way. At lari, 1982) and the 1990s (Ide and Véronis, 1993). each iteration, the links in the page taxonomy To extract hypernym lemmas, we draw on the are used to identify category hypernyms and, notion of copula, that is, the relation between the conversely, the new category hypernyms are complement of a copular verb and the copular verb used to identify more page hypernyms. itself. Therefore, we apply the Stanford parser 3. Refinement of the category taxonomy: fi- (Klein and Manning, 2003) to the definition of a nally we employ structural heuristics to over- page in order to extract all the dependency rela- come inherent problems affecting categories. tions of the sentence. For example, given the def- The output of our three-phase approach is a bitax- inition of the page J ULIA ROBERTS, i.e., “Julia onomy of millions of pages and hundreds of thou- Fiona Roberts is an American actress.”, the Stan- sands of categories for the English Wikipedia. ford parser outputs the set of dependencies shown in Figure 1. The noun involved in the copula re- 3 Phase 1: Inducing the Page Taxonomy lation is actress and thus it is taken as the page’s hypernym lemma. However, the extracted hyper- The goal of the first phase is to induce a taxonomy nym is sometimes overgeneral (one, kind, type, of Wikipedia pages. Let P be the set of all the etc.). For instance, given the definition of the pages and let TP = (P, E) be the page taxonomy page A POLLO, “Apollo is one of the most impor- directed graph whose nodes are pages and whose tant and complex of the Olympian deities in an- edge set E is initially empty (E := ∅). For each cient Greek and Roman religion [...].”, the only p ∈ P our aim is to identify the most suitable gen- copula relation extracted is between is and one. eralization ph ∈ P so that we can create the edge To cope with this problem we use a list of stop- (p, ph ) and add it to E. For instance, given the words.1 When such a term is extracted as hyper- page A PPLE, which represents the fruit meaning nym, we replace it with the rightmost noun of the of apple, we want to determine that its hypernym first following noun sequence (e.g., deity in the is F RUIT and add the hypernym edge connecting above example). If the resulting lemma is again a the two pages (i.e., E := E ∪{(A PPLE, F RUIT)}). stopword we repeat the procedure, until a valid hy- To do this, we perform a syntactic step, in which pernym or no appropriate hypernym can be found. the hypernyms are extracted from the page’s tex- Finally, to capture multiple hypernyms, we iter- tual definition, and a semantic step, in which the atively follow the conj and and conj or relations extracted hypernyms are disambiguated according starting from the initially extracted hypernym. For to the Wikipedia inventory. example, consider the definition of A RISTOTLE: 3.1 Syntactic step: hypernym extraction “Aristotle was a Greek philosopher and polymath, a student of Plato and teacher of Alexander the In the syntactic step, for each page p ∈ P , we Great.” Initially, the philosopher hypernym is extract zero, one or more hypernym lemmas, that selected thanks to the copula relation, then, fol- is, we output potentially ambiguous hypernyms 1 for the page. The first assumption, which follows E.g., species, genus, one, etc. Full list available online.
lowing the conjunction relations, also polymath, of h, if there is one, as hyperlinked across all the student and teacher are extracted as hypernyms. definitions of pages in W : While more sophisticated approaches like Word- X h isa(p, h) = arg max 1(p0 → ph ) Class Lattices could be applied (Navigli and Ve- ph p0 ∈W lardi, 2010), we found that, in practice, our hy- h pernym extraction approach provides higher cov- where 1(p0 → ph ) is the characteristic function erage, which is critical in our case. which equals 1 if h is linked to ph in page p0 , 0 otherwise. For example, the linker sets 3.2 Semantic step: hypernym disambiguation isa(E GGPLANT, plant) = P LANT because most of Since our aim is to connect pairs of pages via the pages associated with T ROPICAL FRUIT, a cat- hypernym relations, our second step consists of egory of E GGPLANT, contain in their definitions disambiguating the obtained hypernym lemmas of the term plant linked to the P LANT page. page p by associating the most suitable page with m Multiword linker If p → ph and m is a each hypernym. Following previous work (Ruiz- multiword expression containing the lemma h Casado et al., 2005; Navigli and Ponzetto, 2012), as one of its words, set isa(p, h) = ph . For as the inventory for a given lemma we consider the example, we set isa(P ROTEIN, compound) = set of pages whose main title is the lemma itself, C HEMICAL COMPOUND, as chemical compound except for the sense specification in parenthesis. is linked to C HEMICAL COMPOUND in the defini- For instance, given fruit as the hypernym for A P - tion of P ROTEIN. PLE we would like to link A PPLE to F RUIT as op- posed to, e.g., F RUIT ( BAND ) or F RUIT ( ALBUM ). Monosemous linker If h is monosemous in Wikipedia (i.e., there is only a single page ph for 3.2.1 Hypernym linkers that lemma), link it to its only sense by setting To disambiguate hypernym lemmas, we exploit isa(p, h) = ph . For example, we extract the the structural features of Wikipedia through a hypernym businessperson from the definition of pipeline of hypernym linkers L = {Li }, applied M ERCHANT and, as it is unambiguous, we link in cascade order (cf. Section 3.3.1). We start with it to B USINESSPERSON. the set of page-hypernym pairs H = {(p, h)} as Distributional linker Finally, we provide a dis- obtained from the syntactic step. The successful tributional approach to hypernym disambiguation. application of a linker to a pair (p, h) ∈ H yields We represent the textual definition of page p as a a page ph as the most suitable sense of h, result- distributional vector ~vp whose components are all ing in setting isa(p, h) = ph . At step i, the i- the English lemmas in Wikipedia. The value of th linker Li ∈ L is applied to H and all the hy- each component is the occurrence count of the cor- pernyms which the linker could disambiguate are responding content word in the definition of p. removed from H. This prevents lower-precision The goal of this approach is to find the best linkers from overriding decisions taken by more link for hypernym h of p among the pages h is accurate ones. linked to, across the whole set of definitions in We now describe the hypernym linkers. In what Wikipedia. Formally, for each ph such that h h follows we denote with p → ph the fact that the is linked to ph in some definition, we define the definition of a Wikipedia page p contains an oc- set of pages P (ph ) whose definitions contain a currence of h linked to page ph . Note that ph is h link to ph , i.e., P (ph ) = {p0 ∈ P |p0 → ph }. not necessarily a sense of h. We then build a distributional vector ~vp0 for each h p0 ∈ P (ph ) as explained Pabove and create an ag- Crowdsourced linker If p → ph , i.e., the hyper- gregate vector ~vph = p0 ~ vp0 . Finally, we de- nym h is found to have been manually linked to ph termine the similarity of p to each ph by calcu- in p by Wikipedians, we assign isa(p, h) = ph . lating the dot product between the two vectors For example, because capital was linked in the sim(p, ph ) = ~vp · ~vph . If sim(p, ph ) > 0 for any B RUSSELS page definition to C APITAL CITY, we ph we perform the following association: set isa(B RUSSELS, capital) = C APITAL CITY. isa(p, h) = arg max sim(p, ph ) ph Category linker Given the set W ⊂ P of Wikipedia pages which have at least one category For example, thanks to this linker we set in common with p, we select the majority sense isa(VACUUM CLEANER, device) = M ACHINE.
Prec. Rec. Cov. Lemma 94.83 90.20 98.50 Sense 82.77 75.10 89.20 Table 1: Page taxonomy performance. Figure 2: Distribution of linked hypernyms. 4 Phase 2: Inducing the Bitaxonomy 3.3 Page Taxonomy Evaluation The page taxonomy built in Section 3 will serve as a stable, pivotal input to the second phase, the Statistics We applied the above linkers to the aim of which is to build our bitaxonomy, that is, a October 2012 English Wikipedia dump. Out of taxonomy of pages and categories. Our key idea the 3,829,058 total pages, 4,270,232 hypernym is that the generalization-specialization informa- lemmas were extracted in the syntactic step for tion available in each of the two taxonomies is 3,697,113 pages (covering more than 96% of the mutually beneficial. We implement this idea by total). Due to illformed definitions, though, it exploiting one taxonomy to update the other, and was not always possible to extract the hypernym vice versa, in an iterative way, until a fixed point lemma: for example, 6 A PRIL 2010 BAGHDAD is reached. The final output of this phase is, on the BOMBINGS is defined as “A series of bomb ex- one hand, a page taxonomy augmented with addi- plosions destroyed several buildings in Baghdad”, tional hypernymy relations and, on the other hand, which only implicitly provides the hypernym. a category taxonomy which is built from scratch. The semantic step disambiguated 3,718,612 hy- pernyms for 3,294,562 Wikipedia pages, i.e., cov- 4.1 Initialization ering more than 86% of the English pages with at Our bitaxonomy B = {TP , TC } is a pair consist- least one disambiguated hypernym. Figure 2 plots ing of the page taxonomy TP = (P, E), as ob- the number and distribution of hypernyms disam- tained in Section 3, and the category taxonomy biguated by our hypernym linkers. TC = (C, ∅), which initially contains all the cate- Taxonomy quality To evaluate the quality of gories as nodes but does not include any hypernym our page taxonomy we randomly sampled 1,000 edge between category nodes. In the following Wikipedia pages. For each page we provided: i) we describe the core algorithm of our approach, a list of suitable hypernym lemmas for the page, which iteratively and mutually populates and re- mainly selected from its definition; ii) for each fines the edge sets E(TP ) and E(TC ). lemma the correct hypernym page(s). We calcu- 4.2 The Bitaxonomy Algorithm lated precision as the average ratio of correct hy- pernym lemmas (senses) to the total number of Preliminaries Before proceeding, we define lemmas (senses) returned for all the pages in the some basic concepts that will turn out to be use- dataset, recall as the number of correct lemmas ful for presenting our algorithm. We denote by (senses) over the total number of lemmas (senses) superT (t) the set of all ancestors of a node t in the in the dataset, and coverage as the fraction of taxonomy T (be it TP or TC ). We further define a pages for which at least one lemma (sense) was verification function t ;T t0 which, in the case of returned, independently of its correctness. Results, TC , returns true if there is a path in the Wikipedia both at lemma- and sense-level, are reported in Ta- category network between t and t0 , false other- ble 1. Not only does our taxonomy show high pre- wise, and, in the case of TP , returns true if t0 is cision and recall in extracting ambiguous hyper- a sense, i.e., a page, of a hypernym h of t (that nyms, it also disambiguates more than 3/4 of the is, (t, h) ∈ H, cf. Section 3.2.1). For instance, hypernyms with high precision. S PORTSMEN ;TC M EN BY OCCUPATION holds for categories because the former is a sub-category 3.3.1 Hypernym linker order of the latter in Wikipedia, and R ADIOHEAD ;TP The optimal order of application of the above BAND ( MUSIC ) for pages, because band is a hy- linkers is the same as that presented in Section pernym extracted from the textual definition of 3.2.1. It was established by selecting the combina- R ADIOHEAD and BAND ( MUSIC ) is a sense of tion, among all possible permutations, which max- band in Wikipedia. Note that, while the super imized precision on a tuning set of 100 randomly function returns information that we have already sampled pages, disjoint from our page dataset. learned, i.e., it is in TP and TC , the ; operator
holds just for candidate is-a relations, as it uses Algorithm 1 The Bitaxonomy Algorithm knowledge from Wikipedia itself which is poten- Input: TP , TC 1: T := TC , T 0 := TP tially incorrect. For instance, S PORTSMEN ;TC 2: repeat M EN ’ S SPORTS in the Wikipedia category net- 3: for all t ∈ V (T ) s.t. @(t, th ) ∈ E(T ) do work, and R ADIOHEAD ;TP BAND ( RADIO ) be- 4: reset count 5: for all t0 ∈ π(t) do tween the two Wikipedia pages, both hold accord- 6: S := superT 0 (t0 ) ing to our definition of ;, while connecting the 7: for all t0h ∈ S do wrong hypernym candidates. At the core of our 8: for all th ∈ π(t0h ) do count(th )++ end for 9: end for algorithm, explained below, is the mutual lever- 10: end for aging of the super function from one of the two 11: t̂h := arg maxth : t;T th count(th ) taxonomies (pages or categories) to decide about 12: if count(t̂h ) > 0 then E(T ) := E(T ) ∪ {(t, t̂h )} which candidates (for which a ; relation holds) 13: end for 14: swap T and T 0 in the other taxonomy are real hypernyms. 15: until convergence Finally, we define the projection operator π, 16: return {T, T 0 } such that π(c), c ∈ C, is the set of pages categorized with c, and π(p), p ∈ P , is the 8). Finally, the node t̂h ∈ V (T ) with maximum set of categories associated with page p in count, and such that t ;T t̂h holds, if one exists, Wikipedia. For instance, the pages which belong is promoted as hypernym of t and a new hypernym to the category O LYMPIC SPORTS are given by edge (t, t̂h ) is added to E(T ) (line 12). Finally, the π(O LYMPIC SPORTS) = {BASEBALL, B OXING, role of the two taxonomies is swapped and the pro- . . . , T RIATHLON}. Vice versa, π(T RIATHLON) = cess is repeated until no more change is possible. {M ULTISPORTS, O LYMPIC SPORTS, . . . , O PEN WATER SWIMMING }. The projection operator π Example Let us illustrate the algorithm by way enables us to jump from one taxonomy to the other of an example. Assume we are in the first iteration and expresses the mutual membership relation be- (T = TC ) and consider the Wikipedia category tween pages and categories. t = O LYMPICS (line 3) and its super-categories {M ULTI - SPORT EVENTS, S PORT AND POLITICS, Algorithm We now describe in detail the bitax- I NTERNATIONAL SPORTS COMPETITIONS}. This onomy algorithm, whose pseudocode is given in category has 27 pages associated with it (line Algorithm 1. The algorithm takes as input the two 5), 23 of which provide a hypernym page in TP taxonomies, initialized as described in Section 4.1. (line 6): e.g., PARALYMPIC G AMES, associated Starting from the category taxonomy (line 1), the with the category O LYMPICS, is a M ULTI - SPORT algorithm updates the two taxonomies in turn, un- EVENT and is therefore contained in S. By con- til convergence is reached, i.e., no more edges can sidering and counting the categories of each page be added to any side of the bitaxonomy. Let T be in S (lines 7-8), we end up counting the category the current taxonomy considered at a given mo- M ULTI - SPORT EVENTS four times and other ment and T 0 its dual taxonomy. The algorithm categories, such as AWARDS and S WIMSUITS, proceeds by selecting a node t ∈ V (T ) for which once. As M ULTI - SPORT EVENTS has the highest no hypernym edge (t, th ) could be found up until count and is connected to O LYMPICS by a path that moment (line 3), and then tries to infer such in the Wikipedia category network (line 11), a relation by drawing on the dual taxonomy T 0 the hypernym edge (O LYMPICS, M ULTI - SPORT (lines 5-12). This is the core of the bitaxonomy al- EVENTS ) is added to TC (line 12). gorithm, in which hypernymy knowledge is trans- ferred from one taxonomy to the other. By apply- 5 Phase 3: Category taxonomy ing the projection operator π to t, the algorithm refinement considers those nodes t0 aligned to t in the dual taxonomy (line 5) and obtains their hypernyms t0h As the final phase, we refine and enrich the cate- using the superT 0 function (line 6). The nodes gory taxonomy. The goal of this phase is to pro- reached in T 0 act as a clue for discovering the suit- vide broader coverage to the category taxonomy able hypernyms for the starting node t ∈ V (T ). TC created as explained in Section 4. We apply To perform the discovery, the algorithm projects three enrichment heuristics which add hypernyms each such hypernym node t0h ∈ S and increments to those categories c for which no hypernym could the count of each projection th ∈ π(t0h ) (line be found in phase 2, i.e., @c0 s.t. (c, c0 ) ∈ E(TC ).
hypernym in TC Single super-category As a first structural re- c0 Wikipedia super-category finement, we automatically link an uncovered cat- egory c to c0 if c0 is the only direct super-category c00 c0 c000 of c in Wikipedia. d e c c1 c00 . . . cm Sub-categories We increase the hypernym cov- erage by exploiting the sub-categories of each un- c1 c2 . . . cn c covered category c (see Figure 3a). In detail, for each uncovered category c we consider the (a) Sub categ. heuristic. (b) Super categ. heuristic. set sub(c) of all the Wikipedia subcategories of Figure 3: Heuristic patterns for the coverage re- c (nodes c1 , c2 , . . . , cn in Figure 3a) and then let finement of the category taxonomy. each category vote, according to its direct hyper- nym categories in TC (the vote is as in Algo- covering more than 96% of the 618,641 categories rithm 1). Then we proceed in decreasing order in the October 2012 English Wikipedia dump. of vote and select the highest-ranking category c0 The graph shows the steepest slope in the first which is connected to c via a path in TC , i.e., iterations of phase 2, which converges around c ;TC c0 . We then pick up the direct ancestor 400k categories at iteration 30, and a significant c00 of c which lies in the path from c to c0 and boost due to phase 3 producing another 175k add the hypernym edge (c, c00 ) to E(TC ). For ex- hypernymy edges, with the super-category heuris- ample, consider the category F RENCH TELEVI - tic contributing most. 78.90% of the nodes in SION PEOPLE; since this category has no asso- TC belong to the same connected component. ciated pages, in phase 2 no hypernym could be The average height of the biggest component of found. However, by applying the sub-categories TC is 23.26 edges and the maximum height is heuristic, we discover that T ELEVISION PEOPLE 49. We note that the average height of TC is BY COUNTRY is the hypernym most voted by our much greater than that of TP , which reflects the target category’s descendants, such as F RENCH category taxonomy distinguishing between very TELEVISION ACTORS and F RENCH TELEVISION subtle classes, such as A LBUMS BY ARTISTS, DIRECTORS . Since T ELEVISION PEOPLE BY A LBUMS BY RECORDING LOCATION, etc. COUNTRY is at distance 1 in the Wikipedia category network from F RENCH TELEVISION Category taxonomy quality To estimate the PEOPLE, we add (F RENCH TELEVISION PEOPLE, quality of the category taxonomy, we ran- T ELEVISION PEOPLE BY COUNTRY) to E(TC ). domly sampled 1,000 categories and, for each of them, we manually associated the super-categories Super-categories We then apply a similar which were deemed to be appropriate hypernyms. heuristic involving super-categories (see Figure Figure 4b shows the performance trend as the al- 3b). Given an uncovered category c, we consider gorithm iteratively covers more and more cate- its direct Wikipedia super-categories and let them gories. Phase 2 is particularly robust across it- vote, according to their hypernym categories in erations, as it leads to increased recall while re- TC . Then we proceed in decreasing order of vote taining very high precision. As regards phase 3, and select the highest-ranking category c0 which is the super-categories heuristic leads to only a slight connected to c in TC , i.e., c ;TC c0 . We then pick precision decrease, while improving recall consid- up the direct ancestor c00 of c which lies in the path erably. Overall, the final taxonomy TC achieves from c to c0 and add the edge (c, c00 ) to E(TC ). 85.80% precision, 83.40% recall and 97.20% cov- 5.1 Bitaxonomy Evaluation erage on our dataset. Category taxonomy statistics We applied Page taxonomy improvement As a result of phases 2 and 3 to the output of phase 1, which phase 2, 141,105 additional hypernymy links were was evaluated in Section 3.3. In Figure 4a we also added to the page taxonomy, resulting in show the increase in category coverage at each an overall 82.99% precision, 77.90% recall and iteration throughout the execution of the two 92.10% coverage, with a non-negligible 3% boost phases (1SUP, SUB and SUPER correspond to from phase 1 to phase 2 in terms of recall and cov- the three above heuristics of phase 3). The final erage on our Wikipedia page dataset. outcome is a category taxonomy which includes We also calculated some statistics for the result- 594,917 hypernymy links between categories, ing taxonomy obtained by aggregating the 3.8M
stead, developed with DBpedia (Auer et al., 2007), which pioneered the current stream of work aimed at extracting semi-structured information from Wikipedia templates and infoboxes. In DBpedia, entities are mapped to a coarse-grained ontology which is collaboratively maintained and contains only about 270 classes corresponding to popular Figure 4: Category taxonomy evaluation. named entity types, in contrast to our goal of struc- turing the full set of Wikipedia articles in a larger hypernym links in a single directed graph. Over- and finer-grained taxonomy. all, 99% of nodes belong to the same connected A few notable efforts to reconcile the two sides component, with a maximum height of 29 and an of Wikipedia, i.e., pages and categories, have average height on the biggest component of 6.98. been put forward very recently: WikiNet (Nas- tase et al., 2010; Nastase and Strube, 2013) is a 6 Related Work project which heuristically exploits different as- pects of Wikipedia to obtain a multilingual con- Although the extraction of taxonomies from cept network by deriving not only is-a relations, machine-readable dictionaries was already being but also other types of relations. A second project, studied in the early 1970s (Calzolari et al., 1973), MENTA (de Melo and Weikum, 2010), creates pioneering work on large amounts of data only one of the largest multilingual lexical knowledge appeared in the 1990s (Hearst, 1992; Ide and bases by interconnecting more than 13M articles Véronis, 1993). Approaches based on hand- in 271 languages. In contrast to our work, hy- crafted patterns and pattern matching techniques pernym extraction is supervised in that decisions have been developed to provide a supertype for are made on the basis of labelled training exam- the extracted terms (Etzioni et al., 2004; Blohm, ples and requires a reconciliation step owing to 2007; Kozareva and Hovy, 2010; Navigli and Ve- the heterogeneous nature of the hypernyms, some- lardi, 2010; Velardi et al., 2013, inter alia). How- thing that we only do for categories, due to their ever, these methods do not link terms to existing noisy network. While WikiNet and MENTA bring knowledge resources such as WordNet, whereas together the knowledge available both at the page those that explicitly link do so by adding new and category level, like we do, they either achieve leaves to the existing taxonomy instead of acquir- low precision and coverage of the taxonomical ing wide-coverage taxonomies from scratch (Pan- structure or exhibit overly general hypernyms, as tel and Ravichandran, 2004; Snow et al., 2006). we show in our experiments in the next section. The recent upsurge of interest in collabo- rative knowledge curation has enabled several Our work differs from the others in at least three approaches to large-scale taxonomy acquisition respects: first, in marked contrast to most other re- (Hovy et al., 2013). Most approaches initially sources, but similarly to WikiNet and WikiTaxon- focused on the Wikipedia category network, an omy, our resource is self-contained and does not entangled set of generalization-containment rela- depend on other resources such as WordNet; sec- tions between Wikipedia categories, to extract the ond, we address the taxonomization task on both hypernymy taxonomy as a subset of the network. sides, i.e., pages and categories, by providing an The first approach of this kind was WikiTaxonomy algorithm which mutually and iteratively transfers (Ponzetto and Strube, 2007; Ponzetto and Strube, knowledge from one side of the bitaxonomy to the 2011), based on simple, yet effective lightweight other; third, we provide a wide coverage bitaxon- heuristics, totaling more than 100k is-a relations. omy closer in structure and granularity to a manual Other approaches, such as YAGO (Suchanek et WordNet-like taxonomy, in contrast, for example, al., 2008; Hoffart et al., 2013), yield a taxonom- to DBpedia’s flat entity-focused hierarchy.2 ical backbone by linking Wikipedia categories to WordNet. However, the categories are linked to 2 Note that all the competitors on categories have average the first, i.e., most frequent, sense of the category height between 1 and 3.69 on their biggest component, while head in WordNet, involving only leaf categories in we have 23.26, while on pages their height is between 1.9 and 4.22, while ours is 6.98. Since WordNet’s average height is the linking. 8.07 we deem WiBi to be the resource structurally closest to Interest in taxonomizing Wikipedia pages, in- WordNet.
Dataset System Prec. Rec. Cov. uses knowledge from 271 Wikipedias to build the WiBi 84.11 79.40 92.57 WikiNet 57.29†† 71.45†† 82.01 final taxonomy. However, we recognize its perfor- Pages DBpedia 87.06 51.50†† 55.93 mance might be relatively higher on a 2012 dump. MENTA 81.52 72.49† 88.92 We show the results on our page hypernym WiBi 85.18 82.88 97.31 dataset in Table 2 (top). As can be seen, WikiNet WikiTax 88.50 54.83†† 59.43 Categories YAGO 94.13 53.41†† 56.74 obtains the lowest precision, due to the high num- MENTA 87.11 84.63 97.15 ber of hypernyms provided, many of which are MENTA−ENT 85.18 71.95†† 84.47 incorrect, with a recall between that of DBpe- dia and MENTA. WiBi outperforms all other re- Table 2: Page and category taxonomy evaluation. † (†† ) denotes statistically significant difference, sources with 84.11% precision, 79.40% recall and 92.57% coverage. MENTA seems to be the clos- using χ2 test, p < 0.02 (p < 0.01) between WiBi est resource to ours, however, we remark that the and the daggered resource. hypernyms output by MENTA are very heteroge- 7 Comparative Evaluation neous: 48% of answers are represented by a Word- Net synset, 37% by Wikipedia categories and 15% 7.1 Experimental Setup are Wikipedia pages. In contrast to all other re- sources, WiBi outputs page hypernyms only. We compared our resource (WiBi) against the Wikipedia taxonomies of the major knowledge re- Wikipedia categories We then compared all the sources in the literature providing hypernym links, knowledge resources which deal with categories, namely DBpedia, WikiNet, MENTA, WikiTax- i.e., WikiTaxonomy, YAGO and MENTA. For the onomy and YAGO (see Section 6). As datasets, latter two, the above considerations about the 2012 we used our gold standards of 1,000 randomly- dump hold, whereas we reimplemented WikiTax- sampled pages (see Section 3.3) and categories onomy, which was based on a 2009 dump, to run it (see Section 5.1). In order to ensure a level playing on the same dump as WiBi. We excluded WikiNet field, we detected those pages (categories) which from our comparison because it turned out to have do not exist in any of the above resources and re- low coverage of categories (i.e., less than 1%). moved them to ensure full coverage of the dataset We show the results on our category dataset across all resources. For each resource we cal- in Table 2 (bottom). Despite other systems ex- culated precision, by manually marking each hy- hibiting higher precision, WiBi generally achieves pernym returned for each page (category) as cor- higher recall, thanks also to its higher category rect or not. As regards recall, we note that in coverage. YAGO obtains the lowest recall and two cases (i.e., DBpedia returning page super- coverage, because only leaf categories are consid- types from its upper taxonomy, YAGO linking cat- ered. MENTA is the closest system to ours, ob- egories to WordNet synsets) the generalizations taining slightly higher precision and recall. No- are neither pages nor categories and that MENTA tably, however, MENTA outputs the first WordNet returns heterogeneous hypernyms as mixed sets of sense of entity for 13.17% of all the given answers, WordNet synsets, Wikipedia pages and categories. which, despite being correct and accounted in pre- Given this heterogeneity, standard recall across re- cision and recall, is uninformative. Since a system sources could not be calculated. For this reason we which always outputs entity would maximise all calculated recall as described in Section 3.3. the three measures, we also calculated the perfor- mance for MENTA when discarding entity as an 7.2 Results answer; as Table 2 shows (bottom, MENTA−ENT ), Wikipedia pages We first report the results of recall drops to 71.95%. Further analysis, pre- the knowledge resources which provide page hy- sented below, shows that the specificity of its hy- pernyms, i.e., we compare against WikiNet, DB- pernyms is considerably lower than that of WiBi. pedia and MENTA. We use the original outputs from the three resources: the first two are based 7.3 Analysis of the results on dumps which are from the same year as the one To get further insight into our results we per- used in WiBi (cf. Section 3.3), while MENTA is formed two additional analyses of the data. First, based on a dump dating back to 2010 (consisting we estimated the level of specialization of the of 3.25M pages and 565k categories). We decided hypernyms in the different resources on our two to include the latter for comparison purposes, as it datasets. The idea is that a hypernym should be
Dataset System (X) WiBi=X WiBi>X WiBi In this paper we have presented WiBi, an auto- 0 if correct; more specific answers were assigned matic 3-phase approach to the construction of a higher scores. When comparing two systems, we bitaxonomy for the English Wikipedia, i.e., a full- select the respective most specific answers a1 , a2 fledged, integrated page and category taxonomy: and say the first system is more specific than the first, using a set of high-precision linkers, the page latter whenever score(a1 ) > score(a2 ). Table 3 taxonomy is populated; next, a fixed point algo- shows the results for all the resources and for both rithm populates the category taxonomy while en- the page and category taxonomies: WiBi consis- riching the page taxonomy iteratively; finally, the tently provides considerably more specific hyper- category taxonomy undergoes structural refine- nyms than any other resource (middle column). ments. Coverage, quality and granularity of the A second important aspect that we analyzed was bitaxonomy are considerably higher than the tax- the granularity of each taxonomy, determined by onomy of state-of-the-art resources like DBpedia, drawing each resource on a bidimensional plane YAGO, MENTA, WikiNet and WikiTaxonomy. with the number of distinct hypernyms on the Our contributions are three-fold: i) we propose x axis and the total number of hypernyms (i.e., a unified, effective approach to the construction of edges) in the taxonomy on the y axis. Figures 5a a Wikipedia bitaxonomy, a richer structure than and 5b show the position of each resource for the those produced in the literature; ii) our method for page and the category taxonomies, respectively. building the bitaxonomy is self-contained, thanks As can be seen, WiBi, as well as the page tax- to its independence from external resources (like onomy of MENTA, is the resource with the best WordNet) and the virtual absence of supervision, granularity, as not only does it attain high cover- making WiBi replicable on any new version of age, but it also provides a larger variety of classes Wikipedia; iii) the taxonomy provides nearly full as generalizations of pages and categories. Specif- coverage of pages and categories, encompassing ically, WiBi provides over 3M hypernym pages the entire encyclopedic knowledge in Wikipedia. chosen from a range of 94k distinct hypernyms, We will apply our video games with a purpose while others exhibit a considerably smaller range (Vannella et al., 2014) to validate WiBi. We also of distinct hypernyms (e.g., DBpedia by design, plan to integrate WiBi into BabelNet (Navigli and but also WikiNet, with around 11k distinct page Ponzetto, 2012), so as to fully taxonomize it, and hypernyms). The large variety of classes provided exploit its high quality for improving semantic by MENTA, however, is due to including more predicates (Flati and Navigli, 2013). than 100k Wikipedia categories (among which, Acknowledgments categories about deaths and births alone repre- sent about 2% of the distinct hypernyms). As re- The authors gratefully acknowledge gards categories, while the number of distinct hy- the support of the ERC Starting pernyms of WiBi and WikiTaxonomy is approxi- Grant MultiJEDI No. 259234. mately the same (around 130k), the total number We thank Luca Telesca for his implementation of of hypernyms (around 580k for both taxonomies) WikiTaxonomy and Jim McManus for his com- is distributed over half of the categories in Wiki- ments on the manuscript.
References David A. Ferrucci. 2012. Introduction to ”This is Wat- son”. IBM Journal of Research and Development, Robert A. Amsler. 1981. A Taxonomy for English 56(3):1. Nouns and Verbs. In Proceedings of Association for Computational Linguistics (ACL ’81), pages 133– Tiziano Flati and Roberto Navigli. 2013. SPred: 138, Stanford, California, USA. Large-scale Harvesting of Semantic Predicates. In Proceedings of the 51st Annual Meeting of the Asso- Sören Auer, Christian Bizer, Georgi Kobilarov, Jens ciation for Computational Linguistics (ACL), pages Lehmann, Richard Cyganiak, and Zachary Ive. 1222–1232, Sofia, Bulgaria. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of 6th International Semantic Web Conference joint with 2nd Asian Semantic Web Con- Marti A. Hearst. 1992. Automatic acquisition of hy- ference (ISWC+ASWC 2007), pages 722–735, Bu- ponyms from large text corpora. In Proceedings san, Korea. of the International Conference on Computational Linguistics (COLING ’92), pages 539–545, Nantes, Christian Bizer, Jens Lehmann, Georgi Kobilarov, France. Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia - a crystal- Johannes Hoffart, Fabian M. Suchanek, Klaus lization point for the Web of Data. Web Semantics, Berberich, and Gerhard Weikum. 2013. YAGO2: A 7(3):154–165. spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194:28–61. Sebastian Blohm. 2007. Using the web to reduce data sparseness in pattern-based information extraction. Eduard H. Hovy, Roberto Navigli, and Simone Paolo In Proceedings of the 11th European Conference on Ponzetto. 2013. Collaboratively built semi- Principles and Practice of Knowledge Discovery in structured content and Artificial Intelligence: The Databases (PKDD), pages 18–29, Warsaw, Poland. story so far. Artificial Intelligence, 194:2–27. Springer. Nancy Ide and Jean Véronis. 1993. Extracting Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim knowledge bases from machine-readable dictionar- Sturge, and Jamie Taylor. 2008. Freebase: A collab- ies: Have we wasted our time? In Proceedings of oratively created graph database for structuring hu- the Workshop on Knowledge Bases and Knowledge man knowledge. In Proceedings of the International Structures, pages 257–266, Tokyo, Japan. Conference on Management of Data (SIGMOD ’08), SIGMOD ’08, pages 1247–1250, New York, NY, Dan Klein and Christopher D. Manning. 2003. Fast USA. Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Infor- Nicoletta Calzolari, Laura Pecchia, and Antonio Zam- mation Processing Systems 15 (NIPS), pages 3–10, polli. 1973. Working on the Italian Machine Dictio- Vancouver, British Columbia, Canada. nary: a Semantic Approach. In Proceedings of the 5th Conference on Computational Linguistics (COL- Zornitsa Kozareva and Eduard H. Hovy. 2010. A ING ’73), pages 49–70, Pisa, Italy. Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. In Proceedings of the Nicoletta Calzolari. 1982. Towards the organization of Conference on Empirical Methods in Natural Lan- lexical definitions on a database structure. In Proc. guage Processing (EMNLP ’10), pages 1110–1118, of the 9th Conference on Computational Linguistics Seattle, WA, USA. (COLING ’82), pages 61–64, Prague, Czechoslo- vakia. Olena Medelyan, David Milne, Catherine Legg, and Gerard de Melo and Gerhard Weikum. 2010. MENTA: Ian H. Witten. 2009. Mining meaning from Inducing Multilingual Taxonomies from Wikipedia. Wikipedia. International Journal of Human- In Proceedings of Conference on Information and Computer Studies, 67(9):716–754. Knowledge Management (CIKM ’10), pages 1099– 1108, New York, NY, USA. Shachar Mirkin, Ido Dagan, and Eyal Shnarch. 2009. Evaluating the inferential utility of lexical-semantic Oren Etzioni, Michael Cafarella, Doug Downey, Stan- resources. In Proceedings of the 12th Conference ley Kok, Ana-Maria Popescu, Tal Shaked, Stephen of the European Chapter of the Association for Soderland, Daniel S. Weld, and Alexander Yates. Computational Linguistics (EACL), pages 558–566, 2004. Web-scale information extraction in know- Athens, Greece. ItAll: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web Tom Mitchell. 2005. Reading the Web: A Break- (WWW ’04), pages 100–110, New York, NY, USA. through Goal for AI. AI Magazine. ACM. Vivi Nastase and Michael Strube. 2013. Transform- Christiane Fellbaum, editor. 1998. WordNet: An Elec- ing Wikipedia into a large scale multilingual concept tronic Database. MIT Press, Cambridge, MA. network. Artificial Intelligence, 194:62–85.
Vivi Nastase, Michael Strube, Benjamin Boerschinger, Conference on Computational Linguistics and 44th Caecilia Zirn, and Anas Elghafari. 2010. WikiNet: Annual Meeting of the Association for Computa- A Very Large Scale Multi-Lingual Concept Net- tional Linguistics (COLING-ACL 2006), pages 801– work. In Proceedings of the Seventh International 808. Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. Fabian Suchanek and Gerhard Weikum. 2013. Knowl- edge harvesting from text and Web sources. In IEEE Roberto Navigli and Simone Paolo Ponzetto. 2012. 29th International Conference on Data Engineer- BabelNet: The automatic construction, evaluation ing (ICDE 2013), pages 1250–1253, Brisbane, Aus- and application of a wide-coverage multilingual se- tralia. IEEE Computer Society. mantic network. Artificial Intelligence, 193:217– 250. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2008. YAGO: A large ontology from Roberto Navigli and Paola Velardi. 2010. Learning Wikipedia and WordNet. Journal of Web Semantics, Word-Class Lattices for Definition and Hypernym 6(3):203–217. Extraction. In Proceedings of the 48th Annual Meet- ing of the Association for Computational Linguistics Daniele Vannella, David Jurgens, Daniele Scarfini, (ACL 2010), pages 1318–1327, Uppsala, Sweden, Domenico Toscani, and Roberto Navigli. 2014. July. Association for Computational Linguistics. Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose. In Pro- Patrick Pantel and Deepak Ravichandran. 2004. Au- ceedings of the 52nd Annual Meeting of the Asso- tomatically labeling semantic classes. In Proceed- ciation for Computational Linguistics (ACL 2014), ings of the Human Language Technology Confer- Baltimore, USA. ence of the North American Chapter of the Asso- ciation for Computational Linguistics (NAACL HLT Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013), Boston, Massachusetts, 2–7 May 2004, pages 2013. OntoLearn Reloaded: A graph-based algo- 321–328. rithm for taxonomy induction. Computational Lin- guistics, 39(3):665–707. Simone Paolo Ponzetto and Michael Strube. 2007. Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd Conference on the Ad- vancement of Artificial Intelligence (AAAI ’07), Van- couver, B.C., Canada, 22–26 July 2007, pages 1440–1445. Simone Paolo Ponzetto and Michael Strube. 2011. Taxonomy induction based on a collaboratively built knowledge repository. Artificial Intelligence, 175(9- 10):1737–1756. Hoifung Poon, Janara Christensen, Pedro Domingos, Oren Etzioni, Raphael Hoffmann, Chloe Kiddon, Thomas Lin, Xiao Ling, Mausam, Alan Ritter, Ste- fan Schoenmackers, Stephen Soderland, Dan Weld, Fei Wu, and Congle Zhang. 2010. Machine Read- ing at the University of Washington. In Proceedings of the 1st International Workshop on Formalisms and Methodology for Learning by Reading in con- junction with NAACL-HLT 2010, pages 87–95, Los Angeles, California, USA. Maria Ruiz-Casado, Enrique Alfonseca, and Pablo Castells. 2005. Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In Ad- vances in Web Intelligence, volume 3528 of Lec- ture Notes in Computer Science, pages 380–386. Springer Verlag. Amit Singhal. 2012. Introducing the Knowledge Graph: Things, Not Strings. Technical report, Of- ficial Blog (of Google). Retrieved May 18, 2012. Rion Snow, Dan Jurafsky, and Andrew Ng. 2006. Se- mantic taxonomy induction from heterogeneous ev- idence. In Proceedings of the 21st International
You can also read