2007 DerivingALargeTaxFromWikipedia

Subject Headings: Ontology Structure Learning, Wikipedia Category Network.


We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexico-syntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.

1. Introduction

The availability of large coverage, machine readable knowledge is a crucial theme for Artificial Intelligence. While advances towards robust statistical inference methods (cf. e.g. Domingos et al. (2006) and Punyakanok et al. (2006)) will certainly improve the computational modeling of intelligence, we believe that crucial advances will also come from rediscovering the deployment of large knowledge bases.

Creating knowledge bases, however, is expensive and they are time-consuming to maintain. In addition, most of the existing knowledge bases are domain dependent or have a limited and arbitrary coverage – Cyc (Lenat & Guha, 1990) and WordNet (Fellbaum, 1998) being notable exceptions. The field of ontology learning deals with these problems by taking textual input and transforming it into a taxonomy or a proper ontology. However, the learned ontologies are small and mostly domain dependent, and evaluations have revealed a rather poor performance (see Buitelaar et al. (2005) for an extensive overview).

We try to overcome such problems by relying on a wide coverage online encyclopedia developed by a large number of users, namely Wikipedia. We use semi-structured input by taking the category system in Wikipedia as a conceptual network. This provides us with pairs of related concepts whose semantic relation is unspecified. The task of creating a subsumption hierarchy then boils down to distinguish between isa and notisa relations. We use methods based on connectivity in the network and lexico-syntactic patterns to label the relations between categories. As a result we are able to derive a large scale taxonomy.


We described the automatic creation of a large scale domain independent taxonomy. We took Wikipedia’s categories as concepts in a semantic network and labeled the relations between these concepts as isa and notisa relations by using methods based on the connectivity of the network and on applying lexico-syntactic patterns to very large corpora. Both connectivity-based methods and lexico-syntactic patterns ensure a high recall while decreasing the precision. We compared the created taxonomy with ResearchCyc and via semantic similarity measures with WordNet. Our Wikipedia-based taxonomy proved to be competitive with the two arguably largest and best developed existing ontologies. We believe that these results are caused by taking already structured and well-maintained knowledge as input.

Our work on deriving a taxonomy is the first step in creating a fully-fledged ontology based on Wikipedia. This will require to label the generic notisa relations with particular ones such as has-part, has-attribute, etc.


2007 DerivingALargeTaxFromWikipediaSimone P. Ponzetto
Michael Strube
Deriving a Large Scale Taxonomy from Wikipediahttp://www.eml-research.de/nlp/papers/ponzetto07b.pdf