2008 InformationExtractionFromWikipedia
Jump to navigation
Jump to search
- (Wu et al., 2008) ⇒ Fei Wu, Raphael Hoffmann, and Daniel S. Weld. (2008). “Information Extraction from Wikipedia: moving down the long tail." In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008) [doi>10.1145/1401890.1401978].
Subject Headings:
Notes
- Goal / vision
- Every Wikipedia article describes one entity from a class that is defined by the type of the infobox (if present). Goal is the unsupervised conversion of Wikipedia into a structured format.
- Challenges
- For many entity types, there are not enough articles / entities. Many articles are too short and do not contain enough information to extract from.
- Problem definition
- Given a Wikipedia page / article, identify its infobox / entity class and extract as many attribute values (of that infobox) as possible.
- Approach
- A document classifier is trained to identify the infobox / entity class.
- A sentence classifier is trained to predict which attribute values are contained in a particular sentence of an article belonging to a given infobox class.
- Finally, an attribute extractor is learned to extract the actual attribute values from the sentences predicted to contain these values.
- Use of infoboxes
- In the training phase, infobox types are used to define training data for the document classifier and infobox attribute values are used to define training data for the sentence classifier.
- In the test phase, infoboxes are ignored, i.e. document classification and attribute extraction use only the article text as input.
- Shrinkage Method
- Is a general statistical method to improve estimators in the case of limited training data.
- In this paper, they apply shrinkage as follows.
- They search upwards and downwards in the ontology of infoboxes to aggregate training data from related classes.
- Extracting from the web
- Many attribute values do not appear in the text of the article.
- In order to improve the recall of attribute extraction, they apply the extractors trained from Wikipedia to other web pages.
- Challenge: maintain precision of the extractors on lower quality non Wikipedia pages.
- Discussion
- From checking out a random sample of Wikepedia pages, I have the impression that for many, if not most of the attributes, the values do not appear in the text (but only in the infobox). There seems to be a great need for considering non-Wikipedia pages.
- On the other hand, for some attributes multiple values appear in the corresponding article, e.g. for the headquarters of a company that changes in the course of the time. It seems to be hard to automatically pick the relevant one from these attribute values.
- While Wikipedia data has not much commercial value, it has the advantage of providing a lot of ground truth, e.g. lists of entities of a particular type and infoboxes with correct attribute values. It also contains some ontology over the collection of infoboxes. Can / should we use Wikipedia to get training and test cases for our second application domain or at least for defining ground truth in our second application domain?
Cited By
Quotes
Abstract
- Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.
,