2005 AProbabilisticModelofRedundancy

(Downey et al., 2005) ⇒ Doug Downey, Oren Etzioni, and Stephen Soderland. (2005). “A Probabilistic Model of Redundancy in Information Extraction.” In: Proceedings of the 19th international joint conference on Artificial intelligence.

Subject Headings: Text Segment Confidence Score, KnowItAll System.

Notes

Received distinguished paper award

Cited By

Quotes

Abstract

Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without using hand-tagged training examples. A fundamental problem for both UIE and supervised IE is assessing the probability that extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness?

This paper introduces a combinatorial "balls-and-urns" model that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating the model's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by Pointwise Mutual Information (PMI) and the noisy-or model used in previous work. For supervised IE, the model's performance is comparable to that of Support Vector Machines, and Logistic Regression.

5 Related Work

In contrast to the bulk of previous IE work, our focus is on unsupervised IE (UIE) where URNS substantially outperforms previous methods (Figure 2).

In addition to the noisy-or models we compare against in our experiments, the IE literature contains a variety of heuristics using repetition as an indication of the veracity of extracted information. For example, Riloff and Jones [ Riloff and Jones, 1999 ] rank extractions by the number of distinct patterns generating them, plus a factor for the reliability of the patterns. Our work is intended to formalize these heuristic techniques, and unlike the noisy-or models, we explicitly model the distribution of the target and error sets (our num ( C ) and num ( E ) ), which is shown to be important for good performance in Section 4.1. The accuracy of the probability estimates produced by the heuristic and noisy-or methods is rarely evaluated explicitly in the IE literature, although most systems make implicit use of such estimates. For example, bootstrap-learning systems start with a set of seed instances of a given relation, which are used to identify extraction patterns for the relation; these patterns are in turn used to extract further instances (e.g. [ Riloff and Jones, 1999; Lin et al. , 2003; Agichtein and Gravano, 2000 ] ). As this pro- cess iterates, random extraction errors result in overly general extraction patterns, leading the system to extract further erroneous instances. The more accurate estimates of extraction probabilities produced by U RNS would help prevent this “concept drift.”

Skounakis and Craven (Skounakis and Craven, 2003) develop a probabilistic model for combining evidence from multiple extractions in a supervised setting. Their problem formulation differs from ours, as they classify each occurrence of an extraction, and then use a binomial model along with the false positive and true positive rates of the classifier to obtain the probability that at least one occurrence is a true positive. Similar to the above approaches, they do not explicitly account for sample size n , nor do they model the distribution of target and error extractions.

Culotta and McCallum (Culotta and McCallum, 2004) provide a model for assessing the confidence of extracted information using conditional random fields (CRFs). Their work focuses on assigning accurate confidence values to individual occurrences of an extracted field based on textual features. This is complementary to our focus on combining confidence estimates from multiple occurrences of the same extraction. In fact, each possible feature vector processed by the CRF in (Culotta and McCallum, 2004) can be thought of as a virtual urn [math]\displaystyle{ m }[/math] in our URNS. The confidence output of Culotta and McCallum’s model could then be used to provide the precision [math]\displaystyle{ p_m }[/math] for the urn.

Our work is similar in spirit to BLOG, a language for specifying probability distributions over sets with unknown objects [ Milch et al. , 2004 ] . As in our work, BLOG models treat observations as draws from a set of balls in an urn. Whereas BLOG is intended to be a general modeling framework for probabilistic first-order logic, our work is directed at modeling redundancy in IE. In contrast to [ Milch et al. , 2004 ] , we provide supervised and unsupervised learning methods for our model and experiments demonstrating their efficacy in practice.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2005 AProbabilisticModelofRedundancy	Doug Downey Stephen Soderland Oren Etzioni			A Probabilistic Model of Redundancy in Information Extraction						2005