Self-Supervised Information Extraction Task
Jump to navigation
Jump to search
A Self-Supervised Information Extraction Task is a Supervised Information Extraction Task that is a Self-Supervised Learning Task (requires a Labeling Pattern).
References
2009
- (Banko, 2009) ⇒ Michele Banko. (2009). “Open Information Extraction for the Web." PhD Thesis, University of Washington.
- [[KnowItAll [33]] is a state-of-the-art Web extraction system that addresses the automation challenge by learning to label its own training examples, and tackles issues pertaining to corpus heterogeneity by not relying on deep linguistic analysis or entity recognizers. Given a relation, KnowItAll used a set of domain-independent patterns to automatically instantiate relation-specific extraction rules. For example, KnowItAll utilized generic extraction patterns like “<X> is a <Y>” to find a list of candidate members X of the class Y . When this pattern is used for the class Country, for instance, it would match the sentence “Spain is a southwestern European country located on the Iberian Peninsula,” and output Country(Spain).
- KnowItAll’s extraction patterns were applied to Web pages identified via search-engine queries. The resulting extractions were assigned a probability using information-theoretic measures derived from search engine hit counts, providing a method of identifying which instantiations were most likely to be bona-fide members of the class. For example, in order to estimate the likelihood that “China” is the name of a country, KnowItAll used automatically generated phrases associated with the class to see if there is a high correlation between the number of documents containing the word “China” and those containing the phrase “countries such as.” Thus KnowItAll was able to confidently label China, France, and India as members of the class Country while correctly knowing that “Garth Brooks is a country singer” does not provide sufficient evidence that “Garth Brooks” is the name of a country [30]. Finally, KnowItAll used a pattern-learning algorithm to acquire relation-specific extraction patterns (e.g. “capital of <country>”) that led it to extract additional countries. Inspired by KnowItAll, the URES Web IE system [71], also utilized high-quality output from baseline KnowItAll to automatically supervise the learning of relation-specific extraction patterns with success.
- KnowItAll and URES are self-supervised; instead of utilizing hand-tagged training data, each system selects and labels its own training examples and iteratively bootstraps its learning process. Self-supervised systems are a species of unsupervised systems because they require no hand-tagged training examples. However, unlike classical unsupervised systems, self-supervised systems do utilize labeled examples. Instead of relying on hand-tagged data, self-supervised systems autonomously “roll their own” labeled examples.