KnowItAll Algorithm
Jump to navigation
Jump to search
A KnowItAll Algorithm is a Semi-Supervised Named Entity Recognition Algorithm.
- Context:
- It is intended to be scaleable and have high throughput in order to access the vast amount of information on the Web.
- It is divided into four components: Extractor, Search Engine Interface, Assessor, and Database.
- Its Extractor avoids the use of deep parsing techniques.
- It requires a set of generic extraction patterns, instead of a set of 'seed' instances.
- Its predefined topics (ontology) of the first version included cities, states, countries, actors and films.
- See: KnowItAll System.
References
2006
- (EtzioniBC, 2006) ⇒ Oren Etzioni, Michele Banko, and M. J. Cafarella. (2006). “Machine Reading.” In: Proceedings of AAAI-2006.
- The KnowItAll Web IE system (Etzioni et al., 2005) took the next step in automation by learning to label its own training examples using only a small set of domain independent extraction patterns, thus being the first published system to carry out unsupervised, domain independent, large-scale extraction from Web pages. … When instantiated for a particular relation, these generic patterns yield relation-specific extraction rules that are then used to learn domain-specific extraction rules. The rules are applied to Web pages, identified via search-engine queries, and the resulting extractions are assigned a probability using mutual-information measures derived from search engine hit counts. For example, KnowItAll utilized generic extraction patterns like “<Class> such as <Mem>” to suggest instantiations of <Mem> as candidate members of the class. Next, KnowItAll used frequency information to identify which instantiations are most likely to be bona-fide members of the class. Thus, it was able to confidently label major cities including Seattle, Tel Aviv, and London as members of the class “Cities” (Downey, Etzioni, and Soderland 2005). Finally, KnowItAll learned a set of relation-specific extraction patterns. … KnowItAll is self supervised--- instead of utilizing handtagged training data, the system selects and labels its own training examples, and iteratively bootstraps its learning process. In general, self-supervised systems are a species of unsupervised systems because they require no handtagged training examples whatsoever. However, unlike classical unsupervised systems (e.g., clustering) selfsupervised systems do utilize labeled examples and do form classifiers whose accuracy can be measured using standard metrics. Instead of relying on hand-tagged data, self-supervised systems autonomously “roll their own” labeled examples. … While self-supervised, KnowItAll is relation-specific--- it requires a laborious bootstrapping process for each relation of interest, and the set of relations of interest has to be named by the human user in advance.
2005
- (Etzioni et al., 2005) ⇒ Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. (2005). “Unsupervised Named-Entity Extraction from the Web: An Experimental Study.” In: Artificial Intelligence, 165(1).
2004
- (Etzioni et al., 2004) ⇒ Oren Etzioni, Michael J. Cafarella, Doug Downey, S. Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. (2004). “Web-scale Information Extraction in KnowItAll.” In: Proceedings of WWW 2004.