2005 UnsupNEExtrFromTheWeb
- (Etzioni et al., 2005) ⇒ Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates. (2005). “Unsupervised Named-Entity Extraction from the Web: An Experimental Study.” In: Artificial Intelligence, 165(1).
Subject Headings: Semi-Supervised Named Entity Recognition Algorithm, KnowItAll System, Information Extraction Task, Pointwise Mutual Information and Information Retrieval.
- (Nadeau & Sekine, 2007) ⇒ David Nadeau, and Satoshi Sekine. (2007). “A Survey of Named Entity Recognition and Classification.” In: Lingvisticae Investigationes, 30(1).
- In Oren Etzioni et al. (2005), Pointwise Mutual Information and Information Retrieval (PMI-IR) is used as a feature to assess that a named entity can be classified under a given type. PMI-IR, developed by P. Turney (2001), measures the dependence between two expressions using web queries. A high PMI-IR means that expressions tend to co-occur. Oren Etzioni et al. create features for each candidate entity (e.g., London) and a large number of automatically generated discriminator phrases like “is a city”, “nation of”, etc.
- (EtzioniBC, 2006) ⇒ Oren Etzioni, Michele Banko, and M. J. Cafarella. (2006). “Machine Reading.” In: Proceedings of AAAI-2006.
- The KnowItAll Web IE system (Etzioni et al., 2005) took the next step in automation by learning to label its own training examples using only a small set of domain independent extraction patterns, thus being the first published system to carry out unsupervised, domain independent, large-scale extraction from Web pages. … When instantiated for a particular relation, these generic patterns yield relation-specific extraction rules that are then used to learn domain-specific extraction rules. The rules are applied to Web pages, identified via search-engine queries, and the resulting extractions are assigned a probability using mutual-information measures derived from search engine hit counts. For example, KnowItAll utilized generic extraction patterns like “<Class> such as <Mem>” to suggest instantiations of <Mem> as candidate members of the class. Next, KnowItAll used frequency information to identify which instantiations are most likely to be bona-fide members of the class. Thus, it was able to confidently label major cities including Seattle, Tel Aviv, and London as members of the class “Cities” (Downey, Etzioni, and Soderland 2005). Finally, KnowItAll learned a set of relation-specific extraction patterns. … KnowItAll is self supervised--- instead of utilizing handtagged training data, the system selects and labels its own training examples, and iteratively bootstraps its learning process. In general, self-supervised systems are a species of unsupervised systems because they require no handtagged training examples whatsoever. However, unlike classical unsupervised systems (e.g., clustering) selfsupervised systems do utilize labeled examples and do form classifiers whose accuracy can be measured using standard metrics. Instead of relying on hand-tagged data, self-supervised systems autonomously “roll their own” labeled examples. … While self-supervised, KnowItAll is relation-specific--- it requires a laborious bootstrapping process for each relation of interest, and the set of relations of interest has to be named by the human user in advance.
The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
5. List Extractor
We now present the third method for increasing KNOWITALL’s recall, the List Extractor (LE). Where the methods described earlier extract information from unstructured text on Web pages, LE uses regular page structure to support extraction. LE locates lists of items on Web pages, learns a wrapper on the fly for each list, automatically extracts items from these lists, then sorts the items by the number of lists in which they appear.
5.4. Example and parameters
We consider a relatively simple example in Fig. 15 in order to see how the algorithm works, and to illustrate the effects of different parameters on precision, recall, overfitting, and generalization. On top we have the 4 seeds used to search and retrieve the HTML document, and below we have the 5 wrappers learned from at least 2 keywords and their bounding lines in the HTML.
The first wrapper, w1, is learned for the whole HTML document, and matches all 4 keywords; w2 is for the body, and is identical to w1, except for the context; w3 has the same wrapper pattern as w1 and w2, contains all keywords, but has a noticeably different and smaller context (just the single table block); w4 is interesting because here we see an example of overfitting. The suffix is too long and will not extract France. We see a similar problem in w5 where the prefix is too long and will not extract Israel.
