2000 SnowballExtractingRelations
- (Agichtein & Gravano, 2000) ⇒ Eugene Agichtein, Luis Gravano. (2000). “Snowball: Extracting Relations from Large Plain-Text Collections.” In: Proceedings of the 5th ACM International Conference on Digital Libraries (DL 2000). doi:10.1145/336597.336644
Subject Headings: Snowball Algorithm, Snowball System, Semi-Supervised Relation Mention Recognition Algorithm, OrganizationHeadquarter Relation, Lexico-Syntactic Pattern Matching Algorithm.
Notes
- Technical Report: http://www.cs.columbia.edu/techreports/cucs-033-99.pdf
Cited By
Quotes
Abstract
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention, and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
1 Introduction
…
2 The Snowball System
In this section we present the Snowball system (Figure 2), which develops key components of the basic DIPRE method. More specifically, Snowball presents a novel technique to generate patterns and extract tuples from text documents (Sections 2.1 and 2.2). Also, Snowball introduces a strategy for evaluating the quality of the patterns and the tuples that are generated in each iteration of the extraction process (Section 2.3). Only those tuples and patterns that are regarded as being “sufficiently reliable” will be kept by Snowball for the following iterations of the system (Section 2.3). These new strategies for generation and filtering of patterns and tuples improve the quality of the extracted tables significantly, as the experimental evaluation in Section 5 will show.
…
2.1 Generating Patterns
A crucial step in the table extraction process is the generationofpatternstofindnew tuplesinthedocuments. Ideally, wewouldlike patternsbothtobeselective, sothatthey donotgenerateincorrecttuples,andtohave highcoverage, sothatthey identifymany new tuples. Inthissection,weintroducea novelwayofgeneratingsuchpatternsfroma setofseedtuplesanda documentcollection.
…
- Definition 1
- A Snowball pattern is a 5-tuple left, tag1, middle, tag2, right>, where tag1 and tag2 are namedentity tags, and left, middle, and right are vectors associating weights with terms.
…
- Definition 2
- The degree of match [math]\displaystyle{ Match(t_P, t_S) }[/math] between two 5-tuples [math]\displaystyle{ t_P = \lt l_P, t_1, m_P, t_2, r_P \gt }[/math] (with tags t1 and t2) and tS =< lS, t'1, mS, t'2, rS > (with tags t'1 and t'2) is defined as: Match(tP, tS) = lP · lS + mP · mS + rP · rS (if the tags match); otherwise zero (0).
…
2.2 Generating Tuples
…
2.3 Evaluating Patterns and Tuples
…
- Definition 3
- The confidence of a pattern P is: Conf (P) = P.positive / (P.positive + P.negative); where P.positive is the number of positive matches for P and P.negative is the number of negative matches.
…
Our definition of confidence of a pattern above is only one among many possibilities. An alternative is to account for a pattern’s coverage in addition to its selectivity. For this, we adopt a metric originally proposed by Riloff [11] to evaluate extraction patterns generated by the Autoslog-TS information extraction system, and define Conf RlogF (P) of pattern P as follows:
…
- Definition 4
- The RlogF confidence of pattern p is: ConfRlogF(p) = p.positive (p.positive + p.negative)·log2(p.positive)
…
- Definition 5
- The confidence of a candidate tuple T is: Conf(T) = 1 − MULT(i =0,P) (1 − {Conf(Pi) · Match(Ci, Pi)}) where P = {Pi} is the set of patterns that generated T and Ci is the context associated with an occurrence of T that matched Pi.
…
Conclusion
…
We only evaluated our techniques on plain text documents, and it would require future work to adopt our methodology to HTML data. While HTML tags can be naturally incorporated into Snowball’s pattern representation, it is problematic to extract named-entity tags from arbitrary HTML documents. State-of-the-art taggers rely on clues from the text surrounding each entity, which may be absent in HTML documents that often rely on visual formatting to convey information. On a related note, we have assumed throughout that the attributes of the relation we extract (i.e., organization and location) correspond to named entities that our tagger can identify accurately. As we mentioned, named-entity taggers like Alembic can be extended to recognize entities that are distinct in a context-independent way (e.g., numbers, dates, proper names). For some other attributes, we will need to extend Snowball so that its pattern generation and matching could be anchored around, say, a noun phrase as opposed to a named entity as in this paper. In the future, we will also generalize Snowball to relations of more than two attributes. Finally, a crucial open problem is how to generalize our tuple and pattern evaluation strategy of Section 2.3 so that it does not rely on an attribute being a key for the relation.
…
References
- 1. Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.
- 2. Avrim Blum and Tom M. Mitchell. Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory, 1998.
- 3. (Brin, 1998) ⇒ Sergey Brin. (1998). “Extracting patterns and relations from the World-Wide Web.” In: Proceedings of on the 1998 International Workshop on Web and Databases (WebDB 1998).
- 4. William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACMInternational Conference on Management of Data (SIGMOD’98), 1998.
- 5. Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
- 6. M. Craven, D. DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999.
- 7. David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In: Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997.
- 8. D. Fisher, S. Soderland, J. McCarthy, F. Feng, andW. Lehnert. Description of the UMass systems as used for MUC-6. In: Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995.
- 9. William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
- 10. Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997.
- 11. Ellen Riloff. Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044–1049, 1996.
- 12. Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
- 13. Gerard M. Salton. Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, 1989.
- 14. Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.
- 15. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189–196. Cambridge, MA, 1995.
- 16. Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In: Proceedings of the 1999 Workshop on Web Information and Data Management, 1999.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2000 SnowballExtractingRelations | Eugene Agichtein Luis Gravano | Snowball: Extracting Relations from Large Plain-Text Collections | Proceedings of the 5th ACM International Conference on Digital Libraries | http://www.mathcs.emory.edu/~eugene/papers/dl00.pdf | 10.1145/336597.336644 | 2000 |