2004 TowardsTerascaleKnowledgeAcquisition
- (Pantel et al., 2004) ⇒ Patrick Pantel, Deepak Ravichandran, Eduard Hovy. (2004). “Towards Terascale Knowledge Acquisition.” In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). doi:10.3115/1220355.1220466
Subject Headings: Web-based Information Extraction, Is-A Relation.
Notes
Cited By
2004
- (Ravichandran et al., 2004) ⇒ Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. (2004). “The Terascale Challenge.” In: Proceedings of KDD Workshop on Mining for and from the Semantic Web (MSW-04).
Quotes
Abstract
Although vast amounts of textual data are freely available, many NLP algorithms exploit only a minute percentage of it. In this paper, we study the challenges of working at the terascale. We present an algorithm, designed for the terascale, for mining is-a relations that achieves similar performance to a state-of-the-art linguistically-rich method. We focus on the accuracy of these two systems as a function of processing time and corpus size.
…
Pattern-based approaches
Marti Hearst (1992) was the first to use a pattern-based approach to extract hyponym relations from a raw corpus. She used an iterative process to semi-automatically learn patterns. However, a corpus of 20MB words yielded only 400 examples. Our pattern-based algorithm is very similar to the one used by Hearst. She uses seed examples to manually discover her patterns whearas we use a minimal edit distance algorithm to automatically discover the patterns.
Riloff and Shepherd (1997) used a semiautomatic method for discovering similar words using a few seed examples by using pattern-based techniques and human supervision. Berland and Charniak (1999) used similar pattern-based techniques and other heuristics to extract meronymy (part-whole) relations. They reported an accuracy of about 55% precision on a corpus of 100,000 words. Girju et al. (2003). improved upon Berland and Charniak’s work using a machine learning filter. Mann (2002) and Fleischman et al. (2003). used part of speech patterns to extract a subset of hyponym relations involving proper nouns.
Our pattern-based algorithm differs from these approaches in two ways. We learn lexico-POS patterns in an automatic way. Also, the patterns are learned with the specific goal of scaling to the terascale (see Table 2).
Scalable pattern-based approach
We propose an algorithm for learning highly scalable lexico-POS patterns. Given two sentences with their surface form and part of speech tags, the algorithm finds the optimal lexico-POS alignment. For example, consider the following 2 sentences:
- 1) Platinum is a precious metal.
- 2) Molybdenum is a metal.
Applying a POS tagger (Brill 1995) gives the following output:
Surface Platinum is a precious metal .
POS NNP VBZ DT JJ NN .
Surface Molybdenum is a metal .
POS NNP VBZ DT NN .
A very good pattern to generalize from the alignment of these two strings would be
Surface is a metal .
POS NNP .
We use the following notation to denote this alignment: "_NNP is a (*s*) metal.", where "_NNP represents the POS tag NNP".
To perform such alignments we introduce two wildcard operators, skip (*s*) and wildcard (*g*). The skip operator represents 0 or 1 instance of any word (similar to the \w* pattern in Perl), while the wildcard operator represents exactly 1 instance of any word (similar to the \w+ pattern in Perl).
…
References
- Banko, M. and Brill, E. (2001). Mitigating the paucity of data problem. In: Proceedings of HLT-2001. San Diego, CA.
- Berland, M. and Eugene Charniak, (1999). Finding parts in very large corpora. In ACL-1999. pp. 5764. College Park, MD.
- Brill, E., (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543566.
- Brill, E.; Lin, J.; Banko, M.; Dumais, S.; and Ng, A. (2001). Dataintensive question answering. In: Proceedings of the TREC-10 Conference, pp 183189. Gaithersburg, MD.
- Caraballo, S. (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In: Proceedings of ACL-99. pp 120126, Baltimore, MD.
- Curran, J. and Moens, M. (2002). Scaling context space. In: Proceedings of ACL-02. pp 231238, Philadelphia, PA.
- Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 191 (1993), 6174.
- Oren Etzioni; Cafarella, M.; Downey, D.; Kok, S.; Popescu, A.M.; Shaked, T.; Soderland, S.; Weld, D. S.; and Yates, A. (2004). Webscale information extraction in Know-It All (Preliminary Results). To appear in the Conference on WWW.
- Fleischman, M.; Eduard Hovy; and Echihabi, A. (2003). Offline strategies for online question answering: Answering questions before they are asked. In: Proceedings of ACL-03. pp. 17. Sapporo, Japan.
- Girju, R.; Badulescu, A.; and Dan Moldovan (2003). Learning semantic constraints for the automatic discovery of part-whole relations. In: Proceedings of HLT/NAACL-03. pp. 8087. Edmonton, Canada.
- Harris, Z. (1985). Distributional structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 2647.
- Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING-92. pp. 539545. Nantes, France.
- Hindle, D. (1990). Noun classification from predicate-argument structures. In: Proceedings of ACL-90. pp. 268275. Pittsburgh, PA.
- Dekang Lin (1994). Principar - an efficient, broad-coverage, principle-based parser. Proceedings of COLING-94. pp. 4248. Kyoto, Japan.
- Dekang Lin (1998). Automatic retrieval and clustering of similar words. In: Proceedings of COLING/ACL-98. pp. 768774. Montreal, Canada.
- Mann, G. S. (2002). Fine-Grained Proper Noun Ontologies for Question Answering. SemaNet 02: Building and Using Semantic Networks, Taipei, Taiwan.
- George A. Miller (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4).
- Och, F.J. and Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of ACL. pp. 295302. Philadelphia, PA.
- Patrick Pantel and Dekang Lin (2002). Discovering Word Senses from Text. In: Proceedings of SIGKDD-02. pp. 613619. Edmonton, Canada.
- Patrick Pantel and Ravichandran, D. (2004). Automatically labeling semantic classes. In: Proceedings of HLT/NAACL-04. pp. 321328. Boston, MA.
- Ellen Riloff and Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In: Proceedings of EMNLP-1997.
- Ellen Voorhees. (2003). Overview of the question answering track. In: Proceedings of TREC-12 Conference. NIST, Gaithersburg, MD.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2004 TowardsTerascaleKnowledgeAcquisition | Eduard Hovy Patrick Pantel Deepak Ravichandran | Towards Terascale Knowledge Acquisition | http://www.isi.edu/natural-language/people/ravichan/papers/coling04.pdf | 10.3115/1220355.1220466 |