2009 GeneralizedExpectCritforBootstrap...
- (Bellare & McCallum, 2009) ⇒ Kedar Bellare, and Andrew McCallum. (2009). “Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment.” In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009).
Subject Headings:
Notes
Cited By
Quotes
Abstract
Traditionally, machine learning approaches for information extraction require human annotated data that can be costly and time-consuming to produce. However, in many cases, there already exists a database (DB) with schema related to the desired output, and records related to the expected input text. We present a conditional random field (CRF) that aligns tokens of a given DB record and its realization in text. The CRF model is trained using only the available DB and unlabeled text with generalized expectation criteria. An annotation of the text induced from inferred alignments is used to train an information extractor. We evaluate our method on a citation extraction task in which alignments between DBLP database records and citation texts are used to train an extractor. Experimental results demonstrate an error reduction of 35% over a previous state-of-the-art method that uses heuristic alignments.
References
- Eugene Agichtein and Venkatesh Ganti. 2004. Mining reference tables for automatic text segmentation. In KDD.
- Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In ICDL.
- Kedar Bellare and Andrew McCallum. (2007). Learning extractors from unlabeled text using relevant databases. In IIWeb workshop at AAAI 2007.
- Phil Blunsom and Trevor Cohn. 2006. Discriminative alignment with conditional random fields. In ACL.
- Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In EDBT Workshop, pages 172–183.
- Peter Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19:263–311.
- Sander Canisius and Caroline Sporleder. (2007). Bootstrapping information extraction from field books. In EMNLP-CoNLL.
- M. Chang, L. Ratinov, and D. Roth. (2007). Guiding semi-supervision with constraint-driven learning. In ACL, pages 280–287.
- William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. A comparison of string distance metrics for name-matching tasks. In IJCAI.
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165.
- D. Freitag and A. McCallum. 1999. Information extraction with HMM and shrinkage. In AAAI.
- T. Grenager, D. Klein, and C. D. Manning. 2005. Unsupervised learning of field segmentation models for information extraction. In ACL.
- Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In HLT-NAACL.
- John Lafferty, Andrew McCallum, and Fernando C N Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, page 282.
- P. Liang, M. I. Jordan, and D. Klein. 2009. Learning semantic correspondences with less supervision. In Association for Computational Linguistics (ACL).
- Gideon S. Mann and Andrew McCallum. (2008). Generalized expectation criteria for semi-supervised learning of conditional random fields. In: Proceedings of ACL’08, pages 870–878.
- I. R. Mansuri and S. Sarawagi. 2006. Integrating unstructured data into relational databases. In ICDE.
- Andrew McCallum, Kedar Bellare, and Fernando Pereira. 2005. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI.
- Matthew Michelson and Craig A. Knoblock. 2005. Semantic annotation of unstructured and ungrammatical text. In IJCAI, pages 1091–1098.
- Matthew Michelson and Craig A. Knoblock. (2008). Creating relational data from unstructured and ungrammatical data sources. JAIR, 31:543–590.
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29.
- Fuchun Peng and A. McCallum. 2004. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL.
- Lawrence R. Rabiner. 1989. A tutorial on hidden markov models and selected applications in speech processing. IEEE, 17:257–286.
- Sridhar Ramakrishnan and Sarit Mukherjee. 2004. Taming the unstructured: Creating structured content from partially labeled schematic text sequences. In CoopIS/DOA/ODBASE, volume 2, page 909.
- Sunita Sarawagi and William W. Cohen. 2004. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, page 89.
- Sunita Sarawagi and William W. Cohen. 2005. SemiMarkov conditional random fields for information extraction. In NIPS.
- Burr Settles and Mark Craven. (2008). An analysis of active learning strategies for sequence labeling tasks. In EMNLP, pages 1070–1079.
- K. Seymore, A. McCallum, and R. Rosenfeld. 1999. Learning hidden markov model structure for information extraction. In: Proceedings of the AAAI Workshop on ML for IE.
- Benjamin Snyder and Regina Barzilay. (2007). Database-text alignment via structured multi-label classification. In IJCAI.
- Charles Sutton, Michael Sindelar, and Andrew McCallum. 2006. Reducing weight undertraining in structured discriminative learning. In HLT-NAACL.
- Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005. A discriminative matching approach to word alignment. In HLT-EMNLP, pages 73–80.,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2009 GeneralizedExpectCritforBootstrap... | Kedar Bellare | Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment | http://www.cs.umass.edu/~kedarb/papers/dbie ge align.pdf |