2004 AccurateIEfromResearchPapersUsingCRFs

Subject Headings: Information Extraction Task, Linear-Chain CRF, CRF Training Algorithm, Regularization, Citation Information Extraction Task, Constraint Information Extraction.

Notes

It applies CRFs to extraction from research paper headers and reference sections.
It obtains state-of-the-art accuracy performance
It compares some simple regularization methods.
It was republished as a Journal Paper.
- (Peng & McCallum, 2006) ⇒ Fuchun Peng, and Andrew McCallum. (2006). “Accurate Information Extraction from Research Papers using Conditional Random Fields.” In: Information Processing & Management, 42(4). doi:10.1016/j.ipm.2005.09.002

With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This article employs conditional random fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. CRFs provide a principled way for incorporating various local features, external lexicon features and global layout features. The basic theory of CRFs is becoming well-understood, but best-practices for applying them to real-world data requires additional exploration. We make an empirical exploration of several factors, including variations on Gaussian, Laplace and hyperbolic-L1 priors for improved regularization, and several classes of features. Based on CRFs, we further present a novel approach for constraint co-reference information extraction; i.e., improving extraction performance given that we know some citations refer to the same publication. On a standard benchmark dataset, we achieve new state-of-the-art performance, reducing error in average F1 by 36%, and word error rate by 78% in comparison with the previous best SVM results. Accuracy compares even more favorably against HMMs.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 AccurateIEfromResearchPapersUsingCRFs	Fuchun Peng Andrew McCallum			Accurate Information Extraction from Research Papers using Conditional Random Fields		Proceedings of the Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics	http://www.cs.umass.edu/~mccallum/papers/hlt2004.pdf			2004