2007 OpenIEfromTheWeb
- (Banko et al., 2007) ⇒ Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, Oren Etzioni. (2007). “Open Information Extraction from the Web.” In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007).
Subject Headings: Open Information Extraction Task, TextRunner Algorithm, TextRunner System.
Notes
- See:
- (Sekine, 2006) ⇒ Satoshi Sekine. (2006). “On-Demand Information Extraction.” In: Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL 2006).
- (Shinyama & Sekine, 2006) ⇒ Yusuke Shinyama, and Satoshi Sekine. (2006). “Preemptive Information Extraction Using Unrestricted Relation Discovery.” In: Proceedings of the HLT-NAACL Conference (HLT-NAACL 2006).
Cited By
2007
- (Wu & Weld, 2007) ⇒ Fei Wu, and Daniel S. Weld. (2007). “Autonomously Semantifying Wikipedia.” In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (CIKM 2007). doi:10.1145/1321440.1321449
Quotes
Abstract
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries.
We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
References
- (Sekine, 2006) ⇒ Satoshi Sekine. (2006). “On-Demand Information Extraction.” In: Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL 2006).
- (Shinyama & Sekine, 2006) ⇒ Yusuke Shinyama, and Satoshi Sekine. (2006). “Preemptive Information Extraction Using Unrestricted Relation Discovery.” In: Proceedings of the HLT-NAACL Conference (HLT-NAACL 2006).
- (Culotta & McCallum, 2004) ⇒ Aron Culotta, and Andrew McCallum. (2004). “Confidence Estimation for Information Extraction.” In: Proceedings of HLT-NAACL (NAACL 2004).
,