Information Extraction from Tables Task

See: Information Extraction Task, Structured Data, HTML Table.

References

(Cafarella, Halevy et al., 2008) ⇒ Michael J. Cafarella, Alon Y. Halevy, Yang Zhang, Daisy Zhe Wang, and Eugene Wu. (2008). “Uncovering the Relational Web.” In: Proceedings of WebDB (WebDB 2008).

(Pinto et al., 2003) ⇒ David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. (2003). “Table Extraction Using Conditional Random Fields.” In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2003). doi:10.1145/860435.860479
(Liu et al., 2003) ⇒ Bing Liu, Robert L. Grossman, Yanhong Zhai. (2003). “Mining Data Records in Web Pages.” In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003). doi:10.1145/956750.956826

(Wang & Hu, 2002) ⇒ Yalin Wang, and Jianying Hu. (2002). “A Machine Learning Based Approach for Table Detection on the Web.” In: Proceedings of the Eleventh International World Wide Web Conference (WWW 2002). doi:10.1145/511446.511478
(Cohen et al., 2002) ⇒ William W. Cohen, Matthew Hurst, and Lee S. Jensen. (2002). “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents.” In: Proceedings of the 11th World Wide Web Conference (WWW 2002).
- It proposes the WhizBang Labs Wrapper Learner (WL2) Wrapper-Learning System.
- It uses a two-dimensional geometric to describe pages.
- It uses a master learning algorithm.
- It uses an ordered set of "builders" (not learning based) to indicate learning system’s bias. Each builder is associated with a language L which is specific and restricted.

(Crescenzi et al., 2001) ⇒ Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. (2001). “RoadRunner: Towards Automatic Data Extraction from Large Web Sites.” In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001).
(Chang & Lui, 2001) ⇒ Chia-Hui Chang, and Shao-Chen Lui. (2001). “IEPAD: Information Extraction Based on Pattern.” In: Proceedings of the 10th International Conference on World Wide Web (WWW 2001).
- It presents a automatically discovers extraction rules from web pages.
- It utilizes repeated pattern mining and multiple sequence alignment
- It can automatically identify record boundaries.
- It proposes the IEPAD Algorithm, composed of three components: Extraction rule generator, pattern viewer and extractor module.