2009 ExtractingAndManagingStructuredWebData

Subject Headings: Semi-Structured Data Information Extraction, WebTables System.

Notes

The Web contains a large amount of structured data embedded in natural language text, two-dimensional tables, and other forms. This "Structured Web" of data is vast, messy, and diverse; it also promises new and compelling applications. Unfortunately, existing tools such as search engines and relational databases ignore Structured Web data entirely. This dissertation identifies four design criteria for a successful Structured Web management system. Such systems are:
- 1. Extraction-Focused - They obtain structured data wherever it can be found.
- 2. Domain-Independent - They are not tied to one particular topic area.
- 3. Domain-Scalable - They can effectively manage many domains simultaneously.
- 4. Computationally-Efficient - They can handle the Web's enormous size.
We also describe three working Structured Web management systems that observe these criteria. TextRunner is an extractor for processing natural language Web text. WebTables extracts and provides applications on top of relations in HTML tables. Finally, Octopus provides integration services over extracted Structured Web data. Together, these three systems demonstrate that managing structured data on the Web is possible today, and also suggest directions for future systems.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 ExtractingAndManagingStructuredWebData	Michael J. Cafarella			Extracting and Managing Structured Web Data			http://turing.cs.washington.edu/papers/cafarella thesis.pdf			2009