2009 ExtractingAndManagingStructuredWebData
Jump to navigation
Jump to search
- (Cafarella, 2009) ⇒ Michael J. Cafarella. (2009). “Extracting and Managing Structured Web Data.” PhD Thesis, University of Washington.
Subject Headings: Semi-Structured Data Information Extraction, WebTables System.
Notes
Quotes
Abstract
- The Web contains a large amount of structured data embedded in natural language text, two-dimensional tables, and other forms. This "Structured Web" of data is vast, messy, and diverse; it also promises new and compelling applications. Unfortunately, existing tools such as search engines and relational databases ignore Structured Web data entirely. This dissertation identifies four design criteria for a successful Structured Web management system. Such systems are:
- 1. Extraction-Focused - They obtain structured data wherever it can be found.
- 2. Domain-Independent - They are not tied to one particular topic area.
- 3. Domain-Scalable - They can effectively manage many domains simultaneously.
- 4. Computationally-Efficient - They can handle the Web's enormous size.
- We also describe three working Structured Web management systems that observe these criteria. TextRunner is an extractor for processing natural language Web text. WebTables extracts and provides applications on top of relations in HTML tables. Finally, Octopus provides integration services over extracted Structured Web data. Together, these three systems demonstrate that managing structured data on the Web is possible today, and also suggest directions for future systems.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2009 ExtractingAndManagingStructuredWebData | Michael J. Cafarella | Extracting and Managing Structured Web Data | http://turing.cs.washington.edu/papers/cafarella thesis.pdf | 2009 |