2003 AutomaticAnnotOfContentRichTHMLDocs

(Mukhrejee et al., 2003) ⇒ Saikat Mukherjee, Guizhen Yang, and I.V. Ramakrishnan. (2003). “Automatic Annotation of Content-rich HTML Documents: Structural and Semantic Analysis.” In: Proceedings of the 2nd International Semantic Web Conference (ISWC 2003). doi:10.1007/978-3-540-39718-2_34

Subject Headings: Semantic Analysis, HTML Document.

Notes

Cited By

~62 http://scholar.google.com/scholar?q=%22Automatic+Annotation+of+Content-rich+HTML+Documents%3A+Structural+and+Semantic+Analysis%22+2003

2006

(Yang et al., 2006) ⇒ Xin Yang, Peifeng Xiang, and Yuanchun Shi. (2006). “Semantic HTML Page Segmentation using Type Analysis.” In: Proceedings of 2006 1st International Symposium on Pervasive Computing and Applications
- In (Mukhrejee et al., 2003), a framework coupling structural analysis of documents with semantic analysis using domain ontology is developed to partition HTML documents into unlabeled partition trees by grouping together elements with related semantics. It exploits the key observation that semantically related items exhibit consistency in presentation style and spatial locality and tries to discover structural recurrence patterns for semantically related items under each sub tree through a bottom-up process. However, it has two inherent limitations. First, it uses specified HTML tag path as the type of each node, making it time consuming and not suitable for Real-time processing. Second, it relies on pattern discovery but overlooks visual cues, yet is not accurate enough and can hardly achieve completeness.

2005

(Reeve & Han, 2005) ⇒ Lawrence Reeve, and Hyoil Han. (2005). “Survey of Semantic Annotation Platforms.” In: Proceedings of the 2005 ACM symposium on Applied computing [doi:10.1145/1066677.1067049].

Quotes

Abstract

Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents, we have developed a novel framework for automatically partitioning such documents into semantic structures. Our framework tightly couples structural analysis of documents with semantic analysis incorporating domain ontologies and lexical databases such as WordNet. We present experimental evidence of the e ectiveness of our techniques on a large collection of HTML documents from various news portals.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 AutomaticAnnotOfContentRichTHMLDocs	Saikat Mukherjee Guizhen Yang I.V. Ramakrishnan			Automatic Annotation of Content-rich HTML Documents: Structural and Semantic Analysis		Proceedings of the 2nd International Semantic Web Conference	http://www.ai.sri.com/~yang/papers/iswc2003.pdf	10.1007/978-3-540-39718-2_34		2003