2012 StratifiedKMeansClusteringovera

Subject Headings:

Notes

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs.

We have evaluated our methods using two synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of stratification (5%-55%), and our and representative sampling methods (up to 54%).

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 StratifiedKMeansClusteringovera	Gagan Agrawal Tantan Liu			Stratified K-means Clustering over a Deep Web Data Source				10.1145/2339530.2339705		2012