2014 ScalingOutBigDataMissingValueIm

From GM-RKB

Jump to navigation Jump to search

(Anagnostopoulos & Triantafillou, 2014) ⇒ Christos Anagnostopoulos, and Peter Triantafillou. (2014). “Scaling Out Big Data Missing Value Imputations: Pythia Vs. Godzilla.” In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2014) Journal. ISBN:978-1-4503-2956-9 doi:10.1145/2623330.2623615

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Big data; clustering; clustering; missing value

Abstract

Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern ! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2014 ScalingOutBigDataMissingValueIm	Christos Anagnostopoulos Peter Triantafillou			Scaling Out Big Data Missing Value Imputations: Pythia Vs. Godzilla				10.1145/2623330.2623615		2014

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=2014_ScalingOutBigDataMissingValueIm&oldid=866902"

Facts

... more about "2014 ScalingOutBigDataMissingValueIm"

Christos Anagnostopoulos + and Peter Triantafillou +

10.1145/2623330.2623615 +

Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +

Scaling Out Big Data Missing Value Imputations: Pythia Vs. Godzilla +

2014 +