2015 PetuumANewPlatformforDistribute

(Xing et al., 2015) ⇒ Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. (2015). “Petuum: A New Platform for Distributed Machine Learning on Big Data.” In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783323

Subject Headings: ML Platform.

Notes

Cited By

Quotes

Author Keywords

Big data; big model; data-parallelism; distributed systems; distributed systems; machine learning; model-parallelism; theory

Abstract

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes) - Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data - and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative-convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML-first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 PetuumANewPlatformforDistribute	Eric P. Xing Xun Zheng Pengtao Xie Qirong Ho Wei Dai Jin-Kyu Kim Jinliang Wei Seunghak Lee Abhimanu Kumar Yaoliang Yu			Petuum: A New Platform for Distributed Machine Learning on Big Data				10.1145/2783258.2783323		2015

2015 PetuumANewPlatformforDistribute

Notes

Cited By

Quotes

Author Keywords

Abstract

References

Navigation menu

Search