2012 ResilientDistributedDatasetsAFa: Difference between revisions

Latest revision as of 01:50, 5 January 2021

(Zaharia et al., 2012) ⇒ Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. (2012). “Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing.” In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.

Subject Headings: Resilient Distributed Dataset Structure.

Notes

Cited By

Quotes

Abstract

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 ResilientDistributedDatasetsAFa	Justin Ma Ion Stoica Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Murphy McCauley Michael J. Franklin Scott Shenker			Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing

@@ Line 1: / Line 1: @@
-* ([[2012_ResilientDistributedDatasetsAFa|Zaharia et al., 2012]]) &rArr; [[author::Matei Zaharia]], [[author::Mosharaf Chowdhury]], [[author::Tathagata Das]], [[author::Ankur Dave]], [[author::Justin Ma]], [[author::Murphy McCauley]], [[author::Michael J. Franklin]], [[author::Scott Shenker]], and [[author::Ion Stoica]]. ([[year::2012]]). "[https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing]." In: [[Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation]].
+* ([[2012_ResilientDistributedDatasetsAFa|Zaharia et al., 2012]]) ⇒ [[author::Matei Zaharia]], [[author::Mosharaf Chowdhury]], [[author::Tathagata Das]], [[author::Ankur Dave]], [[author::Justin Ma]], [[author::Murphy McCauley]], [[author::Michael J. Franklin]], [[author::Scott Shenker]], and [[author::Ion Stoica]]. ([[year::2012]]). “[https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing].” In: [[Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation]].
-<B>Subject Headings:</B>
+<B>Subject Headings:</B> [[Resilient Distributed Dataset Structure]].
-==Notes==
+== Notes ==
-==Cited By==
+== Cited By ==
 * http://scholar.google.com/scholar?q=%222012%22+Resilient+Distributed+Datasets%3A+A+Fault-tolerant+Abstraction+for+in-memory+Cluster+Computing
 * http://dl.acm.org/citation.cfm?id=2228298.2228301&preflayout=flat#citedby
-==Quotes==
+== Quotes ==
-===Abstract===
+=== Abstract ===
-[[We]] present [[Resilient Distributed Datasets (RDDs)]], a [[distributed memory abstraction]] that lets programmers perform [[in-memory computation]]s on [[large cluster]]s in a [[fault-tolerant manner]]. </s>
+[[We]] present [[Resilient Distributed Datasets (RDDs)]], a [[distributed memory abstraction]] that lets [[programmer]]s perform [[in-memory computation]]s on [[large cluster]]s in a [[fault-tolerant manner]]. </s>
-[[RDD]]s are motivated by two types of applications that current [[computing frameworks handle inefficiently]]: [[iterative algorithm]]s and [[interactive data mining tool]]s. </s>
+[[RDD]]s are motivated by two types of [[data processing application|application]]s that current [[computing framework]]s handle [[inefficiently]]: [[iterative algorithm]]s and [[interactive data mining tool]]s. </s>
-In both cases, [[keeping data]] in memory can improve performance by an [[order of magnitude]]. </s>
+In both cases, [[keeping data in memory]] can improve performance by an [[order of magnitude]]. </s>
-To achieve fault tolerance [[efficiently]], [[RDD]]s provide a restricted form of [[shared memory]], based on [[coarse-grained transformation]]s rather than [[fine-grained updates to shared state]]. </s>
+To achieve [[fault tolerance]] [[efficiently]], [[RDD]]s provide a restricted form of [[shared memory]], based on [[coarse-grained transformation]]s rather than [[fine-grained update]]s to [[shared state]]. </s>
-However, we show that [[RDD]]s are expressive enough to capture a [[wide class of computation]]s, including recent specialized [[programming model]]s for [[iterative job]]s, such as [[Pregel]], and new applications that these [[model]]s do not capture. </s>
+However, [[we]] show that [[RDD]]s are expressive enough to capture a wide [[class of computation]]s, including recent specialized [[programming model]]s for [[iterative job]]s, such as [[Pregel]], and new [[data-centric application|application]]s that these [[model]]s do not capture. </s>
-[[We]] have implemented [[RDD]]s in a system called [[Spark]], which we [[evaluate]] through a variety of [[user application]]s and [[benchmark]]s. </s>
+[[We]] have implemented [[RDD]]s in a system called [[Spark]], which we [[evaluate]] through a variety of [[user application]]s and [[data processing benchmark|benchmark]]s. </s>
-==References==
+== References ==
 {{#ifanon:|
 * 1. Apache Hive. http://hadoop.apache.org/hive.
 * 2. Scala. http://www.scala-lang.org.
-* 3. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica, Disk-locality in Datacenter Computing Considered Irrelevant, Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, May 09-11, 2011, Napa, California
+* 3. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, [[Ion Stoica]], Disk-locality in Datacenter Computing Considered Irrelevant, Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, May 09-11, 2011, Napa, California
 * 4. Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, Rafael Pasquin, Incoop: MapReduce for Incremental Computations, Proceedings of the 2nd ACM Symposium on Cloud Computing, p.1-14, October 26-28, 2011, Cascais, Portugal [http://doi.acm.org/10.1145/2038916.2038923 doi:10.1145/2038916.2038923]
 * 5. Rajendra Bose, James Frew, Lineage Retrieval for Scientific Data Processing: A Survey, ACM Computing Surveys (CSUR), v.37 n.1, p.1-28, March 2005 [http://doi.acm.org/10.1145/1057977.1057978 doi:10.1145/1057977.1057978]
@@ Line 31: / Line 31: @@
 * 8. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum, FlumeJava: Easy, Efficient Data-parallel Pipelines, Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 05-10, 2010, Toronto, Ontario, Canada [http://doi.acm.org/10.1145/1806596.1806638 doi:10.1145/1806596.1806638]
 * 9. James Cheney, Laura Chiticariu, Wang-Chiew Tan, Provenance in Databases: Why, How, and Where, Foundations and Trends in Databases, v.1 n.4, p.379-474, April 2009 [http://dx.doi.org/10.1561/1900000006 doi:10.1561/1900000006]
-* 10. Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, p.10-10, December 06-08, 2004, San Francisco, CA
+* 10. [[Jeffrey Dean]], Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, p.10-10, December 06-08, 2004, San Francisco, CA
 * 11. Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, June 21-25, 2010, Chicago, Illinois [http://doi.acm.org/10.1145/1851476.1851593 doi:10.1145/1851476.1851593]
 * 12. Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, Li Zhuang, Nectar: Automatic Management of Data and Computation in Datacenters, Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, p.1-8, October 04-06, 2010, Vancouver, BC, Canada
 * 13. Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M. Frans Kaashoek, Zheng Zhang, R2: An Application-level Kernel for Record and Replay, Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, p.193-208, December 08-10, 2008, San Diego, California
 * 14. T. Hastie, R. Tibshirani, and J. Friedman. <i>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</i>. Springer Publishing Company, New York, NY, 2009.
-* 15. Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Bing Su, Wei Lin, Lidong Zhou, Comet: Batched Stream Processing for Data Intensive Distributed Computing, Proceedings of the 1st ACM Symposium on Cloud Computing, June 10-11, 2010, Indianapolis, Indiana, USA [http://doi.acm.org/10.1145/1807128.1807139 doi:10.1145/1807128.1807139]
+* 15. Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Bing Su, Wei Lin, [[Lidong Zhou]], Comet: Batched Stream Processing for Data Intensive Distributed Computing, Proceedings of the 1st ACM Symposium on Cloud Computing, June 10-11, 2010, Indianapolis, Indiana, USA [http://doi.acm.org/10.1145/1807128.1807139 doi:10.1145/1807128.1807139]
 * 16. Allan Heydon, Roy Levin, Yuan Yu, Caching Function Calls Using Precise Dependencies, ACM SIGPLAN Notices, v.35 n.5, p.311-320, May 2000 [http://doi.acm.org/10.1145/358438.349341 doi:10.1145/358438.349341]
-* 17. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, Mesos: A Platform for Fine-grained Resource Sharing in the Data Center, Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, March 30-April 01, 2011, Boston, MA
+* 17. Benjamin Hindman, Andy Konwinski, [[Matei Zaharia]], Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, [[Ion Stoica]], Mesos: A Platform for Fine-grained Resource Sharing in the Data Center, Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, March 30-April 01, 2011, Boston, MA
-* 18. Timothy Hunter, Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma, Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen, Scaling the Mobile Millennium System in the Cloud, Proceedings of the 2nd ACM Symposium on Cloud Computing, p.1-8, October 26-28, 2011, Cascais, Portugal [http://doi.acm.org/10.1145/2038916.2038944 doi:10.1145/2038916.2038944]
+* 18. Timothy Hunter, Teodor Moldovan, [[Matei Zaharia]], Samy Merzgui, Justin Ma, [[Michael J. Franklin]], [[Pieter Abbeel]], Alexandre M. Bayen, Scaling the Mobile Millennium System in the Cloud, Proceedings of the 2nd ACM Symposium on Cloud Computing, p.1-8, October 26-28, 2011, Cascais, Portugal [http://doi.acm.org/10.1145/2038916.2038944 doi:10.1145/2038916.2038944]
 * 19. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, Dryad: Distributed Data-parallel Programs from Sequential Building Blocks, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, March 21-23, 2007, Lisbon, Portugal [http://doi.acm.org/10.1145/1272996.1273005 doi:10.1145/1272996.1273005]
 * 20. Steven Y. Ko, Imranul Hoque, Brian Cho, Indranil Gupta, On Availability of Intermediate Data in Cloud Computations, Proceedings of the 12th Conference on Hot Topics in Operating Systems, p.6-6, May 18-20, 2009, Monte Verità, Switzerland
@@ Line 49: / Line 49: @@
 * 26. Daniel Peng, Frank Dabek, Large-scale Incremental Processing Using Distributed Transactions and Notifications, Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, p.1-15, October 04-06, 2010, Vancouver, BC, Canada
 * 27. Russell Power, Jinyang Li, Piccolo: Building Fast, Distributed Programs with Partitioned Tables, Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, p.1-14, October 04-06, 2010, Vancouver, BC, Canada
-* 28. Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, Inc., New York, NY, 2002
+* 28. Raghu Ramakrishnan, [[Johannes Gehrke]], Database Management Systems, McGraw-Hill, Inc., New York, NY, 2002
 * 29. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song, Design and Evaluation of a Real-Time URL Spam Filtering Service, Proceedings of the 2011 IEEE Symposium on Security and Privacy, p.447-462, May 22-25, 2011 [http://dx.doi.org/10.1109/Sp.2011.25 doi:10.1109/Sp.2011.25]
 * 30. John W. Young, A First Order Approximation to the Optimum Checkpoint Interval, Communications of the ACM, v.17 n.9, p.530-531, Sept. 1974 [http://doi.acm.org/10.1145/361147.361115 doi:10.1145/361147.361115]
 * 31. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey, DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language, Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, p.1-14, December 08-10, 2008, San Diego, California
-* 32. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica, Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Proceedings of the 5th European Conference on Computer Systems, April 13-16, 2010, Paris, France [http://doi.acm.org/10.1145/1755913.1755940 doi:10.1145/1755913.1755940]
+* 32. [[Matei Zaharia]], Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, [[Ion Stoica]], Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Proceedings of the 5th European Conference on Computer Systems, April 13-16, 2010, Paris, France [http://doi.acm.org/10.1145/1755913.1755940 doi:10.1145/1755913.1755940]
 * 33. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing. Technical Report UCB/EECS-2011-82, EECS Department, UC Berkeley, 2011.

2012 ResilientDistributedDatasetsAFa: Difference between revisions

Latest revision as of 01:50, 5 January 2021

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search