Data Redistribution Across Partitions Operation
Jump to navigation
Jump to search
A Data Redistribution Across Partitions Operation is a distributed data structure operation that ...
- Context:
- …
- Example(s):
- …
- Counter-Example(s):
- See: Read-Only Distributed Data Structure, Data Sharding.
References
2018
- https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
- QUOTE: Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer’s and data scientist’s perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.
A partition (aka split) is a logical chunk of a large distributed data set.
Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
- QUOTE: Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer’s and data scientist’s perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.