Apache Spark Data Processing Framework
An Apache Spark Data Processing Framework is open-source cluster-based data processing platform managed by an Apache Spark Project.
- AKA: Spark Core.
- Context:
- It can (typically) be instantiated into an Apache Spark Cluster Instance (that run Spark jobs on Spark worker nodes).
- It can (typically) be used to implement a Spark-based System (composed of Spark jobs).
- It can (typically) support Spark Modules including: Spark SQL, Spark MLlib, Spark GraphX, Spark Streaming, ...
- It can (typically) provide Development APIs, such as: Spark APIs (e.g. Spark Scala API, Spark Java API, and Spark Python API).
- It can (typically) include Spark CLI to perform Spark commands (such as spark-submit).
- It can (typically) have an Apache Spark Architecture.
- It can (often) be used as a Cluster-based Data Processing Framework (such as a data analytics cluster computing framework).
- It can (often) be used as a Cluster-based Computing Framework.
- It can support Spark Standalone Mode and Spark Cluster Mode.
- It can use Apache YARN as its cluster manager.
- It can support In-Memory Computing Primitives (to provide in-memory data processing).
- It can allow Data Workers to efficiently execute data processing workloads for fast iterative access to datasets.
- It can provide an optimized engine that supports general computation graphs.
- It can provide a Spark-Shell (an extended Scala shell).
- It can operate with an Apache HDFS System.
- It can be installed using an Apache Spark Installation Process.
- …
- Example(s):
- Apache Spark v0.6.0 (~2012-10-15).
- Apache Spark v1.0.1 (~2014-07-11);
- Apache Spark v1.5.0 (~2015-09-09) [1];
- Apache Spark v1.6.0 (~2016-01-04) [2];
- Apache Spark v2.0.0 (~2016-07-26) [3];
- Apache Spark v2.1.1 (~2017-05-02) [4];
- Apache Spark v2.2.0 (~2017-07-11) [5];
- Apache Spark v2.3.0 (~2018-03-06) [6];
- Apache Spark v2.4.0 (~2018-11-08) [7];
- Apache Spark v2.4.4 (~2019-09-01) [8];
- Apache Spark v3.0.0 (~2020-06-18) [9];
- …
- Counter-Example(s):
- See: AWS EMR, RDD Data Structure, Spark Shark, Spark RDD DS, Fault-Tolerant Computing.
References
https://databricks.com/blog/2018/11/08/introducing-apache-spark-2-4.html
2019
- https://www.whizlabs.com/blog/apache-spark-alternatives/
- QUOTE: Some of the key features of Apache Spark are
- It is open source analytics platform for large-scale processing of huge datasets.
- It works as a high-speed engine with high performance in batch as well as streaming data.
- It has resilient distributed datasets (RDDs), and the in-memory data structure allows Spark to perform functional programming.
- It uses a DAG scheduler along with physical execution engine and the query optimizer.
- It is based on the model of micro-batch with high latency.
- Spark recovers the lost work and avoids duplication of work by processing each record only once.
- Big batch calculations can be carried out by pinning memory as Spark streaming collects data streams in the form of mini batches and runs the batch program.
- Spark is also very easy to use and implement by writing applications in Java, Python, R, Scala, SQL, etc.
- It contains a stack of libraries Spark SQL, MLlib (for machine learning), Spark Streaming, and GraphX.
- QUOTE: Some of the key features of Apache Spark are
2018
- https://confluence.sie.sony.com/display/BTRDT/DS+Life+Cycle+in+AWS
- QUOTE: At a first glance, Spark and TensorFlow share some similarities. Both frameworks can perform distributed operations on large datasets. They take an set of input operations, compile these operations to a DAG, ship the DAG to a pool of executors and execute the DAG on a subset of the data. It seems like a natural extension to try and integrate the two!
But, as your question implies, there are enough differences between these frameworks that integrating the two is counterintuitive. Spark utilizes the RDD primitive for distributing any (serializable) map/reduce-like operation; its framework is heavily optimized for caching distributed datasets and minimizing communication between executors (i.e., minimize shuffles). The ideal Spark paradigm is to have every executor independently executing the same DAG on a subset of your dataset. MLLib also utilizes the RDD structure, providing methods for training large ML models like matrix factorizations over big RDD matrices.
- QUOTE: At a first glance, Spark and TensorFlow share some similarities. Both frameworks can perform distributed operations on large datasets. They take an set of input operations, compile these operations to a DAG, ship the DAG to a pool of executors and execute the DAG on a subset of the data. It seems like a natural extension to try and integrate the two!
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Apache_Spark Retrieved:2017-2-2.
- Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
2017b
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Apache_Spark#Overview Retrieved:2017-2-2.
- Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to Apache Hadoop, a popular MapReduce implementation) may be reduced by several orders of magnitude. Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark. Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), [1] MapR File System (MapR-FS), [2] Cassandra, OpenStack Swift, Amazon S3, Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.
- Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
2016
- https://en.wikipedia.org/wiki/Apache_Hadoop#Architecture
- Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2)[3] and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java ARchive(JAR) files and scripts needed to start Hadoop.
For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if one of these hardware failures occurs, the data will remain available.[4]
A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.[5]
Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (ssh) be set up between nodes in the cluster.[6]
In a larger cluster, HDFS nodes are managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thereby preventing file-system corruption and loss of data. Similarly, a standalone JobTracker server can manage job scheduling across nodes. When Hadoop MapReduce is used with an alternate file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the file-system-specific equivalents.
- Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2)[3] and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java ARchive(JAR) files and scripts needed to start Hadoop.
- ↑ Figure showing Spark in relation to other open-source Software projects including Hadoop
- ↑ MapR ecosystem support matrix
- ↑ Chouraria, Harsh (21 October 2012). "MR2 and YARN Briefly Explained". Cloudera. http://blog.cloudera.com/blog/2012/10/mr2-and-yarn-briefly-explained/. Retrieved 23 October 2013.
- ↑ "HDFS User Guide". Hadoop.apache.org. http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html. Retrieved 2014-09-04.
- ↑ "Running Hadoop on Ubuntu Linux System(Multi-Node Cluster)". http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/.
- ↑ "Running Hadoop on Ubuntu Linux (Single-Node Cluster)". http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#prerequisites. Retrieved 6 June 2013.
2014a
- http://spark.apache.org/docs/latest/
- Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
2014b
- http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
- Spark provides fast iterative/functional-like capabilities over large data sets, typically by caching data in memory. As opposed to the rest of the libraries mentioned in this documentation, Apache Spark is computing framework that is not tied to Map/Reduce itself however it does integrate with Hadoop, mainly to HDFS.
2014c
- http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
- QUOTE: Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone and Mesos:
- YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.
- You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
- Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.
- Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
- QUOTE: Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone and Mesos:
2012
- http://spark.apache.org/docs/1.1.1/running-on-yarn.html
- QUOTE: Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.