Apache Spark Data Processing Framework

From GM-RKB
(Redirected from apache spark)
Jump to navigation Jump to search

An Apache Spark Data Processing Framework is open-source cluster-based data processing platform managed by an Apache Spark Project.



References

https://databricks.com/blog/2018/11/08/introducing-apache-spark-2-4.html

2019

2018

2017a

2017b

2016

  • https://en.wikipedia.org/wiki/Apache_Hadoop#Architecture
    • Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2)[3] and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java ARchive(JAR) files and scripts needed to start Hadoop.

      For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if one of these hardware failures occurs, the data will remain available.[4]

      A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.[5]

      Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (ssh) be set up between nodes in the cluster.[6]

      In a larger cluster, HDFS nodes are managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thereby preventing file-system corruption and loss of data. Similarly, a standalone JobTracker server can manage job scheduling across nodes. When Hadoop MapReduce is used with an alternate file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the file-system-specific equivalents.

2014a

2014b

2014c

  • http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
    • QUOTE: Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone and Mesos:
      1. YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.
      2. You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
      3. Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.
      4. Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.

2012