Spark Cluster Manager

From GM-RKB
Jump to navigation Jump to search

A Spark Cluster Manager is a cluster manager for a Spark cluster.



References

2016

  • http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/
    • QUOTE: Trying to decide which Apache Spark cluster managers are the right fit for your specific use case when deploying a Hadoop Spark Cluster on EC2 can be challenging. This post breaks down the general features of each solution and details the scheduling, HA (High Availability), security and monitoring for each option you have.
      Apache Spark, an engine for large data processing, can be run in distributed mode on a cluster. Spark applications are run as independent sets of processes on a cluster, all coordinated by a central coordinator. This central coordinator can connect with three different cluster managers, Spark’s Standalone, Apache Mesos, and Hadoop YARN (Yet Another Resource Negotiator).
      When running an application in distributed mode on a cluster, Spark uses a master/slave architecture and the central coordinator, also called the driver program, is the main process in your application, running the code that creates a SparkContext object. This driver process is responsible for converting a user application into smaller execution units called tasks. These tasks are then executed by executors which are worker processes that run the individual tasks.
      In a cluster, there is a master and any number of workers. The driver program, which can run in an independent process, or in a worker of the cluster, requests executors from the cluster manager. It then schedules the tasks composing the application on the executors obtained from the cluster manager. The cluster manager is responsible for the scheduling and allocation of resources across the host machines forming the cluster.

2015

  • http://spark.apache.org/docs/latest/cluster-overview.html
    • QUOTE: Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

      Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

      There are several useful things to note about this architecture:

      • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
      • Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
      • The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
      • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.



  • (StackOverflow, 2015): "What are workers, executors, cores in Spark Standalone cluster?.” In: StackOverflow, answered Sep 17 '15 Marco (edited Mar 26 at 15:58 Jacek Laskowski)
    • The driver program ask for resources to the cluster manager to launch executors.
    • The cluster manager launches executors.
    • Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result if that finishes.

      With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager.

      • When executors are started they register themselves with the driver and from so on they communicate directly. The workers are in charge of communicating the cluster manager the availability of their resources.
      • In a YARN cluster you can do that with --num-executors. In a standalone cluster you will get one executor per worker unless you play with spark.executor.cores and a worker has enough cores to hold more than one executor. (As @JacekLaskowski pointed out, --num-executors is no longer in use in YARN https://github.com/apache/spark/commit/16b6d18613e150c7038c613992d80a7828413e66)
      • You can assign the number of cores per executor with --executor-cores
      • --total-executor-cores is the max number of executor cores per application