PySpark API

Context:
- It can be referenced by a PySpark Code Segment.
- It can define, PySpark Classes, such as:
  - pyspark.SparkContext (entry point to SparkContext).
  - pyspark.SparkConf (for SparkConf).
  - pyspark.RDD (for Spark RDD).
  - pyspark.sql module (for SparkSQL).
  - pyspark.streaming module (for Spark Streaming).
  - pyspark.ml (for Spark ML).
  - pyspark.mllib (for Spark MLlib).
Example(s):
- PySpark v2.2.0 [1].
- PySpark v3.0.0 [2].
- …
Counter-Example(s):
- Spark Scala API.
- Spark Java API.
See: Spark SQL, Spark Library.

References

http://datanami.com/2017/05/18/committers-talk-hadoop-3-apache-big-data/
- QUOTE: While Spark gives the customer all kinds of great capabilities, the Python implementation lacks the code portability that exists when working with Spark through Java or Scala. …
  … By deploying PySpark in a Docker container under YARN, a developer can get the exact PySpark environment they want without requiring administrators to get involved with detailed configurations. It all gets bundled up in a Docker container, and YARN runs it like any other Hadoop job on the cluster.

https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- QUOTE: The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python.
  There are a few key differences between the Python and Scala APIs:
  - Python is dynamically typed, so RDDs can hold objects of multiple types.
  - PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.

https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
- Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing. The open source community has developed a wonderful utility for spark python big data processing known as PySpark. PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD’s). …
  … The PySpark shell is responsible for linking the python API to the spark core and initializing the spark context. …
  ...