PySpark API
(Redirected from Spark Python API)
Jump to navigation
Jump to search
A PySpark API is a Spark API for Python code.
- Context:
- It can be referenced by a PySpark Code Segment.
- It can define, PySpark Classes, such as:
pyspark.SparkContext
(entry point to SparkContext).pyspark.SparkConf
(for SparkConf).pyspark.RDD
(for Spark RDD).pyspark.sql module
(for SparkSQL).pyspark.streaming module
(for Spark Streaming).pyspark.ml
(for Spark ML).pyspark.mllib
(for Spark MLlib).
- Example(s):
- Counter-Example(s):
- See: Spark SQL, Spark Library.
References
2017
- http://datanami.com/2017/05/18/committers-talk-hadoop-3-apache-big-data/
- QUOTE: While Spark gives the customer all kinds of great capabilities, the Python implementation lacks the code portability that exists when working with Spark through Java or Scala. …
… By deploying PySpark in a Docker container under YARN, a developer can get the exact PySpark environment they want without requiring administrators to get involved with detailed configurations. It all gets bundled up in a Docker container, and YARN runs it like any other Hadoop job on the cluster.
- QUOTE: While Spark gives the customer all kinds of great capabilities, the Python implementation lacks the code portability that exists when working with Spark through Java or Scala. …
2016
- https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- QUOTE: The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python.
There are a few key differences between the Python and Scala APIs:
- Python is dynamically typed, so RDDs can hold objects of multiple types.
- PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.
- QUOTE: The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python.
2016b
- http://spark.apache.org/docs/latest/api/python/
- QUOTE: Core classes:
pyspark.SparkContext
: Main entry point for Spark functionality.pyspark.RDD
: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.pyspark.streaming.StreamingContext
: Main entry point for Spark Streaming functionality.pyspark.streaming.DStream
: A Discretized Stream (DStream), the basic abstraction in Spark Streaming.pyspark.sql.SQLContext
: Main entry point for DataFrame and SQL functionality.pyspark.sql.DataFrame
: A distributed collection of data grouped into named columns.
- QUOTE: Core classes:
2016b
- http://stackoverflow.com/a/37084862
- QUOTE: As of Spark 1.0, you should launch pyspark applications using
spark-submit
.
Whilepyspark
will launch the interactive shell,spark-submit
allows you to easily launch a spark job on various cluster managers.
- QUOTE: As of Spark 1.0, you should launch pyspark applications using
2016c
- https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
- Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing. The open source community has developed a wonderful utility for spark python big data processing known as PySpark. PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD’s). …
… The PySpark shell is responsible for linking the python API to the spark core and initializing the spark context. …
...
- Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing. The open source community has developed a wonderful utility for spark python big data processing known as PySpark. PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD’s). …