SparkSQL Module
(Redirected from Spark SQL Module)
Jump to navigation
Jump to search
A SparkSQL Module is a Spark module that is a SQL Module.
- Context:
- It can be access by a pyspark.sql Module, ...
- It can be based on a Shark System.
- It can support SparkSQL User Defined Functions.
- It can (typically) make use of a SparkSQL Query Optimizer.
- Example(s):
- Counter-Example(s):
- See: Spark API, Spark Shark, SchemaRDD, Resilient Distributed Dataset (RDD).
References
2017
- https://github.com/apache/spark/tree/branch-2.2/sql
- QUOTE: This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
Spark SQL is broken up into four subprojects:
- Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
- QUOTE: This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
2017
- https://cdn2.hubspot.net/hubfs/488249/assets/atscale-data-sheet.pdf
- QUOTE: ... AtScale works out-of-the-box with the leading SQL-on-Hadoop engines, such as Impala, SparkSQL, or Hive, and allows them to function as an analytics engine.
2016
- http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
- QUOTE: One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.
2015
- http://spark.apache.org/sql/
- QUOTE: Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.
2015b
- (Armbrust et al., 2015) ⇒ Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. (2015). “Spark SQL: Relational Data Processing in Spark.” In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ISBN:978-1-4503-2758-9 doi:10.1145/2723372.2742797
- QUOTE: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark,
2013
- (Xin et al., 2013) ⇒ Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. (2013). “Shark: SQL and Rich Analytics at Scale.” In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ISBN:978-1-4503-2037-5 doi:10.1145/2463676.2465288