Apache Hive Distributed RDBMS Platform
An Apache Hive Distributed RDBMS Platform is a RDBMS platform released by the Apache Hive Project.
- Context:
- It can be instantiated in an Apache Hive Instance.
- It can (typically) contain a SQL Parser (that supports SQL-like HiveQL Statements expressed in the HiveQL).
- It can (typically) support Hive User Defined Functions.
- It can be designed to use a Hadoop Cluster.
- It can be designed for Ad Hoc Analytical Data Queries of Very Large Databases.
- It has a Storage System that is based on the Hadoop Append-only Filesystem.
- It can be accessed by: Hive CLI, Hue Beeswax.
- It can be downloaded from http://www.apache.org/dyn/closer.cgi/hive/.
- Example(s):
- Hive Release 3.1.1 (2018-11-01)
- Hive Release 2.1.1 (2016-06-20)
- Hive Release 1.0.0 (2015-02-04)
- Hive Release 0.13.1 (2014-06-06)
- Hive Release 0.10.0 (2013-01-11).
- …
- Counter-Example(s):
- See: Cloudera Manager, Hive Metastore.
References
2014
- (Huai et al., 2014) ⇒ Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. (2014). “Major Technical Advancements in Apache Hive.” In: Proceedings of 2014 ACM SIGMOD Conference on Management of Data (SIGMOD 2014).
- QUOTE: Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. … First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and run time performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. …
- http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/hive.html
- Hive abstracts Hadoop by abstracting it through SQL-like language, called HiveQL so that users can apply data defining and manipulating operations to it, just like with SQL. In Hive data set are defined through tables (that expose type information) in which data can be loaded, selected and transformed through built-in operators or custom/user defined functions (or UDFs).
2011
- https://cwiki.apache.org/confluence/display/Hive
- QUOTE: Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.
Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result, latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries. Hive also does not provide sort of data or query cache to make repeated queries over the same data set faster. Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
- QUOTE: Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Apache_Hive
- Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[1] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Hive is also included in Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes such as the bitmap index.
By default, Hive stores metadata an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.
Currently, there are three file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE and RCFILE.
- Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[1] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Hive is also included in Amazon Elastic MapReduce on Amazon Web Services.
- ↑ Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1430219422.
2009
- (Thusoo et al., 2009) ⇒ Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. (2009). “Hive: a warehousing solution over a map-reduce framework.” In: Proceedings of the VLDB Endowment, 2(2).
- ABSTRACT: The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.