Apache Arrow Framework
An Apache Arrow Framework is a high-performance cross-system data layer for columnar in-memory analytics.
- Context:
- It can be used to develop Data Analytics Applications.
- It can be supported by a Apache Arrow System: such as one based on PyArrow.
- …
- Example(s):
- a Apache Arrow, v1.0.0, ~2020-07-15 [1].
- a Apache Arrow, v0.8.0, ~2017-12-17 [2].
- …
- Counter-Example(s):
- a Apache Parquet Format.
- a Spark RDD.
- …
- See: AirFlow, Columnar Data Store, Columnar DBMS.
References
2020
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Apache_Arrow Retrieved:2020-7-6.
- Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.
2019
- https://stackoverflow.com/a/56481636
- QUOTE: ... Arrow on the other hand is first and foremost a library providing columnar data structures for in-memory computing. When you read a Parquet file, you can decompress and decode the data into Arrow columnar data structures, so that you can then perform analytics in-memory on the decoded data. Arrow columnar format has some nice properties: random access is O(1) and each value cell is next to the previous and following one in memory, so it's efficient to iterate over.
What about "Arrow files" then? Apache Arrow defines a binary "serialization" protocol for arranging a collection of Arrow columnar arrays (called a "record batch") that can be used for messaging and interprocess communication. You can put the protocol anywhere, including on disk, which can later be memory-mapped or read into memory and sent elsewhere.
This Arrow protocol is designed so that you can "map" a blob of Arrow data without doing any deserialization, so performing analytics on Arrow protocol data on disk can use memory-mapping and pay effectively zero cost. The protocol is used for many things, such as streaming data between Spark SQL and Python for running pandas functions against chunks of Spark SQL data, these are called "pandas udfs". ... In some applications, Parquet and Arrow can be used interchangeably for on-disk data serialization. Some things to keep in mind:
- Parquet is intended for "archival" purposes, meaning if you write a file today, we expect that any system that says they can "read Parquet" will be able to read the file in 5 years or 7 years. We are not yet making this assertion about long-term stability of the Arrow format (though we might in the future)
- Parquet is generally a lot more expensive to read because it must be decoded into some other data structure. Arrow protocol data can simply be memory-mapped.
- Parquet files are often much smaller than Arrow-protocol-on-disk because of the data encoding schemes that Parquet uses. If your disk storage or network is slow, Parquet is going to be a better choice
- QUOTE: ... Arrow on the other hand is first and foremost a library providing columnar data structures for in-memory computing. When you read a Parquet file, you can decompress and decode the data into Arrow columnar data structures, so that you can then perform analytics in-memory on the decoded data. Arrow columnar format has some nice properties: random access is O(1) and each value cell is next to the previous and following one in memory, so it's efficient to iterate over.
2017
- http://arrow.apache.org/
- QUOTE:
- Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
- The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
- Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.
- Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.
- QUOTE:
2017b
- https://arrow.apache.org/docs/python/parquet.html
- QUOTE: The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala (incubating), and Apache Spark adopting it as a shared standard for high performance data IO.
Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ implementation of Apache Parquet, which includes a native, multithreaded C++ adapter to and from in-memory Arrow data. PyArrow includes Python bindings to this code, which thus enables reading and writing Parquet files with pandas as well.
- QUOTE: The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala (incubating), and Apache Spark adopting it as a shared standard for high performance data IO.
2017
- https://en.wikipedia.org/wiki/List_of_Apache_Software_Foundation_projects
- Apache Arrow: "A high-performance cross-system data layer for columnar in-memory analytics”.