Parquet File Format

From GM-RKB
Jump to navigation Jump to search

A Parquet File Format is a columnar file format that supports nested data.



References

2021a

2021b

2021c

2019

  • https://stackoverflow.com/a/56481636
    • QUOTE: ... Parquet is a columnar file format for data serialization. Reading a Parquet file requires decompressing and decoding its contents into some kind of in-memory data structure. It is designed to be space/IO-efficient at the expense of CPU utilization for decoding. It does not provide any data structures for in-memory computing. Parquet is a streaming format which must be decoded from start-to-end, while some "index page" facilities have been added to the storage format recently, in general random access operations are costly. ...

2017a

2017b

2017c

  • https://parquet.apache.org/documentation/latest/
    • QUOTE: We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

      Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

      Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

      Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

2017d

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

2017 e.

2016b

  • https://github.com/Parquet/parquet-format
    • QUOTE: Parquet is a columnar storage format that supports nested data.

       Parquet metadata is encoded using Apache Thrift. The Parquet-format project contains all Thrift definitions that are necessary to create readers and writers for Parquet files.

    • We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

      Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

      Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

      Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

2016

  • Ryan Blue. (2016). “Parquet performance tuning - the missing guide." Presentation
    • QUOTE: Topics include:
      • The tools and techniques Netflix uses to analyze Parquet tables
      • How to spot common problems
      • Recommendations for Parquet configuration settings to get the best performance out of your processing platform
      • The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
    • Spark configuration.
      • // turn on Parquet push-down, stats filtering, and dictionary filtering
      • sqlContext.setConf("parquet.filter.statistics.enabled", "true")
      • sqlContext.setConf("parquet.filter.dictionary.enabled", "true")
      • sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")

2010