Skip to content

Commit 4b1c72c

Browse files
emkornfieldalambjorisvandenbosscheFokkopitrou
authored
GH-534: Clarify versioning and V2 (#535)
Clarify versioning and no restrictions on encodings. Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr>
1 parent 534cecc commit 4b1c72c

File tree

2 files changed

+19
-4
lines changed

2 files changed

+19
-4
lines changed

Encodings.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ Parquet encoding definitions
2222

2323
This file contains the specification of all supported encodings.
2424

25+
Unless otherwise stated in page or encoding documentation, any encoding can be
26+
used with any page type.
27+
2528
<a name="PLAIN"></a>
2629
### Plain: (PLAIN = 0)
2730

@@ -59,8 +62,8 @@ Dictionary page format: the entries in the dictionary using the [plain](#PLAIN)
5962
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
6063
followed by the values encoded using RLE/Bit packed described above (with the given bit width).
6164

62-
Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
63-
in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.
65+
Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY`
66+
in a data page and `PLAIN` in a dictionary page for new Parquet files.
6467

6568
<a name="RLE"></a>
6669
### Run Length Encoding / Bit-Packing Hybrid (RLE = 3)

src/main/thrift/parquet.thrift

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -712,9 +712,14 @@ struct DictionaryPageHeader {
712712
}
713713

714714
/**
715-
* New page format allowing reading levels without decompressing the data
715+
* Alternate page format allowing reading levels without decompressing the data
716716
* Repetition and definition levels are uncompressed
717717
* The remaining section containing the data is compressed if is_compressed is true
718+
*
719+
* Implementation note - this header is not necessarily a strict improvement over
720+
* `DataPageHeader` (in particular the original header might provide better compression
721+
* in some scenarios). Page indexes require pages to start and end at row boundaries,
722+
* regardless of which page header is used.
718723
**/
719724
struct DataPageHeaderV2 {
720725
/** Number of values, including NULLs, in this data page. **/
@@ -1255,7 +1260,14 @@ union EncryptionAlgorithm {
12551260
* Description for file metadata
12561261
*/
12571262
struct FileMetaData {
1258-
/** Version of this file **/
1263+
/** Version of this file
1264+
*
1265+
* As of December 2025, there is no agreed upon consensus of what constitutes
1266+
* version 2 of the file. For maximum compatibility with readers, writers should
1267+
* always populate "1" for version. For maximum compatibility with writers,
1268+
* readers should accept "1" and "2" interchangeably. All other versions are
1269+
* reserved for potential future use-cases.
1270+
*/
12591271
1: required i32 version
12601272

12611273
/** Parquet schema for this file. This schema contains metadata for all the columns.

0 commit comments

Comments
 (0)