Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ enum Type {
BOOLEAN = 0;
INT32 = 1;
INT64 = 2;
INT96 = 3; // deprecated, only used by legacy implementations.
INT96 = 3; // deprecated, new Parquet writers should not write data in INT96
FLOAT = 4;
DOUBLE = 5;
BYTE_ARRAY = 6;
Expand Down Expand Up @@ -1076,12 +1076,21 @@ union ColumnOrder {
* BOOLEAN - false, true
* INT32 - signed comparison
* INT64 - signed comparison
* INT96 (only used for legacy timestamps) - undefined
* INT96 (only used for legacy timestamps) - undefined(+)
* FLOAT - signed comparison of the represented value (*)
* DOUBLE - signed comparison of the represented value (*)
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (+) While the INT96 type has been deprecated, at the time of writing it is
* still used in many legacy systems. If a Parquet implementation chooses
* to write statistics for INT96 columns, it is recommended to order them
* according to the legacy rules:
* - compare the last 4 bytes (days) as a little-endian 32-bit signed integer
* - if equal last 4 bytes, compare the first 8 bytes as a little-endian
* 64-bit signed integer (nanos)
* See https://github.com/apache/parquet-format/issues/502 for more details
*
Comment on lines +1079 to +1093
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb how about wording this more strongly?

Suggested change
* INT96 (only used for legacy timestamps) - undefined(+)
* FLOAT - signed comparison of the represented value (*)
* DOUBLE - signed comparison of the represented value (*)
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (+) While the INT96 type has been deprecated, at the time of writing it is
* still used in many legacy systems. If a Parquet implementation chooses
* to write statistics for INT96 columns, it is recommended to order them
* according to the legacy rules:
* - compare the last 4 bytes (days) as a little-endian 32-bit signed integer
* - if equal last 4 bytes, compare the first 8 bytes as a little-endian
* 64-bit signed integer (nanos)
* See https://github.com/apache/parquet-format/issues/502 for more details
*
* INT96 (only used for legacy timestamps) - timestamp logical comparison (+)
* FLOAT - signed comparison of the represented value (*)
* DOUBLE - signed comparison of the represented value (*)
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (+) While the INT96 type has been deprecated, at the time of writing it is
* still used in many legacy systems. The only logical type stored in INT96
* is a timestamp defined as 8 bytes signed little-endian nanos followed by
* 4 bytes signed little-endian julian days.
* See https://github.com/apache/parquet-format/issues/502 for more details
*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the original wording as it a) leaves the ordering undefined (and thus isn't really a change to the specification) and b) spells out exactly how to perform the in-the-wild ordering.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @etseidl's suggestion. BTW, I think we need to avoid saying only logical type stored in INT96 since literally it is not a Parquet logical type or (deprecated) converted type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etseidl I agree that leaving it undefined is an advantage in not changing the spec. The disadvantage is that readers will have less confidence on the meaning of this field when it exists.

@wgtmac I see INT96 as a weird physical type. Had it been a real physical type it would have a logical type TIMESTAMP that would signify that top 64 bits are nanos and bottom 32 bits are julian days. It could also have a logical type IntType{96, true/false} which could be used to store numbers greater than int64max. In the former case the ordering would be what we propose today and in the latter case the ordering would be signed comparison of the 96 bit signed or unsigned integer. The reality is that there is no logical type ever defined for INT96 and its logical type is defacto the timestamp representation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @etseidl is referring to my original proposed change (not the wording prior to this PR)

@alamb how about wording this more strongly?

In terms of changing the wording from

-   *   INT96 (only used for legacy timestamps) - undefined(+)
+ * INT96 (only used for legacy timestamps) - timestamp logical comparison (+)

I would personally read this as a change in the spec, which obligated writers to use the new ordering. I was under the impression we didn't have consensus on making that change

I prefer leaving the wording as undefined personally.

The disadvantage is that readers will have less confidence on the meaning of this field when it exists.

This might be a good thing given that we have instances where old writers wrote it incorrectly may not be correct.

* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* compatibility rules should be applied when reading statistics:
Expand Down