GH-493: Clarify Bounding Box Behavior in GeospatialStatistics #494
GH-493: Clarify Bounding Box Behavior in GeospatialStatistics #494rdblue merged 14 commits intoapache:masterfrom
Conversation
|
@wgtmac @paleolimbot @zhangfengcdt @Kontinuation Please review. Thank you! |
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for putting this together! The bounding box went through a number of iterations when the higher level changes were being discussed and the language you've added here is at least what I think the original intention was (although this wasn't captured in the comments). I think this is the cleanest way to represent the "zero non-NaN elements" case, but we do have to make it explicit to ensure that readers can trust the statistics to skip (e.g.) all null row groups.
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for the clarification! I think we are talking about the suggested behavior of writer and reader. What about using a table to describe this?
| Values \ Behavior | Writer | Reader |
|---|---|---|
| X and Y | Do not produce bbox if any X or Y is NaN | Omit the bbox if any of xmin/xmax/ymin/ymax is NaN |
| Z | Do not produce zmin/zmax if any Z is NaN | Omit the zmin/zmax if any of zmin/zmax is NaN |
| M | Do not produce mmin/mmax if any M is NaN | Omit the mmin/mmax if any of mmin/mmax is NaN |
By omit we mean that we should make no assumption about the values from the bbox and thus cannot be used for predicate pushdown.
| For `GEOGRAPHY` types, X and Y values are restricted to the canonical ranges of | ||
| [-180, 180] for X and [-90, 90] for Y. | ||
|
|
||
| When `GeospatialStatistics` is present, writers must omit zmin and zmax if and |
There was a problem hiding this comment.
What about putting these between line 95 and 97?
|
|
||
| When `GeospatialStatistics` is present, writers must omit zmin and zmax if and | ||
| only if there are zero non-NaN Z values in the column chunk, and must omit mmin | ||
| and mmax if and only if there are zero non-NaN M values. The bounding box must |
There was a problem hiding this comment.
I think it is better to mention X and Y values first because they are required.
| an indication that all corresponding values are null, and may use this | ||
| information to skip data during predicate evaluation. For example, a reader may | ||
| skip a row group if the bounding box is absent, indicating that all X and Y | ||
| coordinates are null. |
There was a problem hiding this comment.
What is a null value in this case? Is WKB able to encode a null coordinate? I was thinking the binary value is null in this case.
There was a problem hiding this comment.
For example, a reader may
skip a row group if the bounding box is absent, indicating that all X and Y
coordinates are null.
This is counter-intuitive because usually in Parquet we cannot skip the row group if its min/max stats is missing in that we cannot make any assumption about its data. It might be the case that all values are null, or the writer does not implement this feature at all.
There was a problem hiding this comment.
I think that's the idea with this language...we need the absent-ness to be significant so that there is a path for a reader to skip an all empty/null row group, and the other ways of communicating absent-ness were also confusing (use NaNs, use Inf/-Inf).
We can also add another field like optional dimensions_that_have_zero_non_nan_values (with a less verbose name)?
There was a problem hiding this comment.
struct ColumnMetaData {
5: required i64 num_values;
12: optional Statistics statistics;
}
struct Statistics {
3: optional i64 null_count;
}I still think that ColumnMetaData::num_values == ColumnMetaData::statistics.null_count indicates (implicitly) that the bbox should be empty. We need to fix the implementation of unknown sort order (to generate statistics in the parquet-cpp) instead of complicating the spec.
There was a problem hiding this comment.
Ah, I didn't understand that the lack of null count was a C++-specific limitation (and I was remembering a previous draft of the geospatial types that had some very strong language that writers were forbidden to write statistics and readers were required to ignore any that were present). I would be ideal if we could also prune 100% empty (but not null), but I think we can make the null count + geospatial_types to prune in most reasonable cases.
There was a problem hiding this comment.
Would require writing the "empty box" be enough in this case?
Basically:
- If all values are NULL then the null count logic could kick in.
- If there are non-NULL values which are all either empty or "invalid" then write the empty bounding box.
The the intention is that if a bounding box is empty the entire group can be skipped.
Co-authored-by: Gang Wu <ustcwg@gmail.com>
@jiayuasu @wgtmac @paleolimbot |
@mkaravel If writers are instructed to drop the bbox when there is any NaN value, how would you skip the rows if there are some (but not all) NaN values in the bbox? Shouldn't we consider it is malformed?
Ah, I got your point. In total there are five cases:
It seems that we need a clear approach to indicate there are no valid data (the case 3 above). I'm hesitant to use a bbox of all NaNs to represent an empty bbox because it complicates the logic to deal with cases where some values are NaN while others are not. Proposal from @paleolimbot to introduce Is this a common case and the invalid data should be preserved as is? Or can we simplify this by writing null for invalid data to Parquet so we can eliminate case 2 and 3 above? |
|
Thank you for the summary! I like the place we're at right now: readers can confidently skip files or row groups for all cases I envision being important. The perfectionist in me would love to handle the all-empty-but-not-null case (3) but I would be surprised if this was important (if it is, one can demonstrate a realistic scenario and propose to add something to handle that case). My proposal was hacky and I'd love to avoid it 🙂 ...it's a great point that using a "missing" thift element, an NaN, and/or Inf values for this concept are easily misinterpreted because they carry other meaning in a Parquet context. I've updated C++'s GeoStatistics to better communicate the meaning of various combinations of empty, uncalculated, or invalid. |
|
BTW, I also updated the example file in my PR description to add the |
|
@wgtmac I apologize for the typo in my post (fixed now, I forgot to write the word "NaN"), which has created the confusion. The summary is great! Thank you! |
|
I like the breakdown in the summary. For (3) my suggestion was to go for the empty box. I agree with @paleolimbot in both aspects: it would be ideal to handle this in a nicer way, but pragmatically I do not think it matters. I am hesitant modifying user data on-write. Is there a precedent in Parquet where something like that happens? I mean data that is invalid being written as null or omitted? |
If this is a reference to
Assuming that "invalid" means EMPTY or something with NaNs in it, I view this as something that a higher level wrapper (e.g., Sedona, Spark, or GeoPandas) might expose as an option to allow future readers of the file to have more useful statistics, and not something we would handle in any Parquet writer. Some tools generate EMPTYs, some tools generate nulls, and some computations may accidentally put nan NaN in a value. It's Parquet's job (in my opinion) to transport all of those cases faithfully to a tool that can do some validation and fix things if needed based on user input. (I think that's where we are right now!) |
|
I agree with @paleolimbot that the Parquet writer does not change users' input and we will still see all the five cases from my summary. To make bbox of cases (2) and (3) more effective, an application that generates parquet files need to transform empty or invalid geometry features to null values so that readers reading these files will only see cases (1) (4) (5). Adding the concept of empty bbox to the Parquet spec may complicates the reader because it cannot blindly discard bbox with NaN values. |
|
@mkaravel @wgtmac @paleolimbot Hey folks, I updated my PR to reflect the latest discussion result. Can you review? |
Co-authored-by: Gang Wu <ustcwg@gmail.com>
| * X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y | ||
| values. Do not produce a bounding box if all X or all Y values are invalid. |
There was a problem hiding this comment.
| * X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y | |
| values. Do not produce a bounding box if all X or all Y values are invalid. | |
| * X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y | |
| values. Do not produce a bounding box if all X and/or all Y values are invalid (even | |
| if there are valid Z and/or M values). |
There was a problem hiding this comment.
Can we just say NaN instead of invalid?
There was a problem hiding this comment.
What should we do if a geometry feature has a NaN value in its X coordinate? Should we ignore all the X/Y/Z/M coordinates from it, or just ignore that specific NaN value? It seems reasonable to do the latter.
| An invalid geospatial value refers to any of the following cases: | ||
|
|
||
| * `null`: A null value in Parquet. | ||
| * A non-`null` value that are encoded in a valid WKB or bounding box format |
There was a problem hiding this comment.
Perhaps it is better to distinguish invalid bbox and invalid WKB value?
There was a problem hiding this comment.
I think generally it might be better to inline the relevant pieces above (I've made suggestions...it is mostly just NaN). This section could perhaps be used to make non-spatial implementors aware that EMPTY geometries exist and that NaN values in geometries are rare and usually result in errors when an operation is performed (or it could be omitted).
There was a problem hiding this comment.
Agreed. Literally null is a valid value from the perspective of Parquet and is ignored by all kinds of min/max stats.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for this, and apologies for taking a few days to circle back here. I don't have any objections here, just some clarifications that may be helpful.
| * X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y | ||
| values. Do not produce a bounding box if all X or all Y values are invalid. |
There was a problem hiding this comment.
Can we just say NaN instead of invalid?
| Omit M from the bounding box if all M values are invalid. | ||
|
|
||
| Readers should follow the guidelines below when examining bounding boxes. If | ||
| a bounding box is [invalid](#invalid-geospatial-values), it is treated as a `no |
There was a problem hiding this comment.
| a bounding box is [invalid](#invalid-geospatial-values), it is treated as a `no | |
| a bounding box is contains `NaN` values, it is treated as a `no |
(this would be the only invalid case, correct?
| * A bounding box is present: | ||
| * X and Y: Both X and Y of the bounding box must be present. If either | ||
| is missing, the bounding box is invalid. |
There was a problem hiding this comment.
| * A bounding box is present: | |
| * X and Y: Both X and Y of the bounding box must be present. If either | |
| is missing, the bounding box is invalid. | |
| * A bounding box is present: | |
| * Because xmin, ymin, xmax, and ymax are required fields, they are always present in the bounding box |
...maybe?
| An invalid geospatial value refers to any of the following cases: | ||
|
|
||
| * `null`: A null value in Parquet. | ||
| * A non-`null` value that are encoded in a valid WKB or bounding box format |
There was a problem hiding this comment.
I think generally it might be better to inline the relevant pieces above (I've made suggestions...it is mostly just NaN). This section could perhaps be used to make non-spatial implementors aware that EMPTY geometries exist and that NaN values in geometries are rare and usually result in errors when an operation is performed (or it could be omitted).
| but are not considered valid under this specification, including: | ||
| * `NaN`: Not a Number. For example, `POINT EMPTY` in WKB is represented by a | ||
| `Point` with each ordinate value set to an IEEE-754 quiet NaN value. | ||
| * `Empty geometries`: Geometries explicitly marked as empty in WKB using |
There was a problem hiding this comment.
How to detect Empty geometries when collecting the bbox in Parquet? I still don't know what to do if I were to implement this rule.
There was a problem hiding this comment.
This is really a WKB reader-specific behavior. I added the following line:
While different WKB readers may interpret such values differently, the resulting output should be treated as invalid
| * `Empty geometries`: Geometries explicitly marked as empty in WKB using | ||
| indicators such as `numPoints`, `numRings`, or `numGeometries`. Examples | ||
| include `LINESTRING EMPTY` or `POLYGON EMPTY`. | ||
| * `Out-of-bounds coordinates`: Values that fall outside the valid range |
There was a problem hiding this comment.
Are writers encouraged to drop X values when Out-of-bounds coordinates has been detected or it is the readers' choice to check this rule? Or both?
There was a problem hiding this comment.
I think the writers are required to drop X values when Out-of-bounds coordinates has been detected.
I clarified the readers behavior below
|
I made the following changes:
|
| values for validation. | ||
|
|
||
| * A bounding box is present: | ||
| * X and Y: Both X and Y of the bounding box must be present. |
There was a problem hiding this comment.
| * X and Y: Both X and Y of the bounding box must be present. | |
| * X and Y: Both X and Y of the bounding box must be present. If any X or Y | |
| value is invalid, this bounding box is not reliable and cannot be used. |
There was a problem hiding this comment.
I’m hesitant to merge this because I want to avoid mentioning invalid values in the bounding box; otherwise, we would need to create a separate section to define them.
There was a problem hiding this comment.
I think this one is important for implementations (although I do think you should just say NaN instead of "invalid"). Perhaps a concrete way to phrase this (unless I'm missing something) would be "If any of xmin, ymin, xmax, or ymax is NaN, the bounding box is not reliable and should not be used".
There was a problem hiding this comment.
The guidelines here is for readers to determine whether a bbox is reliable. So I think value is present is not sufficient.
|
|
||
| * A bounding box is present: | ||
| * X and Y: Both X and Y of the bounding box must be present. | ||
| * Z: If Z of the bounding box is missing, readers should not assume |
There was a problem hiding this comment.
| * Z: If Z of the bounding box is missing, readers should not assume | |
| * Z: If Z of the bounding box is missing or contains any invalid value, readers should not assume |
There was a problem hiding this comment.
Maybe "is missing or either zmin or zmax is NaN"?
| * Z: If Z of the bounding box is missing, readers should not assume | ||
| anything about the presence or validity of Z values and may need to | ||
| load individual coordinates for validation. | ||
| * M: If M of the bounding box is missing, readers should not assume |
There was a problem hiding this comment.
| * M: If M of the bounding box is missing, readers should not assume | |
| * M: If M of the bounding box is missing or contains any invalid value, readers should not assume |
There was a problem hiding this comment.
Maybe "is missing or either mmin or mmax is NaN"?
| # Invalid geospatial values | ||
|
|
||
| An invalid geospatial value refers to the coordinate values of a non-`null` | ||
| geospatial instance that are encoded in a valid WKB format, but are not |
There was a problem hiding this comment.
As we have mentioned a valid WKB format, do we need to provide guidelines for invalid WKB format?
There was a problem hiding this comment.
I think it’s very difficult to enumerate all possible invalid WKB formats, and chasing this could easily become another rabbit hole. Since the WKB standard clearly defines what a valid format is, I suggest we leave this out.
There was a problem hiding this comment.
Agreed. Is there any well-known reference for this?
There was a problem hiding this comment.
I think we already provided references to the WKB standard?
There was a problem hiding this comment.
It's a good point that if you're not going to go in to invalid WKB formats, mentioning a valid one is confusing. I think this section is better labeled as "Special geospatial values" since many of the things you mention in it are perfectly valid (and common) cases.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you again for drafting this language!
| values for validation. | ||
|
|
||
| * A bounding box is present: | ||
| * X and Y: Both X and Y of the bounding box must be present. |
There was a problem hiding this comment.
I think this one is important for implementations (although I do think you should just say NaN instead of "invalid"). Perhaps a concrete way to phrase this (unless I'm missing something) would be "If any of xmin, ymin, xmax, or ymax is NaN, the bounding box is not reliable and should not be used".
| # Invalid geospatial values | ||
|
|
||
| An invalid geospatial value refers to the coordinate values of a non-`null` | ||
| geospatial instance that are encoded in a valid WKB format, but are not |
There was a problem hiding this comment.
It's a good point that if you're not going to go in to invalid WKB formats, mentioning a valid one is confusing. I think this section is better labeled as "Special geospatial values" since many of the things you mention in it are perfectly valid (and common) cases.
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
|
Hi @wgtmac @paleolimbot , I updated the PR again. I replaced most occurrences of |
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for iterating! That last section is clearer to me now. Two minor comments about other things that could be NaN if you're inclined, but looks good to me!
|
|
||
| * A bounding box is present: | ||
| * X and Y: Both X and Y of the bounding box must be present. | ||
| * Z: If Z of the bounding box is missing, readers should not assume |
There was a problem hiding this comment.
Maybe "is missing or either zmin or zmax is NaN"?
| * Z: If Z of the bounding box is missing, readers should not assume | ||
| anything about the presence or validity of Z values and may need to | ||
| load individual coordinates for validation. | ||
| * M: If M of the bounding box is missing, readers should not assume |
There was a problem hiding this comment.
Maybe "is missing or either mmin or mmax is NaN"?
Co-authored-by: Gang Wu <ustcwg@gmail.com>
|
All fixed. Please review again. Thank you! @wgtmac @paleolimbot |
| * `null` instance: Skip it and continue processing the remaining | ||
| geospatial instances. Do not produce a bounding box if all instances are null. | ||
| * Non-`null` instance with [special geospatial values](#special-geospatial-values): | ||
| * X and Y: Skip any special X or Y value and continue processing the |
There was a problem hiding this comment.
I have a question here, do we mean: skip any geospatial instance with special X and Y value
(ie, if a linestring has a thousand coordinates, but only one has bad X,Y value, its enough to skip whole instance?)
There was a problem hiding this comment.
no, what we mean is: if a linestring has a thousand coordinates and only 1 coordinate has a bad Y value, we skip this single Y value, continue aggregating on other Y values in this linestring instance
There was a problem hiding this comment.
ok. sorry i have another confusion in the doc, I know we define special geospatial value for the geospatial instance. But then in the language we say skip special X or Y value, we dont define what is a special coordinate
There was a problem hiding this comment.
Taking a LineString object as an example LINESTRING (0 NaN, 0 1, 1 2):
We have 3 levels here:
- A geospatial instance: this linestring object
- A geospatial coordinate:
- we have 3 geospatial coordinates
0 NaN,0 1,1 2
- we have 3 geospatial coordinates
- A geospatial value:
- For X, we have
0,0,1 - For Y, we have
NaN,1,2
- For X, we have
The special geospatial value in this spec refers to the 3rd level. When calculating bbox, we only skipped that single bad value NaN, and use X (0, 0, 1) and Y (1, 2).
That's why we didn't have a definition for special coordinate (2nd level) because we look at the 3rd level directly
There was a problem hiding this comment.
OK sorry i equated levels 2 and 3 in my head (coordinate and value).
But currently the language of 'special geospatial value' from the link mentions things like 'Nan (point)' , 'empty geometry', and 'invalid WKB' which seems to talk about invalid instances (level 1), which is my confusion. Then the section talks not about those, but about if actual x,v,z,m values (level 3) are invalid. Individual value is invalid only if NaN and outside the range (depending on context), right?
There was a problem hiding this comment.
@szehon-ho I think you’re right. The current wording conflated axis values and coordinates. I’ve updated the special geospatial value section to clarify the distinction.
| remaining X or Y values. Do not produce a bounding box if all X or all Y | ||
| values are special values. | ||
|
|
||
| * Z: Skip any special Z value and continue processing the remaining Z values. |
There was a problem hiding this comment.
nit: should we just omit 'continue processing the remaining...' in all these sections, to reduce the verbosity , as its a bit obvious?
There was a problem hiding this comment.
If you don't have a strong opinion here, I kind of want to keep it verbose. We tend to make it clear that one should only omit individual X/Y/Z/M value and continue using the remaining values, instead of dropping the coordinate or whole geospatial instance.
There was a problem hiding this comment.
ok i think its fine with me. For me It'd probably be more helpful if somehow we had that context (talking about continuing with rest of coordinates in an instance), than it is now. Thinking if we can put this somewhere, but agree not in this line though.
|
@wgtmac Please review. Thank you! |
|
@wgtmac @rdblue The Iceberg clarification PR has been merged. I updated the Parquet PR accordingly. Please review and approve. |
|
Looks great. Thanks for the update, @jiayuasu! |
|
Thanks, everyone! I merged this. |
Rationale for this change
Update the specification to offer clearer instructions on how writers should populate bounding box values in the presence of NaNs, and how readers should interpret missing or incomplete bounding boxes.
What changes are included in this PR?
Added comments in the
Bounding Boxsection of theGeospatial.md.Below is an example illustrating how bounding box statistics behave based on the presence of X and Y values:
All values in this example file are allowed by WKB encoding.
Do these changes have PoC implementations?
Yes, apache/parquet-java#2971 and apache/arrow#45459