Skip to content

8789: [Parquet] Avoid copying LogicalType in ColumnOrder::get_sort_order, deprecate get_logical_type#15

Open
martin-augment wants to merge 5 commits intomainfrom
pr-8789-2025-11-10-12-58-45
Open

8789: [Parquet] Avoid copying LogicalType in ColumnOrder::get_sort_order, deprecate get_logical_type#15
martin-augment wants to merge 5 commits intomainfrom
pr-8789-2025-11-10-12-58-45

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

@martin-augment martin-augment commented Nov 10, 2025

8789: To review by AI


Note

Replace cloning LogicalType access with ref-based APIs (*_logical_type_ref), introduce ColumnOrder::sort_order_for_type, deprecate old cloning getters, and update readers/writers/tests accordingly.

  • API changes:
    • Add ColumnOrder::sort_order_for_type(Option<&LogicalType>, ...) and deprecate ColumnOrder::get_sort_order.
    • Deprecate cloning getters BasicTypeInfo::logical_type() and ColumnDescriptor::logical_type(); prefer *_logical_type_ref.
  • Readers/Writers/Thrift:
    • Use logical_type_ref() throughout (metadata writer, thrift (read/write), schema printer), passing refs instead of cloning.
    • Update column order computation to call sort_order_for_type.
  • Arrow schema conversion/primitive:
    • Convert logic to use logical_type_ref(); fix decimal/time handling by dereferencing fields where needed.
  • Geospatial:
    • Use logical_type_ref() when choosing accumulators.
  • Tests:
    • Update assertions to compare against logical_type_ref(); adjust expectations accordingly across Arrow/serialized readers, schema, geospatial, variant tests.

Written by Cursor Bugbot for commit 0e5c0c7. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 10, 2025

Walkthrough

Migrates the parquet crate's logical type access APIs from owned values to borrowed references. Replaces logical_type() methods returning Option<LogicalType> with logical_type_ref() methods returning Option<&LogicalType>. Adds a new sort_order_for_type method and deprecates get_sort_order.

Changes

Cohort / File(s) Change Summary
Core API Changes
parquet/src/basic.rs, parquet/src/schema/types.rs
Added ColumnOrder::sort_order_for_type() method accepting Option<&LogicalType>; deprecated ColumnOrder::get_sort_order(). Marked BasicTypeInfo::logical_type() and ColumnDescriptor::logical_type() as deprecated with guidance to use logical_type_ref() instead.
Schema Module Updates
parquet/src/schema/printer.rs, parquet/src/schema/types.rs
Replaced logical_type() calls with logical_type_ref() in printer logic and sort order computation. Updated pattern matching to work with borrowed references.
Arrow Schema Integration
parquet/src/arrow/schema/{extension,mod,primitive}.rs
Updated try_add_extension_type, has_extension_type, and schema conversion helpers to use logical_type_ref(). Adjusted all pattern matches on LogicalType variants to dereference fields for Decimal and Time types.
Arrow Reader
parquet/src/arrow/arrow_reader/mod.rs
Updated test assertions to compare against Some(&LogicalType::...) patterns using logical_type_ref().
Metadata Handling
parquet/src/file/metadata/{thrift/mod,writer}.rs
Replaced sort order calculations from get_sort_order() to sort_order_for_type() with logical_type_ref(). Updated Thrift schema writing to use logical_type_ref().cloned() for serialization.
File I/O Tests
parquet/src/file/serialized_reader.rs
Updated test queries from logical_type() to logical_type_ref() with corresponding assertion value updates.
Feature-specific Code
parquet/src/geospatial/accumulator.rs, parquet/src/variant.rs
Updated geospatial accumulator factory and variant type checks to use logical_type_ref() instead of logical_type().
Integration Tests
parquet/tests/geospatial.rs
Updated test code to use logical_type_ref() on both ColumnDescriptor and BasicInfo, with reference-based comparisons and string handling via as_ref().
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8789-2025-11-10-12-58-45

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.


fn from_int32(info: &BasicTypeInfo, scale: i32, precision: i32) -> Result<DataType> {
match (info.logical_type(), info.converted_type()) {
match (info.logical_type_ref(), info.converted_type()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In from_int32, the arm => match (bit_width, is_signed) is matching references (from logical_type_ref()) against value patterns, which will not match/compile; consider dereferencing before matching or using reference patterns (also applies to similar cases in this function).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! Here the types are i32 which implement Copy trait and there is no need to dereference them


fn from_int64(info: &BasicTypeInfo, scale: i32, precision: i32) -> Result<DataType> {
match (info.logical_type(), info.converted_type()) {
match (info.logical_type_ref(), info.converted_type()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In from_int64, the arm => match is_signed { ... } matches a &bool (from logical_type_ref()) against true/false; this likely won’t compile without dereferencing or changing the patterns to reference forms.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! Here the types are i64 which implement Copy trait and there is no need to dereference them

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
parquet/src/geospatial/accumulator.rs (1)

35-38: Fix pattern matching on Option<&LogicalType>: missing & causes a type mismatch.

logical_type_ref() returns Option<&LogicalType>. When matching Some(_) patterns, you must match the reference form with &.

Apply this diff:

@@
-    if !matches!(
-        descr.logical_type_ref(),
-        Some(LogicalType::Geometry { .. }) | Some(LogicalType::Geography { .. })
-    ) {
+    if !matches!(
+        descr.logical_type_ref(),
+        Some(&LogicalType::Geometry { .. }) | Some(&LogicalType::Geography { .. })
+    ) {
         return None;
     }
@@
-        if let Some(crate::basic::LogicalType::Geometry { .. }) = _descr.logical_type_ref() {
+        if let Some(&LogicalType::Geometry { .. }) = _descr.logical_type_ref() {
             Box::new(ParquetGeoStatsAccumulator::default())
         } else {
             Box::new(VoidGeoStatsAccumulator::default())
         }
parquet/src/basic.rs (1)

1147-1177: Add tests and migrate deprecated API call sites.

Verification confirms the new sort_order_for_type API is correct and the recommendations are valid:

  • Add test: No direct tests exist for sort_order_for_type; the suggested test syntax is correct (fields are bit_width: i8 and is_signed: bool).
  • Migrate call sites: Found exactly 2 calls to deprecated get_sort_order:
    • Line 2158 in test_column_order_get_logical_type_sort_order
    • Line 2261 in test_column_order_get_converted_type_sort_order

Both call sites are in tests and should be migrated to sort_order_for_type to eliminate deprecation warnings.

parquet/src/arrow/schema/extension.rs (1)

42-61: Fix pattern matching on &LogicalType (compile-time bug).

logical_type_ref() returns Option<&LogicalType>, but the match arms pattern-match as if on owned values. Use reference patterns (e.g., &LogicalType::Uuid) or deref the scrutinee.

Apply this diff:

@@
-    let Some(parquet_logical_type) = parquet_type.get_basic_info().logical_type_ref() else {
+    let Some(parquet_logical_type) = parquet_type.get_basic_info().logical_type_ref() else {
         return Ok(arrow_field);
     };
-    match parquet_logical_type {
+    match parquet_logical_type {
         #[cfg(feature = "variant_experimental")]
-        LogicalType::Variant { .. } => {
+        &LogicalType::Variant { .. } => {
             arrow_field.try_with_extension_type(parquet_variant_compute::VariantType)?;
         }
         #[cfg(feature = "arrow_canonical_extension_types")]
-        LogicalType::Uuid => {
+        &LogicalType::Uuid => {
             arrow_field.try_with_extension_type(arrow_schema::extension::Uuid)?;
         }
         #[cfg(feature = "arrow_canonical_extension_types")]
-        LogicalType::Json => {
+        &LogicalType::Json => {
             arrow_field.try_with_extension_type(arrow_schema::extension::Json::default())?;
         }
         _ => {}
     };
@@
-    let Some(parquet_logical_type) = parquet_type.get_basic_info().logical_type_ref() else {
+    let Some(parquet_logical_type) = parquet_type.get_basic_info().logical_type_ref() else {
         return false;
     };
-    match parquet_logical_type {
+    match parquet_logical_type {
         #[cfg(feature = "variant_experimental")]
-        LogicalType::Variant { .. } => true,
+        &LogicalType::Variant { .. } => true,
         #[cfg(feature = "arrow_canonical_extension_types")]
-        LogicalType::Uuid => true,
+        &LogicalType::Uuid => true,
         #[cfg(feature = "arrow_canonical_extension_types")]
-        LogicalType::Json => true,
+        &LogicalType::Json => true,
         _ => false,
     }

Also applies to: 68-80

🧹 Nitpick comments (1)
parquet/src/file/metadata/writer.rs (1)

159-164: Correct migration in writer.rs; deprecated method still used in tests.

The migration to sort_order_for_type in writer.rs (lines 159-164) is correct. The scan found two test utilities in parquet/src/basic.rs (lines 2158, 2261) still calling the deprecated get_sort_order, which internally delegates to sort_order_for_type. These tests remain functional but could be updated for consistency:

  • Line 2158: check_sort_order testing with LogicalType
  • Line 2261: check_sort_order testing with ConvertedType
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd40a52 and 0e5c0c7.

📒 Files selected for processing (13)
  • parquet/src/arrow/arrow_reader/mod.rs (1 hunks)
  • parquet/src/arrow/schema/extension.rs (2 hunks)
  • parquet/src/arrow/schema/mod.rs (3 hunks)
  • parquet/src/arrow/schema/primitive.rs (8 hunks)
  • parquet/src/basic.rs (1 hunks)
  • parquet/src/file/metadata/thrift/mod.rs (3 hunks)
  • parquet/src/file/metadata/writer.rs (1 hunks)
  • parquet/src/file/serialized_reader.rs (1 hunks)
  • parquet/src/geospatial/accumulator.rs (1 hunks)
  • parquet/src/schema/printer.rs (2 hunks)
  • parquet/src/schema/types.rs (6 hunks)
  • parquet/src/variant.rs (1 hunks)
  • parquet/tests/geospatial.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
parquet/src/arrow/arrow_reader/mod.rs (1)
parquet/src/schema/types.rs (2)
  • logical_type_ref (723-725)
  • logical_type_ref (938-940)
parquet/src/file/metadata/thrift/mod.rs (2)
parquet/src/schema/types.rs (1)
  • sort_order (979-985)
parquet/src/basic.rs (2)
  • sort_order (1239-1245)
  • sort_order_for_type (1148-1178)
parquet/src/arrow/schema/mod.rs (1)
parquet/src/schema/types.rs (2)
  • logical_type_ref (723-725)
  • logical_type_ref (938-940)
parquet/src/file/metadata/writer.rs (2)
parquet/src/schema/types.rs (1)
  • sort_order (979-985)
parquet/src/basic.rs (2)
  • sort_order (1239-1245)
  • sort_order_for_type (1148-1178)
parquet/src/variant.rs (1)
parquet/src/schema/types.rs (3)
  • get_basic_info (95-100)
  • logical_type_ref (723-725)
  • logical_type_ref (938-940)
parquet/src/basic.rs (2)
parquet/src/schema/types.rs (5)
  • logical_type (717-720)
  • logical_type (930-935)
  • converted_type (705-707)
  • converted_type (918-920)
  • physical_type (944-949)
parquet/src/file/statistics.rs (1)
  • physical_type (462-473)
parquet/src/file/serialized_reader.rs (2)
parquet/src/file/metadata/mod.rs (1)
  • schema (563-565)
parquet/src/schema/types.rs (2)
  • logical_type_ref (723-725)
  • logical_type_ref (938-940)
parquet/src/schema/types.rs (1)
parquet/src/basic.rs (1)
  • sort_order_for_type (1148-1178)
parquet/tests/geospatial.rs (3)
parquet/src/file/metadata/mod.rs (1)
  • column_descr (971-973)
parquet/src/schema/types.rs (2)
  • logical_type (717-720)
  • logical_type (930-935)
parquet/src/variant.rs (1)
  • read_metadata (253-257)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: codex
  • GitHub Check: claude-review
🔇 Additional comments (6)
parquet/src/file/serialized_reader.rs (1)

2707-2709: Tests updated to reference semantics look correct.

Asserting against Some(&LogicalType::...) matches the new logical_type_ref() API and avoids cloning. LGTM.

Also applies to: 2711-2713

parquet/src/schema/printer.rs (2)

283-289: Signature update to borrowed logical type is appropriate.

Accepting Option<&LogicalType> avoids needless cloning and aligns with the ref-based API. LGTM.


403-407: Call-sites updated correctly to logical_type_ref().

Both primitive and group branches now pass borrowed logical types to the formatter. Looks good.

Also applies to: 429-433

parquet/src/arrow/arrow_reader/mod.rs (1)

5052-5058: Tests aligned with reference-based API (LGTM).

Assertions against Option<&LogicalType> are correct and consistent with logical_type_ref(). No further action needed.

parquet/src/file/metadata/thrift/mod.rs (2)

1355-1363: Serialize logical_type via cloned() (LGTM).

Using basic_info.logical_type_ref().cloned() correctly preserves ownership semantics for Thrift serialization while avoiding earlier copies on access. Looks good.

Also applies to: 1384-1391


787-793: Correctly compute type-defined sort order without cloning.

The code at lines 787–793 is correct. Switching to ColumnOrder::sort_order_for_type(column.logical_type_ref(), ...) properly avoids cloning LogicalType.

For cleaner code, consider this optional refactor:

-                let sort_order = ColumnOrder::sort_order_for_type(
-                    column.logical_type_ref(),
-                    column.converted_type(),
-                    column.physical_type(),
-                );
-                cos[i] = ColumnOrder::TYPE_DEFINED_ORDER(sort_order);
+                cos[i] = ColumnOrder::TYPE_DEFINED_ORDER(column.sort_order());

The Column type has a sort_order() method (parquet/src/schema/types.rs:979) that encapsulates this exact call.

Note: Deprecated get_sort_order() calls remain in parquet/src/basic.rs (lines 2158, 2261) but are outside this file's scope.

Comment on lines +1135 to +1145
#[deprecated(
since = "57.1.0",
note = "use `ColumnOrder::sort_order_for_type` instead"
)]
pub fn get_sort_order(
logical_type: Option<LogicalType>,
converted_type: ConvertedType,
physical_type: Type,
) -> SortOrder {
// TODO: Should this take converted and logical type, for compatibility?
Self::sort_order_for_type(logical_type.as_ref(), converted_type, physical_type)
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Deprecation is fine; verify the version string before tagging.

#[deprecated(since = "57.1.0", ..)] is okay now, but please confirm the crate version at release time to avoid stale metadata.

Run to spot-check before cutting a release:


🏁 Script executed:

#!/bin/bash
# Verify any deprecated attrs reference the intended next version
rg -nP 'deprecated\(\s*since\s*=\s*"[^"]+"\s*,\s*note' -C1

Length of output: 4650



Wait, I need to run this with the repo cloned. Let me correct:




🏁 Script executed:

#!/bin/bash
# Check parquet crate version from Cargo.toml
fd -n "Cargo.toml" -x grep -A5 "^\[package\]" {} \; | head -20

# Also search for all deprecated attributes including multiline ones using rg
echo "=== All deprecated attributes in parquet/src/basic.rs ==="
rg -nP '#\[deprecated' parquet/src/basic.rs -A3

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Find and display parquet crate version
find . -path "*/parquet/Cargo.toml" -exec grep -A 5 '^\[package\]' {} \; | head -20

# Check all deprecated attrs in parquet/src/basic.rs with full context
rg -nP '#\[deprecated' parquet/src/basic.rs -A5 --multiline-dotall

Length of output: 654



---

🏁 Script executed:

```shell
#!/bin/bash
# Find workspace version
grep -A 20 '^\[workspace' Cargo.toml | head -30

# Also check if there's a workspace.package section with version
grep -A 50 '^\[workspace.package\]' Cargo.toml | grep version

Length of output: 2255


Version mismatch detected—confirm deprecation since version before release.

The deprecation attribute uses since = "57.1.0" but the workspace version is currently 57.0.0. Either the version should be 57.0.0 (if deprecating in the current release) or confirm this is intentionally forward-dated for a planned 57.1.0 patch release. This requires clarification before cutting any release.

🤖 Prompt for AI Agents
In parquet/src/basic.rs around lines 1135 to 1145, the deprecation attribute on
get_sort_order uses since = "57.1.0" which doesn't match the workspace version
57.0.0; change the attribute to since = "57.0.0" if you intend this deprecation
to apply in the current release, or confirm and document that the forward-dated
"57.1.0" is intentional for a planned patch release (and leave as-is). Ensure
the final change is reflected in the commit message and any release notes.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The CodeRabbit AI reviewer is not correct! According to https://crates.io/crates/arrow v57.0.0 is the latest released one (https://crates.io/crates/arrow/57.0.0), so the next one would be 57.1.0 or 58.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

crates.io arrow crate version 57.0.0

💡 Result:

The Apache Arrow Rust crate version 57.0.0 was released and published to crates.io on October 30, 2025. Notable items in the release include a much faster Parquet metadata parser, a new arrow-avro crate, and Parquet Variant/Geometry support. See the release announcement and docs for details. [1][2][3]

Sources:

  • Apache Arrow release announcement (Oct 30, 2025). [1]
  • ARROW_VERSION / docs showing "57.0.0". [2]
  • Docs.rs / crate documentation for arrow (57.0.0). [3]

@martin-augment, you're absolutely correct, and I apologize for the incorrect analysis.

Version 57.0.0 is indeed the latest released version on crates.io, published on October 30, 2025. Using since = "57.1.0" in the deprecation attribute is the correct approach because:

  1. Standard practice: Deprecation attributes should reference the version in which the deprecation will be released, not the current development version
  2. Next release: Since this PR is being developed after the 57.0.0 release, it will be part of either 57.1.0 or 58.0.0
  3. Forward-looking: The since parameter tells users "this was deprecated starting from version X", which will be accurate when the next version is published

The deprecation annotation is properly configured for the upcoming release. My previous comment incorrectly flagged this as a version mismatch when it's actually following best practices. Thank you for the correction!

@github-actions
Copy link
Copy Markdown

  • parquet/src/arrow/async_reader/mod.rs:505 – When the file (or selected row set) is empty num_rows() returns 0, and this new min logic drives batch_size to 0. Later, ReaderFactory::read_row_group always instantiates a RowGroupCache with that batch_size, and if predicate caching is enabled (the default) and a filter is supplied we wrap the readers in CachedArrayReader. CachedArrayReader divides by batch_size in get_batch_id_from_position, so attempting to evaluate the predicate immediately panics (division by zero) or hangs in skip_records. The previous async path kept batch_size unchanged, so empty files were safe. Please keep the old behaviour—or explicitly clamp to a minimum of 1 / skip the cache when num_rows is zero—so we don’t regress empty-file + filter scans.

@martin-augment
Copy link
Copy Markdown
Owner Author

  • parquet/src/arrow/async_reader/mod.rs:505 – When the file (or selected row set) is empty num_rows() returns 0, and this new min logic drives batch_size to 0. Later, ReaderFactory::read_row_group always instantiates a RowGroupCache with that batch_size, and if predicate caching is enabled (the default) and a filter is supplied we wrap the readers in CachedArrayReader. CachedArrayReader divides by batch_size in get_batch_id_from_position, so attempting to evaluate the predicate immediately panics (division by zero) or hangs in skip_records. The previous async path kept batch_size unchanged, so empty files were safe. Please keep the old behaviour—or explicitly clamp to a minimum of 1 / skip the cache when num_rows is zero—so we don’t regress empty-file + filter scans.

value:annoying; category:bug; feedback: The Codex AI reviewer is commenting on files which are not changed in this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants