Skip to content

21149: fix(datasource): keep stats absent when collect_stats is false#295

Open
martin-augment wants to merge 1 commit intomainfrom
pr-21149-2026-03-26-09-53-07
Open

21149: fix(datasource): keep stats absent when collect_stats is false#295
martin-augment wants to merge 1 commit intomainfrom
pr-21149-2026-03-26-09-53-07

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21149: To review by AI

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the get_statistics_with_limit function where statistics were being collected and populated even when explicitly instructed not to via the collect_stats parameter. The changes ensure that when collect_stats is false, the returned statistics correctly reflect Precision::Absent for all relevant fields, preventing unnecessary computation and providing accurate metadata. Additionally, new tests have been added to thoroughly validate this behavior and the aggregation of statistics when collect_stats is true.

Highlights

  • Conditional Statistics Collection: Ensured that num_rows, total_byte_size, and column_statistics are only populated when the collect_stats flag is true within the get_statistics_with_limit function, preventing unnecessary computation.
  • Column Statistics byte_size: Included byte_size in the column statistics collection when collect_stats is enabled, ensuring more complete statistics are gathered.
  • New Test Cases: Added comprehensive unit tests to verify the behavior of get_statistics_with_limit under various collect_stats and limit scenarios, including cases where statistics should be Precision::Absent or aggregated correctly.
  • Test Helper Functions: Introduced new helper functions (test_schema, make_file_stats, rich_col_stats) to simplify the creation of test data and statistics for improved test readability and maintainability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8cb6972b-1832-4444-bcc1-d87f12a57792

📥 Commits

Reviewing files that changed from the base of the PR and between cf04f01 and 23fa3f6.

📒 Files selected for processing (1)
  • datafusion/datasource/src/statistics.rs

Walkthrough

The pull request modifies get_statistics_with_limit in the statistics module to conditionally initialize statistics based on the collect_stats parameter. When collect_stats is false, statistics fields remain absent; when true, statistics from the first file are copied, including byte_size. An existing test is updated to reflect this behavior change. Three new tests are added to verify statistics collection with and without the collect_stats flag, including cases with limit conditions that produce inexact results.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21149-2026-03-26-09-53-07

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

Code Review

Summary

This PR fixes a real bug: when collect_stats = false, the first-file statistics initialization block ran unconditionally, causing num_rows, total_byte_size, and all column statistics to be populated from the first file's data instead of staying Precision::Absent. The subsequent files' loop already had a continue guard, so only the first file's path was broken.


Correctness ✅

The core fix is correct and well-targeted. Wrapping the first-file block in if collect_stats { ... } makes the behavior consistent with the loop's existing if !collect_stats { continue; } guard.

Bonus Fix ✅

Adding col_stats_set[index].byte_size = file_column.byte_size; inside the first-file block is a legitimate fix: previously byte_size was never propagated from the first file even when collect_stats = true. Good catch.

Existing Test Fix ✅

Changing get_statistics_with_limit(files, schema, None, false)true in test_get_statistics_with_limit_casts_first_file_sum_to_sum_type is correct. With collect_stats = false, sum_value would stay Absent and the cast_to_sum_type assertion was never actually exercised by that test. This was masking the bug.


Issues / Observations

1. Limit behavior is silent when collect_stats = false (pre-existing, but worth noting)

When collect_stats = false, num_rows stays Absent after the first file, so conservative_num_rows = usize::MIN = 0. The while-loop is entered but the continue skips both stats accumulation and the limit break check. This means all files are always returned when collect_stats = false, regardless of the limit argument. This is pre-existing behavior, but the PR doesn't document it and there's no test covering this edge case. It may be intentional (can't enforce row limits without knowing row counts), but a comment in the code or a test would help future readers.

2. Misleading test data in collect_stats_false_returns_bare_statistics

make_file_stats(0, 0, rich_col_stats(1, 1, 9, 15, 64)),  // first file: num_rows=0??

Using num_rows=0 for the first file in the collect_stats=false test makes the data look like an unset/empty value rather than a real row count. A non-zero value (e.g. 5) would make it clearer that the stats are being intentionally suppressed, not just happening to be zero.

3. null_count not asserted in the limit test

get_statistics_with_limit_collect_stats_true_limit_marks_inexact checks min_value, max_value, sum_value, and byte_size after hitting the limit, but omits null_count. Files 1 and 2 contribute null_count = 0 + 1 = 1, so the expected value would be Precision::Inexact(1). Minor gap in coverage.

4. Function is deprecated — does the replacement have the same bug?

This function is marked #[deprecated(since = "47.0.0")] in favor of get_files_with_limit + compute_all_files_statistics. It would be worth verifying (or noting in this PR) whether the collect_stats = false path in those replacement functions also correctly keeps statistics absent, or whether the same bug exists there.


Nits

  • file_stats.column_statistics.clone().into_iter() on line 328 — since into_iter() consumes the Vec, the .clone() is needed to avoid moving out of file_stats (which is used to set the file's .statistics). This is fine, but consider cloning only the column_statistics field rather than all of Statistics if this becomes a hot path.
  • Helper functions test_schema, make_file_stats, rich_col_stats are clean and reusable. Good DRY practice.

Verdict

The fix is correct, the test additions significantly improve coverage, and the existing test correction was necessary. The issues above are mostly minor or pre-existing. Addressing point 4 (replacement function parity) would be the most impactful follow-up.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a collect_stats parameter to the get_statistics_with_limit function, enabling conditional collection of detailed statistics. When collect_stats is false, the function now returns bare statistics with Precision::Absent for all fields, while when true, it aggregates statistics including a new byte_size field. New helper functions and comprehensive tests have been added to validate these behaviors, including scenarios with limits. A minor improvement opportunity was noted regarding an unnecessary clone operation on file_stats.column_statistics which could be optimized for efficiency.

num_rows = file_stats.num_rows;
total_byte_size = file_stats.total_byte_size;
for (index, file_column) in
file_stats.column_statistics.clone().into_iter().enumerate()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The .clone() on file_stats.column_statistics is unnecessary here. Since file_stats is an Arc<Statistics>, file_stats.column_statistics is already owned by the Arc. Iterating over &file_stats.column_statistics would avoid an allocation and a deep copy of the vector, improving efficiency. The file_column is then a reference, which is fine for the subsequent assignments.

Suggested change
file_stats.column_statistics.clone().into_iter().enumerate()
file_stats.column_statistics.iter().enumerate()

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 26, 2026

🤖 Augment PR Summary

Summary: Adjusts deprecated get_statistics_with_limit to keep returned summary statistics absent when collect_stats is disabled.

Changes:

  • Gates initialization/aggregation of `num_rows`, `total_byte_size`, and per-column stats on `collect_stats`
  • Ensures first-file column `byte_size` is populated when collecting stats
  • Updates existing tests to pass `collect_stats=true` when validating sum-type casting/merging behavior
  • Adds new unit tests covering `collect_stats=false` (bare/absent statistics) and `collect_stats=true` (aggregation + inexact stats when `limit` truncates)

Technical Notes: The new tests validate both exact aggregation across files and inexact demotion when a limit stops early, aligning the function’s behavior with the ListingTable configuration intent.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

col_stats_set[index].max_value = file_column.max_value;
col_stats_set[index].min_value = file_column.min_value;
col_stats_set[index].sum_value = file_column.sum_value.cast_to_sum_type();
if collect_stats {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With collect_stats=false, num_rows stays Precision::Absent, so the limit logic will treat it as 0 and may end up iterating all files (even when the first file’s file_stats.num_rows already exceeds limit). If limit is still intended to constrain returned FileGroup independently of summary-stat aggregation, this change looks like it could regress that behavior (a targeted test for collect_stats=false + limit would catch it).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants