21149: fix(datasource): keep stats absent when collect_stats is false#295
21149: fix(datasource): keep stats absent when collect_stats is false#295martin-augment wants to merge 1 commit intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe pull request modifies ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewSummaryThis PR fixes a real bug: when Correctness ✅The core fix is correct and well-targeted. Wrapping the first-file block in Bonus Fix ✅Adding Existing Test Fix ✅Changing Issues / Observations1. Limit behavior is silent when When 2. Misleading test data in make_file_stats(0, 0, rich_col_stats(1, 1, 9, 15, 64)), // first file: num_rows=0??Using 3.
4. Function is deprecated — does the replacement have the same bug? This function is marked Nits
VerdictThe fix is correct, the test additions significantly improve coverage, and the existing test correction was necessary. The issues above are mostly minor or pre-existing. Addressing point 4 (replacement function parity) would be the most impactful follow-up. |
There was a problem hiding this comment.
Code Review
This pull request introduces a collect_stats parameter to the get_statistics_with_limit function, enabling conditional collection of detailed statistics. When collect_stats is false, the function now returns bare statistics with Precision::Absent for all fields, while when true, it aggregates statistics including a new byte_size field. New helper functions and comprehensive tests have been added to validate these behaviors, including scenarios with limits. A minor improvement opportunity was noted regarding an unnecessary clone operation on file_stats.column_statistics which could be optimized for efficiency.
| num_rows = file_stats.num_rows; | ||
| total_byte_size = file_stats.total_byte_size; | ||
| for (index, file_column) in | ||
| file_stats.column_statistics.clone().into_iter().enumerate() |
There was a problem hiding this comment.
The .clone() on file_stats.column_statistics is unnecessary here. Since file_stats is an Arc<Statistics>, file_stats.column_statistics is already owned by the Arc. Iterating over &file_stats.column_statistics would avoid an allocation and a deep copy of the vector, improving efficiency. The file_column is then a reference, which is fine for the subsequent assignments.
| file_stats.column_statistics.clone().into_iter().enumerate() | |
| file_stats.column_statistics.iter().enumerate() |
🤖 Augment PR SummarySummary: Adjusts deprecated Changes:
Technical Notes: The new tests validate both exact aggregation across files and inexact demotion when a 🤖 Was this summary useful? React with 👍 or 👎 |
| col_stats_set[index].max_value = file_column.max_value; | ||
| col_stats_set[index].min_value = file_column.min_value; | ||
| col_stats_set[index].sum_value = file_column.sum_value.cast_to_sum_type(); | ||
| if collect_stats { |
There was a problem hiding this comment.
With collect_stats=false, num_rows stays Precision::Absent, so the limit logic will treat it as 0 and may end up iterating all files (even when the first file’s file_stats.num_rows already exceeds limit). If limit is still intended to constrain returned FileGroup independently of summary-stat aggregation, this change looks like it could regress that behavior (a targeted test for collect_stats=false + limit would catch it).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
21149: To review by AI