20968: feat: add approx_top_k aggregate function#289
20968: feat: add approx_top_k aggregate function#289martin-augment wants to merge 1 commit intomainfrom
Conversation
Add a new approx_top_k(expression, k) aggregate function that returns
the approximate top-k most frequent values with their estimated counts,
using the Filtered Space-Saving algorithm.
The implementation uses a capacity multiplier of 3 (matching ClickHouse's
default) and includes an alpha map for improved accuracy by filtering
low-frequency noise before it enters the main summary.
Return type is List(Struct({value: T, count: UInt64})) ordered by count
descending, where T matches the input column type.
Closes apache#20967
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances DataFusion's analytical capabilities by introducing the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
👮 Files not reviewed due to content moderation or server errors (7)
WalkthroughThis pull request introduces a new aggregate function ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new aggregate function approx_top_k using the Filtered Space-Saving algorithm. The implementation is comprehensive, including support for various data types, distributed execution via state serialization, and extensive testing through unit tests, integration tests, and SQL logic tests. The documentation has also been updated accordingly.
My review focuses on improving the implementation's clarity, correctness, and robustness. I've suggested refactoring the SpaceSavingSummary::new method for better readability, improving the accuracy of the memory size estimation in SpaceSavingSummary::size, and simplifying the ApproxTopKAccumulator by making input_data_type non-optional.
Overall, this is a solid contribution. Addressing the suggested points will further enhance the quality of this new feature.
| fn new(capacity: usize) -> Self { | ||
| Self { | ||
| counters: Vec::new(), | ||
| counter_map: HashMap::new(), | ||
| alpha_map: Vec::new(), | ||
| requested_capacity: 0, | ||
| target_capacity: 0, | ||
| } | ||
| .resized(capacity) | ||
| } |
There was a problem hiding this comment.
The new function creates a temporary SpaceSavingSummary and then immediately calls resized. Since resized (lines 165-175) is not used anywhere else, its logic can be inlined into new for clarity and to avoid creating a temporary struct. This also allows removing the resized function.
fn new(capacity: usize) -> Self {
let alpha_map_size = Self::compute_alpha_map_size(capacity);
let target_capacity = std::cmp::max(64, capacity * 2);
let mut counters = Vec::new();
counters.reserve(target_capacity);
Self {
counters,
counter_map: HashMap::new(),
alpha_map: vec![0u64; alpha_map_size],
requested_capacity: capacity,
target_capacity,
}
}There was a problem hiding this comment.
value:good-but-wont-fix; category:bug; feedback: The Gemini AI reviewer is correct! The resized() function is used just once, so it could be inlined directly in new() to simplify the logic. But having it as a separate function that returns a new/modified Self makes it more flexible for resizing SpaceSavingSummary instances at any point later
| fn size(&self) -> usize { | ||
| size_of::<Self>() | ||
| + self | ||
| .counters | ||
| .iter() | ||
| .map(|c| size_of::<Counter>() + c.item.len()) | ||
| .sum::<usize>() | ||
| // HashMap overhead: each bucket has a key-value pair plus control byte metadata. | ||
| + self.counter_map.capacity() | ||
| * (size_of::<(Vec<u8>, usize)>() + size_of::<u8>()) | ||
| + self.alpha_map.capacity() * size_of::<u64>() | ||
| } |
There was a problem hiding this comment.
The size method likely underestimates the memory usage of the SpaceSavingSummary. Specifically:
- It doesn't account for the heap allocation of the
Vec<u8>keys incounter_map. Sincepush_counterclones theitem, its memory is allocated twice (once incountersand once incounter_map). - It uses
c.item.len()which might be less thanc.item.capacity(), under-reporting allocated memory. - It doesn't account for the capacity of the
countersandalpha_mapvectors themselves, only their contents.
An inaccurate size method can lead to poor memory management decisions by the query engine. A more accurate implementation is suggested.
fn size(&self) -> usize {
use std::mem::size_of;
let item_bytes: usize = self.counters.iter().map(|c| c.item.capacity()).sum();
size_of::<Self>()
+ (self.counters.capacity() * size_of::<Counter>())
// HashMap overhead. The keys are cloned from `counters` so we account for their heap allocation separately.
+ (self.counter_map.capacity() * (size_of::<(Vec<u8>, usize)>() + size_of::<u8>()))
+ (self.alpha_map.capacity() * size_of::<u64>())
+ item_bytes * 2
}There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! To calculate the size of collections one has to multiply their capacity with the size of the collection item. Here a Counter contains items - a Vec. And these items are duplicated both in the counters Vec and in counters_map HashMap, so they should be calculated twice.
| summary: SpaceSavingSummary, | ||
| k: usize, | ||
| /// The data type of the input column. `None` until the first batch is seen. | ||
| input_data_type: Option<DataType>, |
There was a problem hiding this comment.
The input_data_type is an Option<DataType>, but it seems it can always be initialized with a DataType in the accumulator factory method since the planner has access to the input expression's type.
By changing input_data_type to be a non-optional DataType, you can simplify the implementation significantly:
- Remove the need for
unwrap_orand checks foris_none. - Simplify
state()serialization andmerge_batch()logic. - Make the code more robust by ensuring the type is always present.
This would involve changing ApproxTopKAccumulator and its related methods to work with DataType directly instead of Option<DataType>. The comment on line 686 would also need to be updated.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! Only the tests use None as an initial value. The method signature could be optimized to not use Option and always pass the concrete DataType. Prevents using more complex logic to support both None and Some.
Code Review:
|
🤖 Augment PR SummarySummary: Adds a new approximate heavy-hitters aggregate, Changes:
Technical Notes: 🤖 Was this summary useful? React with 👍 or 👎 |
| let k_expr = &args.exprs[1]; | ||
| let k = k_expr | ||
| .as_any() | ||
| .downcast_ref::<datafusion_physical_expr::expressions::Literal>() |
There was a problem hiding this comment.
k is currently required to be a Literal physical expression; if it’s wrapped in a cast or other foldable constant expression (e.g. CAST(2 AS BIGINT)), planning/execution may fail even though the value is still a literal. It may be worth extracting k via scalar evaluation (like other UDAFs here) rather than downcasting to Literal.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| self.find_counter(item).map(|c| (c.count, c.error)) | ||
| } | ||
|
|
||
| /// Get the top-k items sorted by count descending. |
There was a problem hiding this comment.
The doc comment says results are “sorted by count descending”, but top_k orders by (count - error) first via cmp_by_rank, which can differ from the returned count values. That mismatch could lead to surprising ordering for users (and contradict the documented behavior).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| let return_size = std::cmp::min(sorted.len(), k); | ||
|
|
||
| if return_size < sorted.len() { | ||
| sorted.select_nth_unstable_by(return_size - 1, |a, b| a.cmp_by_rank(b)); |
There was a problem hiding this comment.
Because cmp_by_rank returns Equal for ties, select_nth_unstable_by can choose an arbitrary subset when multiple items have equal rank, making output non-deterministic for tied frequencies. This can also make snapshot-style tests brittle when there are ties.
Severity: medium
Other Locations
datafusion/core/tests/dataframe/dataframe_functions.rs:428
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| u64::from_le_bytes(bytes[offset..offset + 8].try_into().unwrap()) as usize; | ||
| offset += 8; | ||
|
|
||
| if offset + alpha_map_size * 8 > bytes.len() { |
There was a problem hiding this comment.
from_bytes uses arithmetic like offset + alpha_map_size * 8 without guarding against usize overflow; malformed/corrupt state could bypass the length check and panic during slicing. It may be worth adding overflow-safe bounds checks/sanity limits when parsing the serialized state.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| } | ||
|
|
||
| bytes | ||
| } |
There was a problem hiding this comment.
Serialization drops counters without alpha map update
Medium Severity
serialize() selects the top requested_capacity counters and discards the rest, but unlike truncate_if_needed, it does not add the evicted counters' true counts to the alpha_map before writing. In a distributed merge, this silently loses frequency information for items tracked in the buffer zone (between requested_capacity and target_capacity), degrading the accuracy of the Filtered Space-Saving algorithm for subsequent merges.
Additional Locations (1)
| "); | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Tests assert non-deterministic order of tied elements
Low Severity
The test_fn_approx_top_k snapshot expects value 1 as the second result, but values 1 and 100 are tied (both count=1, error=0). Similarly, the .slt WHERE test expects c over b when tied. Since top_k uses select_nth_unstable_by, which doesn't guarantee order for equal elements, these tests depend on unspecified implementation behavior and could break across Rust versions.
Additional Locations (1)
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The multiplication may lead to an overflow if the map is really big. Using a saturating multiplication would avoid this issue. |


20968: To review by AI