Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .cursor/rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

5 changes: 5 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,8 @@ When creating a PR, you MUST follow the [PR template](.github/pull_request_templ

See the [Testing Quick Start](docs/source/contributor-guide/testing.md#testing-quick-start)
for the recommended pre-PR test commands.
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,9 @@ mod native;
pub use bytes::BytesDistinctCountAccumulator;
pub use bytes::BytesViewDistinctCountAccumulator;
pub use dict::DictionaryCountAccumulator;
pub use native::Bitmap65536DistinctCountAccumulator;
pub use native::Bitmap65536DistinctCountAccumulatorI16;
pub use native::BoolArray256DistinctCountAccumulator;
pub use native::BoolArray256DistinctCountAccumulatorI8;
pub use native::FloatDistinctCountAccumulator;
pub use native::PrimitiveDistinctCountAccumulator;
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ where
])
}

#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
if values.is_empty() {
return Ok(());
Expand Down Expand Up @@ -149,6 +150,7 @@ impl<T: ArrowPrimitiveType + Debug> Accumulator for FloatDistinctCountAccumulato
self.values.state()
}

#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
self.values.update_batch(values)
}
Expand All @@ -165,3 +167,354 @@ impl<T: ArrowPrimitiveType + Debug> Accumulator for FloatDistinctCountAccumulato
size_of_val(self) + self.values.size()
}
}

/// Optimized COUNT DISTINCT accumulator for u8 using a bool array.
/// Uses 256 bytes to track all possible u8 values.
#[derive(Debug)]
pub struct BoolArray256DistinctCountAccumulator {
seen: [bool; 256],
}

impl BoolArray256DistinctCountAccumulator {
pub fn new() -> Self {
Self { seen: [false; 256] }
}

#[inline]
fn count(&self) -> i64 {
self.seen.iter().filter(|&&b| b).count() as i64
}
}

impl Default for BoolArray256DistinctCountAccumulator {
fn default() -> Self {
Self::new()
}
}

impl Accumulator for BoolArray256DistinctCountAccumulator {
#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
if values.is_empty() {
return Ok(());
}

let arr = as_primitive_array::<arrow::datatypes::UInt8Type>(&values[0])?;
for value in arr.iter().flatten() {
self.seen[value as usize] = true;
}
Comment on lines +203 to +205
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating over the array using iter().flatten() is inefficient for primitive types because it performs an option check for every element. Since this is a performance-oriented optimization, it's better to check null_count() and use values() directly when no nulls are present.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.seen[value as usize] = true;
            }
        } else {
            for value in arr.iter().flatten() {
                self.seen[value as usize] = true;
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Ok(())
}

fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
if states.is_empty() {
return Ok(());
}

let arr = as_list_array(&states[0])?;
arr.iter().try_for_each(|maybe_list| {
if let Some(list) = maybe_list {
let list = as_primitive_array::<arrow::datatypes::UInt8Type>(&list)?;
for value in list.values().iter() {
self.seen[*value as usize] = true;
}
};
Ok(())
})
}

fn state(&mut self) -> datafusion_common::Result<Vec<ScalarValue>> {
let values: Vec<u8> = self
.seen
.iter()
.enumerate()
.filter_map(|(idx, &seen)| if seen { Some(idx as u8) } else { None })
.collect();

let arr = Arc::new(
PrimitiveArray::<arrow::datatypes::UInt8Type>::from_iter_values(values),
);
Ok(vec![
SingleRowListArrayBuilder::new(arr).build_list_scalar(),
])
}

fn evaluate(&mut self) -> datafusion_common::Result<ScalarValue> {
Ok(ScalarValue::Int64(Some(self.count())))
}

fn size(&self) -> usize {
size_of_val(self) + 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In size(), size_of_val(self) already includes the inline [bool; 256] storage, so the extra + 256 double-counts memory. This can skew DataFusion’s memory accounting/limits for COUNT DISTINCT on small integer types.

Severity: medium

Other Locations
  • datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs:331

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The size method is double-counting the memory used by the seen array. Since seen is a fixed-size array within the struct, size_of_val(self) already includes its size (256 bytes).

Suggested change
size_of_val(self) + 256
size_of_val(self)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bool array accumulators double-count memory in size()

Low Severity

The size() method for BoolArray256DistinctCountAccumulator and BoolArray256DistinctCountAccumulatorI8 returns size_of_val(self) + 256, but the seen: [bool; 256] field is stored inline in the struct — not on the heap. size_of_val(self) already includes the 256-byte array, so adding 256 double-counts it, reporting ~512 bytes instead of 256. Contrast this with the Bitmap65536 variants, which correctly add 8192 for the heap-allocated Box<[u64; 1024]> that size_of_val only accounts for as a pointer. This overestimate in memory tracking could cause premature spilling or other unnecessary memory-pressure actions.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b564773. Configure here.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

}

/// Optimized COUNT DISTINCT accumulator for i8 using a bool array.
/// Uses 256 bytes to track all possible i8 values (mapped to 0..255).
#[derive(Debug)]
pub struct BoolArray256DistinctCountAccumulatorI8 {
seen: [bool; 256],
}

impl BoolArray256DistinctCountAccumulatorI8 {
pub fn new() -> Self {
Self { seen: [false; 256] }
}

#[inline]
fn count(&self) -> i64 {
self.seen.iter().filter(|&&b| b).count() as i64
}
}

impl Default for BoolArray256DistinctCountAccumulatorI8 {
fn default() -> Self {
Self::new()
}
}

impl Accumulator for BoolArray256DistinctCountAccumulatorI8 {
#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
if values.is_empty() {
return Ok(());
}

let arr = as_primitive_array::<arrow::datatypes::Int8Type>(&values[0])?;
for value in arr.iter().flatten() {
self.seen[value as u8 as usize] = true;
}
Comment on lines +283 to +285
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the u8 accumulator, using iter().flatten() is less efficient than direct slice access when nulls are absent. Checking null_count() allows for a faster path using values().

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.seen[value as u8 as usize] = true;
            }
        } else {
            for value in arr.iter().flatten() {
                self.seen[value as u8 as usize] = true;
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Ok(())
}

fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
if states.is_empty() {
return Ok(());
}

let arr = as_list_array(&states[0])?;
arr.iter().try_for_each(|maybe_list| {
if let Some(list) = maybe_list {
let list = as_primitive_array::<arrow::datatypes::Int8Type>(&list)?;
for value in list.values().iter() {
self.seen[*value as u8 as usize] = true;
}
};
Ok(())
})
}

fn state(&mut self) -> datafusion_common::Result<Vec<ScalarValue>> {
let values: Vec<i8> = self
.seen
.iter()
.enumerate()
.filter_map(
|(idx, &seen)| {
if seen { Some(idx as u8 as i8) } else { None }
},
)
.collect();

let arr = Arc::new(
PrimitiveArray::<arrow::datatypes::Int8Type>::from_iter_values(values),
);
Ok(vec![
SingleRowListArrayBuilder::new(arr).build_list_scalar(),
])
}

fn evaluate(&mut self) -> datafusion_common::Result<ScalarValue> {
Ok(ScalarValue::Int64(Some(self.count())))
}

fn size(&self) -> usize {
size_of_val(self) + 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The size method is double-counting the memory used by the seen array. size_of_val(self) already includes the 256 bytes of the fixed-size array.

Suggested change
size_of_val(self) + 256
size_of_val(self)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

}
}

/// Optimized COUNT DISTINCT accumulator for u16 using a 65536-bit bitmap.
/// Uses 8KB (1024 x u64) to track all possible u16 values.
#[derive(Debug)]
pub struct Bitmap65536DistinctCountAccumulator {
bitmap: Box<[u64; 1024]>,
}

impl Bitmap65536DistinctCountAccumulator {
pub fn new() -> Self {
Self {
bitmap: Box::new([0; 1024]),
}
}

#[inline]
fn set_bit(&mut self, value: u16) {
let word = (value / 64) as usize;
let bit = value % 64;
self.bitmap[word] |= 1u64 << bit;
}

#[inline]
fn count(&self) -> i64 {
self.bitmap.iter().map(|w| w.count_ones() as i64).sum()
}
}

impl Default for Bitmap65536DistinctCountAccumulator {
fn default() -> Self {
Self::new()
}
}

impl Accumulator for Bitmap65536DistinctCountAccumulator {
#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
if values.is_empty() {
return Ok(());
}

let arr = as_primitive_array::<arrow::datatypes::UInt16Type>(&values[0])?;
for value in arr.iter().flatten() {
self.set_bit(value);
}
Comment on lines +376 to +378
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance can be improved by avoiding the overhead of Option iteration when the array has no nulls.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.set_bit(value);
            }
        } else {
            for value in arr.iter().flatten() {
                self.set_bit(value);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Ok(())
}

fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
if states.is_empty() {
return Ok(());
}

let arr = as_list_array(&states[0])?;
arr.iter().try_for_each(|maybe_list| {
if let Some(list) = maybe_list {
let list = as_primitive_array::<arrow::datatypes::UInt16Type>(&list)?;
for value in list.values().iter() {
self.set_bit(*value);
}
};
Ok(())
})
}

fn state(&mut self) -> datafusion_common::Result<Vec<ScalarValue>> {
let mut values = Vec::new();
for (word_idx, &word) in self.bitmap.iter().enumerate() {
if word != 0 {
for bit in 0..64 {
if (word & (1u64 << bit)) != 0 {
values.push((word_idx as u16) * 64 + bit);
}
}
}
}
Comment on lines +401 to +409
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating through all 64 bits for every non-zero word is inefficient, especially for sparse bitmaps. Using trailing_zeros() to find set bits is significantly faster and more idiomatic for bitmap processing.

        for (word_idx, &word) in self.bitmap.iter().enumerate() {
            let mut w = word;
            while w != 0 {
                let bit = w.trailing_zeros();
                values.push((word_idx as u16) * 64 + bit as u16);
                w &= !(1u64 << bit);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.


let arr = Arc::new(
PrimitiveArray::<arrow::datatypes::UInt16Type>::from_iter_values(values),
);
Ok(vec![
SingleRowListArrayBuilder::new(arr).build_list_scalar(),
])
}

fn evaluate(&mut self) -> datafusion_common::Result<ScalarValue> {
Ok(ScalarValue::Int64(Some(self.count())))
}

fn size(&self) -> usize {
size_of_val(self) + 8192
}
}

/// Optimized COUNT DISTINCT accumulator for i16 using a 65536-bit bitmap.
/// Uses 8KB (1024 x u64) to track all possible i16 values (mapped to 0..65535).
#[derive(Debug)]
pub struct Bitmap65536DistinctCountAccumulatorI16 {
bitmap: Box<[u64; 1024]>,
}

impl Bitmap65536DistinctCountAccumulatorI16 {
pub fn new() -> Self {
Self {
bitmap: Box::new([0; 1024]),
}
}

#[inline]
fn set_bit(&mut self, value: i16) {
let idx = value as u16;
let word = (idx / 64) as usize;
let bit = idx % 64;
self.bitmap[word] |= 1u64 << bit;
}

#[inline]
fn count(&self) -> i64 {
self.bitmap.iter().map(|w| w.count_ones() as i64).sum()
}
}

impl Default for Bitmap65536DistinctCountAccumulatorI16 {
fn default() -> Self {
Self::new()
}
}

impl Accumulator for Bitmap65536DistinctCountAccumulatorI16 {
#[inline(never)]
fn update_batch(&mut self, values: &[ArrayRef]) -> datafusion_common::Result<()> {
if values.is_empty() {
return Ok(());
}

let arr = as_primitive_array::<arrow::datatypes::Int16Type>(&values[0])?;
for value in arr.iter().flatten() {
self.set_bit(value);
}
Comment on lines +470 to +472
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance can be improved by avoiding the overhead of Option iteration when the array has no nulls.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.set_bit(value);
            }
        } else {
            for value in arr.iter().flatten() {
                self.set_bit(value);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Ok(())
}

fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
if states.is_empty() {
return Ok(());
}

let arr = as_list_array(&states[0])?;
arr.iter().try_for_each(|maybe_list| {
if let Some(list) = maybe_list {
let list = as_primitive_array::<arrow::datatypes::Int16Type>(&list)?;
for value in list.values().iter() {
self.set_bit(*value);
}
};
Ok(())
})
}

fn state(&mut self) -> datafusion_common::Result<Vec<ScalarValue>> {
let mut values = Vec::new();
for (word_idx, &word) in self.bitmap.iter().enumerate() {
if word != 0 {
for bit in 0..64 {
if (word & (1u64 << bit)) != 0 {
values.push(((word_idx as u16) * 64 + bit) as i16);
}
}
}
}
Comment on lines +495 to +503
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating through all 64 bits for every non-zero word is inefficient. Using trailing_zeros() to find set bits is significantly faster.

        for (word_idx, &word) in self.bitmap.iter().enumerate() {
            let mut w = word;
            while w != 0 {
                let bit = w.trailing_zeros();
                values.push(((word_idx as u16) * 64 + bit as u16) as i16);
                w &= !(1u64 << bit);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.


let arr = Arc::new(
PrimitiveArray::<arrow::datatypes::Int16Type>::from_iter_values(values),
);
Ok(vec![
SingleRowListArrayBuilder::new(arr).build_list_scalar(),
])
}

fn evaluate(&mut self) -> datafusion_common::Result<ScalarValue> {
Ok(ScalarValue::Int64(Some(self.count())))
}

fn size(&self) -> usize {
size_of_val(self) + 8192
}
}
Loading
Loading