Skip to content

21456: perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#307

Open
martin-augment wants to merge 19 commits intomainfrom
pr-21456-2026-04-13-07-38-00
Open

21456: perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#307
martin-augment wants to merge 19 commits intomainfrom
pr-21456-2026-04-13-07-38-00

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21456: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Walkthrough

This pull request adds code review guidelines for AI agents and introduces four new specialized accumulator implementations for COUNT DISTINCT operations on small integer types. New structs BoolArray256DistinctCountAccumulator, BoolArray256DistinctCountAccumulatorI8, Bitmap65536DistinctCountAccumulator, and Bitmap65536DistinctCountAccumulatorI16 are implemented in the datafusion/functions-aggregate-common module. The accumulator selection logic in datafusion/functions-aggregate is modified to route UInt8, Int8, UInt16, and Int16 types to these optimized accumulators rather than the generic variant. Public re-exports are added to expose the new accumulator types.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21456-2026-04-13-07-38-00

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review

The approach of using bitmaps/bool-arrays for small integer types in COUNT DISTINCT is sound and can yield real performance gains. Here are the issues I found:


Bug: size() double-counts inline array memory

For BoolArray256DistinctCountAccumulator and BoolArray256DistinctCountAccumulatorI8, the seen field ([bool; 256]) is inline in the struct — it is not heap-allocated. size_of_val(self) already includes those 256 bytes, so adding + 256 double-counts:

// BoolArray256DistinctCountAccumulator::size() — incorrect
fn size(&self) -> usize {
    size_of_val(self) + 256  // size_of_val already includes the 256-byte seen array
}

Should be:

fn size(&self) -> usize {
    size_of_val(self)
}

The Bitmap65536 variants are correct — bitmap: Box<[u64; 1024]> is heap-allocated, so size_of_val(self) only captures the pointer (8 bytes) and + 8192 covers the allocation.


Profiling artifact leaked into production: #[inline(never)]

The PR adds #[inline(never)] to the pre-existing PrimitiveDistinctCountAccumulator::update_batch and FloatDistinctCountAccumulator::update_batch. This looks like a profiling artifact. Forcing no-inlining on hot paths can hurt performance and should not be in a production commit. Please remove those two annotations from the existing accumulators.


Missing tests

No tests are added for any of the four new accumulator types. At minimum, each should have:

  • A basic update_batch + evaluate roundtrip
  • A state + merge_batch roundtrip (verify partial aggregation works correctly)
  • A test with all-null input (should return 0)

Significant code duplication

The four new structs share almost identical structure. The signed/unsigned pairs (u8/i8 and u16/i16) differ only in the type cast for array access. A generic approach with a trait or a macro would reduce ~350 lines of near-identical code to a fraction of that. Consider whether upstream DataFusion would accept a generified version before landing this.


BoolArray256 should use a bitmap, not [bool; 256]

For u8/i8 (256 possible values), a [u64; 4] bitmap (32 bytes) would:

  • Use 8× less memory than [bool; 256]
  • Count faster via count_ones() (hardware POPCNT), consistent with the u16/i16 implementation

The current count() iterates 256 elements instead of 4 words. If the goal is performance parity with the bitmap-based u16 path, the u8 path should also use a bitmap.


Naming inconsistency

  • BoolArray256DistinctCountAccumulator vs Bitmap65536DistinctCountAccumulator — different naming schemes for the same concept
  • I8/I16 suffixes for signed variants is unusual; consider a consistent pattern or a single generic struct parameterized on the sign

Minor: .unwrap() in get_count_accumulator

DataType::UInt8 | DataType::Int8 | DataType::UInt16 | DataType::Int16 => {
    get_small_int_accumulator(data_type).unwrap()
}

The unwrap() would panic if get_small_int_accumulator returned an error. Since get_count_accumulator returns Box<dyn Accumulator> (not Result), there is no way to propagate the error. This is only safe because the match arm guarantees the data type is handled — but it is fragile. A #[allow(unreachable_patterns)] + unreachable!() or restructuring to avoid the fallible helper would be cleaner.


Summary: The core idea is solid, but there is a memory-accounting bug in size() for the bool-array variants, a profiling annotation left in on existing code, no tests, and significant code duplication. Please address at least the size() bug and add tests before merging.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 13, 2026

🤖 Augment PR Summary

Summary: Improves COUNT(DISTINCT) performance for small integer types by replacing HashSet-based tracking with compact bitset/array tracking.

Changes:

  • Added specialized COUNT DISTINCT accumulators for u8/i8 (256-entry bool array) and u16/i16 (65536-bit bitmap).
  • Exported the new accumulator types from functions-aggregate-common for reuse.
  • Updated COUNT DISTINCT accumulator selection to route small integer types through a cold helper that constructs these specialized accumulators.
  • Kept existing HashSet-based accumulators for larger integer types and other existing specialized implementations (bytes/float/etc.).

Technical Notes:
Uses fixed-size tracking structures (256 bytes or ~8KB) to reduce hashing overhead and improve locality for small domains; preserves existing state/merge representation via single-row List states.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

}

fn size(&self) -> usize {
size_of_val(self) + 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In size(), size_of_val(self) already includes the inline [bool; 256] storage, so the extra + 256 double-counts memory. This can skew DataFusion’s memory accounting/limits for COUNT DISTINCT on small integer types.

Severity: medium

Other Locations
  • datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs:331

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

),
// Small int types - cold path
DataType::UInt8 | DataType::Int8 | DataType::UInt16 | DataType::Int16 => {
get_small_int_accumulator(data_type).unwrap()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change routes UInt8/Int8/UInt16/Int16 through new bitmap/bool-array accumulators, but there don’t appear to be unit tests covering these new distinct paths (especially state()/merge_batch() behavior). Adding coverage here would help catch regressions in distinct correctness across partial/final aggregation.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! Test cases for this new code path are not added to the Pull Request and they may not behave as desired and also could regress in the future.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces optimized COUNT DISTINCT accumulators for small integer types (u8, i8, u16, i16) using boolean arrays and bitmaps, and updates the count aggregate function to utilize these specialized paths. Additionally, it adds guidelines for AI agents to avoid reviewing specific configuration files or linking to external repositories. The review feedback suggests several performance optimizations, including bypassing Option iteration in update_batch when nulls are absent and using trailing_zeros() for faster bitmap processing, as well as correcting memory size calculations to avoid double-counting fixed-size arrays.

Comment on lines +203 to +205
for value in arr.iter().flatten() {
self.seen[value as usize] = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating over the array using iter().flatten() is inefficient for primitive types because it performs an option check for every element. Since this is a performance-oriented optimization, it's better to check null_count() and use values() directly when no nulls are present.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.seen[value as usize] = true;
            }
        } else {
            for value in arr.iter().flatten() {
                self.seen[value as usize] = true;
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

}

fn size(&self) -> usize {
size_of_val(self) + 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The size method is double-counting the memory used by the seen array. Since seen is a fixed-size array within the struct, size_of_val(self) already includes its size (256 bytes).

Suggested change
size_of_val(self) + 256
size_of_val(self)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

Comment on lines +283 to +285
for value in arr.iter().flatten() {
self.seen[value as u8 as usize] = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the u8 accumulator, using iter().flatten() is less efficient than direct slice access when nulls are absent. Checking null_count() allows for a faster path using values().

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.seen[value as u8 as usize] = true;
            }
        } else {
            for value in arr.iter().flatten() {
                self.seen[value as u8 as usize] = true;
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

}

fn size(&self) -> usize {
size_of_val(self) + 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The size method is double-counting the memory used by the seen array. size_of_val(self) already includes the 256 bytes of the fixed-size array.

Suggested change
size_of_val(self) + 256
size_of_val(self)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

Comment on lines +376 to +378
for value in arr.iter().flatten() {
self.set_bit(value);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance can be improved by avoiding the overhead of Option iteration when the array has no nulls.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.set_bit(value);
            }
        } else {
            for value in arr.iter().flatten() {
                self.set_bit(value);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Comment on lines +401 to +409
for (word_idx, &word) in self.bitmap.iter().enumerate() {
if word != 0 {
for bit in 0..64 {
if (word & (1u64 << bit)) != 0 {
values.push((word_idx as u16) * 64 + bit);
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating through all 64 bits for every non-zero word is inefficient, especially for sparse bitmaps. Using trailing_zeros() to find set bits is significantly faster and more idiomatic for bitmap processing.

        for (word_idx, &word) in self.bitmap.iter().enumerate() {
            let mut w = word;
            while w != 0 {
                let bit = w.trailing_zeros();
                values.push((word_idx as u16) * 64 + bit as u16);
                w &= !(1u64 << bit);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.

Comment on lines +470 to +472
for value in arr.iter().flatten() {
self.set_bit(value);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance can be improved by avoiding the overhead of Option iteration when the array has no nulls.

        if arr.null_count() == 0 {
            for &value in arr.values() {
                self.set_bit(value);
            }
        } else {
            for value in arr.iter().flatten() {
                self.set_bit(value);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.

Comment on lines +495 to +503
for (word_idx, &word) in self.bitmap.iter().enumerate() {
if word != 0 {
for bit in 0..64 {
if (word & (1u64 << bit)) != 0 {
values.push(((word_idx as u16) * 64 + bit) as i16);
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating through all 64 bits for every non-zero word is inefficient. Using trailing_zeros() to find set bits is significantly faster.

        for (word_idx, &word) in self.bitmap.iter().enumerate() {
            let mut w = word;
            while w != 0 {
                let bit = w.trailing_zeros();
                values.push(((word_idx as u16) * 64 + bit as u16) as i16);
                w &= !(1u64 << bit);
            }
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b564773. Configure here.


fn size(&self) -> usize {
size_of_val(self) + 256
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bool array accumulators double-count memory in size()

Low Severity

The size() method for BoolArray256DistinctCountAccumulator and BoolArray256DistinctCountAccumulatorI8 returns size_of_val(self) + 256, but the seen: [bool; 256] field is stored inline in the struct — not on the heap. size_of_val(self) already includes the 256-byte array, so adding 256 double-counts it, reporting ~512 bytes instead of 256. Contrast this with the Bitmap65536 variants, which correctly add 8192 for the heap-allocated Box<[u64; 1024]> that size_of_val only accounts for as a pointer. This overestimate in memory tracking could cause premature spilling or other unnecessary memory-pressure actions.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b564773. Configure here.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs (1)

275-324: Add signed round-trip coverage before this ships.

The Int8 and Int16 variants rely on remapping through u8/u16 plus custom state() serialization. A focused update_batch -> state -> merge_batch -> evaluate test around -128, -1, 0, 127 and i16::MIN, -1, 0, i16::MAX would lock down the edge values most likely to regress.

Also applies to: 462-511

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs`
around lines 275 - 324, Add unit tests that exercise signed integer round-trip
for the Int8 and Int16 distinct-count accumulators: specifically create tests
that call update_batch on BoolArray256DistinctCountAccumulatorI8 (and the Int16
counterpart), then call state(), then merge_batch() with that state, and finally
evaluate() to assert correct counts for edge values (-128, -1, 0, 127 for i8 and
i16::MIN, -1, 0, i16::MAX for i16). Ensure tests cover both direct update_batch
inputs and list-encoded states used by merge_batch so the remapping through
u8/u16 and the custom state() serialization is validated end-to-end (use the
functions/methods update_batch, state, merge_batch, evaluate and the accumulator
types named in the diff).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions-aggregate/src/count.rs`:
- Around line 191-193: The DataType::UInt16 | DataType::Int16 match branch
currently forces creation of the full- bitmap accumulator via
get_small_int_accumulator(data_type).unwrap(), causing eager 8 KiB allocations
per group; change this branch to return the lazy/adaptive accumulator used for
other small-int types (i.e., the accumulator that grows with cardinality and
only materializes a bitmap when needed) instead of the fixed bitmap one so
DISTINCT keeps one growable Accumulator per live group; update the
DataType::UInt16 | DataType::Int16 arm to use the existing lazy/adaptive factory
(or the previous get_small_int_accumulator variant that is lazy) and ensure
types/unwrap usage match the surrounding code.

---

Nitpick comments:
In
`@datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs`:
- Around line 275-324: Add unit tests that exercise signed integer round-trip
for the Int8 and Int16 distinct-count accumulators: specifically create tests
that call update_batch on BoolArray256DistinctCountAccumulatorI8 (and the Int16
counterpart), then call state(), then merge_batch() with that state, and finally
evaluate() to assert correct counts for edge values (-128, -1, 0, 127 for i8 and
i16::MIN, -1, 0, i16::MAX for i16). Ensure tests cover both direct update_batch
inputs and list-encoded states used by merge_batch so the remapping through
u8/u16 and the custom state() serialization is validated end-to-end (use the
functions/methods update_batch, state, merge_batch, evaluate and the accumulator
types named in the diff).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 886dab1a-6bd0-429f-99f2-f9821614f257

📥 Commits

Reviewing files that changed from the base of the PR and between c68fbab and b564773.

📒 Files selected for processing (5)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/functions-aggregate-common/src/aggregate/count_distinct.rs
  • datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
  • datafusion/functions-aggregate/src/count.rs

Comment on lines +191 to +193
DataType::UInt8 | DataType::Int8 | DataType::UInt16 | DataType::Int16 => {
get_small_int_accumulator(data_type).unwrap()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid routing all UInt16/Int16 groups to a fixed 8 KiB accumulator.

The UInt16/Int16 half of this branch now selects accumulators that eagerly allocate the full bitmap. DISTINCT still uses one Accumulator per live group here, so sparse groups go from “grow with cardinality” to “always 8 KiB”. On wide GROUP BY COUNT(DISTINCT ...) plans that is a sizable memory regression and can force early spill/OOM. Please keep the u16/i16 path lazy/adaptive instead of switching every case to the bitmap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions-aggregate/src/count.rs` around lines 191 - 193, The
DataType::UInt16 | DataType::Int16 match branch currently forces creation of the
full- bitmap accumulator via get_small_int_accumulator(data_type).unwrap(),
causing eager 8 KiB allocations per group; change this branch to return the
lazy/adaptive accumulator used for other small-int types (i.e., the accumulator
that grows with cardinality and only materializes a bitmap when needed) instead
of the fixed bitmap one so DISTINCT keeps one growable Accumulator per live
group; update the DataType::UInt16 | DataType::Int16 arm to use the existing
lazy/adaptive factory (or the previous get_small_int_accumulator variant that is
lazy) and ensure types/unwrap usage match the surrounding code.

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug: size() double-counts inline array memory

For BoolArray256DistinctCountAccumulator and BoolArray256DistinctCountAccumulatorI8, the seen field ([bool; 256]) is inline in the struct — it is not heap-allocated. size_of_val(self) already includes those 256 bytes, so adding + 256 double-counts:

// BoolArray256DistinctCountAccumulator::size() — incorrect
fn size(&self) -> usize {
    size_of_val(self) + 256  // size_of_val already includes the 256-byte seen array
}

Should be:

fn size(&self) -> usize {
    size_of_val(self)
}

The Bitmap65536 variants are correct — bitmap: Box<[u64; 1024]> is heap-allocated, so size_of_val(self) only captures the pointer (8 bytes) and + 8192 covers the allocation.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.

@martin-augment
Copy link
Copy Markdown
Owner Author

Profiling artifact leaked into production: #[inline(never)]

The PR adds #[inline(never)] to the pre-existing PrimitiveDistinctCountAccumulator::update_batch and FloatDistinctCountAccumulator::update_batch. This looks like a profiling artifact. Forcing no-inlining on hot paths can hurt performance and should not be in a production commit. Please remove those two annotations from the existing accumulators.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! #[inline(never)] should be used only when the inlining causes some problems or when the function should be compared against an alternative (i.e. profiling). The changes in the PR do not suggest either of these, so the attribute should be removed.

@martin-augment
Copy link
Copy Markdown
Owner Author

BoolArray256 should use a bitmap, not [bool; 256]

For u8/i8 (256 possible values), a [u64; 4] bitmap (32 bytes) would:

  • Use 8× less memory than [bool; 256]
  • Count faster via count_ones() (hardware POPCNT), consistent with the u16/i16 implementation

The current count() iterates 256 elements instead of 4 words. If the goal is performance parity with the bitmap-based u16 path, the u8 path should also use a bitmap.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! By using [u64; 4]it will occupy 32 bytes which is 8 times less than 256. Also by using POPCNT Assembly instruction it will be faster to count the set bits in the bitmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants