Skip to content

[WIP] BloomFilter v2 support for Spark's bloom-filter based joins#4360

Draft
mythrocks wants to merge 2 commits intoNVIDIA:mainfrom
mythrocks:bloom-filter-v2-wip
Draft

[WIP] BloomFilter v2 support for Spark's bloom-filter based joins#4360
mythrocks wants to merge 2 commits intoNVIDIA:mainfrom
mythrocks:bloom-filter-v2-wip

Conversation

@mythrocks
Copy link
Collaborator

Description

This commit adds support for the v2 format of the BloomFilters used in Apache Spark 4.1.1 for joins (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR, spark-rapids-jni supported only the v1 bloom filter format. Testing spark-rapids on Apache Spark 4.1.1 revealed failures in mixed-mode execution, where bloom filters built on CPU were probed on the GPU (assuming v1 format).

The changes here should allow for a reduced false-positive rate for bloom filters built on join keys with high cardinalities (approaching INT_MAX). Note also that support for the v1 format is retained, for backward compatibility.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks marked this pull request as draft March 11, 2026 23:13
@mythrocks mythrocks self-assigned this Mar 11, 2026
Signed-off-by: MithunR <mithunr@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds GPU support for Apache Spark 4.1.1's V2 bloom filter format, which uses 64-bit bit-index arithmetic (instead of V1's 32-bit) to reduce false-positive rates for large filters. The implementation extends the existing V1 code path with a new templated GPU kernel (gpu_bloom_filter_put<Version>, bloom_probe_functor<Version>), a wider V2 header struct (16 bytes: version + num_hashes + seed + num_longs vs V1's 12 bytes), and auto-detection of the format in unpack_bloom_filter. Both the C++ library and the JNI/Java layers are updated.

Key changes and issues:

  • BloomFilter.create(numHashes, bits) default silently changed to V2 — the 2-arg convenience overload now produces a V2 filter, which is a breaking change for any caller relying on it to emit a V1-compatible buffer. Consider keeping V1 as the default or removing the overload entirely to force explicit version selection.
  • Narrowing jlong → int cast in BloomFilterJni.cppbloom_filter_longs is computed as static_cast<int>((bloomFilterBits + 63) / 64), but bloomFilterBits is a 64-bit jlong. For V2 filters larger than ~137 billion bits the result would silently overflow, undermining V2's large-filter purpose. Both the JNI and bloom_filter_create signatures should be widened to int64_t.
  • Missing BuildAndProbeWithNullsV2 test — the V2 C++ test suite does not cover the case where the probe column has null rows (the bloom_probe_functor<2> path with a null-masked input), leaving a gap relative to the V1 suite.
  • Redundant inner check in unpack_bloom_filter — the CUDF_EXPECTS(read_size >= 16, ...) inside the V2 branch is unreachable because the prior size check already guarantees bloom_filter.size() >= 16, hence read_size == 16.

Confidence Score: 3/5

  • PR is marked WIP and has two correctness issues (JNI narrowing cast, silent API default change to V2) that should be resolved before merging.
  • The core GPU algorithm (V2 64-bit hash with h1*INT32_MAX seed and accumulated h2) is correctly structured and mirrors the Spark Java V2 reference. Big-endian swizzle, header layout, merge validation, and probe paths all look correct. However, the JNI bloom_filter_longs narrowing cast from jlong to int could silently corrupt very large V2 filters—directly contradicting V2's main purpose—and the 2-arg BloomFilter.create() silently switching to V2 is a breaking API change. These two issues lower confidence from 5 to 3 pending resolution.
  • src/main/cpp/src/BloomFilterJni.cpp (narrowing cast) and src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java (default version change)

Important Files Changed

Filename Overview
src/main/cpp/src/bloom_filter.cu Core implementation of V1/V2 bloom filter put, probe, and merge. V2 uses 64-bit combined hash (h1*INT32_MAX + accumulated h2). Logic is well-structured; minor redundant validation check in unpack_bloom_filter.
src/main/cpp/src/bloom_filter.hpp Adds V1/V2 header structs, version constants, and a helper bloom_filter_header_size_for_version(). API is clean and well-documented; no issues found.
src/main/cpp/src/BloomFilterJni.cpp JNI bridge updated to accept version and seed. Contains a narrowing cast from jlong to int for bloom_filter_longs that will silently overflow for very large V2 filters, undermining V2's large-filter purpose.
src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java Adds VERSION_1/VERSION_2 constants and versioned create() API. The convenience create(numHashes, bits) overload now defaults to V2, which is a silent breaking change for callers expecting V1 output.
src/main/cpp/tests/bloom_filter.cu Good V2 test coverage mirroring V1 suite, plus a V2WithSeed test. Missing BuildAndProbeWithNullsV2, which leaves the null-masked probe path for V2 untested.
src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java Java tests updated to use the versioned API and adds testBuildAndProbeV1. Coverage is adequate for the Java layer.
src/main/cpp/benchmarks/bloom_filter.cu Benchmark split into separate V1 and V2 variants. Straightforward duplication with version-specific bloom_filter_create calls; no issues.

Sequence Diagram

sequenceDiagram
    participant Spark as Spark 4.1.1 CPU
    participant Java as BloomFilter.java
    participant JNI as BloomFilterJni.cpp
    participant GPU as bloom_filter.cu (GPU)

    Note over Spark,GPU: Build phase (GPU)
    Java->>JNI: creategpu(version, numHashes, bits, seed)
    JNI->>GPU: bloom_filter_create(version, numHashes, longs, seed)
    GPU-->>JNI: list_scalar (V1: 12B header | V2: 16B header | bit array)
    JNI-->>Java: Scalar handle

    Java->>JNI: put(bloomFilter, cv)
    JNI->>GPU: bloom_filter_put(list_scalar, column)
    GPU->>GPU: unpack_bloom_filter → detect V1 or V2
    alt V1
        GPU->>GPU: gpu_bloom_filter_put<1> (32-bit combined hash, loop 1..N)
    else V2
        GPU->>GPU: gpu_bloom_filter_put<2> (64-bit combined hash seeded h1*INT32_MAX, loop 0..N-1)
    end

    Note over Spark,GPU: Probe phase (mixed-mode: CPU-built filter → GPU probe)
    Spark->>Java: serialized bloom filter buffer (V2 format)
    Java->>JNI: probebuffer(addr, len, cv)
    JNI->>GPU: bloom_filter_probe(input, device_span)
    GPU->>GPU: unpack_bloom_filter → read version from first 4 bytes
    alt V1 (version==1, 12B header)
        GPU->>GPU: bloom_probe_functor<1>
    else V2 (version==2, 16B header + seed)
        GPU->>GPU: bloom_probe_functor<2>
    end
    GPU-->>Java: boolean column (true=may-match, false=definitely-not-in-set)
Loading

Comments Outside Diff (3)

  1. src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java, line 1091-1092 (link)

    Default version changed to V2 — potential breaking change

    The no-version create(numHashes, bloomFilterBits) overload now silently delegates to V2 (via VERSION_2). Any existing caller that relied on this convenience API to build a filter to be probed by CPU-side code expecting V1 layout (or vice versa) will now get an incompatible buffer without any error or migration warning.

    If callers in spark-rapids itself always go through the versioned create(version, numHashes, bits, seed) path this is fine, but the PR description says "support for the v1 format is retained, for backward compatibility" — keeping the 2-arg overload as V1 (or removing it entirely so callers are forced to be explicit) would be safer.

  2. src/main/cpp/src/BloomFilterJni.cpp, line 104 (link)

    Narrowing cast from jlong to int for bloom_filter_longs

    bloomFilterBits is a jlong (64-bit), but the result of (bloomFilterBits + 63) / 64 is cast to int (32-bit). For V2 filters, one of the stated goals is supporting larger bloom filters (approaching the limits that caused collisions in V1). A bloomFilterBits value larger than (INT_MAX - 63) * 64 ≈ 137 billion would silently overflow here, producing an incorrect (likely very small or negative) bloom_filter_longs value and a corrupt filter.

    The internal bloom_filter_create signature also accepts int bloom_filter_longs, so both the JNI binding and the C++ API would need to be widened together if large-filter support is intended.

  3. src/main/cpp/tests/bloom_filter.cu, line 921 (link)

    Missing BuildAndProbeWithNullsV2 test

    The V1 suite covers four scenarios: InitializationV1, BuildAndProbeV1, BuildWithNullsAndProbeV1, BuildAndProbeWithNullsV1, and ProbeMergedV1. The V2 suite is missing BuildAndProbeWithNullsV2 — the case where the probe column (not the build column) contains nulls. This is the path that exercises bloom_probe_functor<2> with null-masked input rows, and is particularly important to validate because bloom_filter_probe propagates the input bitmask via cudf::copy_bitmask and any off-by-one in the V2 combined-hash loop when a row is null could go undetected without this test.

Last reviewed commit: 7af89f7

Comment on lines +209 to +212

int32_t seed = 0;
if (version == bloom_filter_version_2) {
CUDF_EXPECTS(read_size >= static_cast<size_t>(bloom_filter_header_v2_size_bytes),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant size check inside the V2 branch

read_size >= bloom_filter_header_v2_size_bytes is always true here. The guard immediately above:

CUDF_EXPECTS(bloom_filter.size() >= static_cast<size_t>(hdr_size), "Encountered truncated bloom filter header");

already ensures bloom_filter.size() >= 16 when version == 2. Since read_size = std::min(bloom_filter.size(), 16), it follows that read_size == 16 at this point. The inner CUDF_EXPECTS can never fire and can safely be removed to reduce noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant