[WIP] BloomFilter v2 support for Spark's bloom-filter based joins#4360
[WIP] BloomFilter v2 support for Spark's bloom-filter based joins#4360mythrocks wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Greptile SummaryThis PR adds GPU support for Apache Spark 4.1.1's V2 bloom filter format, which uses 64-bit bit-index arithmetic (instead of V1's 32-bit) to reduce false-positive rates for large filters. The implementation extends the existing V1 code path with a new templated GPU kernel ( Key changes and issues:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Spark as Spark 4.1.1 CPU
participant Java as BloomFilter.java
participant JNI as BloomFilterJni.cpp
participant GPU as bloom_filter.cu (GPU)
Note over Spark,GPU: Build phase (GPU)
Java->>JNI: creategpu(version, numHashes, bits, seed)
JNI->>GPU: bloom_filter_create(version, numHashes, longs, seed)
GPU-->>JNI: list_scalar (V1: 12B header | V2: 16B header | bit array)
JNI-->>Java: Scalar handle
Java->>JNI: put(bloomFilter, cv)
JNI->>GPU: bloom_filter_put(list_scalar, column)
GPU->>GPU: unpack_bloom_filter → detect V1 or V2
alt V1
GPU->>GPU: gpu_bloom_filter_put<1> (32-bit combined hash, loop 1..N)
else V2
GPU->>GPU: gpu_bloom_filter_put<2> (64-bit combined hash seeded h1*INT32_MAX, loop 0..N-1)
end
Note over Spark,GPU: Probe phase (mixed-mode: CPU-built filter → GPU probe)
Spark->>Java: serialized bloom filter buffer (V2 format)
Java->>JNI: probebuffer(addr, len, cv)
JNI->>GPU: bloom_filter_probe(input, device_span)
GPU->>GPU: unpack_bloom_filter → read version from first 4 bytes
alt V1 (version==1, 12B header)
GPU->>GPU: bloom_probe_functor<1>
else V2 (version==2, 16B header + seed)
GPU->>GPU: bloom_probe_functor<2>
end
GPU-->>Java: boolean column (true=may-match, false=definitely-not-in-set)
|
|
|
||
| int32_t seed = 0; | ||
| if (version == bloom_filter_version_2) { | ||
| CUDF_EXPECTS(read_size >= static_cast<size_t>(bloom_filter_header_v2_size_bytes), |
There was a problem hiding this comment.
Redundant size check inside the V2 branch
read_size >= bloom_filter_header_v2_size_bytes is always true here. The guard immediately above:
CUDF_EXPECTS(bloom_filter.size() >= static_cast<size_t>(hdr_size), "Encountered truncated bloom filter header");already ensures bloom_filter.size() >= 16 when version == 2. Since read_size = std::min(bloom_filter.size(), 16), it follows that read_size == 16 at this point. The inner CUDF_EXPECTS can never fire and can safely be removed to reduce noise.
Description
This commit adds support for the v2 format of the BloomFilters used in Apache Spark 4.1.1 for joins (via apache/spark@a08d8b0).
Background
The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.
Before the fix in this current PR,
spark-rapids-jnisupported only the v1 bloom filter format. Testingspark-rapidson Apache Spark 4.1.1 revealed failures in mixed-mode execution, where bloom filters built on CPU were probed on the GPU (assuming v1 format).The changes here should allow for a reduced false-positive rate for bloom filters built on join keys with high cardinalities (approaching INT_MAX). Note also that support for the v1 format is retained, for backward compatibility.