Add a protocol buffer decode kernel by thirtiseven · Pull Request #4107 · NVIDIA/spark-rapids-jni

thirtiseven · 2025-12-23T08:57:47Z

WIP

It's not ready for (human) review yet (but soon!). Marked as open just because to trigger greptile once I push new commit.

But human review is greatly appreciated also

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

This PR adds a GPU-accelerated protocol buffer decoder with intentionally limited features, focusing on simple scalar field types. The implementation provides a JNI interface for decoding binary protobuf messages into cuDF STRUCT columns.

Key changes:

Implements GPU kernels for decoding protobuf varint, fixed32/64, and length-delimited (string) fields
Adds JNI bindings between Java and CUDA implementation
Provides basic test coverage for INT64 and STRING field types

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSimple.java	Java API providing `decodeToStruct()` method with parameter validation
src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java	Basic test case covering varint (INT64) and string decoding with missing fields and null messages
src/main/cpp/src/protobuf_simple.hpp	C++ API declaration with documentation of supported types
src/main/cpp/src/protobuf_simple.cu	CUDA implementation with three specialized kernels for varint, fixed-width, and string extraction
src/main/cpp/src/ProtobufSimpleJni.cpp	JNI bridge translating Java arrays to C++ vectors and invoking decode logic
src/main/cpp/CMakeLists.txt	Build configuration adding new source files to compilation targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf.cu

src/main/cpp/src/protobuf_simple.hpp

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf_simple.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-23T09:22:25Z

@greptile full review

greptile-apps · 2025-12-23T09:25:51Z

Greptile Summary

This PR adds a multi-pass GPU protobuf decode kernel to spark-rapids-jni, translating a column of raw protobuf bytes (LIST) into a cuDF STRUCT column following a flattened schema descriptor. The implementation spans a Java API layer (Protobuf, ProtobufSchemaDescriptor), a JNI bridge (ProtobufJni.cpp), GPU scan/count kernels (protobuf_kernels.cu), column builder helpers (protobuf_builders.cu), and shared device utilities (protobuf_common.cuh).

The PR has addressed a large number of issues raised in prior review rounds (JNI local-ref leaks, cudaMemsetAsync byte-granularity bug, serialization bypass, shallow clone immutability, fast-path depth-check asymmetry, ERR_OVERFLOW reuse, enum validation in repeated paths, etc.). The remaining issues found in this pass are:

Critical (protobuf_builders.cu:986): validate_enum_values_kernel is launched with total_count (occurrence count) as num_rows but d_row_has_invalid_enum is sized for top-level rows (original_num_rows). When a repeated struct field has more total occurrences than top-level rows (the common case), this produces an out-of-bounds write into device memory. Additionally, even without OOB, the kernel sets row_has_invalid_enum[occurrence_index] instead of row_has_invalid_enum[top_level_row_index], incorrectly nulling the wrong top-level rows. The d_top_row_indices mapping is already computed but not threaded into extract_typed_column.
Style (protobuf_builders.cu:553): The propagation of per-element invalid-enum flags to the per-row array uses a non-atomic store inside a thrust::for_each device lambda. While the value written is always true (monotone), this is technically a data race under the CUDA memory model.
Style (protobuf_builders.cu:1639): The enum-string column building logic (alloc name offsets, launch compute_enum_string_lengths_kernel / copy_enum_string_chars_kernel) is duplicated in four separate functions, with each copy diverging slightly on whether d_row_has_invalid_enum is propagated. Extraction into a shared helper would prevent future copy-paste fix omissions.

Confidence Score: 2/5

Not safe to merge — an out-of-bounds GPU device write exists in the repeated struct + enum validation path that can cause silent memory corruption.
Most previously raised issues are addressed, but a new critical logic bug was found: validate_enum_values_kernel is called with the wrong array size and the wrong index type inside build_repeated_struct_column. For any schema containing a repeated struct with an INT32 child field that has enum validation, GPU memory outside d_row_has_invalid_enum will be written, producing silent corruption or incorrect struct null masks.
src/main/cpp/src/protobuf_builders.cu — specifically the extract_typed_column call sites inside build_repeated_struct_column and build_nested_struct_column (when invoked from a repeated context) that pass d_row_has_invalid_enum without adjusting its size or providing the top-level row index mapping.

Important Files Changed

Filename	Overview
src/main/cpp/src/protobuf_builders.cu	Contains the most critical issue: `validate_enum_values_kernel` is called with occurrence-count as `num_rows` but `d_row_has_invalid_enum` is sized for top-level rows, causing OOB GPU writes in nested repeated struct contexts. Also has significant enum-string logic duplication across four call sites and a non-atomic concurrent write pattern.
src/main/cpp/src/protobuf_kernels.cu	GPU scan and count kernels. Previous review threads addressed depth-check asymmetry, stack-overflow for large field counts, ERR_OVERFLOW reuse, and silent varint corruption. Fixes appear applied in this revision. No new critical issues found.
src/main/cpp/src/protobuf_common.cuh	Device-side utilities including `get_wire_type_size`, `decode_tag`, and extraction kernels. Previously flagged unbounded WT_SGROUP recursion and INT_MAX overflow checks were fixed. The iterative WT_SGROUP loop, proper `fn > INT_MAX` guard in `decode_tag`, and the `extract_typed_column` helper all look correct.
src/main/cpp/src/protobuf.cu	Top-level orchestration: multi-pass decode pipeline, field categorisation, and column assembly. The `cudaMemcpyAsync`-based error initialisation (replacing the buggy `cudaMemsetAsync`) looks correct. `d_row_has_invalid_enum` is sized correctly for top-level rows here; the OOB issue originates in the nested builder functions when that array is passed deeper.
src/main/cpp/src/ProtobufJni.cpp	JNI bridge. Previously flagged null-pointer and local-ref-leak issues on all three object-array loop paths appear fixed. Topology, wire-type, and depth validations are present. No new issues found.
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptor.java	Schema descriptor with thorough validation. Previously flagged issues (shallow clones, serialization bypass, duplicate field numbers, wire-type and enum length validation) all appear fixed. `readObject` calls the shared `validate()` helper; deep copies protect all nested arrays.

Sequence Diagram

sequenceDiagram
    participant Java as Protobuf.java
    participant JNI as ProtobufJni.cpp
    participant Host as protobuf.cu (host)
    participant Kern as GPU Kernels
    participant Build as protobuf_builders.cu

    Java->>JNI: decodeToStruct(binaryInput, schema...)
    JNI->>JNI: Validate arrays & topology
    JNI->>Host: decode_protobuf_to_struct()

    Host->>Kern: scan_all_fields_kernel (Pass 1: locate scalar fields)
    Host->>Kern: count_repeated_fields_kernel (Pass 1: count repeated)
    Host->>Kern: scan_all_fields_kernel (Pass 1: locate nested messages)

    Host->>Kern: thrust::inclusive_scan (Pass 2: prefix sums for offsets)
    Host->>Kern: scan_all_repeated_occurrences_kernel (Pass 2: record occurrences)

    Host->>Build: build scalar columns (batched extract kernels)
    Host->>Build: build_repeated_scalar/string/struct_column (repeated fields)
    Host->>Build: build_nested_struct_column (nested message fields)

    Build->>Kern: extract_varint/fixed_batched_kernel
    Build->>Kern: validate_enum_values_kernel
    Build->>Kern: scan_repeated_message_children_kernel

    Host->>Kern: valid_if (build struct null mask from d_row_has_invalid_enum)
    Host->>Java: ColumnVector handle (STRUCT column)

Comments Outside Diff (2)

src/main/cpp/src/protobuf_builders.cu, line 986-1006 (link)

OOB write + wrong index in validate_enum_values_kernel for repeated struct context

When extract_typed_column is called here with total_count as num_items, it internally calls validate_enum_values_kernel with num_rows = total_count. That kernel writes row_has_invalid_enum[row] for every row index in [0, total_count - 1].

However, d_row_has_invalid_enum was allocated in decode_protobuf_to_struct with exactly original_num_rows (top-level row count) elements. Because a repeated field typically produces total_count ≥ num_rows occurrences (each row can carry ≥ 1 repetition), this write is out-of-bounds as soon as any row has more than one occurrence.

There is also a second, semantic bug: even when there is no OOB, the kernel writes to d_row_has_invalid_enum[occurrence_index] rather than d_row_has_invalid_enum[occ.row_idx] (the actual top-level row that should be nulled). The correct top-level row mapping is already available in d_top_row_indices computed a few lines earlier.

Both issues affect any schema where a top-level repeated struct contains a child INT32 field with enum_valid_values set.

To fix, the validate_enum_values_kernel call path inside extract_typed_column needs to receive the top-level row index mapping for nested/repeated contexts (e.g. an optional int32_t const* top_row_indices and a top_row_count parameter), or the propagation to d_row_has_invalid_enum should be done in the caller after receiving a per-item invalid flag vector, using d_top_row_indices as the mapping.

The same pattern recurs in build_nested_struct_column where it calls extract_typed_column for scalar children inside repeated occurrences.
src/main/cpp/src/protobuf_builders.cu, line 553-566 (link)

Non-atomic concurrent write to shared d_row_has_invalid_enum entries

Multiple GPU threads that process different elements belonging to the same top-level row will concurrently execute row_invalid[occs[idx].row_idx] = true. Because all writers store the same value (true) the comment is correct that the outcome is deterministic on sm_70+ hardware, but this is technically a data race under the CUDA memory model and is undefined behaviour in C++17/20.

A safer idiom that is standard and portable is:
```
atomicOr(reinterpret_cast<unsigned int*>(&row_invalid[occs[idx].row_idx]),
         static_cast<unsigned int>(1u));
```
bool is one byte on all supported platforms, and reinterpret_cast to unsigned int* would be UB. Instead, store d_row_has_invalid_enum as uint8_t* (or use a separate uint8_t array) so that atomicOr((unsigned char*)ptr, 1u) via inline PTX or the cuda::atomic API compiles correctly.

_{Last reviewed commit: 5357378}

greptile-apps

Additional Comments (6)

src/main/cpp/src/protobuf_simple.cu, line 89-91 (link)

logic: potential overflow: len64 can be up to 2^64-1, but casting to int on line 90 can overflow if len64 > INT_MAX
src/main/cpp/src/protobuf_simple.cu, line 323-324 (link)

logic: potential overflow: len64 can be larger than INT_MAX, but casting to int on line 324 will overflow
src/main/cpp/src/protobuf_simple.cu, line 375-376 (link)

logic: race condition: multiple threads write to *error_flag without atomics, causing undefined behavior when multiple threads encounter errors simultaneously

Then in kernels, use atomicOr(error_flag, 1) instead of *error_flag = 1
src/main/cpp/src/protobuf_simple.cu, line 398-407 (link)

logic: protobuf uses zigzag encoding for signed integers (sint32/sint64), but varint decoding here treats them as unsigned - decoding negative values will produce incorrect results. Are you only supporting unsigned int32/int64, or should zigzag decoding be implemented for signed types?
src/main/cpp/src/protobuf_simple.cu, line 240 (link)

syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++
src/main/cpp/src/protobuf_simple.cu, line 248 (link)

syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-25T03:43:10Z

@greptile full review

greptile-apps

Additional Comments (2)

src/main/cpp/src/protobuf_simple.cu, line 186 (link)

logic: zigzag decode uses signed right shift which is correct for signed types, but applied to unsigned v
src/main/cpp/src/ProtobufSimpleJni.cpp, line 55 (link)

logic: encodings is constructed from n_type_scales but then out_types is also constructed using n_type_scales[i] as the scale parameter, which would be wrong for non-decimal types where this represents encoding

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf_simple.cu

src/main/cpp/src/protobuf.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf.cu

src/main/cpp/src/ProtobufSimpleJni.cpp

src/main/cpp/src/protobuf_simple.cu

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/main/cpp/src/protobuf_simple.hpp

src/main/cpp/src/protobuf_simple.cu

src/main/cpp/src/protobuf.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/cpp/src/protobuf.cu

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

src/main/cpp/src/protobuf_simple.cu

src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

…ven/spark-rapids-jni into protocol_buffer_jni_dev

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-01-12T07:33:48Z

@greptileai full review

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

src/main/cpp/src/protobuf.cu

src/main/java/com/nvidia/spark/rapids/jni/Protobuf.java

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

nvauto · 2026-01-19T02:11:57Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T14:05:28Z

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T14:24:47Z

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T14:48:55Z

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T15:10:49Z

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T15:37:26Z

@greptile list all remain issues

greptile-apps · 2026-03-05T15:38:30Z

All review comments appear to be resolved!

Looking at the PR:

✅ All 70+ review threads from previous reviews are marked as resolved and outdated
The PR has 95 total review comments, all of which appear to have been addressed
The PR is currently in draft state

Since all the review feedback has been resolved, the main remaining task is likely to:

Mark the PR as ready for review (remove draft status)
Request review from maintainers
Ensure CI/tests are passing

If you'd like me to check for any specific unresolved issues or review any particular aspect of the code, let me know!

thirtiseven · 2026-03-05T15:41:36Z

@greptile full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-06T02:23:12Z

@greptile full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-06T03:32:40Z

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-06T04:06:06Z

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-06T07:47:45Z

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-12T03:27:30Z

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-12T06:48:07Z

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-12T08:11:44Z

@greptile full review again and give comments rather than just update summary. please.

ai draft

4aa7dbb

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 23, 2025 08:57

Copilot started reviewing on behalf of thirtiseven December 23, 2025 08:58 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

thirtiseven mentioned this pull request Dec 23, 2025

Add partial support for from_protobuf NVIDIA/spark-rapids#14062

Closed

3 tasks

style

a319d10

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Dec 23, 2025

View reviewed changes

thirtiseven added 2 commits December 25, 2025 10:58

address comments

458b583

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

5c1bbf4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 25, 2025 03:43

Copilot started reviewing on behalf of thirtiseven December 25, 2025 03:43 View session

greptile-apps bot reviewed Dec 25, 2025

View reviewed changes

Copilot AI reviewed Dec 25, 2025

View reviewed changes

address comments

0445daa

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 25, 2025 06:56

Copilot started reviewing on behalf of thirtiseven December 25, 2025 06:57 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

thirtiseven added 6 commits December 25, 2025 15:39

address comments

8ddbf96

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

copyrights

c9eea59

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'main' into protocol_buffer_jni_dev

f27fd6e

update, added more tests

617471d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'protocol_buffer_jni_dev' of https://github.com/thirtise…

bdfac06

…ven/spark-rapids-jni into protocol_buffer_jni_dev

style

3701cdf

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

address comments

1bf7660

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

63bc13a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

8316f75

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

8ae9547

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

756010d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

64de422

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

aae6bca

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

6e1c4d2

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

9c6dd70

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address greptile comments

6b2f494

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven added 4 commits March 6, 2026 16:26

Lookup table for repeated message child scan

e40d5a7

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

reflection refactor

4b4f6f9

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

reflection refactor

66daed1

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

bug fixes

cd1763b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

comment address

0213721

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review March 12, 2026 07:08

comment address

5357378

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Conversation

thirtiseven commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Dec 23, 2025

Uh oh!

greptile-apps bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (6)

Uh oh!

thirtiseven commented Dec 25, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Jan 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvauto commented Jan 19, 2026

Uh oh!

thirtiseven commented Dec 23, 2025 •

edited

Loading

greptile-apps bot commented Dec 23, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading