Skip to content

Add a protocol buffer decode kernel#4107

Open
thirtiseven wants to merge 74 commits intoNVIDIA:mainfrom
thirtiseven:protocol_buffer_jni_dev
Open

Add a protocol buffer decode kernel#4107
thirtiseven wants to merge 74 commits intoNVIDIA:mainfrom
thirtiseven:protocol_buffer_jni_dev

Conversation

@thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Dec 23, 2025

WIP

It's not ready for (human) review yet (but soon!). Marked as open just because to trigger greptile once I push new commit.

But human review is greatly appreciated also

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a GPU-accelerated protocol buffer decoder with intentionally limited features, focusing on simple scalar field types. The implementation provides a JNI interface for decoding binary protobuf messages into cuDF STRUCT columns.

Key changes:

  • Implements GPU kernels for decoding protobuf varint, fixed32/64, and length-delimited (string) fields
  • Adds JNI bindings between Java and CUDA implementation
  • Provides basic test coverage for INT64 and STRING field types

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSimple.java Java API providing decodeToStruct() method with parameter validation
src/test/java/com/nvidia/spark/rapids/jni/ProtobufSimpleTest.java Basic test case covering varint (INT64) and string decoding with missing fields and null messages
src/main/cpp/src/protobuf_simple.hpp C++ API declaration with documentation of supported types
src/main/cpp/src/protobuf_simple.cu CUDA implementation with three specialized kernels for varint, fixed-width, and string extraction
src/main/cpp/src/ProtobufSimpleJni.cpp JNI bridge translating Java arrays to C++ vectors and invoking decode logic
src/main/cpp/CMakeLists.txt Build configuration adding new source files to compilation targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile full review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 23, 2025

Greptile Summary

This PR adds a multi-pass GPU protobuf decode kernel to spark-rapids-jni, translating a column of raw protobuf bytes (LIST) into a cuDF STRUCT column following a flattened schema descriptor. The implementation spans a Java API layer (Protobuf, ProtobufSchemaDescriptor), a JNI bridge (ProtobufJni.cpp), GPU scan/count kernels (protobuf_kernels.cu), column builder helpers (protobuf_builders.cu), and shared device utilities (protobuf_common.cuh).

The PR has addressed a large number of issues raised in prior review rounds (JNI local-ref leaks, cudaMemsetAsync byte-granularity bug, serialization bypass, shallow clone immutability, fast-path depth-check asymmetry, ERR_OVERFLOW reuse, enum validation in repeated paths, etc.). The remaining issues found in this pass are:

  • Critical (protobuf_builders.cu:986): validate_enum_values_kernel is launched with total_count (occurrence count) as num_rows but d_row_has_invalid_enum is sized for top-level rows (original_num_rows). When a repeated struct field has more total occurrences than top-level rows (the common case), this produces an out-of-bounds write into device memory. Additionally, even without OOB, the kernel sets row_has_invalid_enum[occurrence_index] instead of row_has_invalid_enum[top_level_row_index], incorrectly nulling the wrong top-level rows. The d_top_row_indices mapping is already computed but not threaded into extract_typed_column.
  • Style (protobuf_builders.cu:553): The propagation of per-element invalid-enum flags to the per-row array uses a non-atomic store inside a thrust::for_each device lambda. While the value written is always true (monotone), this is technically a data race under the CUDA memory model.
  • Style (protobuf_builders.cu:1639): The enum-string column building logic (alloc name offsets, launch compute_enum_string_lengths_kernel / copy_enum_string_chars_kernel) is duplicated in four separate functions, with each copy diverging slightly on whether d_row_has_invalid_enum is propagated. Extraction into a shared helper would prevent future copy-paste fix omissions.

Confidence Score: 2/5

  • Not safe to merge — an out-of-bounds GPU device write exists in the repeated struct + enum validation path that can cause silent memory corruption.
  • Most previously raised issues are addressed, but a new critical logic bug was found: validate_enum_values_kernel is called with the wrong array size and the wrong index type inside build_repeated_struct_column. For any schema containing a repeated struct with an INT32 child field that has enum validation, GPU memory outside d_row_has_invalid_enum will be written, producing silent corruption or incorrect struct null masks.
  • src/main/cpp/src/protobuf_builders.cu — specifically the extract_typed_column call sites inside build_repeated_struct_column and build_nested_struct_column (when invoked from a repeated context) that pass d_row_has_invalid_enum without adjusting its size or providing the top-level row index mapping.

Important Files Changed

Filename Overview
src/main/cpp/src/protobuf_builders.cu Contains the most critical issue: validate_enum_values_kernel is called with occurrence-count as num_rows but d_row_has_invalid_enum is sized for top-level rows, causing OOB GPU writes in nested repeated struct contexts. Also has significant enum-string logic duplication across four call sites and a non-atomic concurrent write pattern.
src/main/cpp/src/protobuf_kernels.cu GPU scan and count kernels. Previous review threads addressed depth-check asymmetry, stack-overflow for large field counts, ERR_OVERFLOW reuse, and silent varint corruption. Fixes appear applied in this revision. No new critical issues found.
src/main/cpp/src/protobuf_common.cuh Device-side utilities including get_wire_type_size, decode_tag, and extraction kernels. Previously flagged unbounded WT_SGROUP recursion and INT_MAX overflow checks were fixed. The iterative WT_SGROUP loop, proper fn > INT_MAX guard in decode_tag, and the extract_typed_column helper all look correct.
src/main/cpp/src/protobuf.cu Top-level orchestration: multi-pass decode pipeline, field categorisation, and column assembly. The cudaMemcpyAsync-based error initialisation (replacing the buggy cudaMemsetAsync) looks correct. d_row_has_invalid_enum is sized correctly for top-level rows here; the OOB issue originates in the nested builder functions when that array is passed deeper.
src/main/cpp/src/ProtobufJni.cpp JNI bridge. Previously flagged null-pointer and local-ref-leak issues on all three object-array loop paths appear fixed. Topology, wire-type, and depth validations are present. No new issues found.
src/main/java/com/nvidia/spark/rapids/jni/ProtobufSchemaDescriptor.java Schema descriptor with thorough validation. Previously flagged issues (shallow clones, serialization bypass, duplicate field numbers, wire-type and enum length validation) all appear fixed. readObject calls the shared validate() helper; deep copies protect all nested arrays.

Sequence Diagram

sequenceDiagram
    participant Java as Protobuf.java
    participant JNI as ProtobufJni.cpp
    participant Host as protobuf.cu (host)
    participant Kern as GPU Kernels
    participant Build as protobuf_builders.cu

    Java->>JNI: decodeToStruct(binaryInput, schema...)
    JNI->>JNI: Validate arrays & topology
    JNI->>Host: decode_protobuf_to_struct()

    Host->>Kern: scan_all_fields_kernel (Pass 1: locate scalar fields)
    Host->>Kern: count_repeated_fields_kernel (Pass 1: count repeated)
    Host->>Kern: scan_all_fields_kernel (Pass 1: locate nested messages)

    Host->>Kern: thrust::inclusive_scan (Pass 2: prefix sums for offsets)
    Host->>Kern: scan_all_repeated_occurrences_kernel (Pass 2: record occurrences)

    Host->>Build: build scalar columns (batched extract kernels)
    Host->>Build: build_repeated_scalar/string/struct_column (repeated fields)
    Host->>Build: build_nested_struct_column (nested message fields)

    Build->>Kern: extract_varint/fixed_batched_kernel
    Build->>Kern: validate_enum_values_kernel
    Build->>Kern: scan_repeated_message_children_kernel

    Host->>Kern: valid_if (build struct null mask from d_row_has_invalid_enum)
    Host->>Java: ColumnVector handle (STRUCT column)
Loading

Comments Outside Diff (2)

  1. src/main/cpp/src/protobuf_builders.cu, line 986-1006 (link)

    OOB write + wrong index in validate_enum_values_kernel for repeated struct context

    When extract_typed_column is called here with total_count as num_items, it internally calls validate_enum_values_kernel with num_rows = total_count. That kernel writes row_has_invalid_enum[row] for every row index in [0, total_count - 1].

    However, d_row_has_invalid_enum was allocated in decode_protobuf_to_struct with exactly original_num_rows (top-level row count) elements. Because a repeated field typically produces total_count ≥ num_rows occurrences (each row can carry ≥ 1 repetition), this write is out-of-bounds as soon as any row has more than one occurrence.

    There is also a second, semantic bug: even when there is no OOB, the kernel writes to d_row_has_invalid_enum[occurrence_index] rather than d_row_has_invalid_enum[occ.row_idx] (the actual top-level row that should be nulled). The correct top-level row mapping is already available in d_top_row_indices computed a few lines earlier.

    Both issues affect any schema where a top-level repeated struct contains a child INT32 field with enum_valid_values set.

    To fix, the validate_enum_values_kernel call path inside extract_typed_column needs to receive the top-level row index mapping for nested/repeated contexts (e.g. an optional int32_t const* top_row_indices and a top_row_count parameter), or the propagation to d_row_has_invalid_enum should be done in the caller after receiving a per-item invalid flag vector, using d_top_row_indices as the mapping.

    The same pattern recurs in build_nested_struct_column where it calls extract_typed_column for scalar children inside repeated occurrences.

  2. src/main/cpp/src/protobuf_builders.cu, line 553-566 (link)

    Non-atomic concurrent write to shared d_row_has_invalid_enum entries

    Multiple GPU threads that process different elements belonging to the same top-level row will concurrently execute row_invalid[occs[idx].row_idx] = true. Because all writers store the same value (true) the comment is correct that the outcome is deterministic on sm_70+ hardware, but this is technically a data race under the CUDA memory model and is undefined behaviour in C++17/20.

    A safer idiom that is standard and portable is:

    atomicOr(reinterpret_cast<unsigned int*>(&row_invalid[occs[idx].row_idx]),
             static_cast<unsigned int>(1u));

    bool is one byte on all supported platforms, and reinterpret_cast to unsigned int* would be UB. Instead, store d_row_has_invalid_enum as uint8_t* (or use a separate uint8_t array) so that atomicOr((unsigned char*)ptr, 1u) via inline PTX or the cuda::atomic API compiles correctly.

Last reviewed commit: 5357378

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (6)

  1. src/main/cpp/src/protobuf_simple.cu, line 89-91 (link)

    logic: potential overflow: len64 can be up to 2^64-1, but casting to int on line 90 can overflow if len64 > INT_MAX

  2. src/main/cpp/src/protobuf_simple.cu, line 323-324 (link)

    logic: potential overflow: len64 can be larger than INT_MAX, but casting to int on line 324 will overflow

  3. src/main/cpp/src/protobuf_simple.cu, line 375-376 (link)

    logic: race condition: multiple threads write to *error_flag without atomics, causing undefined behavior when multiple threads encounter errors simultaneously

    Then in kernels, use atomicOr(error_flag, 1) instead of *error_flag = 1

  4. src/main/cpp/src/protobuf_simple.cu, line 398-407 (link)

    logic: protobuf uses zigzag encoding for signed integers (sint32/sint64), but varint decoding here treats them as unsigned - decoding negative values will produce incorrect results. Are you only supporting unsigned int32/int64, or should zigzag decoding be implemented for signed types?

  5. src/main/cpp/src/protobuf_simple.cu, line 240 (link)

    syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++

  6. src/main/cpp/src/protobuf_simple.cu, line 248 (link)

    syntax: type punning through reinterpret_cast of incompatible pointer types is undefined behavior in C++

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven requested a review from Copilot December 25, 2025 03:43
@thirtiseven
Copy link
Collaborator Author

@greptile full review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. src/main/cpp/src/protobuf_simple.cu, line 186 (link)

    logic: zigzag decode uses signed right shift which is correct for signed types, but applied to unsigned v

  2. src/main/cpp/src/ProtobufSimpleJni.cpp, line 55 (link)

    logic: encodings is constructed from n_type_scales but then out_types is also constructed using n_type_scales[i] as the scale parameter, which would be wrong for non-decimal types where this represents encoding

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@nvauto
Copy link
Collaborator

nvauto commented Jan 19, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile list all remain issues

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

All review comments appear to be resolved!

Looking at the PR:

  • ✅ All 70+ review threads from previous reviews are marked as resolved and outdated
  • The PR has 95 total review comments, all of which appear to have been addressed
  • The PR is currently in draft state

Since all the review feedback has been resolved, the main remaining task is likely to:

  1. Mark the PR as ready for review (remove draft status)
  2. Request review from maintainers
  3. Ensure CI/tests are passing

If you'd like me to check for any specific unresolved issues or review any particular aspect of the code, let me know!

@thirtiseven
Copy link
Collaborator Author

@greptile full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

@thirtiseven thirtiseven marked this pull request as ready for review March 12, 2026 07:08
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile full review again and give comments rather than just update summary. please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants