Skip to content

Support from_protobuf expression#14354

Draft
thirtiseven wants to merge 35 commits intoNVIDIA:mainfrom
thirtiseven:from_protobuf_nested
Draft

Support from_protobuf expression#14354
thirtiseven wants to merge 35 commits intoNVIDIA:mainfrom
thirtiseven:from_protobuf_nested

Conversation

@thirtiseven
Copy link
Collaborator

Fixes #14069.

Description

This PR is a huge PR to support a (big) subset in from_protobuf expression.

I will add documents, performance numbers, and other informations in this PR very soon.

I suppose this PR will be split into smaller ones that will be merged over time.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Mar 3, 2026
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR adds GPU acceleration for Spark's from_protobuf expression, implementing a complete decode pipeline: reflection-based proto descriptor analysis, flattened schema construction, schema pruning (only decode fields referenced downstream), ordinal remapping into the pruned output, and JNI dispatch via Protobuf.decodeToStruct. It introduces five new Scala files, a large Python integration test suite, and test infra for automatically downloading spark-protobuf JARs.

The prior review rounds addressed an extensive set of critical bugs (proto3 acceptance on reflection failure, willNotWorkOnGpu fallbacks, hasDefaultValue flag using wrong variable, shim JSON header gaps, BinaryType default value, reference-equality issues in schema deduplication, and many more). The current revision is substantially cleaner.

Key remaining items:

  • invokeBuildDescriptor retry is fragile: The Spark 3.5+ compatibility retry in SparkProtobufCompat only catches ClassCastException/MatchError from the InvocationTargetException. Other runtime exceptions that could arise from the binary-vs-string payload mismatch (e.g., IllegalArgumentException, InvalidProtocolBufferException) escape unhandled and would fail the query instead of falling back to CPU.
  • nestedMsgDesc parameter semantics: In the recursive struct traversal (addFieldWithChildren / addChildFieldsFromStruct), nestedMsgDesc carries the descriptor of the containing message (not the current field's own message), which is counter-intuitive and could cause maintenance errors. A rename and clarifying comment would help.
  • extractFieldInfo loses primary unsupported reason: When checkFieldSupport flags a type-mismatch and defaultValueResult independently returns Left, the actionable type-mismatch message is silently replaced by the default-value reflection error, making diagnostics harder.
  • ENABLE_PROTOBUF_BATCH_MERGE_AFTER_PROJECT defaults to false: The post-project coalesce optimization is permanently disabled until users explicitly flip an internal flag, with no log-level indication that it is off.
  • PR checklists are open: Documentation, performance numbers, and test coverage descriptions are all unchecked. The PR description itself notes this work is intended to be split into smaller pieces before merge.

Confidence Score: 2/5

  • Not yet safe to merge: all three PR checklist items (docs, tests, perf) are unchecked, and the PR description explicitly states it will be split into smaller PRs before landing.
  • The implementation is architecturally sound and a large number of prior critical bugs have been addressed in review iterations. However, the PR is self-described as incomplete (documentation, performance data, and test-coverage descriptions are all TODO), and one logic-level issue remains in the Spark 3.5+ descriptor retry path that could cause queries to fail rather than fall back to CPU. The post-project coalesce optimization is also silently disabled by default. Given the stated intent to split this into smaller PRs and the incomplete checklist, a score of 2 reflects that more work is needed before this is merge-ready.
  • SparkProtobufCompat.scala (retry exception coverage), ProtobufExprShims.scala (recursive struct traversal parameter naming and analyzeRequiredFields guard logic), and basicPhysicalOperators.scala (post-project coalesce default).

Important Files Changed

Filename Overview
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Core GPU tagging logic for from_protobuf: handles schema analysis, field pruning, ordinal remapping, and flat schema construction. Previous review threads addressed many critical bugs (unsupported field handling, proto3 rejection, willNotWorkOnGpu fallbacks). Remaining concerns: confusingly named nestedMsgDesc parameter in recursive struct traversal, and redundant collectedExprs.isEmpty guard in analyzeRequiredFields.
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala Reflection-based compatibility shim for Spark's spark-protobuf module. Handles Spark 3.4 vs 3.5+ API differences, proto syntax detection, default-value extraction, and descriptor resolution. The Spark 3.5+ retry in invokeBuildDescriptor only catches ClassCastException and MatchError — other runtime exceptions from the binary-descriptor mismatch could escape and fail the query instead of falling back to CPU.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFromProtobuf.scala GPU expression that drives JNI-level protobuf decoding. Correctly overrides equals/hashCode using java.util.Arrays for array fields, adds a safety-net logging catch for unexpected CudfException in PERMISSIVE mode, and documents that ProtobufSchemaDescriptor is a pure-Java holder requiring no explicit close.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaExtractor.scala Field support analysis and wire-type resolution. analyzeAllFields correctly records reflection failures as isSupported = false rather than returning Left immediately, enabling pruning of unreachable fields. Minor issue: when both checkFieldSupport and defaultValueResult produce errors, the primary type-mismatch reason is silently dropped in favour of the default-value reflection error.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaValidator.scala Flat-schema construction and default-value encoding. encodeDefaultValue now returns Either instead of throwing, propagating type mismatches as CPU-fallback signals. Enum, binary, string, and numeric default values are all handled correctly.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaModel.scala Data model classes for the protobuf schema pipeline. DescriptorBytes.equals/hashCode correctly use java.util.Arrays for content-based comparison, fixing the sameDecodeSemantics false-negative issue from the prior thread.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala Adds post-project coalesce logic for protobuf-projecting GpuProjectExec and GpuProjectAstExec. The feature is gated by ENABLE_PROTOBUF_BATCH_MERGE_AFTER_PROJECT which defaults to false (internal flag), meaning the optimization is silently disabled in production by default.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala Adds GpuGetStructFieldMeta and GpuGetArrayStructFieldsMeta that read PRUNED_ORDINAL_TAG to remap field ordinals into the pruned decoded schema. GpuGetArrayStructFieldsMeta now correctly derives effectiveNumFields from the post-pruning child type.
integration_tests/run_pyspark_from_build.sh Adds automatic download of spark-protobuf and protobuf-java JARs at test time. Version detection auto-reads the bundled Spark JAR, with a per-version fallback table. Previous issues (curl --fail, leading-space classpath, quoting) have been addressed.
integration_tests/src/main/python/spark_init_internal.py Adds _add_driver_classpath helper that merges new JARs into the existing --driver-class-path in PYSPARK_SUBMIT_ARGS. Previous issues (early return, unescaped regex replacement, comma split) are resolved. re is already imported at the top of the file.
integration_tests/src/main/python/protobuf_test.py 3826-line integration test suite covering scalar, nested, repeated, enum, and edge-case protobuf decode scenarios. Test helper gracefully skips when spark-protobuf JVM classes are absent. Previous issues (xfail markers, options-drop on legacy API path) are addressed.
sql-plugin/src/test/scala/com/nvidia/spark/rapids/shims/ProtobufExprShimsSuite.scala Unit tests for the shim layer covering schema validation, default-value encoding, flatten-schema construction, and descriptor-source equality. Good coverage of the error paths added in previous review iterations.

Sequence Diagram

sequenceDiagram
    participant Catalyst as Catalyst Optimizer
    participant Shim as ProtobufExprShims (tagExprForGpu)
    participant Compat as SparkProtobufCompat
    participant Extractor as ProtobufSchemaExtractor
    participant Validator as ProtobufSchemaValidator
    participant GPU as GpuFromProtobuf (doColumnar)
    participant JNI as Protobuf.decodeToStruct (JNI)

    Catalyst->>Shim: tagExprForGpu(ProtobufDataToCatalyst)
    Shim->>Compat: extractExprInfo(expr) → ProtobufExprInfo
    Compat-->>Shim: messageName, descriptorSource, options
    Shim->>Compat: resolveMessageDescriptor(exprInfo) → ProtobufMessageDescriptor
    Compat-->>Shim: ReflectiveMessageDescriptor (via reflection)
    Shim->>Extractor: analyzeAllFields(schema, msgDesc, enumsAsInts)
    Extractor-->>Shim: Map[fieldName → ProtobufFieldInfo]
    Shim->>Shim: analyzeRequiredFields() → Set[requiredFieldNames]
    Note over Shim: Schema pruning: only required fields decoded
    Shim->>Shim: registerPrunedOrdinals() on GetStructField/GetArrayStructFields
    Note over Shim: PRUNED_ORDINAL_TAG set on downstream extractors
    loop for each required field
        Shim->>Validator: toFlattenedFieldDescriptor(path, field, info)
        Validator-->>Shim: FlattenedFieldDescriptor
    end
    Shim->>Validator: validateFlattenedSchema(flatFields)
    Shim->>Shim: convertToGpu() → GpuFromProtobuf

    Catalyst->>GPU: doColumnar(inputBinaryColumn)
    GPU->>JNI: Protobuf.decodeToStruct(input, ProtobufSchemaDescriptor, failOnErrors)
    JNI-->>GPU: cudf.ColumnVector (struct)
    GPU->>GPU: mergeAndSetValidity (apply input nulls)
    GPU-->>Catalyst: decoded StructType column
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala, line 1174-1185 (link)

    DescriptorPath retry catches too narrow a set of exceptions

    The retry for Spark 3.5+ only triggers on ClassCastException or MatchError wrapped in InvocationTargetException. However, depending on how Spark 3.5+'s buildDescriptor validates its Option[Array[Byte]] argument when a String is passed, it could also throw IllegalArgumentException, InvalidProtocolBufferException, or UnsupportedOperationException — none of which are caught here. In that case, the InvocationTargetException propagates out of the Try in resolveMessageDescriptor as a non-recoverable failure, causing the entire query to fail rather than falling back gracefully to CPU.

    Consider catching a broader class of exceptions to defensively cover the version-mismatch case:

Last reviewed commit: 16f2f6e

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile please check again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Comment on lines +544 to +545
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addFieldWithChildren passes wrong parentIdx for doubly-nested structs

In addChildFieldsFromStruct, every child addFieldWithChildren call passes parentIdx — the parent index received by addChildFieldsFromStruct — as the parent for all children of the struct. This is correct for one level of nesting, but once addFieldWithChildren is called recursively for a deeper struct child, it calls back into addChildFieldsFromStruct passing currentIdx as the new parentIdx. Inside that recursive call to addChildFieldsFromStruct, the children's grandchildren then receive currentIdx (the deep struct's own flat-schema index) as their parent, which is correct.

However, the issue is that addFieldWithChildren calls addChildFieldsFromStruct(st, nestedMsgDesc, sf.name, currentIdx, depth, pathPrefix) — it passes nestedMsgDesc (the parent message descriptor passed in from the outer addChildFieldsFromStruct) as the parentMsgDesc of the child struct, not sf's own message descriptor. Then inside the recursive addChildFieldsFromStruct, it calls parentMsgDesc.findField(fieldName) using the outer message descriptor to find sf, which is correct. But then it uses parentField.get.messageDescriptor to get sf's own message type for looking up its children — this is also correct.

Wait — actually the concern is that when the code does:

case ArrayType(st: StructType, _) if nestedMsgDesc != null =>
  addChildFieldsFromStruct(st, nestedMsgDesc, sf.name, currentIdx, depth, pathPrefix)

nestedMsgDesc is the parent container's descriptor (not sf's message descriptor). addChildFieldsFromStruct then does parentMsgDesc.findField(sf.name) to locate sf within its parent, which is correct. But this means nestedMsgDesc must be the descriptor of the message that contains sf, not sf's own message descriptor. For top-level calls this is msgDesc (the root), which is correct. For recursive calls from addChildFieldsFromStruct, childMsgDesc is passed to addFieldWithChildren — but that childMsgDesc is sf's own message descriptor, yet it's passed as nestedMsgDesc which is then used as the container descriptor for sf's children. This causes addChildFieldsFromStruct to call childMsgDesc.findField(childSf.name) where childMsgDesc is sf's descriptor — which is correct.

This is actually fine, but the recursive semantics of nestedMsgDesc are counter-intuitive: on each level it carries the descriptor of the current struct (to find its own fields), not the parent's. This works because addFieldWithChildren immediately passes the same nestedMsgDesc downward as the descriptor of the struct it just emitted. For correctness this is OK, but the parameter is named confusingly. Consider renaming nestedMsgDesc to containingMsgDesc and adding a clarifying comment to avoid future mistakes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines 100 to 121
@@ -76,7 +121,7 @@ class GpuProjectExecMeta(
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLE_PROTOBUF_BATCH_MERGE_AFTER_PROJECT defaults to false, silently disabling the post-project coalesce optimization

The config's default is false and is marked internal(). This means users who deploy protobuf decoding with schema pruning will not benefit from the post-project coalesce even though it is the correct behavior to request a coalesce after producing small projected batches. The intent of this config appears to be a safety valve rather than a permanent opt-in gate.

Consider whether the default should be true once the feature is considered stable, and document in the config's doc string what conditions must hold before enabling it, so operators know when it is safe to flip. Otherwise, production deployments will silently miss the optimization without any indication in the logs that it is disabled.

Comment on lines +74 to +81
isRequired = fieldDescriptor.isRequired,
defaultValue = defaultValue,
enumMetadata = fieldDescriptor.enumMetadata,
isRepeated = fieldDescriptor.isRepeated
)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractFieldInfo silently drops the checkFieldSupport unsupported reason when defaultValueResult is Left

When checkFieldSupport returns (false, Some("type mismatch: ..."), ...) but fieldDescriptor.defaultValueResult is also Left("reflection failure for default value"), the function returns Left("reflection failure for default value"). In analyzeAllFields, this Left becomes the unsupportedReason in unsupportedFieldInfo, discarding the type-mismatch reason entirely.

While not a correctness bug (the field is still correctly marked unsupported), the error message surfaced to the user will say "reflection failure" instead of the actionable "type mismatch: Spark X vs Protobuf Y", which hinders debugging.

Consider building the ProtobufFieldInfo directly with the known isSupported/unsupportedReason from checkFieldSupport before inspecting defaultValueResult, so the primary unsupported reason is never lost:

val (isSupported, unsupportedReason, encoding) = checkFieldSupport(...)
val defaultValue = fieldDescriptor.defaultValueResult match {
  case Right(v) => v
  case Left(err) if !isSupported => None  // ignore, field is already unsupported
  case Left(err) => return Left(err)       // only propagate if the field was otherwise OK
}
Right(ProtobufFieldInfo(..., isSupported = isSupported, ...))

Comment on lines +658 to +671
val runtimeOrd = orderedChildren.indexOf(childName)
if (runtimeOrd >= 0) {
gsf.setTagValue(ProtobufExprShims.PRUNED_ORDINAL_TAG, runtimeOrd)
}
}
}
case Some(parentPath) if parentPath.isEmpty =>
val runtimeOrd = topLevelIndices.indexOf(ordinal)
if (runtimeOrd >= 0) {
gsf.setTagValue(ProtobufExprShims.PRUNED_ORDINAL_TAG, runtimeOrd)
}
case _ =>
}
case gasf @ GetArrayStructFields(childExpr, field, _, _, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyzeRequiredFields short-circuits on collectedExprs.isEmpty even when fieldReqs is already populated

The guard condition:

if (!safeToPrune || collectedExprs.isEmpty || hasDirectStructRef || fieldReqs.isEmpty) {

evaluates collectedExprs.isEmpty before fieldReqs.isEmpty. Because fieldReqs is populated solely via collectStructFieldReferences calls on the expressions in collectedExprs, it is impossible for collectedExprs to be empty while fieldReqs is non-empty. The check is harmless but misleading — it could suggest that collecting expressions and populating requirements are independent paths. Consider simplifying to just check fieldReqs.isEmpty (which subsumes the collectedExprs.isEmpty case), and adding a comment explaining that an empty fieldReqs covers both "no expressions found" and "no protobuf fields referenced".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support from_protobuf

1 participant