feat(schema): Add support to write shredded variants for HoodieRecordType.SPARK #18036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

voonhous wants to merge 5 commits into apache:master from voonhous:variant-intro-shredded

+1,704 −109

Member

voonhous commented Jan 29, 2026 •

edited

Loading

Describe the issue this Pull Request addresses

Adds read and write support for Spark Variant data types in Hudi. Since Variant is exclusive to Spark 4.0+, this PR introduces an adapter pattern to handle schema conversion differences, ensuring backward compatibility with Spark 3.x while enabling semi-structured data support in Spark 4.x.

Closes: #17747

Note0: Please merge this PR first: #17833

Note1: This only covers HoodieRecordType.SPARK. AVRO will be covered in another PR.

Note2:

This PR is purely to add support for writing shredded variants.

There is no end-2-end flow to allow users to enable shredded variant writes as of now. We will address this in a separate PR to make PRs small and manageable for reviews.

The next PR for this will be to add a parquet-config to allow users to enable/disable shredding and also to force shredding on certain columns for testing.

Summary and Changelog

This PR refactors schema converters behind SparkAdapter to handle version-specific data types.

Updated HoodieRowParquetWriteSupport to allow support for writing shredded variants.

Impact

New Feature: Users on Spark 4.0 can write Variant shredded columns.
Compatibility: Zero impact on Spark 3.x flows (methods default to current behavior).

Risk Level

Low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

voonhous added 4 commits

January 29, 2026 10:40


          Support reading and writing of Variant Types

288989e

- Add adapter pattern for Spark3 and 4
- Cleanup invariant issue in SparkSqlWriter
- Add cross engine test
- Add backward compatibility test for Spark3.x
- Add cross engine read for Flink


          Make VariantLogicalType compare against singleton

a1f6bd7


          Address comments

3fd0ae3


          Address comments 2

voonhous requested a review from the-other-tim-brown

January 29, 2026 06:13

github-actions bot added the size:XL label

voonhous force-pushed the variant-intro-shredded branch 2 times, most recently from b5bfda3 to d5d986c Compare

January 29, 2026 06:36

the-other-tim-brown reviewed

View reviewed changes

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java Outdated

Comment on lines 192 to 197

+                          StructType shreddedStruct = SparkAdapterSupport$.MODULE$.sparkAdapter()
+                              .generateVariantShreddingSchema(typedValueDataType, true, false);
+                          // Add metadata to mark this as a shredding struct
+                          StructType markedShreddedStruct = SparkAdapterSupport$.MODULE$.sparkAdapter()
+                              .addVariantWriteShreddingMetadata(shreddedStruct);

Contributor

the-other-tim-brown Jan 29, 2026

Could we combine these?

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+                      if (fieldHoodieSchema.getType() == HoodieSchemaType.VARIANT) {
+                        HoodieSchema.Variant variantSchema = (HoodieSchema.Variant) fieldHoodieSchema;
+                        if (variantSchema.isShredded() && variantSchema.getTypedValueField().isPresent()) {

Contributor

the-other-tim-brown Jan 29, 2026

Do we expect a case where isShredded is true but the value field is not present?

Member Author

voonhous Jan 30, 2026

Nope. The value field must always be present from my understanding of things.

In unstructured data, the data can either be shredded or not and by design, since shredding is a feature ontop of variants. As such, the default state of all variant fields is unshredded, so the field has to be there for all data that cannot be shredded.

Contributor

the-other-tim-brown Feb 3, 2026

Can we simplify this condition then?

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java Outdated

Comment on lines 252 to 253

		&& SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType) shreddedField.dataType())
		&& SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(originalField.dataType())) {

Contributor

the-other-tim-brown Jan 29, 2026

Similarly, could this be a single check?

Member Author

voonhous Jan 30, 2026

Done.

The isVariantType(originalField.dataType()) check was redundant because generateShreddedSchema only produces shredding structs (with the metadata that isVariantShreddingStruct looks for) from fields that are already Variant types.

Hence, isVariantShreddingStruct being true already implies the original field was a Variant, will remove the isVariantType check.

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java Outdated

-                  }).toArray(ValueWriter[]::new);
+                    // Check if this field is a shredded Variant (shreddedField has shredding struct, originalField has VariantType)
+                    if (shreddedField.dataType() instanceof StructType

Contributor

the-other-tim-brown Jan 29, 2026

Nitpick: can we move this into the makeWriter so all the logic for creating the writer methods is contained in one place?

Member Author

voonhous Jan 30, 2026

Done, the code definitely looks ALOT cleaner now.

voonhous force-pushed the variant-intro-shredded branch 2 times, most recently from a9565bb to 17f548b Compare

January 30, 2026 06:01


          feat(schema): Add support to write shredded variants

99eadaf

voonhous force-pushed the variant-intro-shredded branch from 17f548b to 99eadaf Compare

January 30, 2026 06:58

voonhous changed the title ~~feat(schema): Add support to write shredded variants~~ feat(schema): Add support to write shredded variants for HoodieRecordType.SPARK

the-other-tim-brown reviewed

View reviewed changes

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

                         HoodieSchema parsedSchema = HoodieSchema.parse(schemaString);
                         return HoodieSchemaUtils.addMetadataFields(parsedSchema, config.getBooleanOrDefault(ALLOW_OPERATION_METADATA_FIELD));
                       });
+                  // Generate shredded schema if there are shredded Variant columns

Contributor

the-other-tim-brown Feb 3, 2026

It would be good to note the behavior when there are no shredded columns, like "falls back to provided schema if no shredded Variant columns are present"

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+                    DataType dataType = field.dataType();
+                    // Check if this is a Variant field that should be shredded
+                    if (SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(dataType)) {

Contributor

the-other-tim-brown Feb 3, 2026

Could we rely directly on the provided HoodieSchema here?

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+                    // Check if this is a Variant field that should be shredded
+                    if (SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(dataType)) {
+                      HoodieSchema fieldHoodieSchema = Option.ofNullable(hoodieSchema)

Contributor

the-other-tim-brown Feb 3, 2026

Is it ever possible for the provided schema to be null? Is there a possibility of it being out of sync with the struct?

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+                      if (fieldHoodieSchema.getType() == HoodieSchemaType.VARIANT) {
+                        HoodieSchema.Variant variantSchema = (HoodieSchema.Variant) fieldHoodieSchema;
+                        if (variantSchema.isShredded() && variantSchema.getTypedValueField().isPresent()) {

Contributor

the-other-tim-brown Feb 3, 2026

Can we simplify this condition then?

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+                  StructField[] shreddedFields = new StructField[fields.length];
+                  boolean hasShredding = false;
+                  for (int i = 0; i < fields.length; i++) {

Contributor

the-other-tim-brown Feb 3, 2026

This loop only contains top level fields. Should we recursively inspect the struct fields?

If it is possible to have nested Variant fields, let's make sure we have a test for it as well.

apache deleted a comment from hudi-bot

Collaborator

hudi-bot commented Feb 10, 2026

CI report:

99eadaf Azure: SUCCESS Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL