feat: Add Unshredded Variant read & write support #17833

voonhous · 2026-01-12T17:48:34Z

Describe the issue this Pull Request addresses

Adds read and write support for Spark Variant data types in Hudi. Since Variant is exclusive to Spark 4.0+, this PR introduces an adapter pattern to handle schema conversion differences, ensuring backward compatibility with Spark 3.x while enabling semi-structured data support in Spark 4.x.

Note: Please merge this PR first: #17751

Summary and Changelog

This PR refactors schema converters behind SparkAdapter to handle version-specific data types.

Refactor: Abstracted HoodieSparkSchemaConverters and SchemaConverters into traits; logic is now delegated via SparkAdapter.
Spark 4: Implemented Variant support mapping Spark VariantType <-> Avro Record (logicalType: variant) <-> Parquet Struct (value/metadata binaries).
Parquet Writer: Updated HoodieRowParquetWriteSupport to handle physical writing of Variant structs.
Fix: Updated AvroSchemaUtils to ensure logicalType and metadata are preserved during schema pruning.

Impact

New Feature: Users on Spark 4.0 can write and query Variant columns in Hudi tables (COW and MOR).
Compatibility: Zero impact on Spark 3.x flows (methods default to current behavior).

Risk Level

Low

The adapter pattern isolates the new logic.

Verification: Added unit tests covering:
- COW/MOR Read & Write paths with Variant data.
- Parquet serialization of Shredded vs Unshredded variants.

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

voonhous

Added some self-reviews

voonhous · 2026-01-12T17:50:27Z

....x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java

+   * </pre>
+   */
+  @Test
+  public void testWriteUnshreddedVariant() throws IOException {


These test aren't really meaningful, they pass without the variant read + write code changes anyways.

Correction, they are meaningful in helping ensure that our definition of Variant is correct.

.../hudi-spark/src/test/scala/org/apache/spark/sql/hudi/feature/index/TestExpressionIndex.scala

voonhous · 2026-01-12T17:51:13Z

...rce/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala

+
+class TestVariantDataType extends HoodieSparkSqlTestBase {
+
+  test("Test COW Table with Variant Data Type") {


Let's add a MOR test to see if there's anything missing too.

voonhous · 2026-01-12T17:52:37Z

...c/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSparkVariant.scala

+
+import java.nio.file.{Files, Path}
+
+class TestHoodieFileGroupReaderOnSparkVariant extends SparkAdapterSupport {


This can be removed, i added this for debugging to allow for finer grain control. It is almost identical to TestVariantDataType.scala. Just that this instantiates a CloseableInternalRowIterator for row reading.

voonhous · 2026-01-12T17:53:23Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

      if (fileStructMap.contains(f.name) && !isDataTypeEqual(requiredType, fileStructMap(f.name))) {
        val readerType = addMissingFields(requiredType, fileStructMap(f.name))
-        implicitTypeChangeInfo.put(new Integer(requiredSchema.fieldIndex(f.name)), org.apache.hudi.common.util.collection.Pair.of(requiredType, readerType))
+        implicitTypeChangeInfo.put(Integer.valueOf(requiredSchema.fieldIndex(f.name)), org.apache.hudi.common.util.collection.Pair.of(requiredType, readerType))


new Integer is marked for removal in Java 9. Using Integer#valueOf instead. This could be a separate PR.

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java

...di-spark-client/src/main/java/org/apache/hudi/client/utils/SparkInternalSchemaConverter.java

...ent/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSchemaConversionUtils.scala

hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java

the-other-tim-brown · 2026-01-16T17:17:20Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java

+        List<Types.Field> variantFields = new ArrayList<>();
+        // Assign field IDs: these are used for schema evolution tracking
+        // Use negative IDs: indicate these are system-generated for Variant type
+        // TODO (voon): Check if we can remove the magic numbers?


Who can we reach out to in order to figure out how we should handle this?

Not very sure, author for this is @xiarixiaoyao, but i don't think they are active anymore.

the-other-tim-brown · 2026-01-16T17:18:15Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

    // Check for precision and scale if the schema has a logical decimal type.
+    // VARIANT (unshredded) type is excluded because it stores semi-structured data as opaque binary blobs,
+    // making min/max statistics meaningless
+    // TODO: For shredded, we are able to store colstats, explore that


Let's make a GH Issue and then link it here so we don't forget

Done
#17988

can you link it inline as well?

hudi-common/src/test/resources/variant_backward_compat/variant_cow.zip

hudi-common/src/test/java/org/apache/hudi/common/testutils/ZipTestUtils.java

the-other-tim-brown · 2026-01-20T22:50:20Z

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

    } else if (dataType == DataTypes.BinaryType) {
      return (row, ordinal) -> recordConsumer.addBinary(
          Binary.fromReusedByteArray(row.getBinary(ordinal)));
+    } else if (SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(dataType)) {


Do we need something similar in HoodieAvroWriteSupport?

Don't think so, i have test that uses HoodieRecordType.{AVRO, SPARK}. They should trigger both write support and it seems there are no test failures.

In Avro, Variant is already an Avro record from HoodieSchema.createVariant. Where Fields: value (bytes), metadata (bytes).

IIUC, Parquet's AvroWriteSupport handles this automatically as it will know how to convert:

Avro record -> Parquet group

Avro bytes -> Parquet binary

HoodieAvroWriteSupport just wraps AvroWriteSupport to add bloom filter support and does not override write logic.

In the Spark Row path, custom handling is needed because Spark's VariantType requires special APIs (createVariantValueWriter) to extract the raw bytes as there are no automatic Spark VariantType -> Parquet conversion from what i can see in our code.

the-other-tim-brown · 2026-01-20T22:52:06Z

...ce/hudi-flink/src/test/java/org/apache/hudi/table/ITTestVariantCrossEngineCompatibility.java

+        "CREATE TABLE variant_table ("
+            + "  id INT,"
+            + "  name STRING,"
+            + "  v ROW<`value` BYTES, metadata BYTES>,"


Is there a way to create the table with the HoodieSchema so the type is annotated as Variant?

Not sure what is meant by this, do you mean some sort of utility function where HoodieSchema.getVariantTypeSQLStringForFlink where it checks the current version of Flink if it supports Variant natively and applies the relevant Variant type?

Yes, something along those lines. Similar to how you have a test with the Variant type in the TestVariantDataType

I just want to clarify one thing, the table on disk will have Variant in the Hudi schema for this table, right?

Yes. It will have a variant logical type in HoodieSchema.
I'm using the physical type here as I am not really sure how to wire this for Flink2.0+ and Flink1.20.

Variant is only supported in Flink2.0+

the-other-tim-brown · 2026-01-20T22:56:47Z

...c/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSparkVariant.scala

+
+    // Create parquet reader
+    val hadoopConf = new Configuration(spark.sparkContext.hadoopConfiguration)
+    val reader = sparkAdapter.createParquetFileReader(


Instead of using a parquetFileReader directly, can we use a FileGroupReader? I'm thinking later it will be useful when MoR is incorporated.

Let me remove this test, i added this to allow me to debug a SPECIFIC flow when trying to develop the read/write support. This is not required as the test here is covered by TestVariantType.scala.

See:
#17833 (comment)

This test may still be useful, let me paste it somewhere in our issue for reference in the future.
It can be found here: #17746 (comment)

...c/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSparkVariant.scala

...ource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark3Adapter.scala

the-other-tim-brown · 2026-01-20T22:59:41Z

...ource/hudi-spark4-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark4Adapter.scala

+
+    // Handle VariantType comparisons
+    (requiredType, fileType) match {
+      case (_: VariantType, s: StructType) if isVariantPhysicalSchema(s) => Some(true)


Why wouldn't the file's type also be Variant?

Files written before Spark 4.0 (or by older Hudi versions) have the struct representation.
When reading the Parquet file's schema directly (without Spark's logical type inference), we also get the physical struct type.

IIRC, This piece of code here to address the variant columns being read out as base64 string issue:

Our Spark40ParquetReader, and other versions implicitly compares schema that's requested, i.e. requestedSchema from the user and the fileSchema implicitly to do projection for mini optimizations.

With the discrepancy between how Variant is represented as a DataType and MessageType, HoodieParquetFileFormatHelper builds an implicitTypeChangeInfo map that looks something like this for the unsafeProjection.

({Integer@27093}7 -> {ImmutablePair@27094}{VariantType$@27095}VariantType -> {StructType@27096}size = 2) indexOfField -> requestedSchema -> fileSchema

This unsafeProjection causes bytes in the variant to be read out as:

{"metadata":"AQIAAwdrZXlsaXN0","value":"AgIAAQAHExl2YWx1ZTIDAwACBAYMAQwCDAM="}

instead of (which is technically equivalent), but the above being is in base64 string form, and will impede further evaluation/representation of Variant.

{'value': b'\x02\x02\x00\x01\x00\x07\x13\x19value2\x03\x03\x00\x02\x04\x06\x0c\x01\x0c\x02\x0c\x03', 'metadata': b'\x01\x02\x00\x03\x07keylist'}

Which should be represented as (if Variant is supported)

{"key":"value2","list":[1,2,3]}

This only affects HoodieRecordType.Spark.

That's why we need this here in Spark4.0 and why the fileType is a StructType.

Hope this makes sense, especially with the copied out content snapshot in memory above of unsafeProjection.

voonhous · 2026-01-22T11:30:41Z

Moving this commit:
af42b98

To:
bvaradar#37

voonhous · 2026-01-27T02:42:28Z

Rebased

voonhous · 2026-01-27T02:44:27Z

hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchemaType.java

      } else if (logicalType == LogicalTypes.uuid()) {
        return UUID;
-      } else if (logicalType instanceof VariantLogicalType) {
+      } else if (logicalType == VariantLogicalType.variant()) {


Added the comparison against the singleton here.

the-other-tim-brown · 2026-01-27T14:57:08Z

...hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieSparkSchemaConverters.scala

        }

-      case other => throw new IncompatibleSchemaException(s"Unsupported HoodieSchemaType: $other")
+      // VARIANT type (Spark >4.x only), which will be handled via SparkAdapter


For lower spark versions, do we want to just return the underlying struct?

In the current implementation in Spark3.5, this is dead code as the variant column will not have a logicalType of variant. It's a record instead.

The reason why it's dead code is because is:

The tableSchema is resolved from schemaSpec.map(s => convertToHoodieSchema(s, tableName)) in HoodieBaseHadoopFsRelationFactory.

This converts a struct where the variant column is a struct of metadata and value into HoodieSchema.

Hence, when the code flow reaches HoodieSparkSchemaConverters, the variant column with HoodieSchema will not have the variant logical type to fall into this code path. It will resolve to the RECORD path instead.

This might change if the table has an internalSchema though, which i think we need to investigate. I'll create an issue for this!

#18021

Will inline the issue too.

the-other-tim-brown · 2026-01-27T14:58:53Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

 import org.apache.spark.sql.types.{ArrayType, DataType, DateType, DecimalType, DoubleType, FloatType, IntegerType, LongType, MapType, StringType, StructField, StructType, TimestampNTZType}

-object HoodieParquetFileFormatHelper {
+trait HoodieParquetFileFormatHelperTrait {


Do we need this switch from object to trait?

Nope, reverting.

the-other-tim-brown · 2026-01-27T15:00:12Z

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+      // Maps VariantType to a group containing 'metadata' and 'value' fields.
+      // This ensures Spark 4.0 compatibility and supports both Shredded and Unshredded schemas.
+      // Note: We intentionally omit 'typed_value' for shredded variants as this writer only accesses raw binary blobs.
+      final byte[][] variantBytes = new byte[2][];  // [0] = value, [1] = metadata


Instead of reusing these byte arrays, can we just return the pair of byte arrays from the variant data to avoid the extra copy?

I've made the code here leaner. Instead of storing the bytes and reading them back, pass the parquet writing logic directly into the consumers.

the-other-tim-brown · 2026-01-27T15:02:43Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

    // Check for precision and scale if the schema has a logical decimal type.
+    // VARIANT (unshredded) type is excluded because it stores semi-structured data as opaque binary blobs,
+    // making min/max statistics meaningless
+    // TODO: For shredded, we are able to store colstats, explore that


can you link it inline as well?

the-other-tim-brown · 2026-01-27T15:08:19Z

...ce/hudi-flink/src/test/java/org/apache/hudi/table/ITTestVariantCrossEngineCompatibility.java

+        "CREATE TABLE variant_table ("
+            + "  id INT,"
+            + "  name STRING,"
+            + "  v ROW<`value` BYTES, metadata BYTES>,"


I just want to clarify one thing, the table on disk will have Variant in the Hudi schema for this table, right?

the-other-tim-brown · 2026-01-27T15:09:39Z

...hadoop-common/src/main/java/org/apache/parquet/avro/AvroSchemaConverterWithTimestampNTZ.java

        break;
      case UNION:
        return convertUnion(fieldName, schema, repetition, schemaPath);
+      case VARIANT:


Can you update the TestAvroSchemaConverter to cover this branch?

the-other-tim-brown · 2026-01-27T15:11:28Z

...spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala

 package org.apache.spark.sql.hudi.command

 import org.apache.hudi.{DataSourceWriteOptions, SparkAdapterSupport}
+import org.apache.hudi.SparkAdapterSupport.sparkAdapter


Remove changes to this file?

the-other-tim-brown · 2026-01-28T15:25:07Z

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java

+      // Note: We intentionally omit 'typed_value' for shredded variants as this writer only accesses raw binary blobs.
+      BiConsumer<SpecializedGetters, Integer> variantWriter = SparkAdapterSupport$.MODULE$.sparkAdapter().createVariantValueWriter(
+          dataType,
+          valueBytes -> consumeField("value", 0, () -> recordConsumer.addBinary(Binary.fromReusedByteArray(valueBytes))),


My understand is that the valueBytes are not part of a reused byte array. They are already copied when the variant object is read so you can skip this copy.

I traced the code a little. I think you're right.

CMIIW or if this does not align with your mental model, it Variant row is created from org.apache.spark.sql.catalyst.expressions.UnsafeRow#getVariant,

@Override public VariantVal getVariant(int ordinal) { if (isNullAt(ordinal)) return null; return VariantVal.readFromUnsafeRow(getLong(ordinal), baseObject, baseOffset); }

Looking at org.apache.spark.unsafe.types.VariantVal#readFromUnsafeRow, new bytes are allocated for both metadata and value.

So these are essentially copies.

I will change fromReusedByteArray to fromConstantByteArray then.

the-other-tim-brown · 2026-01-28T15:41:33Z

...ource/hudi-spark4-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark4Adapter.scala

+  }
+
+  override def isVariantType(dataType: DataType): Boolean = {
+    import org.apache.spark.sql.types.VariantType


One style question, why do the imports inline with the method instead of at the top of the file?

Was encountering some import issue while debugging and implementing the different version switches, I'll check them again if we can put position them at the top of the file!

If it is possible, let's cleanup all the places this is done in this file

Moved them to the top of the file.

the-other-tim-brown · 2026-01-28T15:42:04Z

...ource/hudi-spark4-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark4Adapter.scala

+    fieldName: String,
+    fieldSchema: HoodieSchema,
+    repetition: Repetition
+  ): org.apache.parquet.schema.Type = {


Nit: can we import the parquet type?

- Add adapter pattern for Spark3 and 4 - Cleanup invariant issue in SparkSqlWriter - Add cross engine test - Add backward compatibility test for Spark3.x - Add cross engine read for Flink

vinothchandar

Love to take a pass over this. Is this ready for review?

voonhous · 2026-01-29T04:06:07Z

Love to take a pass over this. Is this ready for review?

Yes, it's in a reviewable state, just a few minor stylistic imports to move around.

hudi-bot · 2026-02-10T01:34:04Z

CI report:

6aadb0a Azure: SUCCESS Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous marked this pull request as draft January 12, 2026 17:48

github-actions bot added the size:XL PR with lines of changes > 1000 label Jan 12, 2026

voonhous commented Jan 12, 2026

View reviewed changes

voonhous changed the title ~~feat (variant): Add Unshredded Variant read & write support~~ feat: Add Unshredded Variant read & write support Jan 13, 2026

voonhous force-pushed the variant-intro-cow-writes branch 8 times, most recently from 1858ee3 to 4d76cb4 Compare January 14, 2026 07:53

voonhous linked an issue Jan 14, 2026 that may be closed by this pull request

Add UNSHREDDED Parquet writer and read support for Variant #17746

Open

voonhous force-pushed the variant-intro-cow-writes branch 2 times, most recently from 936d6cd to 345450a Compare January 14, 2026 10:31

voonhous marked this pull request as ready for review January 14, 2026 11:04

voonhous requested review from rahil-c and the-other-tim-brown January 14, 2026 12:16

voonhous force-pushed the variant-intro-cow-writes branch 3 times, most recently from 3b62c5a to 157fa96 Compare January 15, 2026 16:50

the-other-tim-brown reviewed Jan 16, 2026

View reviewed changes

voonhous force-pushed the variant-intro-cow-writes branch from 157fa96 to 97ef2e3 Compare January 16, 2026 17:54

the-other-tim-brown reviewed Jan 20, 2026

View reviewed changes

voonhous force-pushed the variant-intro-cow-writes branch 3 times, most recently from a88a018 to 18cb139 Compare January 27, 2026 02:41

voonhous commented Jan 27, 2026

View reviewed changes

the-other-tim-brown reviewed Jan 27, 2026

View reviewed changes

the-other-tim-brown reviewed Jan 28, 2026

View reviewed changes

voonhous added 4 commits January 29, 2026 10:40

Support reading and writing of Variant Types

288989e

- Add adapter pattern for Spark3 and 4 - Cleanup invariant issue in SparkSqlWriter - Add cross engine test - Add backward compatibility test for Spark3.x - Add cross engine read for Flink

Make VariantLogicalType compare against singleton

a1f6bd7

Address comments

3fd0ae3

Address comments 2

8756466

voonhous force-pushed the variant-intro-cow-writes branch from 51c5108 to 8756466 Compare January 29, 2026 02:44

vinothchandar self-assigned this Jan 29, 2026

vinothchandar requested changes Jan 29, 2026

View reviewed changes

voonhous mentioned this pull request Jan 29, 2026

feat(schema): Add support to write shredded variants for HoodieRecordType.SPARK #18036

Open

3 tasks

Address comments 3

6aadb0a

apache deleted a comment from hudi-bot Feb 10, 2026


		class TestVariantDataType extends HoodieSparkSqlTestBase {

		test("Test COW Table with Variant Data Type") {


		import java.nio.file.{Files, Path}

		class TestHoodieFileGroupReaderOnSparkVariant extends SparkAdapterSupport {

feat: Add Unshredded Variant read & write support #17833

Are you sure you want to change the base?

feat: Add Unshredded Variant read & write support #17833

Conversation

voonhous commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

voonhous left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous commented Jan 22, 2026

Uh oh!

voonhous commented Jan 27, 2026

Uh oh!

Choose a reason for hiding this comment

voonhous commented Jan 12, 2026 •

edited

Loading

voonhous Jan 28, 2026 •

edited

Loading

voonhous Jan 22, 2026 •

edited

Loading