Skip to content

Conversation

@suryaprasanna
Copy link
Contributor

Describe the issue this Pull Request addresses

This PR enables the use of custom PartitionValueExtractor implementations when reading Hudi tables in Spark, allowing users to define custom logic for extracting partition values from partition paths.
Previously, the PartitionValueExtractor interface was only used during write/sync operations but not during read operations.

Summary and Changelog

Users can now configure custom partition value extractors for read operations using the hoodie.datasource.read.partition.value.extractor.class option, enabling support for non-standard partition path formats.

Changes:

  • Moved PartitionValueExtractor interface from hudi-sync-common to hudi-common for broader accessibility
  • Added PARTITION_VALUE_EXTRACTOR_CLASS config to HoodieTableConfig and DataSourceOptions
  • Updated HoodieSparkUtils.createPartitionSchema() to use PartitionValueExtractor for partition value extraction
  • Updated HoodieFileIndex and related classes to support custom partition value extractors
  • Added test implementation TestCustomSlashPartitionValueExtractor demonstrating date formatting from slash-separated paths (yyyy/mm/dd → yyyy-mm-dd)
  • Updated all existing PartitionValueExtractor implementations to use the relocated interface

Impact

Public API Changes:

  • New config option: hoodie.datasource.read.partition.value.extractor.class for Spark reads
  • PartitionValueExtractor interface relocated from org.apache.hudi.sync.common.model to org.apache.hudi.hive.sync

User-Facing Changes:
Users can now customize partition value extraction during read operations by providing a custom PartitionValueExtractor implementation

Risk Level

Low - This is an additive change that maintains backward compatibility. When no custom extractor is specified, the default behavior remains unchanged (standard slash-based splitting). The change has been tested with custom partition value extractor implementations.

Documentation Update

Documentation should be updated to include:

  • New config option hoodie.datasource.read.partition.value.extractor.class in the configuration reference (WIP)
  • Example usage of custom PartitionValueExtractor implementations for read operations (WIP)

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 13, 2026
@suryaprasanna suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from eb515ef to d5a979e Compare February 2, 2026 02:14
@nsivabalan
Copy link
Contributor

PartitionValueExtractor interface relocated from org.apache.hudi.sync.common.model to org.apache.hudi.hive.sync

Wouldn't this break existing users who might have their own implementations for PartitionValueExtractor?

We should avoid doing this.

Instead, can we introduce a new interface named SparkPartitionValueExtrator in some spark other package. If need be, this can extend from existing PartitionValueExtractor as well.
and use this in spark read code paths.

Array.fill(partitionColumns.length)(UTF8String.fromString(partitionPath))
} else if(usePartitionValueExtractorOnRead && !StringUtils.isNullOrEmpty(partitionValueExtractorClass)) {
try {
val partitionValueExtractor = Class.forName(partitionValueExtractorClass)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to a private method to keep this method lean

may be, parsePartitionValuesBasedOnPartitionValueExtrator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

.withDocumentation("Key Generator type to determine key generator class");

public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = ConfigProperty
.key("hoodie.datasource.hive_sync.partition_extractor_class")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the table property will have "hoodie.datasource. as prefixes.

we should define two configs.

hoodie.datasource.hive_sync.partition_extractor_class for writer property.
and
hoodie.table.hive_sync.partition_extractor_class for table config.

users should not be able to directly set the table property. they should always set the writer property only i.e. hoodie.datasource.hive_sync.partition_extractor_class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are storing configs with prefix as hoodie.datasource like follows, maybe we need to change for them as well.
hoodie.datasource.write.drop.partition.columns
hoodie.datasource.write.hive_style_partitioning
Anyways, I have created two different properties, one will be in HoodieTableConfig and other will be in HoodieSynnConfig. Also added validation to make sure people dont give different values for this.

.or(() -> Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));

if (!partitionFieldsOpt.isPresent()) {
return Option.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not NonPartitionedExtractor ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made the change.

*/

package org.apache.hudi.sync.common.model;
package org.apache.hudi.hive.sync;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my bad I am using the same pacakge now, but I need to move the interface to hudi-common as there is a compile time dependency now.


val USE_PARTITION_VALUE_EXTRACTOR_ON_READ: ConfigProperty[String] = ConfigProperty
.key("hoodie.datasource.read.partition.value.using.partion-value-extractor-class")
.defaultValue("true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we disable by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have an infer function, this might get exercised out of the box. lets keep OOB behavior untouched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.


import java.util

class TestCustomSlashPartitionValueExtractor extends PartitionValueExtractor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes.

.markAdvanced()
.withDocumentation("Field in the table to use for determining hive partition columns.");

public static final ConfigProperty<String> META_SYNC_PARTITION_EXTRACTOR_CLASS = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets leave this as is w/o any changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, reverting this change.

Seq(7, "a7", 7000, "2024-01-03", "CAN", "ON", "TOR")
)

// Test partition pruning with combined date and state filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proper way to assert here is. we should corrupt one of the parquet files in one another partition which does not match the predicate. and then, the query will only succeed if the partition pruning really worked. if not, query will hit FileNotFoundException.
But we can't afford to do it for every query. since one of the data file will be corrupted. So, may be we can do it for 1 or 2 of the partition pruning queries you have here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test case with corrupted parquet file.

public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = ConfigProperty
.key("hoodie.datasource.hive_sync.partition_extractor_class")
.defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
.withInferFunction(cfg -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not be setting any default here right?
ok to have the infer function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


metadataTable.close()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand hive style partitioning and the custom partition value extractor are mutually exclusive.

but can we add a test w/ custom partition value extractor and url encoding enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@suryaprasanna suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from d1d62f3 to 908fd75 Compare February 9, 2026 07:29
@suryaprasanna suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from febf1da to de1f911 Compare February 9, 2026 17:23
@apache apache deleted a comment from hudi-bot Feb 10, 2026
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants