[VL] Make the filename written by Iceberg Native consistent with in Java by Zouxxyy · Pull Request #11435 · apache/gluten

Zouxxyy · 2026-01-17T09:13:28Z

What changes are proposed in this pull request?

Follow the format of filename written by iceberg java

// {partitionId:05d}-{taskId}-{operationId}-{fileCount:05d}{suffix}
00000-0-00e4a92d-4b72-4950-b562-818bfe0853e2-0-00001.parquet

How was this patch tested?

github-actions · 2026-01-17T09:13:58Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-17T10:59:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-17T11:04:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-17T11:38:40Z

Run Gluten Clickhouse CI on x86

Zouxxyy · 2026-01-19T02:25:25Z

CC @jinchengchenghh for a look, thanks

jinchengchenghh

Thanks for your enhancement!

jinchengchenghh · 2026-01-19T03:36:33Z

The CH CI failure is not related to your PR, let us wait for the community feedback, I have raised this issue.

jinchengchenghh · 2026-01-19T04:00:07Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-01-19T08:30:43Z

Please rebase to fix the CH CI

github-actions · 2026-01-19T10:37:56Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-01-21T11:04:42Z

...ait/src/main/scala/org/apache/spark/sql/datasources/v2/ColumnarWriteToDataSourceV2Exec.scala

    val taskId = context.taskAttemptId()
    val attemptId = context.attemptNumber()
-    val dataWriter = factory.createWriter().asInstanceOf[W]
+    val dataWriter = factory.createWriter(partId, taskId).asInstanceOf[W]


https://github.com/apache/incubator-gluten/blob/main/cpp/velox/compute/VeloxRuntime.h#L45C25-L45C38 we can get these two variables in native side by VeloxRuntime

I tried obtain them here

std::shared_ptr<IcebergWriter> VeloxRuntime::createIcebergWriter( RowTypePtr rowType, int32_t format, const std::string& outputDirectory, facebook::velox::common::CompressionKind compressionKind, const std::string& operationId, std::shared_ptr<const facebook::velox::connector::hive::iceberg::IcebergPartitionSpec> spec, const gluten::IcebergNestedField& protoField, const std::unordered_map<std::string, std::string>& sparkConfs) { GLUTEN_CHECK(taskInfo_.has_value(), "Task info must be set before creating IcebergWriter"); auto veloxPool = memoryManager()->getLeafMemoryPool(); auto connectorPool = memoryManager()->getAggregateMemoryPool(); return std::make_shared<IcebergWriter>( rowType, format, outputDirectory, compressionKind, taskInfo_->partitionId, taskInfo_->taskId, operationId, spec, protoField, sparkConfs, veloxPool, connectorPool); }

but got this error, perhaps the timing of setting taskInfo is inconsistent with Iceberg's write creation.

22:18:47.228 ERROR org.apache.spark.task.TaskResources: Task 0 failed by error: org.apache.gluten.exception.GlutenException: Task info must be set before creating IcebergWriter at org.apache.gluten.execution.IcebergWriteJniWrapper.init(Native Method) at org.apache.gluten.connector.write.IcebergDataWriteFactory.getJniWrapper(IcebergDataWriteFactory.scala:103) at org.apache.gluten.connector.write.IcebergDataWriteFactory.createWriter(IcebergDataWriteFactory.scala:77) at org.apache.spark.sql.datasources.v2.WritingColumnarBatchSparkTask.run(ColumnarWriteToDataSourceV2Exec.scala:49) at org.apache.spark.sql.datasources.v2.WritingColumnarBatchSparkTask.run$(ColumnarWriteToDataSourceV2Exec.scala:39) at org.apache.spark.sql.datasources.v2.DataWritingColumnarBatchSparkTask$.run(ColumnarWriteToDataSourceV2Exec.scala:93)

Actually, I think maybe my implementation is fine too, especially the adjustment to the ColumnarBatchDataWriterFactory interface—it now fully aligns with Spark's DataWriterFactory.

Gluten's ColumnarBatchDataWriterFactory

public interface ColumnarBatchDataWriterFactory extends Serializable { DataWriter<ColumnarBatch> createWriter(int partitionId, long taskId); }

Spark's DataWriterFactory

public interface DataWriterFactory extends Serializable { DataWriter<InternalRow> createWriter(int partitionId, long taskId); }

@jinchengchenghh Can you have a look again, thanks

Looks good!

github-actions bot added CORE works for Gluten Core VELOX DATA_LAKE labels Jan 17, 2026

Zouxxyy force-pushed the dev/add-write-path2 branch from 482382c to 4010907 Compare January 17, 2026 11:03

jinchengchenghh approved these changes Jan 19, 2026

View reviewed changes

Zouxxyy added 2 commits January 19, 2026 17:51

v1

fb53d3a

fix compile

392d44c

Zouxxyy force-pushed the dev/add-write-path2 branch from 8109e01 to 392d44c Compare January 19, 2026 09:52

jinchengchenghh reviewed Jan 21, 2026

View reviewed changes

jinchengchenghh self-requested a review January 21, 2026 11:04

jinchengchenghh approved these changes Jan 30, 2026

View reviewed changes

jinchengchenghh merged commit d2c0630 into apache:main Jan 30, 2026
116 of 120 checks passed

Conversation

Zouxxyy commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Zouxxyy commented Jan 19, 2026

Uh oh!

jinchengchenghh left a comment

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

jinchengchenghh Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zouxxyy commented Jan 17, 2026 •

edited

Loading