Skip to content

[VL] Make the filename written by Iceberg Native consistent with in Java#11435

Merged
jinchengchenghh merged 2 commits intoapache:mainfrom
Zouxxyy:dev/add-write-path2
Jan 30, 2026
Merged

[VL] Make the filename written by Iceberg Native consistent with in Java#11435
jinchengchenghh merged 2 commits intoapache:mainfrom
Zouxxyy:dev/add-write-path2

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Jan 17, 2026

What changes are proposed in this pull request?

Follow the format of filename written by iceberg java

// {partitionId:05d}-{taskId}-{operationId}-{fileCount:05d}{suffix}
00000-0-00e4a92d-4b72-4950-b562-818bfe0853e2-0-00001.parquet

How was this patch tested?

@github-actions github-actions bot added CORE works for Gluten Core VELOX DATA_LAKE labels Jan 17, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy Zouxxyy force-pushed the dev/add-write-path2 branch from 482382c to 4010907 Compare January 17, 2026 11:03
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 19, 2026

CC @jinchengchenghh for a look, thanks

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your enhancement!

@jinchengchenghh
Copy link
Contributor

The CH CI failure is not related to your PR, let us wait for the community feedback, I have raised this issue.

@jinchengchenghh
Copy link
Contributor

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please rebase to fix the CH CI

@Zouxxyy Zouxxyy force-pushed the dev/add-write-path2 branch from 8109e01 to 392d44c Compare January 19, 2026 09:52
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

val taskId = context.taskAttemptId()
val attemptId = context.attemptNumber()
val dataWriter = factory.createWriter().asInstanceOf[W]
val dataWriter = factory.createWriter(partId, taskId).asInstanceOf[W]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried obtain them here

std::shared_ptr<IcebergWriter> VeloxRuntime::createIcebergWriter(
    RowTypePtr rowType,
    int32_t format,
    const std::string& outputDirectory,
    facebook::velox::common::CompressionKind compressionKind,
    const std::string& operationId,
    std::shared_ptr<const facebook::velox::connector::hive::iceberg::IcebergPartitionSpec> spec,
    const gluten::IcebergNestedField& protoField,
    const std::unordered_map<std::string, std::string>& sparkConfs) {
  GLUTEN_CHECK(taskInfo_.has_value(), "Task info must be set before creating IcebergWriter");
  auto veloxPool = memoryManager()->getLeafMemoryPool();
  auto connectorPool = memoryManager()->getAggregateMemoryPool();
  return std::make_shared<IcebergWriter>(
      rowType, format, outputDirectory, compressionKind, taskInfo_->partitionId, taskInfo_->taskId, operationId, spec, protoField, sparkConfs, veloxPool, connectorPool);
}

but got this error, perhaps the timing of setting taskInfo is inconsistent with Iceberg's write creation.

22:18:47.228 ERROR org.apache.spark.task.TaskResources: Task 0 failed by error: 
org.apache.gluten.exception.GlutenException: Task info must be set before creating IcebergWriter
	at org.apache.gluten.execution.IcebergWriteJniWrapper.init(Native Method)
	at org.apache.gluten.connector.write.IcebergDataWriteFactory.getJniWrapper(IcebergDataWriteFactory.scala:103)
	at org.apache.gluten.connector.write.IcebergDataWriteFactory.createWriter(IcebergDataWriteFactory.scala:77)
	at org.apache.spark.sql.datasources.v2.WritingColumnarBatchSparkTask.run(ColumnarWriteToDataSourceV2Exec.scala:49)
	at org.apache.spark.sql.datasources.v2.WritingColumnarBatchSparkTask.run$(ColumnarWriteToDataSourceV2Exec.scala:39)
	at org.apache.spark.sql.datasources.v2.DataWritingColumnarBatchSparkTask$.run(ColumnarWriteToDataSourceV2Exec.scala:93)

Actually, I think maybe my implementation is fine too, especially the adjustment to the ColumnarBatchDataWriterFactory interface—it now fully aligns with Spark's DataWriterFactory.

Gluten's ColumnarBatchDataWriterFactory

public interface ColumnarBatchDataWriterFactory extends Serializable {
  DataWriter<ColumnarBatch> createWriter(int partitionId, long taskId);
}

Spark's DataWriterFactory

public interface DataWriterFactory extends Serializable {
  DataWriter<InternalRow> createWriter(int partitionId, long taskId);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Can you have a look again, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jinchengchenghh jinchengchenghh self-requested a review January 21, 2026 11:04
@jinchengchenghh jinchengchenghh merged commit d2c0630 into apache:main Jan 30, 2026
116 of 120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants