[VL] Support multiple segments per partition in columnar shuffle by guowangy · Pull Request #11722 · apache/gluten

guowangy · 2026-03-09T02:54:20Z

What changes are proposed in this pull request?

Introduces multi-segment-per-partition support in the Velox backend columnar shuffle writer, enabling incremental flushing of partition data to the final data file during processing — reducing peak memory usage without requiring full in-memory buffering or temporary spill files. The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.

New index file format (`ColumnarIndexShuffleBlockResolver`)

Extends IndexShuffleBlockResolver with a new index format supporting multiple (offset, length) segments per partition:

[Partition Index: (N+1) × 8-byte big-endian offsets]
[Segment Data: per-partition list of (data_offset, length) pairs, each 8 bytes]
[1-byte end marker]  ← distinguishes from legacy format (size always multiple of 8)

ColumnarShuffleManager now uses this resolver. Multi-segment mode activates only when external shuffle service, push-based shuffle, and dictionary encoding are all disabled (dictionary encoding requires all-batches-complete before writing).

New I/O abstractions

FileSegmentsInputStream — InputStream over non-contiguous (offset, size) file segments; supports zero-copy native reads via read(destAddress, maxSize)
FileSegmentsManagedBuffer — ManagedBuffer backed by discontiguous segments; supports nioByteBuffer(), createInputStream(), convertToNetty()
DiscontiguousFileRegion — Netty FileRegion mapping a logical range to multiple physical segments for zero-copy network transfer
LowCopyFileSegmentsJniByteInputStream — zero-copy JNI wrapper over FileSegmentsInputStream; wired into JniByteInputStreams.create()

C++ `LocalPartitionWriter` changes

usePartitionMultipleSegments_ flag + partitionSegments_ vector tracking (start, length) per partition
flushCachedPayloads() — incremental flush after each hashEvict
writeMemoryPayload() — direct write to final data file during sortEvict
writeIndexFile() — serializes the new index at stop time
PayloadCache::writeIncremental() — flushes completed (non-active) partitions without touching the in-use partition

JNI/JVM wiring

LocalPartitionWriterJniWrapper and JniWrapper.cc accept a new optional indexFile parameter; ColumnarShuffleWriter passes the temp index file path when multi-segment mode is active.

How was this patch tested?

New unit test suites:

ColumnarIndexShuffleBlockResolverSuite — index format read/write, format detection, multi-segment block lookup
FileSegmentsInputStreamSuite — sequential reads, multi-segment traversal, skip, zero-copy native reads
FileSegmentsManagedBufferSuite — nioByteBuffer, createInputStream, convertToNetty, EOF and mmap edge cases
DiscontiguousFileRegionSuite — Netty transfer across discontiguous segments, lazy open
LowCopyFileSegmentsJniByteInputStreamTest — JNI wrapper correctness for ByteInputStream

Was this patch authored or co-authored using generative AI tooling?

… segments support

github-actions · 2026-03-09T02:54:49Z

Run Gluten Clickhouse CI on x86

marin-ma

@guowangy Thanks for contributing this feature. Please check my comments below.

marin-ma · 2026-03-10T16:58:45Z

cpp/core/shuffle/LocalPartitionWriter.cc

+#endif
+}
+
+arrow::Status LocalPartitionWriter::writeIndexFile() {


Can you add some c++ unit tests for the multi-segment partition write?

marin-ma · 2026-03-10T17:00:39Z

cpp/core/shuffle/LocalPartitionWriter.cc

+}
+
+// Helper for big-endian conversion (network order)
+#include <arpa/inet.h>


Please move the header to the top, and move htoll into anonymous namespace after the headers.

marin-ma · 2026-03-10T17:02:31Z

backends-velox/src/main/scala/org/apache/spark/shuffle/ColumnarShuffleWriter.scala

+        // For Dictionary encoding, the dict only finalizes after all batches are processed,
+        // and dict is required to saved at the head of the partition data.
+        // So we cannot use multiple segments to save partition data incrementally.
+        partitionUseMultipleSegments = true


Please add a configuration to enable this feature.

marin-ma · 2026-03-10T17:09:42Z

cpp/core/shuffle/LocalPartitionWriter.cc


+  if (usePartitionMultipleSegments_) {
+    // If multiple segments per partition is enabled, write directly to the final data file.
+    RETURN_NOT_OK(writeMemoryPayload(partitionId, std::move(inMemoryPayload)));


Can you explain a bit more on how this can reduce the memory usage? Looks like the memory is still only get reclaimed by OOM and spilling.

marin-ma · 2026-03-10T17:15:13Z

cpp/core/shuffle/LocalPartitionWriter.cc

      RETURN_NOT_OK(payloadCache_->cache(partitionId, std::move(payload)));
    }
+    if (usePartitionMultipleSegments_) {
+      RETURN_NOT_OK(flushCachedPayloads());


The hashEvict is not only called for spilling. When the evictType is kCache, then it try to cache as much payload in memory as possible to reduce spilling.

And when the evitType is kSpill, the data will be written to a spilled data file. Two evict types can exist in the same job. Is evictType == kSpill being properly handled for multi-segments write?

marin-ma · 2026-03-10T21:57:14Z

The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.

Can you explain where this improvement mainly comes from?

Currently we follow the same file layout as vanilla spark to have each partition output contiguous. I think one major benefit for this design is to reduce small random disk IO from the shuffle reader side. If memory is tight then the spill will be triggered more frequently, and it will be more likely to produce small output blocks for each partition. In this case this design will not be IO friendly.

guowangy added 20 commits March 9, 2026 10:24

Add ColumnarIndexShuffleBlockResolver

9b67bbb

Add DiscontiguousFileRegion

0be5dae

Add FileSegmentsBuffer#createInputStream

74e9692

Impl FileSegmentsBuffer#nioByteBuffer

15f2226

Add mmap opt for single large file

54d9dd1

DiscontiguousFileRegion support lazy open

c072f0e

Impl FileSegmentsManagedBuffer#convertToNetty

d534477

Add ColumnarIndexShuffleBlockResolver#getSegmentsFromIndex

f96ca3e

Impl ColumnarIndexShuffleBlockResolver#getBlockData

4d1a015

ColumnarIndexShuffleBlockResolver: add limitation of using new format

d6159c1

Add an optional indexFile params to LocalPartitionWriter for multiple…

6286505

… segments support

LocalPartitionWriter: support multiple segments of partition

f24c45a

Add LowCopyFileSegmentsJniByteInputStream to support native read

5a2ea3e

Add FileSegmentsInputStream to handle segments read

9ceb7fd

LowCopyFileSegmentsJniByteInputStream use FileSegmentsInputStream

66d3742

Avoid frequently calling fp.Tell()

e8f1b54

LocalPartitionWriter: fix output format for sortEvict

9cac127

Various fixes

93b3f53

FileSegmentsManagedBuffer: empty segments should work

5857915

Fixes

ecde73f

github-actions bot added CORE works for Gluten Core VELOX labels Mar 9, 2026

zhouyuan requested a review from marin-ma March 10, 2026 15:43

marin-ma reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Support multiple segments per partition in columnar shuffle#11722

[VL] Support multiple segments per partition in columnar shuffle#11722
guowangy wants to merge 20 commits intoapache:mainfrom
guowangy:partition-multi-segments

guowangy commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

marin-ma left a comment

Uh oh!

marin-ma Mar 10, 2026

Uh oh!

marin-ma Mar 10, 2026

Uh oh!

marin-ma Mar 10, 2026

Uh oh!

marin-ma Mar 10, 2026

Uh oh!

marin-ma Mar 10, 2026

Uh oh!

marin-ma commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guowangy commented Mar 9, 2026

What changes are proposed in this pull request?

New index file format (ColumnarIndexShuffleBlockResolver)

New I/O abstractions

C++ LocalPartitionWriter changes

JNI/JVM wiring

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

marin-ma left a comment

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New index file format (`ColumnarIndexShuffleBlockResolver`)

C++ `LocalPartitionWriter` changes