[VL] Support multiple segments per partition in columnar shuffle#11722
[VL] Support multiple segments per partition in columnar shuffle#11722guowangy wants to merge 20 commits intoapache:mainfrom
Conversation
… segments support
|
Run Gluten Clickhouse CI on x86 |
| #endif | ||
| } | ||
|
|
||
| arrow::Status LocalPartitionWriter::writeIndexFile() { |
There was a problem hiding this comment.
Can you add some c++ unit tests for the multi-segment partition write?
| } | ||
|
|
||
| // Helper for big-endian conversion (network order) | ||
| #include <arpa/inet.h> |
There was a problem hiding this comment.
Please move the header to the top, and move htoll into anonymous namespace after the headers.
| // For Dictionary encoding, the dict only finalizes after all batches are processed, | ||
| // and dict is required to saved at the head of the partition data. | ||
| // So we cannot use multiple segments to save partition data incrementally. | ||
| partitionUseMultipleSegments = true |
There was a problem hiding this comment.
Please add a configuration to enable this feature.
|
|
||
| if (usePartitionMultipleSegments_) { | ||
| // If multiple segments per partition is enabled, write directly to the final data file. | ||
| RETURN_NOT_OK(writeMemoryPayload(partitionId, std::move(inMemoryPayload))); |
There was a problem hiding this comment.
Can you explain a bit more on how this can reduce the memory usage? Looks like the memory is still only get reclaimed by OOM and spilling.
| RETURN_NOT_OK(payloadCache_->cache(partitionId, std::move(payload))); | ||
| } | ||
| if (usePartitionMultipleSegments_) { | ||
| RETURN_NOT_OK(flushCachedPayloads()); |
There was a problem hiding this comment.
The hashEvict is not only called for spilling. When the evictType is kCache, then it try to cache as much payload in memory as possible to reduce spilling.
And when the evitType is kSpill, the data will be written to a spilled data file. Two evict types can exist in the same job. Is evictType == kSpill being properly handled for multi-segments write?
Can you explain where this improvement mainly comes from? Currently we follow the same file layout as vanilla spark to have each partition output contiguous. I think one major benefit for this design is to reduce small random disk IO from the shuffle reader side. If memory is tight then the spill will be triggered more frequently, and it will be more likely to produce small output blocks for each partition. In this case this design will not be IO friendly. |
What changes are proposed in this pull request?
Introduces multi-segment-per-partition support in the Velox backend columnar shuffle writer, enabling incremental flushing of partition data to the final data file during processing — reducing peak memory usage without requiring full in-memory buffering or temporary spill files. The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.
New index file format (
ColumnarIndexShuffleBlockResolver)Extends
IndexShuffleBlockResolverwith a new index format supporting multiple(offset, length)segments per partition:ColumnarShuffleManagernow uses this resolver. Multi-segment mode activates only when external shuffle service, push-based shuffle, and dictionary encoding are all disabled (dictionary encoding requires all-batches-complete before writing).New I/O abstractions
FileSegmentsInputStream—InputStreamover non-contiguous(offset, size)file segments; supports zero-copy native reads viaread(destAddress, maxSize)FileSegmentsManagedBuffer—ManagedBufferbacked by discontiguous segments; supportsnioByteBuffer(),createInputStream(),convertToNetty()DiscontiguousFileRegion— NettyFileRegionmapping a logical range to multiple physical segments for zero-copy network transferLowCopyFileSegmentsJniByteInputStream— zero-copy JNI wrapper overFileSegmentsInputStream; wired intoJniByteInputStreams.create()C++
LocalPartitionWriterchangesusePartitionMultipleSegments_flag +partitionSegments_vector tracking(start, length)per partitionflushCachedPayloads()— incremental flush after eachhashEvictwriteMemoryPayload()— direct write to final data file duringsortEvictwriteIndexFile()— serializes the new index at stop timePayloadCache::writeIncremental()— flushes completed (non-active) partitions without touching the in-use partitionJNI/JVM wiring
LocalPartitionWriterJniWrapperandJniWrapper.ccaccept a new optionalindexFileparameter;ColumnarShuffleWriterpasses the temp index file path when multi-segment mode is active.How was this patch tested?
New unit test suites:
ColumnarIndexShuffleBlockResolverSuite— index format read/write, format detection, multi-segment block lookupFileSegmentsInputStreamSuite— sequential reads, multi-segment traversal, skip, zero-copy native readsFileSegmentsManagedBufferSuite—nioByteBuffer,createInputStream,convertToNetty, EOF and mmap edge casesDiscontiguousFileRegionSuite— Netty transfer across discontiguous segments, lazy openLowCopyFileSegmentsJniByteInputStreamTest— JNI wrapper correctness for ByteInputStreamWas this patch authored or co-authored using generative AI tooling?