-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
area: memory🧠 Memory management and allocation🧠 Memory management and allocationpriority: low🟢 Low priority / nice-to-have features🟢 Low priority / nice-to-have featurestype: performance⚡ Performance improvements and optimizations⚡ Performance improvements and optimizations
Description
Description
Implement NUMA (Non-Uniform Memory Access) aware memory allocation and access patterns for HAVING operations on large datasets to optimize memory bandwidth and reduce cross-socket memory access latency.
Background
On multi-socket systems, memory access latency varies depending on which CPU accesses which memory bank. NUMA-aware allocation can significantly improve performance for memory-intensive operations like large aggregations in HAVING clauses.
Proposed Implementation
1. NUMA-Aware Memory Allocator
type NUMAAllocator struct {
nodeAllocators map[int]memory.Allocator
nodeAffinity map[string]int // Column -> NUMA node
}
// Allocate memory on specific NUMA node
func (n *NUMAAllocator) AllocateOnNode(size int, node int) []byte
// Pin goroutine to NUMA node
func (n *NUMAAllocator) SetThreadAffinity(node int) error2. NUMA-Aware Parallel Execution
// Distribute work based on NUMA topology
type NUMAWorkerPool struct {
nodesCount int
workersPerNode int
nodeQueues []chan Task
}
// Schedule work to minimize cross-node access
func (p *NUMAWorkerPool) ScheduleHavingTask(task HavingTask) {
node := p.determineOptimalNode(task.Data)
p.nodeQueues[node] <- task
}3. Data Placement Strategy
// Optimize data placement for HAVING operations
type NUMADataPlacement struct {
// Place frequently accessed columns on same node
PlaceGroupByColumns(cols []string) error
// Co-locate aggregation buffers with data
AllocateAggregationBuffers(groups int) error
// Monitor and rebalance if needed
MonitorAccessPatterns() AccessStats
}4. Configuration
numa:
enabled: true
auto_detect: true
nodes: 2
# Memory allocation strategy
allocation:
interleave: false # Use local allocation
preferred_node: -1 # Auto-select
# Worker distribution
workers:
per_node: 4
pin_threads: true
# Monitoring
monitoring:
track_migrations: true
rebalance_threshold: 0.2Implementation Tasks
- Detect NUMA topology at runtime
- Implement NUMA-aware memory allocator
- Create NUMA-aware worker pool for parallel execution
- Add data placement optimization strategies
- Implement thread affinity management
- Add NUMA performance metrics and monitoring
- Create benchmarks on NUMA systems
- Document NUMA configuration options
Performance Impact
Expected improvements on NUMA systems:
- 20-40% reduction in memory access latency
- 15-30% improvement in aggregation throughput
- Better scalability on high-core-count systems
- Reduced memory bandwidth contention
Compatibility
- Graceful fallback on non-NUMA systems
- Optional feature with runtime detection
- No performance regression on single-socket systems
Priority
Low - Advanced optimization for specific hardware configurations.
Related
- Extends parallel processing from [High] Implement Parallel Sorting #18
- Part of performance optimization efforts
- Complements memory management improvements from [Medium] Implement Memory Management #7
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: memory🧠 Memory management and allocation🧠 Memory management and allocationpriority: low🟢 Low priority / nice-to-have features🟢 Low priority / nice-to-have featurestype: performance⚡ Performance improvements and optimizations⚡ Performance improvements and optimizations