Skip to content

Implement NUMA-aware memory access for large datasets #139

@paveg

Description

@paveg

Description

Implement NUMA (Non-Uniform Memory Access) aware memory allocation and access patterns for HAVING operations on large datasets to optimize memory bandwidth and reduce cross-socket memory access latency.

Background

On multi-socket systems, memory access latency varies depending on which CPU accesses which memory bank. NUMA-aware allocation can significantly improve performance for memory-intensive operations like large aggregations in HAVING clauses.

Proposed Implementation

1. NUMA-Aware Memory Allocator

type NUMAAllocator struct {
    nodeAllocators map[int]memory.Allocator
    nodeAffinity   map[string]int  // Column -> NUMA node
}

// Allocate memory on specific NUMA node
func (n *NUMAAllocator) AllocateOnNode(size int, node int) []byte

// Pin goroutine to NUMA node
func (n *NUMAAllocator) SetThreadAffinity(node int) error

2. NUMA-Aware Parallel Execution

// Distribute work based on NUMA topology
type NUMAWorkerPool struct {
    nodesCount int
    workersPerNode int
    nodeQueues []chan Task
}

// Schedule work to minimize cross-node access
func (p *NUMAWorkerPool) ScheduleHavingTask(task HavingTask) {
    node := p.determineOptimalNode(task.Data)
    p.nodeQueues[node] <- task
}

3. Data Placement Strategy

// Optimize data placement for HAVING operations
type NUMADataPlacement struct {
    // Place frequently accessed columns on same node
    PlaceGroupByColumns(cols []string) error
    
    // Co-locate aggregation buffers with data
    AllocateAggregationBuffers(groups int) error
    
    // Monitor and rebalance if needed
    MonitorAccessPatterns() AccessStats
}

4. Configuration

numa:
  enabled: true
  auto_detect: true
  nodes: 2
  
  # Memory allocation strategy
  allocation:
    interleave: false  # Use local allocation
    preferred_node: -1  # Auto-select
    
  # Worker distribution
  workers:
    per_node: 4
    pin_threads: true
    
  # Monitoring
  monitoring:
    track_migrations: true
    rebalance_threshold: 0.2

Implementation Tasks

  • Detect NUMA topology at runtime
  • Implement NUMA-aware memory allocator
  • Create NUMA-aware worker pool for parallel execution
  • Add data placement optimization strategies
  • Implement thread affinity management
  • Add NUMA performance metrics and monitoring
  • Create benchmarks on NUMA systems
  • Document NUMA configuration options

Performance Impact

Expected improvements on NUMA systems:

  • 20-40% reduction in memory access latency
  • 15-30% improvement in aggregation throughput
  • Better scalability on high-core-count systems
  • Reduced memory bandwidth contention

Compatibility

  • Graceful fallback on non-NUMA systems
  • Optional feature with runtime detection
  • No performance regression on single-socket systems

Priority

Low - Advanced optimization for specific hardware configurations.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: memory🧠 Memory management and allocationpriority: low🟢 Low priority / nice-to-have featurestype: performance⚡ Performance improvements and optimizations

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions