Implement NUMA-aware memory access for large datasets

## Description

Implement NUMA (Non-Uniform Memory Access) aware memory allocation and access patterns for HAVING operations on large datasets to optimize memory bandwidth and reduce cross-socket memory access latency.

## Background

On multi-socket systems, memory access latency varies depending on which CPU accesses which memory bank. NUMA-aware allocation can significantly improve performance for memory-intensive operations like large aggregations in HAVING clauses.

## Proposed Implementation

### 1. NUMA-Aware Memory Allocator

```go
type NUMAAllocator struct {
    nodeAllocators map[int]memory.Allocator
    nodeAffinity   map[string]int  // Column -> NUMA node
}

// Allocate memory on specific NUMA node
func (n *NUMAAllocator) AllocateOnNode(size int, node int) []byte

// Pin goroutine to NUMA node
func (n *NUMAAllocator) SetThreadAffinity(node int) error
```

### 2. NUMA-Aware Parallel Execution

```go
// Distribute work based on NUMA topology
type NUMAWorkerPool struct {
    nodesCount int
    workersPerNode int
    nodeQueues []chan Task
}

// Schedule work to minimize cross-node access
func (p *NUMAWorkerPool) ScheduleHavingTask(task HavingTask) {
    node := p.determineOptimalNode(task.Data)
    p.nodeQueues[node] <- task
}
```

### 3. Data Placement Strategy

```go
// Optimize data placement for HAVING operations
type NUMADataPlacement struct {
    // Place frequently accessed columns on same node
    PlaceGroupByColumns(cols []string) error
    
    // Co-locate aggregation buffers with data
    AllocateAggregationBuffers(groups int) error
    
    // Monitor and rebalance if needed
    MonitorAccessPatterns() AccessStats
}
```

### 4. Configuration

```yaml
numa:
  enabled: true
  auto_detect: true
  nodes: 2
  
  # Memory allocation strategy
  allocation:
    interleave: false  # Use local allocation
    preferred_node: -1  # Auto-select
    
  # Worker distribution
  workers:
    per_node: 4
    pin_threads: true
    
  # Monitoring
  monitoring:
    track_migrations: true
    rebalance_threshold: 0.2
```

## Implementation Tasks

- [ ] Detect NUMA topology at runtime
- [ ] Implement NUMA-aware memory allocator
- [ ] Create NUMA-aware worker pool for parallel execution
- [ ] Add data placement optimization strategies
- [ ] Implement thread affinity management
- [ ] Add NUMA performance metrics and monitoring
- [ ] Create benchmarks on NUMA systems
- [ ] Document NUMA configuration options

## Performance Impact

Expected improvements on NUMA systems:
- 20-40% reduction in memory access latency
- 15-30% improvement in aggregation throughput
- Better scalability on high-core-count systems
- Reduced memory bandwidth contention

## Compatibility

- Graceful fallback on non-NUMA systems
- Optional feature with runtime detection
- No performance regression on single-socket systems

## Priority

**Low** - Advanced optimization for specific hardware configurations.

## Related

- Extends parallel processing from #18
- Part of performance optimization efforts
- Complements memory management improvements from #7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NUMA-aware memory access for large datasets #139

Description

Background

Proposed Implementation

1. NUMA-Aware Memory Allocator

2. NUMA-Aware Parallel Execution

3. Data Placement Strategy

4. Configuration

Implementation Tasks

Performance Impact

Compatibility

Priority

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement NUMA-aware memory access for large datasets #139

Description

Description

Background

Proposed Implementation

1. NUMA-Aware Memory Allocator

2. NUMA-Aware Parallel Execution

3. Data Placement Strategy

4. Configuration

Implementation Tasks

Performance Impact

Compatibility

Priority

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions