Daily Perf Improver: Research and Plan

# Furnace Performance Research and Improvement Plan

## Performance Testing Infrastructure ✅

**Current Setup:**
- **Benchmarking Framework**: BenchmarkDotNet with comprehensive comparison against Python/PyTorch
- **Benchmark Command**: `dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"`
- **Python Benchmarks**: `dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*"`
- **Tests**: `dotnet test --configuration Release`
- **Build**: `dotnet build --configuration Release`

## Current Performance Characteristics 📊

### Backend Performance Analysis (from existing benchmarks):

1. **Reference Backend**: 
   - Pure F# implementation, runs on any .NET platform
   - Low setup overhead (~1.0μs) but slower execution (~0.0025μs per operation)  
   - Better performance for small tensors (<10K elements)
   - 10-50x slower than Torch for large tensor operations

2. **Torch Backend**:
   - **CPU**: Medium setup overhead (~8.0μs), fast execution (~0.0010μs per operation)
   - **GPU**: High setup overhead (~75.0μs), very fast execution (~0.000056μs per operation)
   - Optimal for large tensor operations (>10K elements)

### Typical Workloads 🎯:
- **Machine Learning**: Neural networks, optimization, gradient computation
- **Tensor Operations**: Matrix multiplication, element-wise operations, convolutions
- **Scientific Computing**: Linear algebra, differentiable programming
- **Data Types**: Primarily float32, float64, int32 across CPU/GPU

## Performance Bottlenecks Identified 🔍

### 1. **High-Level API Overhead**
   - Benchmark data shows Tensor layer often 2-3x slower than RawTensor
   - Multiple abstraction layers between user API and native operations

### 2. **Small Tensor Performance**  
   - Torch backend not optimal for tensors <10K elements due to setup overhead
   - Opportunity for hybrid backend selection based on tensor size

### 3. **Known Optimization TODOs** (from code analysis):
   - `Torch.RawTensor.fs:1118`: "TODO - this should be faster"
   - `Tensor.fs:795`: "TODO: The following can be slow, especially for reverse mode differentiation of the diagonal of a large tensor"
   - Several missing optimized routines noted in benchmarks (addWithAlpha, addInPlace)

### 4. **Memory Access Patterns**
   - Analysis shows 3% overhead per additional field in RawTensor for small tensors
   - Potential for memory layout optimization

## Performance Goals 🎯

### Round 1 (Foundation): Low-hanging fruit, 10-30% improvements
- [ ] **Implement missing optimized tensor operations** (addWithAlpha, addInPlace optimizations)
- [ ] **Fix identified performance TODOs** in codebase  
- [ ] **Optimize small tensor code paths** in Reference backend
- [ ] **Reduce API layer overhead** for common operations

### Round 2 (Optimization): 20-50% improvements  
- [ ] **Intelligent backend selection** based on tensor size/operation
- [ ] **Memory layout optimizations** for RawTensor
- [ ] **Batch operation optimizations** for multiple small tensors
- [ ] **Cache optimization** for frequently used tensor shapes

### Round 3 (Advanced): 50%+ improvements
- [ ] **SIMD optimizations** in Reference backend for specific operations
- [ ] **Custom GPU kernels** for common ML operations
- [ ] **Memory pooling** for tensor allocation/deallocation
- [ ] **Lazy evaluation** for tensor operation chains

## Environment Setup for Performance Work 🛠️

**Prerequisites**: 
- .NET 6.0 SDK
- TorchSharp backend (CPU/GPU)

**Build Steps** (automated via `.github/actions/daily-perf-improver/build-steps/action.yml`):
```bash
dotnet restore
dotnet build --configuration Release --no-restore
dotnet test --configuration Release --no-build
```

**Benchmarking Workflow**:
1. Make performance changes
2. Run benchmarks: `dotnet run --project tests/Furnace.Benchmarks/Furnace.Benchmarks.fsproj -c Release`
3. Compare with baseline results in `BenchmarkDotNet.Artifacts/results/`
4. For Python comparison: Update `tests/Furnace.Benchmarks.Python/` and run

**Performance Measurement**:
- Wall-clock time measurements (with virtualization caveats)
- Memory allocation analysis
- Operation throughput (ops/sec)
- Comparative analysis vs PyTorch baseline

## Engineering Best Practices for Performance Work 🔧

**Reliable Performance Testing Zone**:
- Commands complete within ~1min for rapid iteration
- Clear visibility into code paths affecting performance  
- Repeatable benchmark results with statistical significance
- Version-controlled baseline results for regression detection

**Target Metrics**:
- **Torch Backend**: Reduce API overhead by 20-30%
- **Reference Backend**: Improve large tensor performance by 2-3x  
- **Hybrid Approach**: Automatic backend selection saving 15-25% across mixed workloads
- **Memory**: Reduce allocation overhead by 10-20%

## Performance Bottleneck Categories 🏗️

**CPU-bound**: Reference backend mathematical operations, small tensor arithmetic
**Memory-bound**: Large tensor creation/copying, gradient computation  
**I/O-bound**: GPU data transfer, model loading/saving
**API-bound**: High-level abstraction overhead, dynamic dispatch

## Maintainer Considerations 👥

This plan focuses on:
- ✅ **Compatibility**: No breaking API changes
- ✅ **Maintainability**: Clean, well-documented optimizations  
- ✅ **Testing**: Comprehensive benchmarks and regression tests
- ✅ **Progressive**: Incremental improvements with measured impact

---

> AI-generated content by [Daily Perf Improver](https://github.com/fsprojects/Furnace/actions/runs/17341902486) may contain mistakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daily Perf Improver: Research and Plan #61

Furnace Performance Research and Improvement Plan

Performance Testing Infrastructure ✅

Current Performance Characteristics 📊

Backend Performance Analysis (from existing benchmarks):

Typical Workloads 🎯:

Performance Bottlenecks Identified 🔍

1. High-Level API Overhead

2. Small Tensor Performance

3. Known Optimization TODOs (from code analysis):

4. Memory Access Patterns

Performance Goals 🎯

Round 1 (Foundation): Low-hanging fruit, 10-30% improvements

Round 2 (Optimization): 20-50% improvements

Round 3 (Advanced): 50%+ improvements

Environment Setup for Performance Work 🛠️

Engineering Best Practices for Performance Work 🔧

Performance Bottleneck Categories 🏗️

Maintainer Considerations 👥

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Daily Perf Improver: Research and Plan #61

Description

Furnace Performance Research and Improvement Plan

Performance Testing Infrastructure ✅

Current Performance Characteristics 📊

Backend Performance Analysis (from existing benchmarks):

Typical Workloads 🎯:

Performance Bottlenecks Identified 🔍

1. High-Level API Overhead

2. Small Tensor Performance

3. Known Optimization TODOs (from code analysis):

4. Memory Access Patterns

Performance Goals 🎯

Round 1 (Foundation): Low-hanging fruit, 10-30% improvements

Round 2 (Optimization): 20-50% improvements

Round 3 (Advanced): 50%+ improvements

Environment Setup for Performance Work 🛠️

Engineering Best Practices for Performance Work 🔧

Performance Bottleneck Categories 🏗️

Maintainer Considerations 👥

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions