Daily Perf Improver: Research and Plan

# Daily Performance Improvement Research & Plan

## Project Overview
Furnace is an F# tensor library with support for differentiable programming, designed for machine learning, probabilistic programming, and optimization. It provides:
- Nested and mixed-mode differentiation
- PyTorch familiar naming and idioms
- Multiple backends: Reference (CPU-only F#), Torch (TorchSharp/LibTorch with CUDA support)
- Common optimizers, model elements, differentiable probability distributions

## Current Performance Testing Infrastructure

### Benchmarking Framework
- **BenchmarkDotNet** is used for micro-benchmarking
- Benchmarks are in `tests/Furnace.Benchmarks/` 
- Python comparison benchmarks in `tests/Furnace.Benchmarks.Python/`
- Command to run benchmarks: `dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"`
- Current benchmark matrix tests:
  - Tensor sizes: 16, 2048, 65536 elements
  - Data types: int32, float32, float64
  - Devices: cpu, cuda
  - Operations: creation (zeros, ones, rand), basic math (addition, scalar ops), matrix ops (matmul)

### Current Performance Numbers
From existing benchmark results, Reference backend significantly outperforms:
- **Reference backend** is ~100x faster than PyTorch for small tensor operations 
- **TorchSharp/Torch backend** is ~2-3x faster than PyTorch but ~50x slower than Reference
- Example (16 element float32 tensors on CPU):
  - PyTorch addition: ~759ms
  - TorchSharp addition: ~523ms  
  - Reference addition: ~15ms

## Performance Bottlenecks Analysis

### 1. Backend Layer Inefficiencies
- **TorchSharp Interop Overhead**: Heavy cost converting between F# types and TorchSharp C# types
- **Type Conversion**: Multiple conversions between Dtype/torch.ScalarType, Device/torch.Device
- **Handle Management**: Each tensor operation creates new handles with disposal overhead

### 2. Tensor Creation Performance
- Reference backend creates tensors ~100x faster than Torch backend
- Significant overhead in `fromCpuData` operations through TorchSharp

### 3. Memory Management
- No explicit tensor disposal (issue #19) - relying on GC
- TorchSharp tensors accumulate in memory until GC
- Potential for memory pressure during intensive computation

### 4. Algorithm-Level Optimizations
- **Marsaglia Gaussian generator** inefficiency (issue #23) - discarding second generated sample
- Missing vectorization opportunities in Reference backend
- No SIMD optimizations for CPU operations

### 5. Differentiation Overhead  
- Multiple tensor wrapper types (TensorC/TensorF/TensorR) add call overhead
- Deep nesting in `primalDeep` operations with recursive calls

## Typical Workloads & Bottlenecks

### Machine Learning Workloads
- Neural network training: Forward/backward passes with large matrix operations
- Optimization algorithms (Adam, SGD): Many small tensor operations per step
- Model inference: Mostly matrix multiplications and activations

### Performance Characteristics
- **I/O bound**: Data loading, model serialization
- **CPU bound**: Reference backend operations, automatic differentiation  
- **Memory bound**: Large tensor operations, gradient accumulation
- **GPU bound**: CUDA operations through TorchSharp when available

## Performance Goals by Round

### Round 1: Low-Hanging Fruit (Target: 20-50% improvement)
1. **Fix Marsaglia Gaussian generator** - cache second sample (2x improvement for random normal)
2. **Optimize tensor creation paths** - reduce type conversions in TorchSharp backend  
3. **Add tensor operation fusion** - combine multiple operations to reduce intermediate allocations
4. **Improve scalar operations** - optimize tensor-scalar arithmetic patterns

### Round 2: Backend Optimizations (Target: 2-5x improvement)
1. **SIMD vectorization** for Reference backend hot paths
2. **Memory pooling** for intermediate tensor allocations
3. **Lazy evaluation** for tensor operation chains
4. **In-place operation support** for appropriate cases

### Round 3: Architecture Changes (Target: 5-10x improvement)
1. **Native backend** with F# P/Invoke to optimized BLAS/LAPACK
2. **Tensor operation batching** for better GPU utilization
3. **JIT compilation** for tensor operation graphs
4. **Custom automatic differentiation** with specialized reverse-mode implementation

## Build & Testing Commands

### Standard Build
```bash
dotnet restore
dotnet build --configuration Release --no-restore --verbosity normal
dotnet test --configuration Release --no-build
```

### Benchmark Commands  
```bash
# Run F# benchmarks
dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"

# Run Python benchmarks (updates source files)
dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*"
```

### GPU Testing
```bash
dotnet test /p:Furnace_TESTGPU=true
```

## Profiling & Measurement Setup

### Micro-benchmarking
- Use existing BenchmarkDotNet infrastructure
- Target operations in `BasicTensorOps` benchmark suite
- Focus on operations with high iteration counts (workloadSize / tensorSize factor)

### Profiling Tools
- **.NET profilers**: PerfView, JetBrains dotMemory, Visual Studio Diagnostics
- **Native profilers**: Intel VTune (for TorchSharp), perf on Linux
- **Memory profilers**: Application Verifier, Debug heap

### Performance Measurement Strategy
- Compare Reference vs TorchSharp backends
- Measure against PyTorch baselines  
- Test across tensor size matrix: small (16), medium (2048), large (65536)
- Profile both CPU and GPU (CUDA) execution paths

## Maintainer Priorities

Based on existing issues and project direction:
1. **Correctness over performance** - ensure mathematical correctness
2. **API stability** - minimize breaking changes to public interfaces
3. **Cross-platform support** - maintain Linux/macOS/Windows compatibility  
4. **Educational use** - support learning ML concepts with F#
5. **PyTorch compatibility** - maintain familiar API patterns

## Concrete Next Steps

### Environment Setup for Performance Work
1. Build project: `dotnet build -c Release`
2. Install benchmark dependencies: `dotnet tool restore`
3. Run baseline benchmarks to establish current performance
4. Set up profiling tools (PerfView/dotMemory)
5. Create performance regression test suite

### Development Workflow
1. **Identify bottleneck** via profiling/benchmarks
2. **Implement optimization** in isolated branch
3. **Measure improvement** with micro-benchmarks  
4. **Run regression tests** to ensure correctness
5. **Performance CI** integration with benchmark comparison

This plan provides a systematic approach to performance improvement with clear measurement criteria and realistic improvement targets.

> AI-generated content by [Daily Perf Improver](https://github.com/fsprojects/Furnace/actions/runs/17336769011) may contain mistakes.

Daily Perf Improver: Research and Plan #45

Description

Daily Performance Improvement Research & Plan

Project Overview

Current Performance Testing Infrastructure

Benchmarking Framework

Current Performance Numbers

Performance Bottlenecks Analysis

1. Backend Layer Inefficiencies

2. Tensor Creation Performance

3. Memory Management

4. Algorithm-Level Optimizations

5. Differentiation Overhead

Typical Workloads & Bottlenecks

Machine Learning Workloads

Performance Characteristics

Performance Goals by Round

Round 1: Low-Hanging Fruit (Target: 20-50% improvement)

Round 2: Backend Optimizations (Target: 2-5x improvement)

Round 3: Architecture Changes (Target: 5-10x improvement)

Build & Testing Commands

Standard Build

Benchmark Commands

GPU Testing

Profiling & Measurement Setup

Micro-benchmarking

Profiling Tools

Performance Measurement Strategy

Maintainer Priorities

Concrete Next Steps

Environment Setup for Performance Work

Development Workflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions