-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Description
Furnace Performance Research and Improvement Plan
Performance Testing Infrastructure ✅
Current Setup:
- Benchmarking Framework: BenchmarkDotNet with comprehensive comparison against Python/PyTorch
- Benchmark Command:
dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*" - Python Benchmarks:
dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*" - Tests:
dotnet test --configuration Release - Build:
dotnet build --configuration Release
Current Performance Characteristics 📊
Backend Performance Analysis (from existing benchmarks):
-
Reference Backend:
- Pure F# implementation, runs on any .NET platform
- Low setup overhead (~1.0μs) but slower execution (~0.0025μs per operation)
- Better performance for small tensors (<10K elements)
- 10-50x slower than Torch for large tensor operations
-
Torch Backend:
- CPU: Medium setup overhead (~8.0μs), fast execution (~0.0010μs per operation)
- GPU: High setup overhead (~75.0μs), very fast execution (~0.000056μs per operation)
- Optimal for large tensor operations (>10K elements)
Typical Workloads 🎯:
- Machine Learning: Neural networks, optimization, gradient computation
- Tensor Operations: Matrix multiplication, element-wise operations, convolutions
- Scientific Computing: Linear algebra, differentiable programming
- Data Types: Primarily float32, float64, int32 across CPU/GPU
Performance Bottlenecks Identified 🔍
1. High-Level API Overhead
- Benchmark data shows Tensor layer often 2-3x slower than RawTensor
- Multiple abstraction layers between user API and native operations
2. Small Tensor Performance
- Torch backend not optimal for tensors <10K elements due to setup overhead
- Opportunity for hybrid backend selection based on tensor size
3. Known Optimization TODOs (from code analysis):
Torch.RawTensor.fs:1118: "TODO - this should be faster"Tensor.fs:795: "TODO: The following can be slow, especially for reverse mode differentiation of the diagonal of a large tensor"- Several missing optimized routines noted in benchmarks (addWithAlpha, addInPlace)
4. Memory Access Patterns
- Analysis shows 3% overhead per additional field in RawTensor for small tensors
- Potential for memory layout optimization
Performance Goals 🎯
Round 1 (Foundation): Low-hanging fruit, 10-30% improvements
- Implement missing optimized tensor operations (addWithAlpha, addInPlace optimizations)
- Fix identified performance TODOs in codebase
- Optimize small tensor code paths in Reference backend
- Reduce API layer overhead for common operations
Round 2 (Optimization): 20-50% improvements
- Intelligent backend selection based on tensor size/operation
- Memory layout optimizations for RawTensor
- Batch operation optimizations for multiple small tensors
- Cache optimization for frequently used tensor shapes
Round 3 (Advanced): 50%+ improvements
- SIMD optimizations in Reference backend for specific operations
- Custom GPU kernels for common ML operations
- Memory pooling for tensor allocation/deallocation
- Lazy evaluation for tensor operation chains
Environment Setup for Performance Work 🛠️
Prerequisites:
- .NET 6.0 SDK
- TorchSharp backend (CPU/GPU)
Build Steps (automated via .github/actions/daily-perf-improver/build-steps/action.yml):
dotnet restore
dotnet build --configuration Release --no-restore
dotnet test --configuration Release --no-buildBenchmarking Workflow:
- Make performance changes
- Run benchmarks:
dotnet run --project tests/Furnace.Benchmarks/Furnace.Benchmarks.fsproj -c Release - Compare with baseline results in
BenchmarkDotNet.Artifacts/results/ - For Python comparison: Update
tests/Furnace.Benchmarks.Python/and run
Performance Measurement:
- Wall-clock time measurements (with virtualization caveats)
- Memory allocation analysis
- Operation throughput (ops/sec)
- Comparative analysis vs PyTorch baseline
Engineering Best Practices for Performance Work 🔧
Reliable Performance Testing Zone:
- Commands complete within ~1min for rapid iteration
- Clear visibility into code paths affecting performance
- Repeatable benchmark results with statistical significance
- Version-controlled baseline results for regression detection
Target Metrics:
- Torch Backend: Reduce API overhead by 20-30%
- Reference Backend: Improve large tensor performance by 2-3x
- Hybrid Approach: Automatic backend selection saving 15-25% across mixed workloads
- Memory: Reduce allocation overhead by 10-20%
Performance Bottleneck Categories 🏗️
CPU-bound: Reference backend mathematical operations, small tensor arithmetic
Memory-bound: Large tensor creation/copying, gradient computation
I/O-bound: GPU data transfer, model loading/saving
API-bound: High-level abstraction overhead, dynamic dispatch
Maintainer Considerations 👥
This plan focuses on:
- ✅ Compatibility: No breaking API changes
- ✅ Maintainability: Clean, well-documented optimizations
- ✅ Testing: Comprehensive benchmarks and regression tests
- ✅ Progressive: Incremental improvements with measured impact
AI-generated content by Daily Perf Improver may contain mistakes.
Metadata
Metadata
Assignees
Labels
No labels