-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Description
Daily Performance Improvement Research & Plan
Project Overview
Furnace is an F# tensor library with support for differentiable programming, designed for machine learning, probabilistic programming, and optimization. It provides:
- Nested and mixed-mode differentiation
- PyTorch familiar naming and idioms
- Multiple backends: Reference (CPU-only F#), Torch (TorchSharp/LibTorch with CUDA support)
- Common optimizers, model elements, differentiable probability distributions
Current Performance Testing Infrastructure
Benchmarking Framework
- BenchmarkDotNet is used for micro-benchmarking
- Benchmarks are in
tests/Furnace.Benchmarks/ - Python comparison benchmarks in
tests/Furnace.Benchmarks.Python/ - Command to run benchmarks:
dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*" - Current benchmark matrix tests:
- Tensor sizes: 16, 2048, 65536 elements
- Data types: int32, float32, float64
- Devices: cpu, cuda
- Operations: creation (zeros, ones, rand), basic math (addition, scalar ops), matrix ops (matmul)
Current Performance Numbers
From existing benchmark results, Reference backend significantly outperforms:
- Reference backend is ~100x faster than PyTorch for small tensor operations
- TorchSharp/Torch backend is ~2-3x faster than PyTorch but ~50x slower than Reference
- Example (16 element float32 tensors on CPU):
- PyTorch addition: ~759ms
- TorchSharp addition: ~523ms
- Reference addition: ~15ms
Performance Bottlenecks Analysis
1. Backend Layer Inefficiencies
- TorchSharp Interop Overhead: Heavy cost converting between F# types and TorchSharp C# types
- Type Conversion: Multiple conversions between Dtype/torch.ScalarType, Device/torch.Device
- Handle Management: Each tensor operation creates new handles with disposal overhead
2. Tensor Creation Performance
- Reference backend creates tensors ~100x faster than Torch backend
- Significant overhead in
fromCpuDataoperations through TorchSharp
3. Memory Management
- No explicit tensor disposal (issue Explicit disposal discussion #19) - relying on GC
- TorchSharp tensors accumulate in memory until GC
- Potential for memory pressure during intensive computation
4. Algorithm-Level Optimizations
- Marsaglia Gaussian generator inefficiency (issue Improve the Marsaglia Gaussian generator #23) - discarding second generated sample
- Missing vectorization opportunities in Reference backend
- No SIMD optimizations for CPU operations
5. Differentiation Overhead
- Multiple tensor wrapper types (TensorC/TensorF/TensorR) add call overhead
- Deep nesting in
primalDeepoperations with recursive calls
Typical Workloads & Bottlenecks
Machine Learning Workloads
- Neural network training: Forward/backward passes with large matrix operations
- Optimization algorithms (Adam, SGD): Many small tensor operations per step
- Model inference: Mostly matrix multiplications and activations
Performance Characteristics
- I/O bound: Data loading, model serialization
- CPU bound: Reference backend operations, automatic differentiation
- Memory bound: Large tensor operations, gradient accumulation
- GPU bound: CUDA operations through TorchSharp when available
Performance Goals by Round
Round 1: Low-Hanging Fruit (Target: 20-50% improvement)
- Fix Marsaglia Gaussian generator - cache second sample (2x improvement for random normal)
- Optimize tensor creation paths - reduce type conversions in TorchSharp backend
- Add tensor operation fusion - combine multiple operations to reduce intermediate allocations
- Improve scalar operations - optimize tensor-scalar arithmetic patterns
Round 2: Backend Optimizations (Target: 2-5x improvement)
- SIMD vectorization for Reference backend hot paths
- Memory pooling for intermediate tensor allocations
- Lazy evaluation for tensor operation chains
- In-place operation support for appropriate cases
Round 3: Architecture Changes (Target: 5-10x improvement)
- Native backend with F# P/Invoke to optimized BLAS/LAPACK
- Tensor operation batching for better GPU utilization
- JIT compilation for tensor operation graphs
- Custom automatic differentiation with specialized reverse-mode implementation
Build & Testing Commands
Standard Build
dotnet restore
dotnet build --configuration Release --no-restore --verbosity normal
dotnet test --configuration Release --no-buildBenchmark Commands
# Run F# benchmarks
dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"
# Run Python benchmarks (updates source files)
dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*"GPU Testing
dotnet test /p:Furnace_TESTGPU=trueProfiling & Measurement Setup
Micro-benchmarking
- Use existing BenchmarkDotNet infrastructure
- Target operations in
BasicTensorOpsbenchmark suite - Focus on operations with high iteration counts (workloadSize / tensorSize factor)
Profiling Tools
- .NET profilers: PerfView, JetBrains dotMemory, Visual Studio Diagnostics
- Native profilers: Intel VTune (for TorchSharp), perf on Linux
- Memory profilers: Application Verifier, Debug heap
Performance Measurement Strategy
- Compare Reference vs TorchSharp backends
- Measure against PyTorch baselines
- Test across tensor size matrix: small (16), medium (2048), large (65536)
- Profile both CPU and GPU (CUDA) execution paths
Maintainer Priorities
Based on existing issues and project direction:
- Correctness over performance - ensure mathematical correctness
- API stability - minimize breaking changes to public interfaces
- Cross-platform support - maintain Linux/macOS/Windows compatibility
- Educational use - support learning ML concepts with F#
- PyTorch compatibility - maintain familiar API patterns
Concrete Next Steps
Environment Setup for Performance Work
- Build project:
dotnet build -c Release - Install benchmark dependencies:
dotnet tool restore - Run baseline benchmarks to establish current performance
- Set up profiling tools (PerfView/dotMemory)
- Create performance regression test suite
Development Workflow
- Identify bottleneck via profiling/benchmarks
- Implement optimization in isolated branch
- Measure improvement with micro-benchmarks
- Run regression tests to ensure correctness
- Performance CI integration with benchmark comparison
This plan provides a systematic approach to performance improvement with clear measurement criteria and realistic improvement targets.
AI-generated content by Daily Perf Improver may contain mistakes.
Metadata
Metadata
Assignees
Labels
No labels