Skip to content

Daily Perf Improver: Research and Plan #61

@github-actions

Description

@github-actions

Furnace Performance Research and Improvement Plan

Performance Testing Infrastructure ✅

Current Setup:

  • Benchmarking Framework: BenchmarkDotNet with comprehensive comparison against Python/PyTorch
  • Benchmark Command: dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"
  • Python Benchmarks: dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*"
  • Tests: dotnet test --configuration Release
  • Build: dotnet build --configuration Release

Current Performance Characteristics 📊

Backend Performance Analysis (from existing benchmarks):

  1. Reference Backend:

    • Pure F# implementation, runs on any .NET platform
    • Low setup overhead (~1.0μs) but slower execution (~0.0025μs per operation)
    • Better performance for small tensors (<10K elements)
    • 10-50x slower than Torch for large tensor operations
  2. Torch Backend:

    • CPU: Medium setup overhead (~8.0μs), fast execution (~0.0010μs per operation)
    • GPU: High setup overhead (~75.0μs), very fast execution (~0.000056μs per operation)
    • Optimal for large tensor operations (>10K elements)

Typical Workloads 🎯:

  • Machine Learning: Neural networks, optimization, gradient computation
  • Tensor Operations: Matrix multiplication, element-wise operations, convolutions
  • Scientific Computing: Linear algebra, differentiable programming
  • Data Types: Primarily float32, float64, int32 across CPU/GPU

Performance Bottlenecks Identified 🔍

1. High-Level API Overhead

  • Benchmark data shows Tensor layer often 2-3x slower than RawTensor
  • Multiple abstraction layers between user API and native operations

2. Small Tensor Performance

  • Torch backend not optimal for tensors <10K elements due to setup overhead
  • Opportunity for hybrid backend selection based on tensor size

3. Known Optimization TODOs (from code analysis):

  • Torch.RawTensor.fs:1118: "TODO - this should be faster"
  • Tensor.fs:795: "TODO: The following can be slow, especially for reverse mode differentiation of the diagonal of a large tensor"
  • Several missing optimized routines noted in benchmarks (addWithAlpha, addInPlace)

4. Memory Access Patterns

  • Analysis shows 3% overhead per additional field in RawTensor for small tensors
  • Potential for memory layout optimization

Performance Goals 🎯

Round 1 (Foundation): Low-hanging fruit, 10-30% improvements

  • Implement missing optimized tensor operations (addWithAlpha, addInPlace optimizations)
  • Fix identified performance TODOs in codebase
  • Optimize small tensor code paths in Reference backend
  • Reduce API layer overhead for common operations

Round 2 (Optimization): 20-50% improvements

  • Intelligent backend selection based on tensor size/operation
  • Memory layout optimizations for RawTensor
  • Batch operation optimizations for multiple small tensors
  • Cache optimization for frequently used tensor shapes

Round 3 (Advanced): 50%+ improvements

  • SIMD optimizations in Reference backend for specific operations
  • Custom GPU kernels for common ML operations
  • Memory pooling for tensor allocation/deallocation
  • Lazy evaluation for tensor operation chains

Environment Setup for Performance Work 🛠️

Prerequisites:

  • .NET 6.0 SDK
  • TorchSharp backend (CPU/GPU)

Build Steps (automated via .github/actions/daily-perf-improver/build-steps/action.yml):

dotnet restore
dotnet build --configuration Release --no-restore
dotnet test --configuration Release --no-build

Benchmarking Workflow:

  1. Make performance changes
  2. Run benchmarks: dotnet run --project tests/Furnace.Benchmarks/Furnace.Benchmarks.fsproj -c Release
  3. Compare with baseline results in BenchmarkDotNet.Artifacts/results/
  4. For Python comparison: Update tests/Furnace.Benchmarks.Python/ and run

Performance Measurement:

  • Wall-clock time measurements (with virtualization caveats)
  • Memory allocation analysis
  • Operation throughput (ops/sec)
  • Comparative analysis vs PyTorch baseline

Engineering Best Practices for Performance Work 🔧

Reliable Performance Testing Zone:

  • Commands complete within ~1min for rapid iteration
  • Clear visibility into code paths affecting performance
  • Repeatable benchmark results with statistical significance
  • Version-controlled baseline results for regression detection

Target Metrics:

  • Torch Backend: Reduce API overhead by 20-30%
  • Reference Backend: Improve large tensor performance by 2-3x
  • Hybrid Approach: Automatic backend selection saving 15-25% across mixed workloads
  • Memory: Reduce allocation overhead by 10-20%

Performance Bottleneck Categories 🏗️

CPU-bound: Reference backend mathematical operations, small tensor arithmetic
Memory-bound: Large tensor creation/copying, gradient computation
I/O-bound: GPU data transfer, model loading/saving
API-bound: High-level abstraction overhead, dynamic dispatch

Maintainer Considerations 👥

This plan focuses on:

  • Compatibility: No breaking API changes
  • Maintainability: Clean, well-documented optimizations
  • Testing: Comprehensive benchmarks and regression tests
  • Progressive: Incremental improvements with measured impact

AI-generated content by Daily Perf Improver may contain mistakes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions