Skip to content

Daily Perf Improver: Research and Plan #45

@github-actions

Description

@github-actions

Daily Performance Improvement Research & Plan

Project Overview

Furnace is an F# tensor library with support for differentiable programming, designed for machine learning, probabilistic programming, and optimization. It provides:

  • Nested and mixed-mode differentiation
  • PyTorch familiar naming and idioms
  • Multiple backends: Reference (CPU-only F#), Torch (TorchSharp/LibTorch with CUDA support)
  • Common optimizers, model elements, differentiable probability distributions

Current Performance Testing Infrastructure

Benchmarking Framework

  • BenchmarkDotNet is used for micro-benchmarking
  • Benchmarks are in tests/Furnace.Benchmarks/
  • Python comparison benchmarks in tests/Furnace.Benchmarks.Python/
  • Command to run benchmarks: dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"
  • Current benchmark matrix tests:
    • Tensor sizes: 16, 2048, 65536 elements
    • Data types: int32, float32, float64
    • Devices: cpu, cuda
    • Operations: creation (zeros, ones, rand), basic math (addition, scalar ops), matrix ops (matmul)

Current Performance Numbers

From existing benchmark results, Reference backend significantly outperforms:

  • Reference backend is ~100x faster than PyTorch for small tensor operations
  • TorchSharp/Torch backend is ~2-3x faster than PyTorch but ~50x slower than Reference
  • Example (16 element float32 tensors on CPU):
    • PyTorch addition: ~759ms
    • TorchSharp addition: ~523ms
    • Reference addition: ~15ms

Performance Bottlenecks Analysis

1. Backend Layer Inefficiencies

  • TorchSharp Interop Overhead: Heavy cost converting between F# types and TorchSharp C# types
  • Type Conversion: Multiple conversions between Dtype/torch.ScalarType, Device/torch.Device
  • Handle Management: Each tensor operation creates new handles with disposal overhead

2. Tensor Creation Performance

  • Reference backend creates tensors ~100x faster than Torch backend
  • Significant overhead in fromCpuData operations through TorchSharp

3. Memory Management

  • No explicit tensor disposal (issue Explicit disposal discussion #19) - relying on GC
  • TorchSharp tensors accumulate in memory until GC
  • Potential for memory pressure during intensive computation

4. Algorithm-Level Optimizations

  • Marsaglia Gaussian generator inefficiency (issue Improve the Marsaglia Gaussian generator #23) - discarding second generated sample
  • Missing vectorization opportunities in Reference backend
  • No SIMD optimizations for CPU operations

5. Differentiation Overhead

  • Multiple tensor wrapper types (TensorC/TensorF/TensorR) add call overhead
  • Deep nesting in primalDeep operations with recursive calls

Typical Workloads & Bottlenecks

Machine Learning Workloads

  • Neural network training: Forward/backward passes with large matrix operations
  • Optimization algorithms (Adam, SGD): Many small tensor operations per step
  • Model inference: Mostly matrix multiplications and activations

Performance Characteristics

  • I/O bound: Data loading, model serialization
  • CPU bound: Reference backend operations, automatic differentiation
  • Memory bound: Large tensor operations, gradient accumulation
  • GPU bound: CUDA operations through TorchSharp when available

Performance Goals by Round

Round 1: Low-Hanging Fruit (Target: 20-50% improvement)

  1. Fix Marsaglia Gaussian generator - cache second sample (2x improvement for random normal)
  2. Optimize tensor creation paths - reduce type conversions in TorchSharp backend
  3. Add tensor operation fusion - combine multiple operations to reduce intermediate allocations
  4. Improve scalar operations - optimize tensor-scalar arithmetic patterns

Round 2: Backend Optimizations (Target: 2-5x improvement)

  1. SIMD vectorization for Reference backend hot paths
  2. Memory pooling for intermediate tensor allocations
  3. Lazy evaluation for tensor operation chains
  4. In-place operation support for appropriate cases

Round 3: Architecture Changes (Target: 5-10x improvement)

  1. Native backend with F# P/Invoke to optimized BLAS/LAPACK
  2. Tensor operation batching for better GPU utilization
  3. JIT compilation for tensor operation graphs
  4. Custom automatic differentiation with specialized reverse-mode implementation

Build & Testing Commands

Standard Build

dotnet restore
dotnet build --configuration Release --no-restore --verbosity normal
dotnet test --configuration Release --no-build

Benchmark Commands

# Run F# benchmarks
dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"

# Run Python benchmarks (updates source files)
dotnet run --project tests\Furnace.Benchmarks.Python\Furnace.Benchmarks.Python.fsproj -c Release --filter "*"

GPU Testing

dotnet test /p:Furnace_TESTGPU=true

Profiling & Measurement Setup

Micro-benchmarking

  • Use existing BenchmarkDotNet infrastructure
  • Target operations in BasicTensorOps benchmark suite
  • Focus on operations with high iteration counts (workloadSize / tensorSize factor)

Profiling Tools

  • .NET profilers: PerfView, JetBrains dotMemory, Visual Studio Diagnostics
  • Native profilers: Intel VTune (for TorchSharp), perf on Linux
  • Memory profilers: Application Verifier, Debug heap

Performance Measurement Strategy

  • Compare Reference vs TorchSharp backends
  • Measure against PyTorch baselines
  • Test across tensor size matrix: small (16), medium (2048), large (65536)
  • Profile both CPU and GPU (CUDA) execution paths

Maintainer Priorities

Based on existing issues and project direction:

  1. Correctness over performance - ensure mathematical correctness
  2. API stability - minimize breaking changes to public interfaces
  3. Cross-platform support - maintain Linux/macOS/Windows compatibility
  4. Educational use - support learning ML concepts with F#
  5. PyTorch compatibility - maintain familiar API patterns

Concrete Next Steps

Environment Setup for Performance Work

  1. Build project: dotnet build -c Release
  2. Install benchmark dependencies: dotnet tool restore
  3. Run baseline benchmarks to establish current performance
  4. Set up profiling tools (PerfView/dotMemory)
  5. Create performance regression test suite

Development Workflow

  1. Identify bottleneck via profiling/benchmarks
  2. Implement optimization in isolated branch
  3. Measure improvement with micro-benchmarks
  4. Run regression tests to ensure correctness
  5. Performance CI integration with benchmark comparison

This plan provides a systematic approach to performance improvement with clear measurement criteria and realistic improvement targets.

AI-generated content by Daily Perf Improver may contain mistakes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions