|
| 1 | +# Performance Comparison: strided-opteinsum vs OMEinsum.jl |
| 2 | + |
| 3 | +This document summarizes the key optimizations that make strided-opteinsum |
| 4 | +faster than OMEinsum.jl on the |
| 5 | +[einsum benchmark](https://benchmark.einsum.org/) suite. |
| 6 | + |
| 7 | +Both systems use the same pre-computed contraction paths for fair comparison |
| 8 | +(omeinsum_path mode). Benchmarks are run on AMD EPYC 7713P (Zen 3) and |
| 9 | +Apple Silicon M2. See the |
| 10 | +[benchmark suite](https://github.com/tensor4all/strided-rs-benchmark-suite) |
| 11 | +for full results. |
| 12 | + |
| 13 | +## Summary of Advantages |
| 14 | + |
| 15 | +| Optimization | Biggest impact | Typical speedup | |
| 16 | +|---|---|---| |
| 17 | +| Lazy permutation + source-order copy | Tensor networks (many small dims) | 1.3–1.7x | |
| 18 | +| Parallel copy/permutation (rayon) | 4T tensor networks | 1.7x vs Julia | |
| 19 | +| HPTT-based permutation | Writeback and finalize paths | 6x vs naive | |
| 20 | +| Buffer reuse | All instances | reduced allocation overhead | |
| 21 | +| faer GEMM | 1T small-to-medium matrices | competitive with OpenBLAS | |
| 22 | + |
| 23 | +## 1. Lazy Permutation + Source-Order Copy |
| 24 | + |
| 25 | +**The single largest source of speedup**, especially on tensor network |
| 26 | +instances with many small dimensions (e.g. 24 binary dims of size 2). |
| 27 | + |
| 28 | +### Design |
| 29 | + |
| 30 | +OMEinsum.jl eagerly materializes every intermediate permutation via |
| 31 | +`permutedims!`. strided-rs uses **metadata-only permutation**: between |
| 32 | +contraction steps, only the dims/strides arrays are reordered — no data copy. |
| 33 | + |
| 34 | +When a subsequent step needs contiguous input for GEMM, strided-rs copies |
| 35 | +using **source-stride-order iteration** (`copy_strided_src_order`). Because |
| 36 | +the source data is physically contiguous (col-major output from the previous |
| 37 | +GEMM), iterating in source-stride order gives sequential reads that exploit |
| 38 | +the hardware prefetcher. Scattered writes are absorbed by write-combining |
| 39 | +buffers. |
| 40 | + |
| 41 | +This is faster than HPTT (destination-stride-order) for this case because: |
| 42 | +- HPTT's dst-order causes scattered reads from cold L3 cache |
| 43 | +- For 24 binary dims, HPTT's bilateral fusion yields ~17 fused dims with a |
| 44 | + 2x2 inner tile and 15 recursion levels — high per-element overhead |
| 45 | + |
| 46 | +See [permutation-optimization.md](permutation-optimization.md) for detailed |
| 47 | +analysis and bandwidth measurements. |
| 48 | + |
| 49 | +### Impact |
| 50 | + |
| 51 | +`tensornetwork_permutation_light_415` (AMD EPYC, 1T): |
| 52 | +- strided-rs (faer): **283 ms** |
| 53 | +- OMEinsum.jl: 476 ms |
| 54 | +- Speedup: **1.68x** |
| 55 | + |
| 56 | +## 2. Parallel Copy/Permutation |
| 57 | + |
| 58 | +Copy and permutation operations are parallelized via rayon when the |
| 59 | +`parallel` feature is enabled: |
| 60 | + |
| 61 | +- `copy_strided_src_order_par` in strided-einsum2: splits outer source-stride |
| 62 | + dimensions across rayon threads (threshold: > 1M elements) |
| 63 | +- `strided-kernel/parallel`: parallelizes `copy_into`, `map_into`, and other |
| 64 | + kernel operations via rayon |
| 65 | +- `strided-perm/parallel`: parallelizes the outermost HPTT ComputeNode loop |
| 66 | + |
| 67 | +**Important**: The `parallel` feature must propagate through the full |
| 68 | +dependency chain: `strided-opteinsum/parallel` → `strided-einsum2/parallel` |
| 69 | +→ `strided-kernel/parallel` → `strided-perm/parallel`. A bug where |
| 70 | +`strided-kernel/parallel` was not enabled was fixed in |
| 71 | +[840d9b8](https://github.com/tensor4all/strided-rs/commit/840d9b8). |
| 72 | + |
| 73 | +### Impact |
| 74 | + |
| 75 | +`tensornetwork_permutation_light_415` (AMD EPYC, 4T): |
| 76 | +- strided-rs (faer): **222 ms** |
| 77 | +- OMEinsum.jl: 388 ms |
| 78 | +- Speedup: **1.75x** |
| 79 | + |
| 80 | +## 3. HPTT-Based Permutation |
| 81 | + |
| 82 | +The `strided-perm` crate implements cache-efficient tensor transpose based on |
| 83 | +HPTT (Springer et al. 2017): |
| 84 | + |
| 85 | +- Bilateral dimension fusion |
| 86 | +- 4x4 (f64) / 8x8 (f32) micro-kernel transpose |
| 87 | +- Macro-kernel blocking (BLOCK=16 for f64) |
| 88 | +- Recursive loop nest with stride-1 fast path |
| 89 | + |
| 90 | +Used in writeback paths (`finalize_into`), BGEMM packing, and single-tensor |
| 91 | +operations where HPTT's blocked approach is effective (warm cache, regular |
| 92 | +stride patterns). |
| 93 | + |
| 94 | +See [permutation-optimization.md](permutation-optimization.md) for details. |
| 95 | + |
| 96 | +## 4. Buffer Reuse |
| 97 | + |
| 98 | +strided-opteinsum reuses intermediate buffers across contraction steps via a |
| 99 | +thread-local pool (`alloc_col_major_uninit_with_pool`). This avoids repeated |
| 100 | +allocation/deallocation of large temporary arrays during multi-step |
| 101 | +contractions. |
| 102 | + |
| 103 | +OMEinsum.jl allocates fresh arrays at each step. |
| 104 | + |
| 105 | +## 5. faer GEMM |
| 106 | + |
| 107 | +The [faer](https://github.com/sarah-quinones/faer-rs) backend provides a |
| 108 | +pure-Rust GEMM that is competitive with OpenBLAS, especially for |
| 109 | +small-to-medium matrices at 1 thread. On the benchmark suite, faer often |
| 110 | +matches or beats OpenBLAS at 1T, and significantly outperforms at 4T when |
| 111 | +combined with rayon-parallelized copy operations (since faer and rayon share |
| 112 | +the same thread pool, avoiding the OMP/rayon dual-pool overhead). |
| 113 | + |
| 114 | +## Where OMEinsum.jl Still Wins |
| 115 | + |
| 116 | +- `str_nw_mera_closed_120` (opt_size, 1T): Julia 1296 ms vs strided-rs |
| 117 | + 1363 ms. MERA networks are GEMM-dominated with large matrices where |
| 118 | + Julia's OpenBLAS 0.3.29 via libblastrampoline is slightly faster. |
| 119 | +- Some 4T instances show Julia winning on small-tensor workloads where |
| 120 | + strided-rs threading overhead exceeds the parallelism benefit. |
0 commit comments