Skip to content

Commit 4f1eea5

Browse files
shinaokaclaude
andcommitted
docs: add performance comparison vs OMEinsum.jl
Summarize the key optimizations that make strided-opteinsum faster: lazy permutation + source-order copy, parallel copy/permutation, HPTT-based permutation, buffer reuse, and faer GEMM. References permutation-optimization.md for detailed analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 840d9b8 commit 4f1eea5

1 file changed

Lines changed: 120 additions & 0 deletions

File tree

docs/performance-vs-omeinsum.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Performance Comparison: strided-opteinsum vs OMEinsum.jl
2+
3+
This document summarizes the key optimizations that make strided-opteinsum
4+
faster than OMEinsum.jl on the
5+
[einsum benchmark](https://benchmark.einsum.org/) suite.
6+
7+
Both systems use the same pre-computed contraction paths for fair comparison
8+
(omeinsum_path mode). Benchmarks are run on AMD EPYC 7713P (Zen 3) and
9+
Apple Silicon M2. See the
10+
[benchmark suite](https://github.com/tensor4all/strided-rs-benchmark-suite)
11+
for full results.
12+
13+
## Summary of Advantages
14+
15+
| Optimization | Biggest impact | Typical speedup |
16+
|---|---|---|
17+
| Lazy permutation + source-order copy | Tensor networks (many small dims) | 1.3–1.7x |
18+
| Parallel copy/permutation (rayon) | 4T tensor networks | 1.7x vs Julia |
19+
| HPTT-based permutation | Writeback and finalize paths | 6x vs naive |
20+
| Buffer reuse | All instances | reduced allocation overhead |
21+
| faer GEMM | 1T small-to-medium matrices | competitive with OpenBLAS |
22+
23+
## 1. Lazy Permutation + Source-Order Copy
24+
25+
**The single largest source of speedup**, especially on tensor network
26+
instances with many small dimensions (e.g. 24 binary dims of size 2).
27+
28+
### Design
29+
30+
OMEinsum.jl eagerly materializes every intermediate permutation via
31+
`permutedims!`. strided-rs uses **metadata-only permutation**: between
32+
contraction steps, only the dims/strides arrays are reordered — no data copy.
33+
34+
When a subsequent step needs contiguous input for GEMM, strided-rs copies
35+
using **source-stride-order iteration** (`copy_strided_src_order`). Because
36+
the source data is physically contiguous (col-major output from the previous
37+
GEMM), iterating in source-stride order gives sequential reads that exploit
38+
the hardware prefetcher. Scattered writes are absorbed by write-combining
39+
buffers.
40+
41+
This is faster than HPTT (destination-stride-order) for this case because:
42+
- HPTT's dst-order causes scattered reads from cold L3 cache
43+
- For 24 binary dims, HPTT's bilateral fusion yields ~17 fused dims with a
44+
2x2 inner tile and 15 recursion levels — high per-element overhead
45+
46+
See [permutation-optimization.md](permutation-optimization.md) for detailed
47+
analysis and bandwidth measurements.
48+
49+
### Impact
50+
51+
`tensornetwork_permutation_light_415` (AMD EPYC, 1T):
52+
- strided-rs (faer): **283 ms**
53+
- OMEinsum.jl: 476 ms
54+
- Speedup: **1.68x**
55+
56+
## 2. Parallel Copy/Permutation
57+
58+
Copy and permutation operations are parallelized via rayon when the
59+
`parallel` feature is enabled:
60+
61+
- `copy_strided_src_order_par` in strided-einsum2: splits outer source-stride
62+
dimensions across rayon threads (threshold: > 1M elements)
63+
- `strided-kernel/parallel`: parallelizes `copy_into`, `map_into`, and other
64+
kernel operations via rayon
65+
- `strided-perm/parallel`: parallelizes the outermost HPTT ComputeNode loop
66+
67+
**Important**: The `parallel` feature must propagate through the full
68+
dependency chain: `strided-opteinsum/parallel``strided-einsum2/parallel`
69+
`strided-kernel/parallel``strided-perm/parallel`. A bug where
70+
`strided-kernel/parallel` was not enabled was fixed in
71+
[840d9b8](https://github.com/tensor4all/strided-rs/commit/840d9b8).
72+
73+
### Impact
74+
75+
`tensornetwork_permutation_light_415` (AMD EPYC, 4T):
76+
- strided-rs (faer): **222 ms**
77+
- OMEinsum.jl: 388 ms
78+
- Speedup: **1.75x**
79+
80+
## 3. HPTT-Based Permutation
81+
82+
The `strided-perm` crate implements cache-efficient tensor transpose based on
83+
HPTT (Springer et al. 2017):
84+
85+
- Bilateral dimension fusion
86+
- 4x4 (f64) / 8x8 (f32) micro-kernel transpose
87+
- Macro-kernel blocking (BLOCK=16 for f64)
88+
- Recursive loop nest with stride-1 fast path
89+
90+
Used in writeback paths (`finalize_into`), BGEMM packing, and single-tensor
91+
operations where HPTT's blocked approach is effective (warm cache, regular
92+
stride patterns).
93+
94+
See [permutation-optimization.md](permutation-optimization.md) for details.
95+
96+
## 4. Buffer Reuse
97+
98+
strided-opteinsum reuses intermediate buffers across contraction steps via a
99+
thread-local pool (`alloc_col_major_uninit_with_pool`). This avoids repeated
100+
allocation/deallocation of large temporary arrays during multi-step
101+
contractions.
102+
103+
OMEinsum.jl allocates fresh arrays at each step.
104+
105+
## 5. faer GEMM
106+
107+
The [faer](https://github.com/sarah-quinones/faer-rs) backend provides a
108+
pure-Rust GEMM that is competitive with OpenBLAS, especially for
109+
small-to-medium matrices at 1 thread. On the benchmark suite, faer often
110+
matches or beats OpenBLAS at 1T, and significantly outperforms at 4T when
111+
combined with rayon-parallelized copy operations (since faer and rayon share
112+
the same thread pool, avoiding the OMP/rayon dual-pool overhead).
113+
114+
## Where OMEinsum.jl Still Wins
115+
116+
- `str_nw_mera_closed_120` (opt_size, 1T): Julia 1296 ms vs strided-rs
117+
1363 ms. MERA networks are GEMM-dominated with large matrices where
118+
Julia's OpenBLAS 0.3.29 via libblastrampoline is slightly faster.
119+
- Some 4T instances show Julia winning on small-tensor workloads where
120+
strided-rs threading overhead exceeds the parallelism benefit.

0 commit comments

Comments
 (0)