tensor4all · shinaoka · Feb 20, 2026 · Feb 20, 2026 · Feb 20, 2026
diff --git a/docs/hptt-openblas-vs-julia-benchmark.md b/docs/hptt-openblas-vs-julia-benchmark.md
@@ -0,0 +1,112 @@
+# strided-rs (HPTT + OpenBLAS) vs OMEinsum.jl Benchmark
+
+Branch: `perf/src-vs-dst-order`
+
+## Purpose
+
+Verify that strided-rs with HPTT as the default copy strategy (instead of
+source-stride-order) remains competitive with OMEinsum.jl. If so, there is no
+need for an adaptive copy strategy — HPTT can be the universal default.
+
+## Setup
+
+- **Rust**: strided-opteinsum with `blas` + `hptt-input-copy` features.
+  Copy elision (`try_fuse_group`) is enabled; when a copy is needed, HPTT
+  (destination-stride-order) is used. GEMM backend: OpenBLAS 0.3.29.
+- **Julia**: OMEinsum.jl v0.9.3 with pre-computed contraction paths
+  (`omeinsum_path` mode). Julia 1.10.0, BLAS vendor: lbt (OpenBLAS).
+- **Hardware**: AMD EPYC 7713P
+- **Timing**: median of 15 runs, 3 warmup
+
+## Results
+
+### 1T — opt_flops
+
+| Instance | Rust (ms) | Julia (ms) | Ratio |
+|---|---|---|---|
+| gm_queen5_5_3 | 5209 | SKIP | — |
+| lm_brackets_4_4d | 19 | 30 | **0.62x** |
+| lm_sentence_3_12d | 76 | 80 | **0.95x** |
+| lm_sentence_4_4d | 22 | 33 | **0.66x** |
+| str_matrix_chain_100 | 14 | 16 | **0.84x** |
+| str_mps_varying_200 | 15 | 37 | **0.42x** |
+| mera_closed | 1361 | 1386 | **0.98x** |
+| mera_open | 880 | 1251 | **0.70x** |
+| tn_focus | 455 | 491 | **0.93x** |
+| tn_light | 450 | 495 | **0.91x** |
+
+### 1T — opt_size
+
+| Instance | Rust (ms) | Julia (ms) | Ratio |
+|---|---|---|---|
+| gm_queen5_5_3 | 1632 | SKIP | — |
+| lm_brackets_4_4d | 20 | 32 | **0.60x** |
+| lm_sentence_3_12d | 60 | 92 | **0.66x** |
+| lm_sentence_4_4d | 26 | 33 | **0.77x** |
+| str_matrix_chain_100 | 12 | 17 | **0.69x** |
+| str_mps_varying_200 | 17 | 35 | **0.48x** |
+| mera_closed | 1173 | 1286 | **0.91x** |
+| mera_open | 914 | 1298 | **0.70x** |
+| tn_focus | 449 | 495 | **0.91x** |
+| tn_light | 451 | 500 | **0.90x** |
+
+### 4T — opt_flops
+
+| Instance | Rust (ms) | Julia (ms) | Ratio |
+|---|---|---|---|
+| gm_queen5_5_3 | 4092 | SKIP | — |
+| lm_brackets_4_4d | 20 | 42 | **0.49x** |
+| lm_sentence_3_12d | 53 | 59 | **0.89x** |
+| lm_sentence_4_4d | 23 | 46 | **0.51x** |
+| str_matrix_chain_100 | 14 | 17 | **0.81x** |
+| str_mps_varying_200 | 23 | 59 | **0.39x** |
+| mera_closed | 587 | 931 | **0.63x** |
+| mera_open | 353 | 798 | **0.44x** |
+| tn_focus | 352 | 444 | **0.79x** |
+| tn_light | 358 | 446 | **0.80x** |
+
+### 4T — opt_size
+
+| Instance | Rust (ms) | Julia (ms) | Ratio |
+|---|---|---|---|
+| gm_queen5_5_3 | 1202 | SKIP | — |
+| lm_brackets_4_4d | 22 | 40 | **0.55x** |
+| lm_sentence_3_12d | 39 | 63 | **0.63x** |
+| lm_sentence_4_4d | 29 | 49 | **0.59x** |
+| str_matrix_chain_100 | 8 | 16 | **0.48x** |
+| str_mps_varying_200 | 20 | 44 | **0.45x** |
+| mera_closed | 457 | 704 | **0.65x** |
+| mera_open | 356 | 796 | **0.45x** |
+| tn_focus | 353 | 457 | **0.77x** |
+| tn_light | 356 | 464 | **0.77x** |
+
+## Analysis
+
+Rust with HPTT is **equal or faster than Julia on every instance**, in both
+1T and 4T configurations.
+
+- **1T**: Rust is 0.42x–0.98x of Julia (2–58% faster across instances)
+- **4T**: Rust is 0.39x–0.89x of Julia (11–61% faster across instances)
+- The 4T advantage is larger because strided-rs parallelizes both
+  permutation copies (via rayon) and GEMM (via OpenBLAS threads), while
+  Julia's OMEinsum only parallelizes GEMM
+
+Even `tn_focus` and `tn_light` — the instances where HPTT is slower than
+source-stride-order in isolation (see `src-vs-dst-order-experiment.md`) —
+still outperform Julia. The copy elision (`try_fuse_group`) compensates for
+HPTT's overhead on these degenerate many-small-dims cases.
+
+## Conclusion
+
+**HPTT can be the default copy strategy** for the `Contract` CPU backend.
+There is no need for an adaptive strategy that switches between HPTT and
+source-stride-order based on tensor shape. The combination of copy elision +
+HPTT is sufficient to match or exceed Julia's OMEinsum.jl on all tested
+workloads.
+
+## Notes
+
+- `gm_queen5_5_3` is skipped by Julia due to a `MethodError` (3D+ array
+  incompatibility in OMEinsum.jl)
+- Julia's IQR is generally higher than Rust's, suggesting more variance
+  (likely due to GC pressure)
diff --git a/docs/src-vs-dst-order-experiment.md b/docs/src-vs-dst-order-experiment.md
@@ -0,0 +1,119 @@
+# Source-Stride-Order vs Destination-Stride-Order (HPTT) Experiment
+
+Branch: `perf/src-vs-dst-order`
+
+## Hypothesis
+
+The eager-HPTT experiment showed a 26–31% regression on `mera_open` when
+permutations are eagerly materialized. However, that experiment conflated two
+factors: **copy elision** (`try_fuse_group`) and **copy strategy** (source-order
+vs destination-order). This experiment isolates the copy strategy factor by
+disabling copy elision (`force-copy` feature) and comparing source-stride-order
+copy vs HPTT (destination-stride-order) copy.
+
+## Setup
+
+Two feature flags added to `strided-einsum2`:
+
+- `force-copy`: Forces `needs_copy = true` in all `prepare_input_*` and
+  `prepare_output_*` functions, disabling `try_fuse_group` elision.
+- `hptt-input-copy`: Switches `prepare_input_owned` from source-stride-order
+  copy to HPTT (`strided_kernel::copy_into_col_major`).
+
+Three configurations benchmarked:
+
+1. **Baseline**: Default (copy elision enabled, src-order when copy needed)
+2. **force-copy + src-order**: Copy elision disabled, source-stride-order copy
+3. **force-copy + HPTT**: Copy elision disabled, HPTT destination-order copy
+
+## Results (AMD EPYC 7713P, faer, 1T)
+
+### opt_flops
+
+| Instance | Baseline | force+src | force+HPTT | src vs HPTT |
+|---|---|---|---|---|
+| gm_queen5_5_3 | 5775 ms | 6400 ms (+11%) | 6052 ms (+5%) | HPTT 5% faster |
+| lm_brackets_4_4d | 19 ms | 35 ms (+80%) | 24 ms (+23%) | **HPTT 31% faster** |
+| lm_sentence_3_12d | 65 ms | 92 ms (+42%) | 78 ms (+19%) | **HPTT 16% faster** |
+| lm_sentence_4_4d | 17 ms | 34 ms (+99%) | 24 ms (+41%) | **HPTT 29% faster** |
+| str_matrix_chain_100 | 11 ms | 20 ms (+80%) | 14 ms (+25%) | **HPTT 30% faster** |
+| str_mps_varying_200 | 16 ms | 21 ms (+31%) | 16 ms (-3%) | **HPTT 25% faster** |
+| mera_closed | 1518 ms | 1739 ms (+15%) | 1567 ms (+3%) | **HPTT 10% faster** |
+| mera_open | 935 ms | 1129 ms (+21%) | 1142 ms (+22%) | ~same (+1%) |
+| tn_focus | 288 ms | 400 ms (+39%) | 568 ms (+97%) | **src 30% faster** |
+| tn_light | 289 ms | 401 ms (+39%) | 560 ms (+94%) | **src 28% faster** |
+
+### opt_size
+
+| Instance | Baseline | force+src | force+HPTT | src vs HPTT |
+|---|---|---|---|---|
+| gm_queen5_5_3 | 1727 ms | 2471 ms (+43%) | 2290 ms (+33%) | HPTT 7% faster |
+| lm_brackets_4_4d | 19 ms | 35 ms (+82%) | 20 ms (+5%) | **HPTT 43% faster** |
+| lm_sentence_3_12d | 52 ms | 78 ms (+49%) | 59 ms (+14%) | **HPTT 24% faster** |
+| lm_sentence_4_4d | 22 ms | 37 ms (+69%) | 25 ms (+18%) | **HPTT 30% faster** |
+| str_matrix_chain_100 | 11 ms | 20 ms (+85%) | 15 ms (+36%) | **HPTT 26% faster** |
+| str_mps_varying_200 | 14 ms | 23 ms (+68%) | 13 ms (-3%) | **HPTT 43% faster** |
+| mera_closed | 1480 ms | 1496 ms (+1%) | 1322 ms (-11%) | **HPTT 12% faster** |
+| mera_open | 934 ms | 1108 ms (+19%) | 1086 ms (+16%) | ~same (-2%) |
+| tn_focus | 288 ms | 394 ms (+37%) | 541 ms (+88%) | **src 27% faster** |
+| tn_light | 289 ms | 387 ms (+34%) | 532 ms (+84%) | **src 27% faster** |
+
+## Analysis
+
+### HPTT is faster for most workloads
+
+Contrary to the initial assumption that source-stride-order is generally
+superior, HPTT outperforms source-order on 8 out of 10 instances (16–43%
+faster). HPTT's cache-blocked 2D transpose tiles give better cache utilization
+when the data layout has moderate-to-large contiguous blocks.
+
+### Source-order wins only for many small binary dimensions
+
+The two instances where source-order is faster — `tn_focus` (316 tensors) and
+`tn_light` (415 tensors) — have many binary dimensions (size 2). With ~24
+dimensions of size 2, HPTT builds ~15 recursion levels with only 2 iterations
+each, and the 2×2 inner tile degenerates. The simple odometer loop of
+source-order copy handles this case more efficiently.
+
+### mera_open: copy strategy is irrelevant
+
+`mera_open` shows essentially no difference between source-order and HPTT
+(+1% / -2%, within noise). The 21–22% regression vs baseline is entirely
+due to copy elision loss. This confirms that the 26–31% regression in the
+eager-HPTT experiment was caused by copy elision, not by HPTT's copy strategy.
+
+### Copy elision remains the dominant optimization
+
+All instances are faster with copy elision enabled (baseline) than with either
+forced copy strategy. The biggest gaps are on lm_* and str_* instances (up to
+99% regression with forced copies). Copy elision (`try_fuse_group`) should
+always be the first priority.
+
+## Conclusions
+
+1. **Copy elision (`try_fuse_group`) is the most important optimization** —
+   responsible for the majority of performance gains across all instances.
+
+2. **HPTT is the better default copy strategy** when copies cannot be avoided.
+   It outperforms source-order on most workloads thanks to cache-blocked tiling.
+
+3. **Source-order is better for degenerate cases** with many small dimensions
+   (size 2), where HPTT's recursion structure becomes overhead-heavy.
+
+4. **The optimal `Contract` implementation should use adaptive copy strategy**:
+   - Always try copy elision first (`try_fuse_group`)
+   - Use HPTT for general cases
+   - Consider source-order for tensors with many small dimensions (heuristic
+     needed)
+
+## Implications for `Contract` CPU backend
+
+The priority order in `contract-as-core-op.md` should be updated:
+
+1. **Copy elision** (`try_fuse_group`) — dominant optimization, always first
+2. **HPTT (destination-stride-order)** — default copy strategy when elision fails
+3. **Source-stride-order** — fallback for degenerate many-small-dims cases
+
+A simple heuristic for choosing between HPTT and source-order: if the minimum
+dimension size after bilateral fusion is ≤ 2 and the number of fused dimensions
+is large (e.g., > 10), prefer source-order. Otherwise, use HPTT.