Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/hptt-openblas-vs-julia-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# strided-rs (HPTT + OpenBLAS) vs OMEinsum.jl Benchmark

Branch: `perf/src-vs-dst-order`

## Purpose

Verify that strided-rs with HPTT as the default copy strategy (instead of
source-stride-order) remains competitive with OMEinsum.jl. If so, there is no
need for an adaptive copy strategy — HPTT can be the universal default.

## Setup

- **Rust**: strided-opteinsum with `blas` + `hptt-input-copy` features.
Copy elision (`try_fuse_group`) is enabled; when a copy is needed, HPTT
(destination-stride-order) is used. GEMM backend: OpenBLAS 0.3.29.
- **Julia**: OMEinsum.jl v0.9.3 with pre-computed contraction paths
(`omeinsum_path` mode). Julia 1.10.0, BLAS vendor: lbt (OpenBLAS).
- **Hardware**: AMD EPYC 7713P
- **Timing**: median of 15 runs, 3 warmup

## Results

### 1T — opt_flops

| Instance | Rust (ms) | Julia (ms) | Ratio |
|---|---|---|---|
| gm_queen5_5_3 | 5209 | SKIP | — |
| lm_brackets_4_4d | 19 | 30 | **0.62x** |
| lm_sentence_3_12d | 76 | 80 | **0.95x** |
| lm_sentence_4_4d | 22 | 33 | **0.66x** |
| str_matrix_chain_100 | 14 | 16 | **0.84x** |
| str_mps_varying_200 | 15 | 37 | **0.42x** |
| mera_closed | 1361 | 1386 | **0.98x** |
| mera_open | 880 | 1251 | **0.70x** |
| tn_focus | 455 | 491 | **0.93x** |
| tn_light | 450 | 495 | **0.91x** |

### 1T — opt_size

| Instance | Rust (ms) | Julia (ms) | Ratio |
|---|---|---|---|
| gm_queen5_5_3 | 1632 | SKIP | — |
| lm_brackets_4_4d | 20 | 32 | **0.60x** |
| lm_sentence_3_12d | 60 | 92 | **0.66x** |
| lm_sentence_4_4d | 26 | 33 | **0.77x** |
| str_matrix_chain_100 | 12 | 17 | **0.69x** |
| str_mps_varying_200 | 17 | 35 | **0.48x** |
| mera_closed | 1173 | 1286 | **0.91x** |
| mera_open | 914 | 1298 | **0.70x** |
| tn_focus | 449 | 495 | **0.91x** |
| tn_light | 451 | 500 | **0.90x** |

### 4T — opt_flops

| Instance | Rust (ms) | Julia (ms) | Ratio |
|---|---|---|---|
| gm_queen5_5_3 | 4092 | SKIP | — |
| lm_brackets_4_4d | 20 | 42 | **0.49x** |
| lm_sentence_3_12d | 53 | 59 | **0.89x** |
| lm_sentence_4_4d | 23 | 46 | **0.51x** |
| str_matrix_chain_100 | 14 | 17 | **0.81x** |
| str_mps_varying_200 | 23 | 59 | **0.39x** |
| mera_closed | 587 | 931 | **0.63x** |
| mera_open | 353 | 798 | **0.44x** |
| tn_focus | 352 | 444 | **0.79x** |
| tn_light | 358 | 446 | **0.80x** |

### 4T — opt_size

| Instance | Rust (ms) | Julia (ms) | Ratio |
|---|---|---|---|
| gm_queen5_5_3 | 1202 | SKIP | — |
| lm_brackets_4_4d | 22 | 40 | **0.55x** |
| lm_sentence_3_12d | 39 | 63 | **0.63x** |
| lm_sentence_4_4d | 29 | 49 | **0.59x** |
| str_matrix_chain_100 | 8 | 16 | **0.48x** |
| str_mps_varying_200 | 20 | 44 | **0.45x** |
| mera_closed | 457 | 704 | **0.65x** |
| mera_open | 356 | 796 | **0.45x** |
| tn_focus | 353 | 457 | **0.77x** |
| tn_light | 356 | 464 | **0.77x** |

## Analysis

Rust with HPTT is **equal or faster than Julia on every instance**, in both
1T and 4T configurations.

- **1T**: Rust is 0.42x–0.98x of Julia (2–58% faster across instances)
- **4T**: Rust is 0.39x–0.89x of Julia (11–61% faster across instances)
- The 4T advantage is larger because strided-rs parallelizes both
permutation copies (via rayon) and GEMM (via OpenBLAS threads), while
Julia's OMEinsum only parallelizes GEMM

Even `tn_focus` and `tn_light` — the instances where HPTT is slower than
source-stride-order in isolation (see `src-vs-dst-order-experiment.md`) —
still outperform Julia. The copy elision (`try_fuse_group`) compensates for
HPTT's overhead on these degenerate many-small-dims cases.

## Conclusion

**HPTT can be the default copy strategy** for the `Contract` CPU backend.
There is no need for an adaptive strategy that switches between HPTT and
source-stride-order based on tensor shape. The combination of copy elision +
HPTT is sufficient to match or exceed Julia's OMEinsum.jl on all tested
workloads.

## Notes

- `gm_queen5_5_3` is skipped by Julia due to a `MethodError` (3D+ array
incompatibility in OMEinsum.jl)
- Julia's IQR is generally higher than Rust's, suggesting more variance
(likely due to GC pressure)
119 changes: 119 additions & 0 deletions docs/src-vs-dst-order-experiment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Source-Stride-Order vs Destination-Stride-Order (HPTT) Experiment

Branch: `perf/src-vs-dst-order`

## Hypothesis

The eager-HPTT experiment showed a 26–31% regression on `mera_open` when
permutations are eagerly materialized. However, that experiment conflated two
factors: **copy elision** (`try_fuse_group`) and **copy strategy** (source-order
vs destination-order). This experiment isolates the copy strategy factor by
disabling copy elision (`force-copy` feature) and comparing source-stride-order
copy vs HPTT (destination-stride-order) copy.

## Setup

Two feature flags added to `strided-einsum2`:

- `force-copy`: Forces `needs_copy = true` in all `prepare_input_*` and
`prepare_output_*` functions, disabling `try_fuse_group` elision.
- `hptt-input-copy`: Switches `prepare_input_owned` from source-stride-order
copy to HPTT (`strided_kernel::copy_into_col_major`).

Three configurations benchmarked:

1. **Baseline**: Default (copy elision enabled, src-order when copy needed)
2. **force-copy + src-order**: Copy elision disabled, source-stride-order copy
3. **force-copy + HPTT**: Copy elision disabled, HPTT destination-order copy

## Results (AMD EPYC 7713P, faer, 1T)

### opt_flops

| Instance | Baseline | force+src | force+HPTT | src vs HPTT |
|---|---|---|---|---|
| gm_queen5_5_3 | 5775 ms | 6400 ms (+11%) | 6052 ms (+5%) | HPTT 5% faster |
| lm_brackets_4_4d | 19 ms | 35 ms (+80%) | 24 ms (+23%) | **HPTT 31% faster** |
| lm_sentence_3_12d | 65 ms | 92 ms (+42%) | 78 ms (+19%) | **HPTT 16% faster** |
| lm_sentence_4_4d | 17 ms | 34 ms (+99%) | 24 ms (+41%) | **HPTT 29% faster** |
| str_matrix_chain_100 | 11 ms | 20 ms (+80%) | 14 ms (+25%) | **HPTT 30% faster** |
| str_mps_varying_200 | 16 ms | 21 ms (+31%) | 16 ms (-3%) | **HPTT 25% faster** |
| mera_closed | 1518 ms | 1739 ms (+15%) | 1567 ms (+3%) | **HPTT 10% faster** |
| mera_open | 935 ms | 1129 ms (+21%) | 1142 ms (+22%) | ~same (+1%) |
| tn_focus | 288 ms | 400 ms (+39%) | 568 ms (+97%) | **src 30% faster** |
| tn_light | 289 ms | 401 ms (+39%) | 560 ms (+94%) | **src 28% faster** |

### opt_size

| Instance | Baseline | force+src | force+HPTT | src vs HPTT |
|---|---|---|---|---|
| gm_queen5_5_3 | 1727 ms | 2471 ms (+43%) | 2290 ms (+33%) | HPTT 7% faster |
| lm_brackets_4_4d | 19 ms | 35 ms (+82%) | 20 ms (+5%) | **HPTT 43% faster** |
| lm_sentence_3_12d | 52 ms | 78 ms (+49%) | 59 ms (+14%) | **HPTT 24% faster** |
| lm_sentence_4_4d | 22 ms | 37 ms (+69%) | 25 ms (+18%) | **HPTT 30% faster** |
| str_matrix_chain_100 | 11 ms | 20 ms (+85%) | 15 ms (+36%) | **HPTT 26% faster** |
| str_mps_varying_200 | 14 ms | 23 ms (+68%) | 13 ms (-3%) | **HPTT 43% faster** |
| mera_closed | 1480 ms | 1496 ms (+1%) | 1322 ms (-11%) | **HPTT 12% faster** |
| mera_open | 934 ms | 1108 ms (+19%) | 1086 ms (+16%) | ~same (-2%) |
| tn_focus | 288 ms | 394 ms (+37%) | 541 ms (+88%) | **src 27% faster** |
| tn_light | 289 ms | 387 ms (+34%) | 532 ms (+84%) | **src 27% faster** |

## Analysis

### HPTT is faster for most workloads

Contrary to the initial assumption that source-stride-order is generally
superior, HPTT outperforms source-order on 8 out of 10 instances (16–43%
faster). HPTT's cache-blocked 2D transpose tiles give better cache utilization
when the data layout has moderate-to-large contiguous blocks.

### Source-order wins only for many small binary dimensions

The two instances where source-order is faster — `tn_focus` (316 tensors) and
`tn_light` (415 tensors) — have many binary dimensions (size 2). With ~24
dimensions of size 2, HPTT builds ~15 recursion levels with only 2 iterations
each, and the 2×2 inner tile degenerates. The simple odometer loop of
source-order copy handles this case more efficiently.

### mera_open: copy strategy is irrelevant

`mera_open` shows essentially no difference between source-order and HPTT
(+1% / -2%, within noise). The 21–22% regression vs baseline is entirely
due to copy elision loss. This confirms that the 26–31% regression in the
eager-HPTT experiment was caused by copy elision, not by HPTT's copy strategy.

### Copy elision remains the dominant optimization

All instances are faster with copy elision enabled (baseline) than with either
forced copy strategy. The biggest gaps are on lm_* and str_* instances (up to
99% regression with forced copies). Copy elision (`try_fuse_group`) should
always be the first priority.

## Conclusions

1. **Copy elision (`try_fuse_group`) is the most important optimization** —
responsible for the majority of performance gains across all instances.

2. **HPTT is the better default copy strategy** when copies cannot be avoided.
It outperforms source-order on most workloads thanks to cache-blocked tiling.

3. **Source-order is better for degenerate cases** with many small dimensions
(size 2), where HPTT's recursion structure becomes overhead-heavy.

4. **The optimal `Contract` implementation should use adaptive copy strategy**:
- Always try copy elision first (`try_fuse_group`)
- Use HPTT for general cases
- Consider source-order for tensors with many small dimensions (heuristic
needed)

## Implications for `Contract` CPU backend

The priority order in `contract-as-core-op.md` should be updated:

1. **Copy elision** (`try_fuse_group`) — dominant optimization, always first
2. **HPTT (destination-stride-order)** — default copy strategy when elision fails
3. **Source-stride-order** — fallback for degenerate many-small-dims cases

A simple heuristic for choosing between HPTT and source-order: if the minimum
dimension size after bilateral fusion is ≤ 2 and the number of fused dimensions
is large (e.g., > 10), prefer source-order. Otherwise, use HPTT.
Loading