Benchmark suite for tenferro-rs based on the einsum benchmark (168 standardized einsum problems across 7 categories).
This repository provides:
- A Python pipeline to extract einsum benchmark metadata (shapes, dtypes, contraction paths) into portable JSON
- A Rust benchmark runner using tenferro-einsum
Only metadata is stored — tensors are generated at benchmark time (zero-filled), keeping the repo lightweight.
The structure mirrors strided-rs-benchmark-suite for consistency and cross-comparison.
See tensor4all/tenferro-rs for the full library.
tenferro-einsum-benchmark/
src/
main.rs # Rust benchmark runner (tenferro-einsum)
scripts/
run_all.sh # Top-level orchestrator (delegates to run_all_rust.sh)
run_all_rust.sh # Build & run tenferro-einsum + strided-rs (faer)
generate_dataset.py # Filter & export benchmark instances as JSON
format_results.py # Parse logs and output markdown tables
data/
instances/ # Exported JSON metadata (one file per instance)
results/ # Benchmark logs and markdown results
Cargo.toml # Rust project
pyproject.toml # Python project
Requires Python 3.12+ and uv.
uv syncRequires a local clone of tenferro-rs at ../tenferro-rs.
cargo build --releaseuv run python scripts/generate_dataset.pyThis selects instances by category with laptop-scale criteria and saves JSON metadata to data/instances/. rnd_mixed_ instances are excluded (not yet supported by tenferro-einsum).
Selection criteria (per category):
| Category | Prefix | log10[FLOPS] | log2[SIZE] | num_tensors | dtype |
|---|---|---|---|---|---|
| Language model | lm_ |
< 10 | < 25 | ≤ 100 | float64 or complex128 |
| Graphical model | gm_ |
< 10 | < 27 | ≤ 200 | float64 or complex128 |
| Structured | str_ |
< 11 | < 26 | ≤ 200 | float64 or complex128 |
./scripts/run_all.sh # 1 thread (default)
./scripts/run_all.sh 4 # 4 threads- Delegates to
scripts/run_all_rust.sh(tenferro-einsum + strided-rs faer) - Sets
OMP_NUM_THREADSandRAYON_NUM_THREADSto the given thread count (default: 1) - Formats results as a markdown table via
scripts/format_results.py - Saves all outputs to
data/results/with timestamps
Instance JSON files that fail to read or parse are skipped with a warning; the suite continues with the rest. Instances that trigger a backend error are reported as SKIP with the reason on stderr.
Requires strided-rs-benchmark-suite at ../strided-rs-benchmark-suite for strided-rs comparison (optional; skipped with a note if not found).
To run the benchmark for one instance only, set the environment variable BENCH_INSTANCE to the instance name:
BENCH_INSTANCE=gm_queen5_5_3.wcsp cargo run --release
BENCH_INSTANCE=tensornetwork_permutation_light_415 cargo run --releaseWith the full script:
BENCH_INSTANCE=gm_queen5_5_3.wcsp ./scripts/run_all.sh 1
BENCH_INSTANCE=tensornetwork_permutation_light_415 ./scripts/run_all.sh 4Instance name must match the name field in the JSON (i.e. the filename without .json). To list available names: ls data/instances/ → e.g. gm_queen5_5_3.wcsp.json → use gm_queen5_5_3.wcsp.
For bottleneck investigation, this repo includes a small binary-only set:
bin_matmul_256(ij,jk->ik) — dense matrix multiplication baselinebin_batched_matmul_b32_m64_n64_k64(bij,bjk->bik) — batched GEMM with moderate batch sizebin_outer_product_4096(i,j->ij) — outer-product path (sensitive to broadcast/pack overhead)bin_elementwise_mul_2048x2048(ij,ij->ij) — pure elementwise multiplication throughput
Recommended invocation for consistent single-thread profiling:
RAYON_NUM_THREADS=1 OMP_NUM_THREADS=1 \
BENCH_INSTANCE=bin_matmul_256 cargo run --releaseThese are intended for quick, reproducible profiling before running heavier LM/structured/network cases.
RAYON_NUM_THREADS=1 OMP_NUM_THREADS=1 cargo run --releaseuv run python scripts/format_results.py data/results/tenferro_einsum_*.log./scripts/run_all.sh # 1 thread
./scripts/run_all.sh 4 # 4 threadsThis script:
- Delegates to
scripts/run_all_rust.shwhich builds and runs tenferro-einsum and strided-rs (faer) - Sets
OMP_NUM_THREADSandRAYON_NUM_THREADSto the given thread count (default: 1) - Formats results as a markdown table via
scripts/format_results.py - Saves all outputs to
data/results/with timestamps
Instances are from the einsum benchmark suite. Selection is per-category (see Export benchmark metadata); dtype is float64 or complex128; tensors are zero-filled at runtime.
| Instance | Category | Tensors | Steps | log10(FLOPS) | log2(SIZE) |
|---|---|---|---|---|---|
bin_matmul_256 |
Binary (diagnostic) | 2 | 1 | — | — |
bin_batched_matmul_b32_m64_n64_k64 |
Binary (diagnostic) | 2 | 1 | — | — |
bin_outer_product_4096 |
Binary (diagnostic) | 2 | 1 | — | — |
bin_elementwise_mul_2048x2048 |
Binary (diagnostic) | 2 | 1 | — | — |
gm_queen5_5_3.wcsp |
Graphical model | 160 | 159 | 9.75 | 26.94 |
lm_batch_likelihood_brackets_4_4d |
Language model | 84 | 83 | 8.37 | 18.96 |
lm_batch_likelihood_sentence_3_12d |
Language model | 38 | 37 | 9.20 | 20.86 |
lm_batch_likelihood_sentence_4_4d |
Language model | 84 | 83 | 8.46 | 18.89 |
str_matrix_chain_multiplication_100 |
Structured | 100 | 99 | 8.48 | 17.26 |
str_mps_varying_inner_product_200 |
Structured (MPS) | 200 | 199 | 8.31 | 15.48 |
str_nw_mera_closed_120 |
Structured (MERA) | 120 | 119 | 10.66 | 25.02 |
str_nw_mera_open_26 |
Structured (MERA) | 26 | 25 | 10.49 | 25.36 |
tensornetwork_permutation_light_415 |
Tensor network | 415 | 414 | 9.65 | 24.0 |
tensornetwork_permutation_focus_step409_316 |
Tensor network (focused) | 316 | 315 | 9.65 | 24.0 |
- Graphical model (gm_*): WCSP / constraint networks; many small 2D factors (e.g. 3×3), full contraction to scalar.
- Language model (lm_*): many small multi-dimensional tensors (3D/4D) with large batch dimensions; many steps with small GEMM kernels.
- Structured — matrix chain (str_matrix_chain_*): large 2D matrices; each step is one large GEMM.
- Structured — MPS (str_mps_*): matrix product state–style networks; varying inner dimensions, many 2D contractions.
- Structured — MERA (str_nw_mera_*): tensor networks from multi-scale entanglement renormalization; many small 3×3-like tensors, heavy contraction.
- Tensor network (tensornetwork_permutation_light_415): 415 tensors extracted from the full TensorNetworkBenchmarks instance via BFS-connected subgraph.
- Tensor network focused (tensornetwork_permutation_focus_step409_316): focused subtree for profiling the late bottleneck steps (408/409); 316 tensors with non-scalar output (rank-18).
Median ± IQR (ms). OMP_NUM_THREADS=1, RAYON_NUM_THREADS=1.
| Instance | tenferro-einsum (ms) | strided-rs faer (ms) |
|---|---|---|
| bin_batched_matmul_b32_m64_n64_k64 | 1.505 ± 0.746 | 1.085 ± 0.366 |
| bin_elementwise_mul_2048x2048 | 1.729 ± 0.106 | 1.315 ± 0.093 |
| bin_matmul_256 | 0.538 ± 0.025 | 0.617 ± 0.016 |
| bin_outer_product_4096 | 4.409 ± 0.880 | 2.114 ± 0.056 |
| gm_queen5_5_3.wcsp | 1366.614 ± 45.159 | 1719.183 ± 10.071 |
| lm_batch_likelihood_brackets_4_4d | 9.831 ± 0.299 | 9.901 ± 0.758 |
| lm_batch_likelihood_sentence_3_12d | 34.342 ± 1.804 | 32.972 ± 0.110 |
| lm_batch_likelihood_sentence_4_4d | 12.180 ± 0.453 | 10.660 ± 0.085 |
| str_matrix_chain_multiplication_100 | 8.580 ± 0.121 | 8.771 ± 0.130 |
| str_mps_varying_inner_product_200 | 7.488 ± 0.166 | 8.781 ± 0.047 |
| str_nw_mera_closed_120 | 860.603 ± 26.606 | 895.693 ± 1.960 |
| str_nw_mera_open_26 | 564.104 ± 13.562 | 553.778 ± 1.552 |
| tensornetwork_permutation_focus_step409_316 | 179.686 ± 1.070 | 165.224 ± 0.533 |
| tensornetwork_permutation_light_415 | 181.119 ± 1.587 | 166.615 ± 0.527 |
Median ± IQR (ms). OMP_NUM_THREADS=1, RAYON_NUM_THREADS=1.
| Instance | tenferro-einsum (ms) | strided-rs faer (ms) |
|---|---|---|
| bin_batched_matmul_b32_m64_n64_k64 | 0.500 ± 0.035 | 0.498 ± 0.001 |
| bin_elementwise_mul_2048x2048 | 1.795 ± 0.164 | 1.214 ± 0.047 |
| bin_matmul_256 | 0.575 ± 0.005 | 0.577 ± 0.017 |
| bin_outer_product_4096 | 3.273 ± 0.080 | 2.119 ± 0.025 |
| gm_queen5_5_3.wcsp | 846.545 ± 7.491 | 667.801 ± 3.780 |
| lm_batch_likelihood_brackets_4_4d | 11.728 ± 0.596 | 10.293 ± 0.056 |
| lm_batch_likelihood_sentence_3_12d | 36.700 ± 3.756 | 34.509 ± 0.129 |
| lm_batch_likelihood_sentence_4_4d | 12.441 ± 0.644 | 11.235 ± 0.051 |
| str_matrix_chain_multiplication_100 | 8.985 ± 0.574 | 8.689 ± 0.044 |
| str_mps_varying_inner_product_200 | 7.365 ± 0.201 | 8.577 ± 0.027 |
| str_nw_mera_closed_120 | 873.335 ± 43.807 | 852.160 ± 16.226 |
| str_nw_mera_open_26 | 567.437 ± 10.611 | 555.803 ± 3.270 |
| tensornetwork_permutation_focus_step409_316 | 182.351 ± 9.294 | 164.994 ± 0.291 |
| tensornetwork_permutation_light_415 | 182.904 ± 1.304 | 166.532 ± 0.352 |
Median ± IQR (ms). OMP_NUM_THREADS=4, RAYON_NUM_THREADS=4.
| Instance | tenferro-einsum (ms) | strided-rs faer (ms) |
|---|---|---|
| bin_batched_matmul_b32_m64_n64_k64 | 1.123 ± 0.336 | 1.113 ± 0.357 |
| bin_elementwise_mul_2048x2048 | 1.597 ± 0.077 | 1.341 ± 0.066 |
| bin_matmul_256 | 0.222 ± 0.028 | 0.200 ± 0.002 |
| bin_outer_product_4096 | 3.388 ± 0.130 | 1.875 ± 0.161 |
| gm_queen5_5_3.wcsp | 1274.833 ± 6.605 | 1828.192 ± 13.416 |
| lm_batch_likelihood_brackets_4_4d | 7.313 ± 0.099 | 8.033 ± 0.078 |
| lm_batch_likelihood_sentence_3_12d | 17.477 ± 1.159 | 17.012 ± 1.098 |
| lm_batch_likelihood_sentence_4_4d | 7.681 ± 0.180 | 8.757 ± 0.171 |
| str_matrix_chain_multiplication_100 | 5.931 ± 0.343 | 6.221 ± 0.234 |
| str_mps_varying_inner_product_200 | 8.081 ± 0.281 | 9.353 ± 0.235 |
| str_nw_mera_closed_120 | 306.992 ± 7.956 | 307.959 ± 3.124 |
| str_nw_mera_open_26 | 196.437 ± 2.539 | 195.246 ± 1.538 |
| tensornetwork_permutation_focus_step409_316 | 87.775 ± 0.415 | 119.616 ± 1.128 |
| tensornetwork_permutation_light_415 | 88.735 ± 2.428 | 121.402 ± 1.925 |
Median ± IQR (ms). OMP_NUM_THREADS=4, RAYON_NUM_THREADS=4.
| Instance | tenferro-einsum (ms) | strided-rs faer (ms) |
|---|---|---|
| bin_batched_matmul_b32_m64_n64_k64 | 0.542 ± 0.008 | 0.508 ± 0.009 |
| bin_elementwise_mul_2048x2048 | 1.549 ± 0.128 | 1.232 ± 0.041 |
| bin_matmul_256 | 0.214 ± 0.078 | 0.236 ± 0.077 |
| bin_outer_product_4096 | 3.106 ± 0.197 | 1.836 ± 0.094 |
| gm_queen5_5_3.wcsp | 443.578 ± 2.611 | 516.313 ± 10.647 |
| lm_batch_likelihood_brackets_4_4d | 7.981 ± 0.308 | 9.226 ± 0.309 |
| lm_batch_likelihood_sentence_3_12d | 18.517 ± 0.992 | 17.840 ± 1.153 |
| lm_batch_likelihood_sentence_4_4d | 7.747 ± 0.170 | 9.661 ± 0.125 |
| str_matrix_chain_multiplication_100 | 6.077 ± 0.357 | 6.223 ± 0.329 |
| str_mps_varying_inner_product_200 | 8.066 ± 0.111 | 9.219 ± 0.106 |
| str_nw_mera_closed_120 | 291.378 ± 2.231 | 312.743 ± 3.406 |
| str_nw_mera_open_26 | 196.197 ± 2.475 | 197.780 ± 1.954 |
| tensornetwork_permutation_focus_step409_316 | 88.133 ± 1.545 | 119.643 ± 1.092 |
| tensornetwork_permutation_light_415 | 89.034 ± 1.891 | 121.721 ± 3.546 |
Notes:
- strided-rs faer uses faer (pure Rust GEMM).
- Both backends use the same pre-computed contraction path for fair comparison.
- Einsum Benchmark — standardized einsum benchmark suite
- ti2-group/einsum_benchmark — Python package
- tensor4all/tenferro-rs — Rust tensor library
- tensor4all/strided-rs-benchmark-suite — sibling benchmark suite for strided-rs
MIT