-
Notifications
You must be signed in to change notification settings - Fork 0
strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors #114
Description
Summary
strided_perm::copy_into is 4x slower than Julia's permutedims! when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.
Benchmark (AMD EPYC 7713P, 1T)
Reproducing step 408 of tensornetwork_permutation_light_415 — the single most expensive step (79% of total time).
| Operation | Rust (strided_perm) |
Julia (permutedims!) |
|---|---|---|
| Copy B (16M f64, scattered 24-dim perm) | 236 ms | 58 ms |
The tensor has 24 dims of size 2 with a fully scattered permutation:
dims: [2, 2, 2, ..., 2] (24 dims)
strides: [16, 1024, 8388608, 4096, 1048576, 1, 8, 131072, ...]
Reproduction
See micro_bench/step408_bench.rs and micro_bench/step408_fair.jl in the benchmark suite.
Approach: Adopt Julia permutedims! strategy
Julia's Base.permutedims! uses a recursive dimension-stripping approach (see multidimensional.jl). The current strided_perm uses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.
Proposed fix: Add a specialized path in strided_perm::copy_into for high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.
Context
This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.