strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors

## Summary

`strided_perm::copy_into` is **4x slower** than Julia's `permutedims!` when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.

## Benchmark (AMD EPYC 7713P, 1T)

Reproducing step 408 of `tensornetwork_permutation_light_415` — the single most expensive step (79% of total time).

| Operation | Rust (`strided_perm`) | Julia (`permutedims!`) |
|---|---|---|
| Copy B (16M f64, scattered 24-dim perm) | **236 ms** | **58 ms** |

The tensor has 24 dims of size 2 with a fully scattered permutation:
```
dims:    [2, 2, 2, ..., 2]  (24 dims)
strides: [16, 1024, 8388608, 4096, 1048576, 1, 8, 131072, ...]
```

## Reproduction

See `micro_bench/step408_bench.rs` and `micro_bench/step408_fair.jl` in the [benchmark suite](https://github.com/tensor4all/strided-rs-benchmark-suite).

## Approach: Adopt Julia `permutedims!` strategy

Julia's `Base.permutedims!` uses a **recursive dimension-stripping** approach (see [multidimensional.jl](https://github.com/JuliaLang/julia/blob/master/base/multidimensional.jl)). The current `strided_perm` uses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.

**Proposed fix**: Add a specialized path in `strided_perm::copy_into` for high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.

## Context

This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors #114

Summary

Benchmark (AMD EPYC 7713P, 1T)

Reproduction

Approach: Adopt Julia `permutedims!` strategy

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors #114

Description

Summary

Benchmark (AMD EPYC 7713P, 1T)

Reproduction

Approach: Adopt Julia permutedims! strategy

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Approach: Adopt Julia `permutedims!` strategy