Skip to content

strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors #114

@shinaoka

Description

@shinaoka

Summary

strided_perm::copy_into is 4x slower than Julia's permutedims! when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.

Benchmark (AMD EPYC 7713P, 1T)

Reproducing step 408 of tensornetwork_permutation_light_415 — the single most expensive step (79% of total time).

Operation Rust (strided_perm) Julia (permutedims!)
Copy B (16M f64, scattered 24-dim perm) 236 ms 58 ms

The tensor has 24 dims of size 2 with a fully scattered permutation:

dims:    [2, 2, 2, ..., 2]  (24 dims)
strides: [16, 1024, 8388608, 4096, 1048576, 1, 8, 131072, ...]

Reproduction

See micro_bench/step408_bench.rs and micro_bench/step408_fair.jl in the benchmark suite.

Approach: Adopt Julia permutedims! strategy

Julia's Base.permutedims! uses a recursive dimension-stripping approach (see multidimensional.jl). The current strided_perm uses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.

Proposed fix: Add a specialized path in strided_perm::copy_into for high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.

Context

This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions