@@ -6,50 +6,62 @@ Cache-efficient tensor permutation / transpose, inspired by
66## Techniques
77
881 . ** Bilateral dimension fusion** -- fuse consecutive dimensions that are
9- contiguous in * both* source and destination stride patterns.
10- 2 . ** Cache-aware blocking** -- tile iterations to fit in L1 cache (32 KB).
11- 3 . ** Optimal loop ordering** -- place the stride-1 dimension innermost for
12- sequential memory access; sort outer dimensions by descending stride.
13- 4 . ** Rank-specialized kernels** -- tight 1D/2D/3D blocked loops with no
14- allocation overhead; generic N-D fallback with pre-allocated odometer.
15- 5 . ** Optional Rayon parallelism** (` parallel ` feature) -- parallelize the
16- outermost block loop via ` rayon::par_iter ` .
9+ contiguous in * both* source and destination stride patterns
10+ (equivalent to HPTT's ` fuseIndices ` ).
11+ 2 . ** 2D micro-kernel transpose** -- 4×4 scalar kernel for f64, 8×8 for f32.
12+ 3 . ** Macro-kernel blocking** -- BLOCK × BLOCK tile (16 for f64, 32 for f32)
13+ processed as a grid of micro-kernel calls, with scalar edge handling.
14+ 4 . ** Recursive ComputeNode loop nest** -- mirrors HPTT's linked-list loop
15+ structure; only stride-1 dims get blocked.
16+ 5 . ** ConstStride1 fast path** -- when src and dst stride-1 dims coincide,
17+ uses memcpy/strided-copy instead of the 2D transpose kernel.
18+ 6 . ** Optional Rayon parallelism** (` parallel ` feature) -- parallelize the
19+ outermost ComputeNode dimension via ` rayon::par_iter ` .
20+
21+ ### TODO
22+
23+ - ** SIMD micro-kernels** -- the current scalar 4×4/8×8 kernels rely on LLVM
24+ auto-vectorization. Dedicated AVX2/NEON intrinsic kernels could further
25+ close the gap with HPTT C++.
1726
1827## Benchmark Results
1928
20- Environment: Linux, AMD 64-core server, ` RUSTFLAGS="-C target-cpu=native" ` .
29+ Environment: Apple M2, 8 cores, macOS .
2130
2231All tensors use ` f64 ` (8 bytes). "16M elements" = 128 MB read + 128 MB write.
2332
2433### Single-threaded (1T)
2534
2635| Scenario | strided-perm | naive | Speedup |
2736| ---| ---:| ---:| ---:|
28- | Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
29- | Contig->contig perm (24d) | 30 ms (8.9 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
30- | Small tensor (13d, 8K elems) | 0.023 ms (5.7 GB/s) | 0.039 ms (3.4 GB/s) | 1.7x |
31- | 256^3 transpose [ 2,0,1] | 76 ms (3.6 GB/s) | 73 ms (3.7 GB/s) | ~ 1x |
32- | 256^3 transpose [ 1,0,2] | 37 ms (7.3 GB/s) | -- | -- |
33- | memcpy baseline | 5.8 ms (46 GB/s) | -- | -- |
37+ | Scattered 24d (16M elems) | 11.0 ms (24 GB/s) | 38 ms (7.0 GB/s) | 3.5x |
38+ | Contig→contig perm (24d) | 6.0 ms (45 GB/s) | 30 ms (9.1 GB/s) | 5.0x |
39+ | Small tensor reverse (13d, 8K) | 0.035 ms (3.7 GB/s) | 0.015 ms (8.9 GB/s) | 0.4x |
40+ | Small tensor cyclic (13d, 8K) | 0.004 ms (29 GB/s) | -- | -- |
41+ | 256^3 transpose [ 2,0,1] | 17.1 ms (16 GB/s) | 45 ms (6.0 GB/s) | 2.6x |
42+ | 256^3 transpose [ 1,0,2] | 15.0 ms (18 GB/s) | -- | -- |
43+ | memcpy baseline | 4.5 ms (59 GB/s) | -- | -- |
3444
35- ### Multi-threaded (64T , ` parallel ` feature)
45+ ### Multi-threaded (8T , ` parallel ` feature)
3646
37- | Scenario | 1T | 64T | Speedup |
47+ | Scenario | 1T | 8T | Speedup |
3848| ---| ---:| ---:| ---:|
39- | Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 23 ms (11.7 GB/s) | 1.3x |
40- | Contig-> contig perm (24d) | 30 ms (8.9 GB/s) | 24 ms (11.4 GB/s) | 1.3x |
41- | Small tensor (13d, 8K elems ) | 0.023 ms | 0.023 ms | 1.0x (below threshold) |
42- | 256^3 transpose [ 2,0,1] | 76 ms (3.6 GB/s) | 4.7 ms (56.8 GB/s) | 16x |
43- | 256^3 transpose [ 1,0,2] | 37 ms (7.3 GB/s) | 4.2 ms (64.1 GB/s) | 8.8x |
49+ | Scattered 24d (16M elems) | 15.7 ms (17 GB/s) | 7.8 ms (35 GB/s) | 2.0x |
50+ | Contig→ contig perm (24d) | 6.3 ms (43 GB/s) | 6.5 ms (42 GB/s) | ~ 1x |
51+ | Small tensor reverse (13d, 8K) | 0.033 ms | 0.033 ms | 1.0x (below threshold) |
52+ | 256^3 transpose [ 2,0,1] | 17.0 ms (16 GB/s) | 17.5 ms (15 GB/s) | ~ 1x |
53+ | 256^3 transpose [ 1,0,2] | 15.8 ms (17 GB/s) | 6.3 ms (42 GB/s) | 2.5x |
4454
4555### Notes
4656
4757- ** Scattered 24d** : 24 binary dimensions with non-contiguous strides from a
4858 real tensor-network workload. Parallel improvement is modest because bilateral
4959 fusion leaves few outer blocks to distribute.
50- - ** 256^3 transpose** : Parallel execution yields dramatic speedup (16x) by
51- exploiting the large L3 cache and memory bandwidth of the 64-core machine.
52- Single-threaded performance is TLB-limited due to stride-65536 access.
60+ - ** Small tensor reverse** : Slower than naive because plan construction overhead
61+ dominates at 8K elements. The cyclic permutation fuses to fewer dims and is
62+ much faster.
63+ - ** 256^3 transpose [ 2,0,1] ** : Parallel speedup is limited because the outermost
64+ ComputeNode dimension is small after bilateral fusion.
5365- ** Small tensor** : Below ` MINTHREADLENGTH ` (32K elements), the parallel path
5466 falls back to single-threaded, incurring no overhead.
5567
0 commit comments