Skip to content

Commit 8f3cfee

Browse files
authored
Merge pull request #112 from tensor4all/refactor/hptt-cleanup-and-benchmarks
refactor: rewrite hptt module as 2D micro-kernel architecture
2 parents 329921d + afc65f1 commit 8f3cfee

10 files changed

Lines changed: 1445 additions & 960 deletions

File tree

THIRD-PARTY-LICENSES

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
This file lists third-party works whose algorithms or code influenced this
2+
project. Each entry includes the original license text.
3+
4+
================================================================================
5+
HPTT — High-Performance Tensor Transpose
6+
https://github.com/springer13/hptt
7+
================================================================================
8+
9+
The strided-perm/src/hptt/ module implements an algorithm based on the HPTT
10+
library by Paul Springer, Tong Su, and Paolo Bientinesi. This is an
11+
independent Rust reimplementation; no C++ source code was copied.
12+
13+
Reference:
14+
Paul Springer, Tong Su, and Paolo Bientinesi.
15+
"HPTT: A High-Performance Tensor Transpose C++ Library."
16+
In Proceedings of the 4th ACM SIGPLAN International Workshop on
17+
Libraries, Languages, and Compilers for Array Programming (ARRAY), 2017.
18+
19+
License (BSD-3-Clause):
20+
21+
Copyright 2018 Paul Springer
22+
23+
Redistribution and use in source and binary forms, with or without
24+
modification, are permitted provided that the following conditions are met:
25+
26+
1. Redistributions of source code must retain the above copyright notice,
27+
this list of conditions and the following disclaimer.
28+
29+
2. Redistributions in binary form must reproduce the above copyright notice,
30+
this list of conditions and the following disclaimer in the documentation
31+
and/or other materials provided with the distribution.
32+
33+
3. Neither the name of the copyright holder nor the names of its
34+
contributors may be used to endorse or promote products derived from this
35+
software without specific prior written permission.
36+
37+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
38+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
39+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
40+
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
41+
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
42+
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
43+
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
44+
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
45+
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
46+
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
47+
POSSIBILITY OF SUCH DAMAGE.

coverage-thresholds.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
{
22
"_comment": "Per-file line coverage thresholds (%). Files not listed default to 'default'.",
3-
"default": 80
3+
"default": 80,
4+
"files": {
5+
"strided-perm/src/hptt/execute.rs": 65,
6+
"strided-perm/src/hptt/macro_kernel.rs": 60
7+
}
48
}

strided-perm/README.md

Lines changed: 37 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,50 +6,62 @@ Cache-efficient tensor permutation / transpose, inspired by
66
## Techniques
77

88
1. **Bilateral dimension fusion** -- fuse consecutive dimensions that are
9-
contiguous in *both* source and destination stride patterns.
10-
2. **Cache-aware blocking** -- tile iterations to fit in L1 cache (32 KB).
11-
3. **Optimal loop ordering** -- place the stride-1 dimension innermost for
12-
sequential memory access; sort outer dimensions by descending stride.
13-
4. **Rank-specialized kernels** -- tight 1D/2D/3D blocked loops with no
14-
allocation overhead; generic N-D fallback with pre-allocated odometer.
15-
5. **Optional Rayon parallelism** (`parallel` feature) -- parallelize the
16-
outermost block loop via `rayon::par_iter`.
9+
contiguous in *both* source and destination stride patterns
10+
(equivalent to HPTT's `fuseIndices`).
11+
2. **2D micro-kernel transpose** -- 4×4 scalar kernel for f64, 8×8 for f32.
12+
3. **Macro-kernel blocking** -- BLOCK × BLOCK tile (16 for f64, 32 for f32)
13+
processed as a grid of micro-kernel calls, with scalar edge handling.
14+
4. **Recursive ComputeNode loop nest** -- mirrors HPTT's linked-list loop
15+
structure; only stride-1 dims get blocked.
16+
5. **ConstStride1 fast path** -- when src and dst stride-1 dims coincide,
17+
uses memcpy/strided-copy instead of the 2D transpose kernel.
18+
6. **Optional Rayon parallelism** (`parallel` feature) -- parallelize the
19+
outermost ComputeNode dimension via `rayon::par_iter`.
20+
21+
### TODO
22+
23+
- **SIMD micro-kernels** -- the current scalar 4×4/8×8 kernels rely on LLVM
24+
auto-vectorization. Dedicated AVX2/NEON intrinsic kernels could further
25+
close the gap with HPTT C++.
1726

1827
## Benchmark Results
1928

20-
Environment: Linux, AMD 64-core server, `RUSTFLAGS="-C target-cpu=native"`.
29+
Environment: Apple M2, 8 cores, macOS.
2130

2231
All tensors use `f64` (8 bytes). "16M elements" = 128 MB read + 128 MB write.
2332

2433
### Single-threaded (1T)
2534

2635
| Scenario | strided-perm | naive | Speedup |
2736
|---|---:|---:|---:|
28-
| Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
29-
| Contig->contig perm (24d) | 30 ms (8.9 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
30-
| Small tensor (13d, 8K elems) | 0.023 ms (5.7 GB/s) | 0.039 ms (3.4 GB/s) | 1.7x |
31-
| 256^3 transpose [2,0,1] | 76 ms (3.6 GB/s) | 73 ms (3.7 GB/s) | ~1x |
32-
| 256^3 transpose [1,0,2] | 37 ms (7.3 GB/s) | -- | -- |
33-
| memcpy baseline | 5.8 ms (46 GB/s) | -- | -- |
37+
| Scattered 24d (16M elems) | 11.0 ms (24 GB/s) | 38 ms (7.0 GB/s) | 3.5x |
38+
| Contig→contig perm (24d) | 6.0 ms (45 GB/s) | 30 ms (9.1 GB/s) | 5.0x |
39+
| Small tensor reverse (13d, 8K) | 0.035 ms (3.7 GB/s) | 0.015 ms (8.9 GB/s) | 0.4x |
40+
| Small tensor cyclic (13d, 8K) | 0.004 ms (29 GB/s) | -- | -- |
41+
| 256^3 transpose [2,0,1] | 17.1 ms (16 GB/s) | 45 ms (6.0 GB/s) | 2.6x |
42+
| 256^3 transpose [1,0,2] | 15.0 ms (18 GB/s) | -- | -- |
43+
| memcpy baseline | 4.5 ms (59 GB/s) | -- | -- |
3444

35-
### Multi-threaded (64T, `parallel` feature)
45+
### Multi-threaded (8T, `parallel` feature)
3646

37-
| Scenario | 1T | 64T | Speedup |
47+
| Scenario | 1T | 8T | Speedup |
3848
|---|---:|---:|---:|
39-
| Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 23 ms (11.7 GB/s) | 1.3x |
40-
| Contig->contig perm (24d) | 30 ms (8.9 GB/s) | 24 ms (11.4 GB/s) | 1.3x |
41-
| Small tensor (13d, 8K elems) | 0.023 ms | 0.023 ms | 1.0x (below threshold) |
42-
| 256^3 transpose [2,0,1] | 76 ms (3.6 GB/s) | 4.7 ms (56.8 GB/s) | 16x |
43-
| 256^3 transpose [1,0,2] | 37 ms (7.3 GB/s) | 4.2 ms (64.1 GB/s) | 8.8x |
49+
| Scattered 24d (16M elems) | 15.7 ms (17 GB/s) | 7.8 ms (35 GB/s) | 2.0x |
50+
| Contigcontig perm (24d) | 6.3 ms (43 GB/s) | 6.5 ms (42 GB/s) | ~1x |
51+
| Small tensor reverse (13d, 8K) | 0.033 ms | 0.033 ms | 1.0x (below threshold) |
52+
| 256^3 transpose [2,0,1] | 17.0 ms (16 GB/s) | 17.5 ms (15 GB/s) | ~1x |
53+
| 256^3 transpose [1,0,2] | 15.8 ms (17 GB/s) | 6.3 ms (42 GB/s) | 2.5x |
4454

4555
### Notes
4656

4757
- **Scattered 24d**: 24 binary dimensions with non-contiguous strides from a
4858
real tensor-network workload. Parallel improvement is modest because bilateral
4959
fusion leaves few outer blocks to distribute.
50-
- **256^3 transpose**: Parallel execution yields dramatic speedup (16x) by
51-
exploiting the large L3 cache and memory bandwidth of the 64-core machine.
52-
Single-threaded performance is TLB-limited due to stride-65536 access.
60+
- **Small tensor reverse**: Slower than naive because plan construction overhead
61+
dominates at 8K elements. The cyclic permutation fuses to fewer dims and is
62+
much faster.
63+
- **256^3 transpose [2,0,1]**: Parallel speedup is limited because the outermost
64+
ComputeNode dimension is small after bilateral fusion.
5365
- **Small tensor**: Below `MINTHREADLENGTH` (32K elements), the parallel path
5466
falls back to single-threaded, incurring no overhead.
5567

0 commit comments

Comments
 (0)