Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2a19506
test(pytorch-profiling): add rocprofv3 validation script for version1
Oct 28, 2025
59ae9ca
docs(pytorch-profiling): document version1 rocprofv3 test results
Oct 28, 2025
993b853
chore(profiling): ignore profiling artifacts and traces
Oct 28, 2025
409ecf3
chore(profiling): exclude TinyTransformer profiling data
Oct 28, 2025
b497c02
docs(tinytransformer): add version comparison analysis
Oct 28, 2025
8b7372f
feat(profiling): add ROCProfiler automation scripts
Oct 28, 2025
c2f5ad8
feat(profiling): add kernel trace analysis script
Oct 28, 2025
a1e2fdc
feat(profiling): add rocprof-compute profiling scripts
Oct 28, 2025
0c14684
feat(profiling): add rocprof-sys profiling scripts
Oct 28, 2025
a167ad0
chore(profiling): exclude inference_benchmark profiling data
Oct 28, 2025
d3a01c0
docs(inference-benchmark): add profiling scripts guide
Oct 28, 2025
1dc7b2e
feat(profiling): add inference_benchmark profiling scripts
Oct 28, 2025
b9ec2a1
feat(profiling): add kernel trace analysis for inference benchmarks
Oct 28, 2025
afc3820
docs(inference-benchmark): document ROCm 6.x/7.x compatibility
Nov 5, 2025
f531e23
feat(profiling): add ROCm 7.x SQLite database analysis tools
Nov 5, 2025
262108f
refactor(tinytransformer): add ROCm version detection to profiling
Nov 5, 2025
8a34687
feat(inference-benchmark): enhance profiling with ROCm 7.x support
Nov 5, 2025
939b675
refactor(pytorch_microbench): rename inference_benchmark directory
Jan 14, 2026
d92db24
chore(pytorch_microbench): remove custom analysis scripts
Jan 14, 2026
9fc76c9
fix(pytorch_microbench): update profiling scripts per PR review
Jan 14, 2026
cf46839
docs(pytorch_microbench): rewrite README following GhostExchange format
Jan 14, 2026
ad8c1dc
docs(pytorch_microbench): add example outputs from RX 7900 XTX profiling
Jan 14, 2026
d8fb3d5
chore(TinyTransformer): remove custom analysis scripts
Jan 14, 2026
009a174
fix(TinyTransformer): update profiling scripts per PR review
Jan 14, 2026
d2846e1
docs(TinyTransformer): rewrite README following GhostExchange format
Jan 14, 2026
ce0440f
docs(tinytransformer): condense markdown documentation
Jan 14, 2026
db47525
chore(TinyTransformer): remove custom analysis scripts from version2
Jan 14, 2026
afbbe47
refactor(TinyTransformer): update version2 to follow GhostExchange fo…
Jan 14, 2026
c8810cf
chore(TinyTransformer): remove custom analysis scripts from version3
Jan 14, 2026
5795e40
refactor(TinyTransformer): update version3 to follow GhostExchange fo…
Jan 14, 2026
19aed1b
chore(TinyTransformer): remove custom analysis scripts from version4
Jan 14, 2026
84e7019
refactor(TinyTransformer): update version4 to follow GhostExchange fo…
Jan 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,18 @@ Mkfile.old
dkms.conf
__pycache__/
*.pyc

# Profiling artifacts
*.pftrace
rocprof_*/
pytorch_profiles/

# Downloaded datasets
MLExamples/PyTorch_Profiling/data/

# Generated profiling traces
MLExamples/PyTorch_Profiling/rocprofv3/single_process/
MLExamples/TinyTransformer/*/counters/
MLExamples/TinyTransformer/*/traces/
MLExamples/TinyTransformer/*/github_issue_test/
MLExamples/inference_benchmark/profiling_results/
227 changes: 227 additions & 0 deletions MLExamples/TinyTransformer/VERSION_COMPARISON.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Version 1 vs Version 2 vs Version 3 vs Version 4 Profiling Comparison

## Executive Summary

All four versions successfully profile with rocprofv3. The GitHub issue #1386 "no device activity" does not reproduce with ROCm 6.4.4 on RX 7900 XTX.

**Key Finding**: Both version3 (Triton custom kernels) and version4 (PyTorch SDPA + Triton) achieve **4.4x speedup** over version1 baseline, with similar performance characteristics. Version2 (PyTorch fusion) provides minimal gains.

## Test Configuration

- **GPU**: AMD Radeon RX 7900 XTX (gfx1100)
- **ROCm**: 6.4.4
- **Profiler**: rocprofv3
- **Test parameters**: batch-size 8, seq-len 128, num-steps 10

## Profiling Results Comparison

### Trace File Sizes (Runtime Trace)

| Version | Trace Size | Result |
|---------|-----------|---------|
| Version 1 | 44 MB | Success - full device activity captured |
| Version 2 | 41 MB | Success - full device activity captured |
| Version 3 | Not tested | Kernel trace tested instead (3.0 MB) |
| Version 4 | 9.7 MB | Success - full device activity captured |

### Kernel Trace Analysis

| Metric | Version 1 | Version 2 | Version 3 | Version 4 | V3/V4 vs V1 |
|--------|-----------|-----------|-----------|-----------|-------------|
| Total kernel dispatches | 22,284 | 22,479 | 4,727 | 5,493 | -76.3% to -78.8% |
| Unique kernel types | 64 | 55 | 32 | 33 | -48.4% to -50.0% |
| Total GPU time | 346.21 ms | 378.06 ms | 104.49 ms | 103.36 ms | -70.1% to -69.8% |

### Top 3 Kernels by GPU Time

#### Version 1 (PyTorch Baseline)

1. **GEMM kernel** (Cijk_Alik_Bljk...): 30,658 us (127.74 us avg) - 240 calls
2. **GEMM kernel** (Cijk_Ailk_Bljk...): 29,954 us (124.81 us avg) - 240 calls
3. **GEMM kernel** (Cijk_Alik_Bljk...): 26,641 us (74.00 us avg) - 360 calls

**Total top 3**: 87,253 us (25.2% of total GPU time)

#### Version 2 (PyTorch Fused)

1. **GEMM kernel** (Cijk_Ailk_Bljk...): 54,678 us (455.65 us avg) - 120 calls
2. **GEMM kernel** (Cijk_Alik_Bljk...): 25,482 us (212.35 us avg) - 120 calls
3. **bwd_kernel_fuse**: 24,814 us (206.78 us avg) - 120 calls

**Total top 3**: 104,974 us (27.8% of total GPU time)

#### Version 3 (Triton Custom Kernels)

1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,710 us (123.79 us avg) - 240 calls
2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,442 us (79.01 us avg) - 360 calls
3. **flash_attention_kernel**: 15,557 us (129.64 us avg) - 120 calls

**Total top 3**: 73,709 us (70.5% of total GPU time)

**Note**: Version3's top 3 kernels account for 70.5% of GPU time vs 25-28% in V1/V2, showing much better kernel concentration.

#### Version 4 (PyTorch SDPA + Triton)

1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,641 us (123.50 us avg) - 240 calls
2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,320 us (78.67 us avg) - 360 calls
3. **attn_fwd** (PyTorch SDPA): 13,045 us (108.71 us avg) - 120 calls

**Total top 3**: 71,006 us (68.7% of total GPU time)

**Note**: Version4 uses PyTorch SDPA (`attn_fwd`) instead of custom flash attention, but achieves similar performance to version3.

### Key Observations

1. **Version3 and Version4 achieve similar performance through different approaches**:
- **Version3**: Custom Triton kernels (`flash_attention_kernel`, `rmsnorm_kernel`)
- **Version4**: PyTorch SDPA (`attn_fwd`) with Triton fallbacks
- Both: 78-76% fewer kernel dispatches than version1
- Both: ~50% fewer unique kernel types than version1
- V3 flash attention: 15,557 us (129.64 us avg)
- V4 SDPA attention: 13,045 us (108.71 us avg) - slightly faster!

2. **Version2 fused kernels**:
- `bwd_kernel_fuse` (24,814 us total) - backward pass fusion
- `attn_fwd` (12,639 us total) - attention forward fusion
- These are custom fused operations not present in version1
- 14.1% fewer unique kernel types than version1
- Marginal performance impact (slightly slower)

3. **Performance progression**:
- Version1: Many small kernels, high launch overhead
- Version2: Some fusion, but still many PyTorch framework kernels
- Version3: Aggressive fusion with custom Triton kernels
- 69.8% reduction in GPU time vs version1
- 72.4% reduction in GPU time vs version2
- 78.8% fewer kernel launches vs version1

4. **Memory efficiency**:
- Version1: 434.3 MB peak memory
- Version2: 434.3 MB peak memory
- Version3: 193.8 MB peak memory (55.4% reduction)
- Triton kernels use significantly less memory

5. **Profiler functionality**:
- rocprofv3 successfully captures all GPU activity on all three versions
- No "no device activity" issue observed
- GitHub issue #1386 likely fixed in ROCm 6.4.4

## Performance Comparison

### Throughput

| Version | Samples/sec | Tokens/sec | Speedup vs V1 |
|---------|-------------|------------|---------------|
| Version 1 | 240.6 | 30,803 | 1.00x (baseline) |
| Version 2 | 247.4 | 31,672 | 1.03x |
| Version 3 | 1,054.8 | 135,014 | **4.38x** |
| Version 4 | 1,054.5 | 134,972 | **4.38x** |

Version3 and Version4 both achieve **4.38x speedup** over version1 and **4.26x speedup** over version2.

### Batch Processing Time

| Version | Average Batch Time | Speedup vs V1 |
|---------|-------------------|---------------|
| Version 1 | 33.3 ms | 1.00x (baseline) |
| Version 2 | 32.3 ms | 1.03x |
| Version 3 | 7.5 ms | **4.44x** |
| Version 4 | 7.6 ms | **4.38x** |

### Memory Usage

| Version | Peak Memory | Reduction vs V1 |
|---------|-------------|-----------------|
| Version 1 | 434.3 MB | baseline |
| Version 2 | 434.3 MB | 0% |
| Version 3 | 193.8 MB | **55.4%** |
| Version 4 | 193.9 MB | **55.3%** |

Version3 and Version4 both use less than half the memory of version1/version2.

## Fusion Impact Analysis

### Version2 (PyTorch Fused)

Version2 reports these fusion optimizations available:
- QKV Fusion: Available but not active in this run
- Flash Attention: Available but not active in this run
- SwiGLU Fusion: Available but not active in this run
- Torch Compile: Available but failed to activate

The fused kernels observed (`bwd_kernel_fuse`, `attn_fwd`) suggest some fusion is occurring despite the "not active" status. This may be a reporting issue in the code.

**Verdict**: Version2 fusion provides minimal benefit (3% speedup) and may have reporting issues.

### Version3 (Triton Custom Kernels)

Version3 reports active Triton optimizations:
- RMSNorm Kernel: ACTIVE - Fused variance + normalization (1,167 us total, 4.58 us avg)
- Flash Attention Kernel: ACTIVE - Memory-efficient attention (15,557 us total, 129.64 us avg)
- SwiGLU Kernel: ACTIVE (not visible in top kernels, likely very fast)

**Verdict**: Version3 Triton kernels deliver massive performance gains (4.38x speedup) with proper kernel fusion and optimization.

### Version4 (PyTorch SDPA + Triton)

Version4 uses PyTorch's Scaled Dot Product Attention (SDPA) with Triton fallbacks:
- **attn_fwd** (PyTorch SDPA): 13,045 us total, 108.71 us avg
- Slightly faster than V3's custom flash attention (15,557 us)
- Leverages PyTorch's optimized SDPA implementation
- Custom Triton kernels for other operations (RMSNorm, SwiGLU likely present but not in top kernels)
- 16% more kernel dispatches than V3 (5,493 vs 4,727)
- One additional unique kernel type (33 vs 32)

**Verdict**: Version4 achieves identical performance to version3 (4.38x speedup) using PyTorch SDPA instead of custom flash attention. PyTorch SDPA is actually slightly more efficient for attention, but V4 has slightly more overhead elsewhere.

## Conclusion

1. **rocprofv3 works correctly** on all four versions with ROCm 6.4.4
2. **No reproduction of GitHub issue #1386** - all versions show full device activity

3. **Version3 and Version4 are equivalent winners**:
- Both: **4.38x faster** than version1 baseline
- Both: **4.26x faster** than version2
- Both: **~55% less memory** usage
- Both: **~77-79% fewer** kernel dispatches
- Both: **~70% reduction** in GPU time
- V3 uses custom flash attention, V4 uses PyTorch SDPA
- V4's SDPA is slightly faster (13.0 ms vs 15.6 ms) but has slightly more overhead elsewhere

4. **Version2 provides minimal gains**:
- Only 3% faster than version1
- Same memory usage as version1
- Some fusion, but not well optimized
- May have reporting issues with fusion flags

5. **Performance progression summary**:
- V1 baseline: 240.6 samples/sec, 346 ms GPU time, 434 MB memory
- V2 fused: 247.4 samples/sec, 378 ms GPU time, 434 MB memory (marginal improvement)
- V3 custom Triton: 1,054.8 samples/sec, 104 ms GPU time, 194 MB memory (massive improvement)
- V4 PyTorch SDPA: 1,054.5 samples/sec, 103 ms GPU time, 194 MB memory (equivalent to V3)

6. **Key takeaways**:
- Custom Triton kernels (V3) deliver transformational performance that PyTorch-level fusion (V2) cannot match
- PyTorch SDPA (V4) provides a practical alternative to custom flash attention without sacrificing performance
- For production use, V4 may be preferable due to reliance on PyTorch's maintained SDPA implementation
- For maximum control and customization, V3's fully custom Triton approach is ideal

## Files Generated

### Version 1
- Runtime trace: `version1_pytorch_baseline/traces/trace_*/`
- Kernel trace: `version1_pytorch_baseline/counters/counter_20251028_164804/1f81e102abe6/9544_kernel_trace.csv` (11.6 MB)

### Version 2
- Runtime trace: `version2_pytorch_fused/traces/trace_20251028_170752/` (41 MB)
- Runtime trace (50 steps): `version2_pytorch_fused/github_issue_test/test_20251028_172311/` (149 MB)
- Kernel trace: `version2_pytorch_fused/counters/counter_20251028_172429/1f81e102abe6/17496_kernel_trace.csv` (10.8 MB)

### Version 3
- Kernel trace: `version3_triton/counters/counter_20251028_173451/1f81e102abe6/20129_kernel_trace.csv` (3.0 MB)
- Much smaller trace file due to 78.8% fewer kernel dispatches

### Version 4
- Runtime trace: `version4_pytorch_sdpa/traces/trace_20251028_174853/` (9.7 MB)
- Kernel trace: `version4_pytorch_sdpa/counters/counter_20251028_174948/1f81e102abe6/23175_kernel_trace.csv` (3.3 MB)
- Similar trace sizes to version3
Loading