diff --git a/.gitignore b/.gitignore index 00a5516d..28d7588b 100644 --- a/.gitignore +++ b/.gitignore @@ -52,3 +52,18 @@ Mkfile.old dkms.conf __pycache__/ *.pyc + +# Profiling artifacts +*.pftrace +rocprof_*/ +pytorch_profiles/ + +# Downloaded datasets +MLExamples/PyTorch_Profiling/data/ + +# Generated profiling traces +MLExamples/PyTorch_Profiling/rocprofv3/single_process/ +MLExamples/TinyTransformer/*/counters/ +MLExamples/TinyTransformer/*/traces/ +MLExamples/TinyTransformer/*/github_issue_test/ +MLExamples/inference_benchmark/profiling_results/ diff --git a/MLExamples/TinyTransformer/VERSION_COMPARISON.md b/MLExamples/TinyTransformer/VERSION_COMPARISON.md new file mode 100644 index 00000000..bb6cfc4d --- /dev/null +++ b/MLExamples/TinyTransformer/VERSION_COMPARISON.md @@ -0,0 +1,227 @@ +# Version 1 vs Version 2 vs Version 3 vs Version 4 Profiling Comparison + +## Executive Summary + +All four versions successfully profile with rocprofv3. The GitHub issue #1386 "no device activity" does not reproduce with ROCm 6.4.4 on RX 7900 XTX. + +**Key Finding**: Both version3 (Triton custom kernels) and version4 (PyTorch SDPA + Triton) achieve **4.4x speedup** over version1 baseline, with similar performance characteristics. Version2 (PyTorch fusion) provides minimal gains. + +## Test Configuration + +- **GPU**: AMD Radeon RX 7900 XTX (gfx1100) +- **ROCm**: 6.4.4 +- **Profiler**: rocprofv3 +- **Test parameters**: batch-size 8, seq-len 128, num-steps 10 + +## Profiling Results Comparison + +### Trace File Sizes (Runtime Trace) + +| Version | Trace Size | Result | +|---------|-----------|---------| +| Version 1 | 44 MB | Success - full device activity captured | +| Version 2 | 41 MB | Success - full device activity captured | +| Version 3 | Not tested | Kernel trace tested instead (3.0 MB) | +| Version 4 | 9.7 MB | Success - full device activity captured | + +### Kernel Trace Analysis + +| Metric | Version 1 | Version 2 | Version 3 | Version 4 | V3/V4 vs V1 | +|--------|-----------|-----------|-----------|-----------|-------------| +| Total kernel dispatches | 22,284 | 22,479 | 4,727 | 5,493 | -76.3% to -78.8% | +| Unique kernel types | 64 | 55 | 32 | 33 | -48.4% to -50.0% | +| Total GPU time | 346.21 ms | 378.06 ms | 104.49 ms | 103.36 ms | -70.1% to -69.8% | + +### Top 3 Kernels by GPU Time + +#### Version 1 (PyTorch Baseline) + +1. **GEMM kernel** (Cijk_Alik_Bljk...): 30,658 us (127.74 us avg) - 240 calls +2. **GEMM kernel** (Cijk_Ailk_Bljk...): 29,954 us (124.81 us avg) - 240 calls +3. **GEMM kernel** (Cijk_Alik_Bljk...): 26,641 us (74.00 us avg) - 360 calls + +**Total top 3**: 87,253 us (25.2% of total GPU time) + +#### Version 2 (PyTorch Fused) + +1. **GEMM kernel** (Cijk_Ailk_Bljk...): 54,678 us (455.65 us avg) - 120 calls +2. **GEMM kernel** (Cijk_Alik_Bljk...): 25,482 us (212.35 us avg) - 120 calls +3. **bwd_kernel_fuse**: 24,814 us (206.78 us avg) - 120 calls + +**Total top 3**: 104,974 us (27.8% of total GPU time) + +#### Version 3 (Triton Custom Kernels) + +1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,710 us (123.79 us avg) - 240 calls +2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,442 us (79.01 us avg) - 360 calls +3. **flash_attention_kernel**: 15,557 us (129.64 us avg) - 120 calls + +**Total top 3**: 73,709 us (70.5% of total GPU time) + +**Note**: Version3's top 3 kernels account for 70.5% of GPU time vs 25-28% in V1/V2, showing much better kernel concentration. + +#### Version 4 (PyTorch SDPA + Triton) + +1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,641 us (123.50 us avg) - 240 calls +2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,320 us (78.67 us avg) - 360 calls +3. **attn_fwd** (PyTorch SDPA): 13,045 us (108.71 us avg) - 120 calls + +**Total top 3**: 71,006 us (68.7% of total GPU time) + +**Note**: Version4 uses PyTorch SDPA (`attn_fwd`) instead of custom flash attention, but achieves similar performance to version3. + +### Key Observations + +1. **Version3 and Version4 achieve similar performance through different approaches**: + - **Version3**: Custom Triton kernels (`flash_attention_kernel`, `rmsnorm_kernel`) + - **Version4**: PyTorch SDPA (`attn_fwd`) with Triton fallbacks + - Both: 78-76% fewer kernel dispatches than version1 + - Both: ~50% fewer unique kernel types than version1 + - V3 flash attention: 15,557 us (129.64 us avg) + - V4 SDPA attention: 13,045 us (108.71 us avg) - slightly faster! + +2. **Version2 fused kernels**: + - `bwd_kernel_fuse` (24,814 us total) - backward pass fusion + - `attn_fwd` (12,639 us total) - attention forward fusion + - These are custom fused operations not present in version1 + - 14.1% fewer unique kernel types than version1 + - Marginal performance impact (slightly slower) + +3. **Performance progression**: + - Version1: Many small kernels, high launch overhead + - Version2: Some fusion, but still many PyTorch framework kernels + - Version3: Aggressive fusion with custom Triton kernels + - 69.8% reduction in GPU time vs version1 + - 72.4% reduction in GPU time vs version2 + - 78.8% fewer kernel launches vs version1 + +4. **Memory efficiency**: + - Version1: 434.3 MB peak memory + - Version2: 434.3 MB peak memory + - Version3: 193.8 MB peak memory (55.4% reduction) + - Triton kernels use significantly less memory + +5. **Profiler functionality**: + - rocprofv3 successfully captures all GPU activity on all three versions + - No "no device activity" issue observed + - GitHub issue #1386 likely fixed in ROCm 6.4.4 + +## Performance Comparison + +### Throughput + +| Version | Samples/sec | Tokens/sec | Speedup vs V1 | +|---------|-------------|------------|---------------| +| Version 1 | 240.6 | 30,803 | 1.00x (baseline) | +| Version 2 | 247.4 | 31,672 | 1.03x | +| Version 3 | 1,054.8 | 135,014 | **4.38x** | +| Version 4 | 1,054.5 | 134,972 | **4.38x** | + +Version3 and Version4 both achieve **4.38x speedup** over version1 and **4.26x speedup** over version2. + +### Batch Processing Time + +| Version | Average Batch Time | Speedup vs V1 | +|---------|-------------------|---------------| +| Version 1 | 33.3 ms | 1.00x (baseline) | +| Version 2 | 32.3 ms | 1.03x | +| Version 3 | 7.5 ms | **4.44x** | +| Version 4 | 7.6 ms | **4.38x** | + +### Memory Usage + +| Version | Peak Memory | Reduction vs V1 | +|---------|-------------|-----------------| +| Version 1 | 434.3 MB | baseline | +| Version 2 | 434.3 MB | 0% | +| Version 3 | 193.8 MB | **55.4%** | +| Version 4 | 193.9 MB | **55.3%** | + +Version3 and Version4 both use less than half the memory of version1/version2. + +## Fusion Impact Analysis + +### Version2 (PyTorch Fused) + +Version2 reports these fusion optimizations available: +- QKV Fusion: Available but not active in this run +- Flash Attention: Available but not active in this run +- SwiGLU Fusion: Available but not active in this run +- Torch Compile: Available but failed to activate + +The fused kernels observed (`bwd_kernel_fuse`, `attn_fwd`) suggest some fusion is occurring despite the "not active" status. This may be a reporting issue in the code. + +**Verdict**: Version2 fusion provides minimal benefit (3% speedup) and may have reporting issues. + +### Version3 (Triton Custom Kernels) + +Version3 reports active Triton optimizations: +- RMSNorm Kernel: ACTIVE - Fused variance + normalization (1,167 us total, 4.58 us avg) +- Flash Attention Kernel: ACTIVE - Memory-efficient attention (15,557 us total, 129.64 us avg) +- SwiGLU Kernel: ACTIVE (not visible in top kernels, likely very fast) + +**Verdict**: Version3 Triton kernels deliver massive performance gains (4.38x speedup) with proper kernel fusion and optimization. + +### Version4 (PyTorch SDPA + Triton) + +Version4 uses PyTorch's Scaled Dot Product Attention (SDPA) with Triton fallbacks: +- **attn_fwd** (PyTorch SDPA): 13,045 us total, 108.71 us avg + - Slightly faster than V3's custom flash attention (15,557 us) + - Leverages PyTorch's optimized SDPA implementation +- Custom Triton kernels for other operations (RMSNorm, SwiGLU likely present but not in top kernels) +- 16% more kernel dispatches than V3 (5,493 vs 4,727) +- One additional unique kernel type (33 vs 32) + +**Verdict**: Version4 achieves identical performance to version3 (4.38x speedup) using PyTorch SDPA instead of custom flash attention. PyTorch SDPA is actually slightly more efficient for attention, but V4 has slightly more overhead elsewhere. + +## Conclusion + +1. **rocprofv3 works correctly** on all four versions with ROCm 6.4.4 +2. **No reproduction of GitHub issue #1386** - all versions show full device activity + +3. **Version3 and Version4 are equivalent winners**: + - Both: **4.38x faster** than version1 baseline + - Both: **4.26x faster** than version2 + - Both: **~55% less memory** usage + - Both: **~77-79% fewer** kernel dispatches + - Both: **~70% reduction** in GPU time + - V3 uses custom flash attention, V4 uses PyTorch SDPA + - V4's SDPA is slightly faster (13.0 ms vs 15.6 ms) but has slightly more overhead elsewhere + +4. **Version2 provides minimal gains**: + - Only 3% faster than version1 + - Same memory usage as version1 + - Some fusion, but not well optimized + - May have reporting issues with fusion flags + +5. **Performance progression summary**: + - V1 baseline: 240.6 samples/sec, 346 ms GPU time, 434 MB memory + - V2 fused: 247.4 samples/sec, 378 ms GPU time, 434 MB memory (marginal improvement) + - V3 custom Triton: 1,054.8 samples/sec, 104 ms GPU time, 194 MB memory (massive improvement) + - V4 PyTorch SDPA: 1,054.5 samples/sec, 103 ms GPU time, 194 MB memory (equivalent to V3) + +6. **Key takeaways**: + - Custom Triton kernels (V3) deliver transformational performance that PyTorch-level fusion (V2) cannot match + - PyTorch SDPA (V4) provides a practical alternative to custom flash attention without sacrificing performance + - For production use, V4 may be preferable due to reliance on PyTorch's maintained SDPA implementation + - For maximum control and customization, V3's fully custom Triton approach is ideal + +## Files Generated + +### Version 1 +- Runtime trace: `version1_pytorch_baseline/traces/trace_*/` +- Kernel trace: `version1_pytorch_baseline/counters/counter_20251028_164804/1f81e102abe6/9544_kernel_trace.csv` (11.6 MB) + +### Version 2 +- Runtime trace: `version2_pytorch_fused/traces/trace_20251028_170752/` (41 MB) +- Runtime trace (50 steps): `version2_pytorch_fused/github_issue_test/test_20251028_172311/` (149 MB) +- Kernel trace: `version2_pytorch_fused/counters/counter_20251028_172429/1f81e102abe6/17496_kernel_trace.csv` (10.8 MB) + +### Version 3 +- Kernel trace: `version3_triton/counters/counter_20251028_173451/1f81e102abe6/20129_kernel_trace.csv` (3.0 MB) +- Much smaller trace file due to 78.8% fewer kernel dispatches + +### Version 4 +- Runtime trace: `version4_pytorch_sdpa/traces/trace_20251028_174853/` (9.7 MB) +- Kernel trace: `version4_pytorch_sdpa/counters/counter_20251028_174948/1f81e102abe6/23175_kernel_trace.csv` (3.3 MB) +- Similar trace sizes to version3 diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/IMPORTTIME_PROFILING.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/IMPORTTIME_PROFILING.md index 2b5cea5b..0eb11830 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/IMPORTTIME_PROFILING.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/IMPORTTIME_PROFILING.md @@ -1,25 +1,20 @@ +# Python Import Time Profiling -## Python Import Time Profiling +IMPORTTIME_PROFILING.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository. -`IMPORTTIME_PROFILING.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository +## Overview -### Overview +The `python -X importtime` flag provides detailed timing information about module imports during Python script execution. This is useful for identifying slow imports that can impact startup time. -The `python -X importtime` flag provides detailed timing information about module imports during Python script execution. This is useful for identifying slow imports that can impact startup time and overall application performance. +## Basic Usage -### Basic Usage - -```bash +``` python -X importtime script.py ``` -This outputs a hierarchical tree showing: +This outputs a hierarchical tree showing import time for each module in microseconds. -- Import time for each module -- Cumulative time including sub-imports -- Self time (time spent in the module itself) - -### Output Format +## Output Format ``` import time: self [us] | cumulative | imported package @@ -34,232 +29,53 @@ import time: 1521 | 2865 | encodings - **cumulative**: Total time including all sub-imports (microseconds) - **imported package**: Module name with indentation showing import hierarchy -### Example: Profiling TinyLlama V1 - -#### Basic Import Analysis - -```bash -python -X importtime tiny_llama_v1.py 2> import_times.txt -``` - -This redirects the import timing output (stderr) to a file for analysis. +## Example: Profiling TinyLlama V1 -#### Analyzing PyTorch Import Time +Redirect import timing output to a file for analysis: -```bash -python -X importtime -c "import torch" 2>&1 | grep -E "torch|time:" -``` - -Expected output shows PyTorch's heavy import cost: -``` -import time: 1234567 | 1234567 | torch ``` - -#### Analyzing DeepSpeed Import Time - -```bash -python -X importtime -c "import deepspeed" 2>&1 | grep -E "deepspeed|time:" -``` - -### Common Import Time Bottlenecks in AI Workloads - -#### 1. PyTorch (torch) - -- Typical import time: 500ms - 2000ms -- Loads CUDA/ROCm libraries -- Initializes operator registry -- Sets up autograd engine - -#### 2. Transformers Library - -- Typical import time: 300ms - 1000ms -- Loads tokenizers -- Registers model architectures -- Initializes configuration classes - -#### 3. DeepSpeed - -- Typical import time: 200ms - 800ms -- Loads distributed training components -- Initializes optimization kernels -- Sets up communication backends - -#### 4. NumPy/SciPy - -- Typical import time: 50ms - 200ms -- Loads optimized BLAS/LAPACK libraries -- Initializes array operations - -### Best Practices - -#### 1. Lazy Imports -Move imports inside functions for code that's not always executed: - -```python -def run_with_profiler(): - # Only import when profiler is actually used - from torch.profiler import profile, ProfilerActivity - ... +python -X importtime tiny_llama_v1.py 2> import_times.txt ``` -#### 2. Conditional Imports -Import heavy dependencies only when needed: +Analyze PyTorch import time: -```python -if args.enable_profiler: - import deepspeed.profiling.flops_profiler as fp ``` - -#### 3. Import Grouping -Organize imports by load time to understand startup cost: - -```python -# Fast imports -import os -import sys -import argparse - -# Medium imports -import numpy as np -import pandas as pd - -# Heavy imports (consider lazy loading) -import torch -import deepspeed +python -X importtime -c "import torch" 2>&1 | grep -E "torch|time:" ``` -### Optimization Techniques - -#### 1. Module-Level Import Caching -Python caches imports in `sys.modules`, so subsequent imports are fast: +## Common Import Time Bottlenecks -```python -import torch # Slow first time -import torch # Fast - already cached -``` +| Package | Typical Import Time | Notes | +|---------|-------------------|-------| +| PyTorch (torch) | 500ms - 2000ms | Loads CUDA/ROCm libraries, operator registry | +| Transformers | 300ms - 1000ms | Loads tokenizers, model architectures | +| DeepSpeed | 200ms - 800ms | Distributed training components | +| NumPy/SciPy | 50ms - 200ms | Optimized BLAS/LAPACK libraries | -#### 2. Using `__import__()` for Dynamic Imports -For plugins or optional features: +## Generate Import Time Report -```python -def load_profiler(profiler_type): - if profiler_type == "pytorch": - torch_prof = __import__("torch.profiler", fromlist=["profile"]) - return torch_prof ``` - -#### 3. Parallel Import Loading -Not natively supported, but can structure code to minimize import depth. - -### Analyzing Import Time Results - -#### Generate Report -```bash python -X importtime tiny_llama_v1.py 2>&1 | \ grep "import time:" | \ sort -k3 -n -r | \ head -20 > top_imports.txt ``` -#### Parse with Script -```python -import re -import sys - -with open('import_times.txt', 'r') as f: - for line in f: - match = re.search(r'import time:\s+(\d+)\s+\|\s+(\d+)\s+\|\s+(.+)', line) - if match: - self_time = int(match.group(1)) - cumulative = int(match.group(2)) - module = match.group(3).strip() - if cumulative > 100000: # > 100ms - print(f"{module}: {cumulative/1000:.2f}ms") -``` - -### ROCm/PyTorch Specific Considerations - -#### HIP Runtime Loading -ROCm's HIP runtime can add significant import overhead: -- libamdhip64.so loading -- GPU device detection -- Architecture-specific kernel initialization - -#### Environment Variables Impact -These can affect import time: -```bash -# Reduce logging overhead during import -AMD_LOG_LEVEL=0 MIOPEN_LOG_LEVEL=0 python -X importtime script.py - -# Skip GPU initialization during import analysis -HIP_VISIBLE_DEVICES=-1 python -X importtime script.py -``` - -### Integration with Other Profiling Tools +## ROCm/PyTorch Considerations -#### Combine with cProfile -```bash -# First check import time -python -X importtime script.py 2> imports.txt +Reduce logging overhead during import analysis: -# Then profile runtime -python -m cProfile -o profile.stats script.py ``` - -#### Combine with PyTorch Profiler -```python -# Fast startup with lazy imports -def main(): - import torch - from torch.profiler import profile - - # Your training code here - ... - -if __name__ == "__main__": - main() +AMD_LOG_LEVEL=0 MIOPEN_LOG_LEVEL=0 python -X importtime script.py ``` -### Example Analysis for Version 1 - -#### Expected Import Hierarchy +Skip GPU initialization during import analysis: ``` -import time: self [us] | cumulative | imported package -import time: 2341 | 2341 | _frozen_importlib_external -import time: 850000 | 850000 | torch # Dominant cost -import time: 120000 | 120000 | torch.nn -import time: 45000 | 45000 | torch.optim -import time: 23000 | 23000 | apex.normalization.fused_layer_norm -import time: 18000 | 18000 | apex.transformer.functional.fused_rope -import time: 8000 | 8000 | argparse -import time: 3500 | 3500 | json +HIP_VISIBLE_DEVICES=-1 python -X importtime script.py ``` -#### Interpreting Results - -- **torch**: Largest import cost (850ms typical) -- **torch.nn**: Additional overhead for neural network modules -- **apex**: NVIDIA optimizations (ROCm compatible) -- Standard library imports (argparse, json): Negligible cost +## Additional Resources -### When to Use Import Time Profiling - -1. **Debugging slow script startup**: Identify which imports are causing delays -2. **Optimizing CLI tools**: Reduce time-to-first-output for user experience -3. **Container startup optimization**: Minimize cold-start latency -4. **CI/CD pipeline optimization**: Reduce test suite initialization time - -### Limitations - -- Does not profile runtime execution (use cProfile or PyTorch Profiler for that) -- Import time varies based on system load and cold vs. warm cache -- First import after system reboot will be slower due to OS page cache - -### References - -- [PEP 565 - Show DeprecationWarning in __main__](https://www.python.org/dev/peps/pep-0565/) - [Python -X Options Documentation](https://docs.python.org/3/using/cmdline.html#id5) - [PyTorch Performance Tuning Guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html) - - diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md index 59d84818..b35025e7 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md @@ -1,2367 +1,199 @@ -# Tiny LLaMA PyTorch Baseline - Profiling Workshop -## Complete Hands-On Walkthrough Guide +# Tiny LLaMA PyTorch Baseline - Workshop Walkthrough ---- +PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository. -## Important Note +This walkthrough demonstrates profiling techniques for transformer training workloads using Tiny LLaMA V1 as the baseline model. -**The performance numbers and metrics shown throughout this workshop are representative examples and were collected on specific hardware configurations.** Your actual results will differ based on: +## Prerequisites -- GPU model (e.g., MI250X, MI300X, MI325X) -- ROCm version -- PyTorch version -- System configuration (CPU, memory, drivers) -- Current GPU utilization and temperature +- ROCm installation with rocprofv3 +- PyTorch with ROCm support +- DeepSpeed (optional, for FLOPS profiling) -**Focus on the relative improvements and optimization techniques** demonstrated in each exercise rather than matching the exact numbers shown. The methodologies and analysis approaches are applicable across different hardware platforms. +## Environment Verification ---- +Check ROCm installation: -## Notation and Variables - -Throughout this workshop, we use the following notation for tensor dimensions and model parameters: - -**Tensor Dimensions:** -- **B** = Batch size (number of samples processed together) -- **S** = Sequence length (number of tokens in each sequence) -- **D** = Hidden dimension / Model dimension (size of hidden representations) -- **H** = Number of attention heads -- **head_dim** = Dimension per attention head (typically D / H) - -**Model Parameters:** -- **D_ff** = Feed-forward network intermediate dimension -- **V** = Vocabulary size (number of unique tokens) -- **L** = Number of transformer layers - -**Performance Metrics:** -- **FLOPS** = Floating Point Operations Per Second -- **MFU** = Model FLOPS Utilization (% of theoretical peak achieved) -- **TFLOPS** = Tera-FLOPS (10^12 floating point operations per second) -- **GFLOPS** = Giga-FLOPS (10^9 floating point operations per second) - -**Complexity Notation:** -- **O(S)** = Linear complexity with sequence length -- **O(S^2)** = Quadratic complexity with sequence length -- **O(B × S × D)** = Complexity grows with batch, sequence, and dimension - -**Example Tensor Shapes:** ``` -Input tensor: [B, S, D] e.g., [8, 128, 256] -Attention weights: [B, H, S, S] e.g., [8, 8, 128, 128] -Query/Key/Value: [B, H, S, head_dim] e.g., [8, 8, 128, 32] -FFN intermediate: [B, S, D_ff] e.g., [8, 128, 512] -``` - ---- - -## Table of Contents - -1. [Introduction & Setup](#1-introduction--setup) -2. [Understanding Tiny LLaMA Architecture](#2-understanding-tiny-llama-architecture) -3. [Understanding the Baseline Implementation](#3-understanding-the-baseline-implementation) -4. [Exercise 1: Baseline Performance Analysis](#4-exercise-1-baseline-performance-analysis) -5. [Exercise 2: Memory Analysis & Optimization](#5-exercise-2-memory-analysis--optimization) -6. [Exercise 3: Performance Study Across Problem Sizes](#6-exercise-3-performance-study-across-problem-sizes) - ---- - -## 1. Introduction & Setup - -### 1.1 What is LLM Training? - -**Large Language Model (LLM) Training** involves teaching neural networks to understand and generate human language through iterative optimization of model parameters. - -**Key Differences: Training vs Inference** - -| Aspect | Training | Inference | -|--------|----------|-----------| -| **Purpose** | Learn patterns from data | Make predictions | -| **Direction** | Forward + Backward pass | Forward pass only | -| **Gradients** | Required and computed | Not required | -| **Batch Size** | Typically larger (8-64) | Often smaller (1-32) | -| **Performance Goal** | Samples/sec + FLOPS efficiency | Latency + throughput | -| **Memory Usage** | Very high (activations + gradients) | Lower (no gradient storage) | -| **Optimization Focus** | Throughput, MFU, memory efficiency | Latency, batch throughput | - -**Why Profile LLM Training?** - -- Understand computational bottlenecks -- Optimize hardware utilization (Model FLOPS Utilization - MFU) -- Reduce training costs -- Identify memory inefficiencies -- Guide optimization decisions -- Establish baseline for improvements - -### 1.2 Workshop Goals - -By the end of this workshop, you will be able to: - -- Configure and run deterministic PyTorch LLM training -- Use PyTorch Profiler for detailed operator-level analysis -- Integrate DeepSpeed FLOPS profiler for computational efficiency metrics -- Interpret profiling results and identify performance bottlenecks -- Understand memory usage patterns in transformer training -- Analyze attention mechanisms and FFN performance -- Calculate Model FLOPS Utilization (MFU) -- Establish baseline performance metrics for optimization comparison - -### 1.3 Understanding Key Metrics - -Before diving into exercises, let's understand the metrics we'll be measuring: - -#### Training Speed (samples/sec) -- **What:** Number of training samples processed per second -- **Higher is better** -- **Typical range:** 50-200 samples/sec for small models on single GPU -- **Formula:** `(batch_size × num_steps) / total_time` - -#### FLOPS (Floating Point Operations Per Second) -- **What:** Computational throughput -- **Higher is better** -- **Units:** TFLOPS (TeraFLOPS, 10^12 operations/second) -- **Theoretical Peak:** Hardware maximum (e.g., MI250X: ~95 TFLOPS FP32, ~190 TFLOPS FP16) - -#### Model FLOPS Utilization (MFU) -- **What:** Percentage of theoretical peak FLOPS achieved -- **Formula:** `(Achieved FLOPS / Theoretical Peak FLOPS) × 100%` -- **Typical ranges:** - - 20-30%: Baseline PyTorch (memory-bound) - - 40-50%: Well-optimized (compute-bound) - - 60%+: Highly optimized (kernel fusion, Flash Attention) - -#### Memory Usage (GB) -- **What:** GPU memory consumed -- **Components:** Model weights + optimizer states + activations + gradients -- **Lower is better** (allows larger batches) - -#### GPU Utilization (%) -- **What:** Percentage of GPU compute units in use -- **Higher is better** (approaching 100%) -- **Low utilization indicates:** Memory bottlenecks, CPU bottlenecks, or small workloads - -### 1.4 Environment Verification - -Let's verify your system is ready for the workshop. - -#### Step 1: Check ROCm Installation - -```bash -# Check if ROCm is installed rocminfo | grep "Name:" ``` -**Expected Output:** -``` - Name: gfx90a - Name: AMD Instinct MI250X -``` +Check GPU status: -**If you see an error:** -```bash -# Check if ROCm is installed -which rocminfo - -# If not found, ROCm is not installed -# Contact your system administrator ``` - -#### Step 2: Check GPU Visibility - -```bash -# Check GPU status rocm-smi ``` -**Expected Output:** -``` -GPU[0] : GPU ID: 0 -GPU[0] : GPU Name: AMD Instinct MI250X -GPU[0] : Temperature: 35.0°C -GPU[0] : GPU Memory Usage: 512 MB / 65536 MB -GPU[0] : GPU Utilization: 0% -``` - -**Common Issues:** +Verify PyTorch with ROCm: -**Error: "Unable to detect any GPUs"** -```bash -# Check permissions -sudo usermod -aG video $USER -sudo usermod -aG render $USER - -# Logout and login again -# Then retry: rocm-smi ``` - -#### Step 3: Check PyTorch + ROCm - -```bash -# Test PyTorch with ROCm python3 -c " import torch print(f'PyTorch Version: {torch.__version__}') print(f'CUDA Available: {torch.cuda.is_available()}') if torch.cuda.is_available(): print(f'GPU Name: {torch.cuda.get_device_name(0)}') - print(f'GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB') -else: - print('ERROR: No GPU detected!') " ``` -**Expected Output:** -``` -PyTorch Version: 2.7.1+rocm6.4.4 -CUDA Available: True -GPU Name: AMD Instinct MI250X -GPU Memory: 65.5 GB -``` - -**Common Issues:** - -**Error: "ModuleNotFoundError: No module named 'torch'"** -```bash -# Install PyTorch with ROCm support -pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2 -``` +## Model Overview -**Error: "CUDA Available: False"** -```bash -# Check if ROCm-enabled PyTorch is installed -python3 -c "import torch; print(torch.__version__)" +Tiny LLaMA is a scaled-down transformer decoder with configurable parameters: -# Should show something like: 2.7.1+rocm6.4.4 -# If it shows 2.7.1+cpu, you have CPU-only PyTorch +| Parameter | Default | Description | +|-----------|---------|-------------| +| hidden_dim | 256 | Model dimension | +| n_layers | 4 | Transformer layers | +| n_heads | 8 | Attention heads | +| intermediate_dim | 512 | FFN intermediate dimension | +| vocab_size | 1000 | Vocabulary size | -# Reinstall with ROCm support -pip uninstall torch torchvision torchaudio -pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2 -``` +Default model size: ~2.9M parameters (~11 MB FP32) -#### Step 4: Check DeepSpeed (Optional but Recommended) +## Running the Baseline -```bash -# Check if DeepSpeed is installed -python3 -c "import deepspeed; print(f'DeepSpeed Version: {deepspeed.__version__}')" -``` +Quick validation: -**Expected Output:** ``` -DeepSpeed Version: 0.12.6 -``` - -**If not installed:** -```bash -# Install DeepSpeed -pip install deepspeed +python3 tiny_llama_v1.py --batch-size 4 --seq-len 64 --num-steps 5 ``` -#### Step 5: Navigate to Workshop Directory - -```bash -# Navigate to version1_pytorch_baseline directory -cd ~/castille-ai-workshop-training/version1_pytorch_baseline/ - -# List files -ls -la -``` +Standard training run: -**Expected Output:** ``` --rw-rw-r-- tiny_llama_v1.py --rw-rw-r-- run_pytorch_profiler.py --rw-rw-r-- run_deepspeed_flops.py --rw-rw-r-- README.md --rwxrwxr-x run_baseline.sh --rwxrwxr-x run_pytorch_profiler.sh --rwxrwxr-x run_deepspeed_flops.sh -drwxrwxr-x exercises/ +python3 tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -#### Step 6: Quick Test Run - -Let's verify everything works with a very small test: - -```bash -# Run a tiny test (should complete in ~1-2 minutes) -python3 tiny_llama_v1.py --batch-size 4 --seq-len 64 --num-steps 5 -``` +Expected output: -**Expected Output:** ``` ========================================== Tiny LLaMA V1 - PyTorch Baseline ========================================== Configuration: - Batch Size: 4 - Sequence Length: 64 - Number of Steps: 5 - Hidden Dim: 256 - Num Layers: 4 - Num Heads: 8 - -Initializing model... -Model parameters: 2.3M + Batch Size: 8 + Sequence Length: 128 + Number of Steps: 20 + ... Starting training... -Step 1/5: Loss = 6.9088, Time = 0.235 seconds -Step 2/5: Loss = 6.9076, Time = 0.045 seconds -Step 3/5: Loss = 6.9065, Time = 0.044 seconds -Step 4/5: Loss = 6.9054, Time = 0.043 seconds -Step 5/5: Loss = 6.9042, Time = 0.043 seconds +Step 1/20: Loss = 6.9088, Time = 0.234 seconds +Step 2/20: Loss = 6.9076, Time = 0.046 seconds +... +Step 20/20: Loss = 6.8821, Time = 0.044 seconds ========================================== Performance Summary: ========================================== -Average time per step: 0.044 seconds -Training speed: 90.9 samples/sec -Peak memory usage: 1234 MB +Average time per step: 0.045 seconds +Training speed: 177.8 samples/sec +Peak memory usage: 2847 MB ========================================== ``` -**If you see this output, your environment is ready!** - - ---- - -## 2. Understanding Tiny LLaMA Architecture - -### 2.1 Model Overview - -Tiny LLaMA is a scaled-down version of the LLaMA architecture, designed for educational purposes and profiling workshops. It uses the standard transformer decoder architecture with modern enhancements. - -**Model Configuration (Default):** - -```python -vocab_size = 1000 # Small vocabulary for workshop -hidden_dim = 256 # Model dimension (D) -n_layers = 4 # Number of transformer layers -n_heads = 8 # Number of attention heads -n_kv_heads = 4 # Number of key-value heads (GQA) -intermediate_dim = 512 # FFN intermediate dimension -max_seq_len = 128 # Maximum sequence length -``` - -**Model Size:** -- Parameters: ~2.9 million -- Memory footprint: ~11 MB (FP32) -- Training memory (batch=8, seq=128): ~200-500 MB (includes activations, gradients, optimizer states) - -**Detailed Parameter Calculation:** - -Understanding how we arrive at ~2.9M parameters: - -1. **Token Embeddings**: - - Shape: [vocab_size, hidden_dim] = [1000, 256] - - Parameters: 1000 × 256 = 256,000 - -2. **Per Transformer Layer** (4 layers total): - - a. **RMSNorm (×2 per layer)**: - - Pre-attention norm: hidden_dim = 256 parameters - - Pre-FFN norm: hidden_dim = 256 parameters - - Total: 2 × 256 = 512 parameters per layer - - b. **Multi-Head Attention with GQA** (Grouped Query Attention): - - **Q projection**: [hidden_dim, hidden_dim] = [256, 256] = 65,536 parameters - - **K projection** (GQA): [hidden_dim, head_dim × n_kv_heads] = [256, 32 × 4] = [256, 128] = 32,768 parameters - - Why smaller? GQA uses fewer key/value heads (4) than query heads (8) - - head_dim = hidden_dim / n_heads = 256 / 8 = 32 - - **V projection** (GQA): [256, 128] = 32,768 parameters - - **O projection** (output): [256, 256] = 65,536 parameters - - **Total Attention**: 65,536 + 32,768 + 32,768 + 65,536 = 196,608 parameters per layer - - c. **SwiGLU Feed-Forward Network**: - - **Gate projection**: [hidden_dim, intermediate_dim] = [256, 512] = 131,072 parameters - - **Up projection**: [256, 512] = 131,072 parameters - - **Down projection**: [intermediate_dim, hidden_dim] = [512, 256] = 131,072 parameters - - **Total FFN**: 131,072 + 131,072 + 131,072 = 393,216 parameters per layer - - d. **Total per layer**: 512 + 196,608 + 393,216 = 590,336 parameters - - e. **All 4 layers**: 4 × 590,336 = 2,361,344 parameters - -3. **Final Components**: - - **Final RMSNorm**: 256 parameters - - **Output projection** (LM head): [hidden_dim, vocab_size] = [256, 1000] = 256,000 parameters - - **Total**: 256 + 256,000 = 256,256 parameters - -4. **Grand Total**: - - Embeddings: 256,000 - - All layers: 2,361,344 - - Final components: 256,256 - - **Total**: 256,000 + 2,361,344 + 256,256 = **2,873,600 parameters ≈ 2.9M** - -**Memory Footprint Calculation:** -- FP32: 4 bytes per parameter -- Total memory: 2,873,600 × 4 bytes = 11,494,400 bytes ≈ **11.0 MB** - -**Training Memory Breakdown** (batch_size=8, seq_len=128): - -Per-layer memory requirements: -- **Input activations**: [B, S, D] = [8, 128, 256] = 262,144 elements → 1.05 MB -- **Q, K, V tensors**: 3 × [8, 128, 256] → 3.15 MB -- **Attention scores**: [B, H, S, S] = [8, 8, 128, 128] = 1,048,576 elements → 4.19 MB -- **FFN intermediates**: 2 × [B, S, D_ff] = 2 × [8, 128, 512] → 4.19 MB -- **Per-layer subtotal**: ~15.7 MB × 4 layers = **~63 MB** - -Training overhead: -- **Gradients** (same size as activations): ~63 MB -- **Parameter gradients**: 2.9M × 4 bytes = ~11 MB -- **Optimizer states** (Adam: momentum + variance): 2.9M × 2 × 4 bytes = ~22 MB - -**Total training memory**: 63 + 63 + 11 + 22 = **~160 MB** - -Note: Actual PyTorch memory usage will be 200-500 MB due to: -- Framework overhead -- Memory fragmentation -- Temporary buffers -- CUDA kernels and workspace - -### 2.2 Transformer Layer Architecture - -Each transformer layer consists of: - -1. **RMSNorm** (Root Mean Square Normalization) -2. **Multi-Head Attention** with RoPE -3. **Residual Connection** -4. **RMSNorm** -5. **Feed-Forward Network** (SwiGLU) -6. **Residual Connection** - -**Visual Structure:** - -``` -Input (B, S, D) - ↓ -┌───────────────────────────────────────┐ -│ RMSNorm │ -└───────────────────────────────────────┘ - ↓ -┌───────────────────────────────────────┐ -│ Multi-Head Attention │ -│ ┌─────────────────────────────────┐ │ -│ │ Q, K, V Projections │ │ -│ │ RoPE (Rotary Position Encoding) │ │ -│ │ Attention Computation │ │ -│ │ Output Projection │ │ -│ └─────────────────────────────────┘ │ -└───────────────────────────────────────┘ - ↓ - Residual Add - ↓ -┌───────────────────────────────────────┐ -│ RMSNorm │ -└───────────────────────────────────────┘ - ↓ -┌───────────────────────────────────────┐ -│ Feed-Forward Network (SwiGLU) │ -│ ┌─────────────────────────────────┐ │ -│ │ Gate Projection │ │ -│ │ Up Projection │ │ -│ │ SiLU Activation │ │ -│ │ Element-wise Multiply │ │ -│ │ Down Projection │ │ -│ └─────────────────────────────────┘ │ -└───────────────────────────────────────┘ - ↓ - Residual Add - ↓ -Output (B, S, D) -``` - -### 2.3 Multi-Head Attention Implementation - -**Standard PyTorch Attention (Version 1 Baseline):** - -The baseline uses separate linear projections for Query, Key, and Value: - -```python -def attention_forward(self, hidden_states, attention_mask=None): - batch_size, seq_len, _ = hidden_states.size() - - # STEP 1: Separate linear projections (3 kernel launches) - query = self.q_proj(hidden_states) # [B, S, D] -> [B, S, D] - key = self.k_proj(hidden_states) # [B, S, D] -> [B, S, D] - value = self.v_proj(hidden_states) # [B, S, D] -> [B, S, D] - - # STEP 2: Reshape for multi-head attention - query = query.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - key = key.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - value = value.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - # Result: [B, H, S, head_dim] - - # STEP 3: Apply rotary position embeddings - query, key = self.rotary_emb(query, key, seq_len) - - # STEP 4: Compute attention scores - # attn_weights: [B, H, S, S] - attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim) - - if attention_mask is not None: - attn_weights = attn_weights + attention_mask - - # STEP 5: Softmax over last dimension - attn_weights = F.softmax(attn_weights, dim=-1) - - # STEP 6: Apply attention to values - attn_output = torch.matmul(attn_weights, value) - # Result: [B, H, S, head_dim] - - # STEP 7: Reshape and project output - attn_output = attn_output.transpose(1, 2).contiguous() - attn_output = attn_output.view(batch_size, seq_len, self.hidden_size) - attn_output = self.o_proj(attn_output) - - return attn_output -``` - -**Performance Characteristics:** - -- **3 separate linear projections:** Creates kernel launch overhead -- **Attention matrix materialization:** O(S^2) memory usage per head -- **Memory-bound operations:** Multiple tensor reshapes -- **Sequential execution:** Limited parallelization - -**FLOP Count (per layer):** - -Understanding FLOP calculations for attention operations with example configuration (B=8, S=128, D=256, H=8, head_dim=32): - -**Linear Projection FLOP Formula:** -For a matrix multiplication: `output = input @ weight` -- Input shape: [B, S, D_in] -- Weight shape: [D_in, D_out] -- FLOPs = 2 × B × S × D_in × D_out - - Factor of 2: Each multiply-accumulate (MAC) operation counts as 2 FLOPs (1 multiply + 1 add) - - We perform B × S output positions, each requiring D_in × D_out operations - -**Attention FLOP Calculations:** - -1. **Q, K, V Projections** (3 separate linear layers): - - Each projection: [B, S, D] → [B, S, D] - - FLOPs per projection: 2 × B × S × D × D - - Calculation: 2 × 8 × 128 × 256 × 256 = 134,217,728 ≈ 134.2M FLOPs - - Total for Q, K, V: 3 × 134.2M = 402.6M FLOPs - -2. **Attention Scores** (Q @ K^T): - - After reshaping: Q and K are [B, H, S, head_dim] - - For each head: [S, head_dim] @ [head_dim, S] → [S, S] - - FLOPs: 2 × B × H × S × S × head_dim - - Calculation: 2 × 8 × 8 × 128 × 128 × 32 = 67,108,864 ≈ 67.1M FLOPs - - Why: For each of B×H attention matrices, we compute S×S scores, each requiring head_dim multiply-accumulates - -3. **Attention Application** (Softmax @ V): - - Attention weights [B, H, S, S] @ Values [B, H, S, head_dim] → [B, H, S, head_dim] - - FLOPs: 2 × B × H × S × S × head_dim - - Calculation: 2 × 8 × 8 × 128 × 128 × 32 = 67.1M FLOPs - - Same as attention scores computation - -4. **Output Projection**: - - [B, S, D] → [B, S, D] - - FLOPs: 2 × B × S × D × D - - Calculation: 2 × 8 × 128 × 256 × 256 = 134.2M FLOPs - -**Summary:** -``` -Q projection: 134.2M FLOPs -K projection: 134.2M FLOPs -V projection: 134.2M FLOPs -Attention scores: 67.1M FLOPs -Softmax: ~0.1M FLOPs (negligible, element-wise) -Attention application: 67.1M FLOPs -Output projection: 134.2M FLOPs -───────────────────────────────── -Total Attention: ~671M FLOPs per layer -``` - -**Key Insights:** -- Linear projections (Q, K, V, O) dominate: 536.8M FLOPs (80% of attention) -- Attention computation (scores + application): 134.2M FLOPs (20% of attention) -- Quadratic term (S × S) appears in attention scores but with small head_dim coefficient -- For longer sequences, the S^2 term becomes more significant - -### 2.4 SwiGLU Feed-Forward Network - -**Implementation:** - -```python -def swiglu_forward(self, hidden_states): - # STEP 1: Separate gate and up projections (2 kernel launches) - gate = self.gate_proj(hidden_states) # [B, S, D] -> [B, S, D_ff] - up = self.up_proj(hidden_states) # [B, S, D] -> [B, S, D_ff] - - # STEP 2: SiLU activation (Swish) - gate_activated = F.silu(gate) # Element-wise operation - - # STEP 3: Element-wise multiplication - intermediate = gate_activated * up # [B, S, D_ff] - - # STEP 4: Down projection - output = self.down_proj(intermediate) # [B, S, D_ff] -> [B, S, D] - - return output -``` - -**Why SwiGLU?** -- Better than standard ReLU activation -- Gating mechanism improves model capacity -- Used in modern LLMs (LLaMA, PaLM) - -**Performance Characteristics:** -- **Separate gate/up projections:** Can be fused into single GEMM -- **Intermediate tensor storage:** Memory overhead -- **Sequential activation:** SiLU can be fused with multiplication - -**FLOP Count (per layer):** - -Understanding FLOP calculations for feed-forward network with example configuration (B=8, S=128, D=256, D_ff=512): - -**FFN FLOP Calculations:** - -1. **Gate Projection**: - - Transform: [B, S, D] → [B, S, D_ff] - - Weight matrix: [D, D_ff] = [256, 512] - - FLOPs: 2 × B × S × D × D_ff - - Calculation: 2 × 8 × 128 × 256 × 512 = 268,435,456 ≈ 268.4M FLOPs - - Explanation: For each of B×S positions, multiply a D-dimensional vector by a [D, D_ff] matrix - -2. **Up Projection**: - - Same dimensions as gate projection: [B, S, D] → [B, S, D_ff] - - FLOPs: 2 × B × S × D × D_ff = 268.4M FLOPs - - Calculation: 2 × 8 × 128 × 256 × 512 = 268.4M FLOPs - -3. **SiLU Activation**: - - Element-wise operation: silu(x) = x × sigmoid(x) - - Applied to gate tensor: [B, S, D_ff] - - FLOPs: ~3 × B × S × D_ff (sigmoid + multiply) ≈ 0.01M FLOPs - - Negligible compared to matrix multiplications - -4. **Element-wise Multiply**: - - gate_activated × up: [B, S, D_ff] element-wise - - FLOPs: B × S × D_ff = 8 × 128 × 512 ≈ 0.5M FLOPs - - Negligible compared to linear projections - -5. **Down Projection**: - - Transform: [B, S, D_ff] → [B, S, D] - - Weight matrix: [D_ff, D] = [512, 256] - - FLOPs: 2 × B × S × D_ff × D - - Calculation: 2 × 8 × 128 × 512 × 256 = 268,435,456 ≈ 268.4M FLOPs - -**Summary:** -``` -Gate projection: 268.4M FLOPs -Up projection: 268.4M FLOPs -Down projection: 268.4M FLOPs -SiLU activation: ~0.01M FLOPs (negligible) -Element-wise multiply: ~0.5M FLOPs (negligible) -───────────────────────────────── -Total FFN: ~805.3M FLOPs per layer -``` - -**Key Insights:** -- Three linear projections dominate: 805.2M FLOPs (>99.9% of FFN) -- Element-wise operations (SiLU, multiply) are negligible: <1M FLOPs combined -- FFN is more compute-intensive than attention: 805M vs 671M FLOPs -- Gate and up projections can be fused to reduce memory bandwidth -- D_ff is typically 2-4× larger than D, making FFN compute-bound - -### 2.5 RMSNorm (Root Mean Square Normalization) - -**Implementation:** - -```python -def rms_norm_forward(self, hidden_states): - input_dtype = hidden_states.dtype - - # Compute RMS - variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True) - hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) - - # Apply learned scale - return (self.weight * hidden_states).to(input_dtype) -``` - -**Why RMSNorm instead of LayerNorm?** -- Simpler: No mean subtraction -- Faster: Fewer operations -- Same effectiveness for LLMs -- Less memory bandwidth - -**Performance Characteristics:** -- Memory-bound operation -- Reduction over hidden dimension -- Opportunity for fusion with adjacent operations - -### 2.6 Complete Layer FLOP Breakdown - -For a single transformer layer with batch_size=8, seq_len=128: - -``` -Component | FLOPs | Percentage -------------------------|--------------|------------ -Attention QKV Proj | 402.6M | 27.3% -Attention Computation | 134.2M | 9.1% -Attention Output Proj | 134.2M | 9.1% -FFN Gate/Up Proj | 536.8M | 36.4% -FFN Down Proj | 268.4M | 18.2% -RMSNorm (x2) | ~0.5M | <0.1% -------------------------|--------------|------------ -Total per Layer | ~1,476M | 100% -Total Model (4 layers) | ~5.91B | - -``` - -**Corrected Calculations:** -- Attention QKV: 3 × 134.2M = 402.6M FLOPs -- Attention scores + application: 67.1M + 67.1M = 134.2M FLOPs -- Attention output: 134.2M FLOPs -- FFN gate + up: 2 × 268.4M = 536.8M FLOPs -- FFN down: 268.4M FLOPs -- Total per layer: 402.6 + 134.2 + 134.2 + 536.8 + 268.4 + 0.5 = 1,476.7M ≈ 1.48B FLOPs -- Total model (4 layers): 4 × 1.48B = 5.92B FLOPs per forward pass - -**Key Observations:** -- FFN dominates compute: ~54.6% of FLOPs (gate/up/down projections) -- Attention: ~45.5% of FLOPs -- RMSNorm negligible: <0.1% of FLOPs -- Linear projections (GEMM operations) account for >99% of all FLOPs - -### 2.7 Memory Layout and Access Patterns +## Profiling with PyTorch Profiler -**Memory Requirements (batch_size=8, seq_len=128):** +Enable PyTorch profiler for detailed operator-level analysis: ``` -Component | Memory (MB) | Notes ------------------------|-------------|--------------------------- -Model Parameters | 9.2 | Weights only (FP32) -Optimizer States | 36.8 | Adam: 2× params (m, v) -Input Activations | 1.0 | Per layer -Attention Activations | 4.2 | Intermediate tensors -FFN Activations | 2.1 | Intermediate tensors -Gradients | 9.2 | Same as parameters -Attention Matrix | 1.0 | [B, H, S, S] per layer ------------------------|-------------|--------------------------- -Total (approximate) | 63.5 MB | Can vary with framework +python3 tiny_llama_v1.py \ + --batch-size 8 \ + --seq-len 128 \ + --num-steps 20 \ + --enable-pytorch-profiler \ + --profile-dir ./pytorch_profiles \ + --profile-steps 5 ``` -**Memory Bandwidth Patterns:** - -- **Attention:** Memory-bound (many small operations, reshapes) -- **FFN:** Compute-bound (large GEMMs with high arithmetic intensity) -- **RMSNorm:** Memory-bound (reduction operations) - ---- - -## 3. Understanding the Baseline Implementation - -### 3.1 Code Structure Overview - -The `tiny_llama_v1.py` file is organized into several key components: +View results with TensorBoard: ``` -tiny_llama_v1.py -├── Configuration Classes -│ ├── TinyLlamaConfig (model configuration) -│ └── ProfilerConfig (profiling options) -├── Model Components -│ ├── RMSNorm (normalization layer) -│ ├── RotaryEmbedding (position encoding) -│ ├── Attention (multi-head attention) -│ ├── MLP (SwiGLU feed-forward) -│ ├── TransformerBlock (complete layer) -│ └── TinyLlamaModel (full model) -├── Training Infrastructure -│ ├── Optimizer setup -│ ├── Loss computation -│ └── Training loop -└── Profiling Integration - ├── PyTorch Profiler setup - ├── DeepSpeed FLOPS profiler - └── Performance reporting +tensorboard --logdir ./pytorch_profiles --port 6006 ``` -### 3.2 Command-Line Arguments +## Memory Analysis -Understanding the available options: +Test memory scaling with different batch sizes: -**Basic Training Arguments:** - -```bash ---batch-size 8 # Number of samples per batch ---seq-len 128 # Sequence length ---num-steps 50 # Number of training steps ---learning-rate 1e-4 # Optimizer learning rate ---device cuda # Device to use (cuda/cpu) -``` - -**Model Configuration:** - -```bash ---hidden-dim 256 # Model hidden dimension ---n-layers 4 # Number of transformer layers ---n-heads 8 # Number of attention heads ---intermediate-dim 512 # FFN intermediate size ``` - -**Profiling Options:** - -```bash ---enable-pytorch-profiler # Enable PyTorch profiler ---profile-dir ./profiles # Directory for profile output ---profile-memory # Include memory profiling ---profile-operators # Detailed operator profiling ---profile-steps 5 # Number of steps to profile +python3 tiny_llama_v1.py --batch-size 4 --seq-len 128 --num-steps 15 +python3 tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 15 +python3 tiny_llama_v1.py --batch-size 16 --seq-len 128 --num-steps 15 ``` -**DeepSpeed FLOPS Profiling:** +Test sequence length scaling: -```bash ---enable-deepspeed-flops # Enable FLOPS profiler ---flops-profile-step 10 # Which step to profile ``` - -**Other Options:** - -```bash ---seed 42 # Random seed for reproducibility ---deterministic # Enable deterministic operations ---output-dir ./output # Directory for outputs ---log-interval 10 # Logging frequency +python3 tiny_llama_v1.py --batch-size 8 --seq-len 64 --num-steps 10 +python3 tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +python3 tiny_llama_v1.py --batch-size 8 --seq-len 256 --num-steps 10 ``` -### 3.3 Profiling Integration Points +Memory scales linearly with batch size and quadratically with sequence length (due to attention matrices). -The code includes several profiling integration points: +## Performance Study -**PyTorch Profiler Context:** +Use the performance study launcher for pre-configured problem sizes: -```python -# In training loop -with torch.profiler.profile( - activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], - record_shapes=True, - profile_memory=True, - with_stack=True, - with_flops=True -) as prof: - # Training step - outputs = model(inputs) - loss = criterion(outputs) - loss.backward() - optimizer.step() - -# Export results -prof.export_chrome_trace("trace.json") ``` - -**NVTX Annotations:** - -```python -# Mark important regions -with nvtx.range("attention_forward"): - attn_output = attention(hidden_states) - -with nvtx.range("ffn_forward"): - ffn_output = feed_forward(hidden_states) +./launch_performance_study.sh tiny +./launch_performance_study.sh small +./launch_performance_study.sh medium --enable-profilers ``` -**DeepSpeed FLOPS Profiler:** - -```python -from deepspeed.profiling.flops_profiler import FlopsProfiler +Available problem sizes: -profiler = FlopsProfiler(model) -profiler.start_profile() -# Forward pass -profiler.stop_profile() -profiler.print_model_profile(profile_step=1) -``` +| Size | Hidden Dim | Layers | Seq Len | Batch | Est. Parameters | +|------|-----------|--------|---------|-------|-----------------| +| tiny | 256 | 4 | 128 | 8 | ~2.9M | +| small | 512 | 8 | 256 | 8 | ~20.9M | +| medium | 1024 | 12 | 512 | 16 | ~167M | +| large | 2048 | 16 | 1024 | 8 | ~1.3B | -### 3.4 Expected Kernel Launch Pattern +## Key Performance Metrics -For a single training step, the baseline implementation generates: +- **Training Speed**: samples/sec processed +- **FLOPS**: Floating point operations per second +- **MFU**: Model FLOPS Utilization (% of theoretical peak) +- **Memory Usage**: Peak GPU memory consumed -``` -Per Transformer Layer (~17 kernel launches): -├── RMSNorm (pre-attention) : 1 kernel -├── Q Projection : 1 kernel -├── K Projection : 1 kernel -├── V Projection : 1 kernel -├── RoPE (query) : 1 kernel -├── RoPE (key) : 1 kernel -├── Attention scores (QK^T) : 1 kernel -├── Softmax : 1 kernel -├── Attention application (softmax*V): 1 kernel -├── Output Projection : 1 kernel -├── Residual Add : 1 kernel -├── RMSNorm (pre-FFN) : 1 kernel -├── Gate Projection : 1 kernel -├── Up Projection : 1 kernel -├── SiLU Activation : 1 kernel -├── Element-wise Multiply : 1 kernel -└── Down Projection : 1 kernel +Baseline performance characteristics: +- Training speed: 50-200 samples/sec (varies by hardware) +- GPU utilization: 60-75% (typical for baseline PyTorch) +- Attention operations: ~35-45% of compute time +- FFN operations: ~30-40% of compute time -Total per step (4 layers): ~68 kernels (forward only) -With backward pass: ~136 kernels per step -``` +## Optimization Opportunities -**Optimization Implications:** -- High kernel launch overhead -- Many small operations -- Opportunities for fusion +Based on profiling analysis, the baseline model shows opportunities for: -### 3.5 Running the Baseline +1. **Kernel Fusion**: Combine separate QKV projections into single GEMM +2. **Flash Attention**: Reduce attention memory from O(S^2) to O(S) +3. **SwiGLU Fusion**: Combine gate and up projections +4. **Mixed Precision**: FP16/BF16 for 2x memory reduction -**Quick Start:** +## Troubleshooting -```bash -# Basic run without profiling -./run_baseline.sh +CUDA/ROCm memory errors: -# Or manually -python3 tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 50 ``` - -**With PyTorch Profiler:** - -```bash -# Using helper script -./run_pytorch_profiler.sh - -# Or manually -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 20 \ - --enable-pytorch-profiler \ - --profile-dir ./pytorch_profiles \ - --profile-memory +python3 tiny_llama_v1.py --batch-size 4 --seq-len 64 --num-steps 10 ``` -**With DeepSpeed FLOPS Profiler:** - -```bash -# Using helper script -./run_deepspeed_flops.sh +Check GPU utilization: -# Or manually -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 20 \ - --enable-deepspeed-flops \ - --flops-profile-step 10 ``` - ---- - -## 4. Exercise 1: Baseline Performance Analysis - -### 4.1 Objective - -Establish baseline performance metrics for Tiny LLaMA V1 and understand the profiling methodology that will be used throughout the workshop. - -**What you'll learn:** -- How to run the baseline model -- How to enable and use PyTorch Profiler -- How to interpret basic profiling output -- What "good" performance looks like for this model -- How to identify top operations consuming time - -### 4.2 Step-by-Step Instructions - -#### Step 1: Run Baseline Training - -First, let's run the basic model without any profiling to establish a clean baseline: - -```bash -# Navigate to version1_pytorch_baseline directory -cd ~/castille-ai-workshop-training/version1_pytorch_baseline/ - -# Run basic training -python3 tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 20 +rocm-smi ``` -**Expected Output:** +Memory fragmentation: ``` -========================================== -Tiny LLaMA V1 - PyTorch Baseline -========================================== -Configuration: - Batch Size: 8 - Sequence Length: 128 - Number of Steps: 20 - Hidden Dim: 256 - Num Layers: 4 - Num Heads: 8 - Intermediate Dim: 512 - -Model Configuration: - Total Parameters: 2,345,984 - Model Size: 9.2 MB (FP32) - -Initializing model and optimizer... -Using device: cuda -GPU: AMD Instinct MI250X - -Starting training... -Step 1/20: Loss = 6.9088, Time = 0.234 seconds -Step 2/20: Loss = 6.9076, Time = 0.046 seconds -Step 3/20: Loss = 6.9065, Time = 0.045 seconds -Step 4/20: Loss = 6.9054, Time = 0.044 seconds -... -Step 20/20: Loss = 6.8821, Time = 0.044 seconds - -========================================== -Performance Summary: -========================================== -Average time per step: 0.045 seconds -Training speed: 177.8 samples/sec -Peak memory usage: 2847 MB -Avg time per forward: 0.022 seconds -Avg time per backward: 0.018 seconds -Avg time per optimizer: 0.005 seconds -========================================== +export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 ``` -**Record the following baseline metrics:** -- Training speed: _____ samples/sec -- Peak memory usage: _____ MB -- Avg time per step: _____ ms -- GPU name and memory - -**Key Observations:** - -1. **First iteration is slower:** Step 1 takes ~234ms vs ~44ms for subsequent steps - - Reason: Kernel compilation, memory allocation, cache warming - - **Always exclude first iteration from measurements** - -2. **Consistent timing:** Steps 2-20 have similar timing - - Good sign: stable performance - - Small variance indicates consistent GPU utilization - -3. **Memory usage:** ~2.8 GB for this configuration - - Includes: Model weights (9 MB) + optimizer states (36 MB) + activations + gradients - -#### Step 2: Enable PyTorch Profiler - -Now let's add PyTorch profiler to understand what's happening under the hood: - -```bash -# Run with PyTorch profiler enabled -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 20 \ - --enable-pytorch-profiler \ - --profile-dir ./exercise1_profiles \ - --profile-steps 5 -``` - -**What this does:** -- Profiles steps 5-10 (after warmup) -- Records CPU and GPU operations -- Tracks memory allocations -- Generates TensorBoard-compatible traces - -**Expected Output:** - -``` -========================================== -Tiny LLaMA V1 - PyTorch Baseline (Profiling Enabled) -========================================== -... (same as before) ... - -Profiling enabled: Steps 5-10 -Profile data will be saved to: ./exercise1_profiles/ - -Step 1/20: Loss = 6.9088, Time = 0.245 seconds -Step 2/20: Loss = 6.9076, Time = 0.048 seconds -Step 3/20: Loss = 6.9065, Time = 0.047 seconds -Step 4/20: Loss = 6.9054, Time = 0.046 seconds -Step 5/20: Loss = 6.9043, Time = 0.052 seconds [PROFILING] -Step 6/20: Loss = 6.9032, Time = 0.053 seconds [PROFILING] -... -Step 10/20: Loss = 6.8989, Time = 0.052 seconds [PROFILING] -Step 11/20: Loss = 6.8978, Time = 0.046 seconds -... - -Profiling complete! -Profile files generated: - - ./exercise1_profiles/trace_step_5_10.json - - ./exercise1_profiles/events.out.tfevents.* - - ./exercise1_profiles/performance_summary.json - -Average time per step: 0.048 seconds (with profiling overhead) -Training speed: 166.7 samples/sec -Peak memory usage: 3124 MB -``` - -**Answer these questions in your results file:** - -1. How much overhead did profiling add to training time? - - Without profiling: ~0.045 seconds/step - - With profiling: ~0.048-0.052 seconds/step - - Overhead: ~6-15% (acceptable for profiling) - -2. What files were generated in the `exercise1_profiles/` directory? - -```bash -ls -lh ./exercise1_profiles/ -``` - -3. What's the difference in memory usage with profiling enabled? - - Extra memory needed for profiler data structures - -#### Step 3: Analyze Profiling Results with TensorBoard - -Launch TensorBoard to visualize the profiling results: - -```bash -# Launch TensorBoard (run in background or separate terminal) -tensorboard --logdir ./exercise1_profiles --port 6006 - -# If TensorBoard is not available, examine JSON traces -# We'll show alternative analysis methods below -``` - -**TensorBoard Analysis:** - -1. Open your browser to `http://localhost:6006` (or your server address) -2. Navigate to the "PROFILE" tab -3. Select the most recent run - -**Explore the following views:** - -**A. Overview Page:** - -- **Performance Summary:** Shows step time breakdown -- **Run Environment:** GPU model, driver version, CUDA/ROCm version -- **Recommendation:** TensorBoard may suggest optimizations - -**B. Trace Viewer:** - -- Timeline of CPU and GPU operations -- Each row represents a thread or GPU stream -- Zoom in to see individual kernel launches -- Look for: - - GPU idle time (gaps in GPU timeline) - - CPU bottlenecks - - Memory transfer operations - -**C. Operator View:** - -Shows aggregated statistics for each operation type: - -``` -Top Operations by Total Time: -Operation | Calls | GPU Time | CPU Time | Total Time ------------------------------------|-------|----------|----------|------------ -aten::mm (matrix multiply) | 240 | 18.5 ms | 0.2 ms | 18.7 ms -aten::addmm (matrix multiply+add) | 480 | 15.3 ms | 0.3 ms | 15.6 ms -aten::bmm (batch matrix multiply) | 160 | 12.1 ms | 0.1 ms | 12.2 ms -aten::softmax | 80 | 8.4 ms | 0.1 ms | 8.5 ms -aten::mul (element-wise multiply) | 320 | 3.2 ms | 0.1 ms | 3.3 ms -aten::add_ (in-place add) | 160 | 2.8 ms | 0.1 ms | 2.9 ms -aten::silu (SiLU activation) | 80 | 2.1 ms | 0.1 ms | 2.2 ms -aten::rsqrt (RMSNorm) | 160 | 1.5 ms | 0.1 ms | 1.6 ms -``` - -**Document in your results file:** - -**Top 3 longest-running operations:** -1. _________________ -2. _________________ -3. _________________ - -**D. Memory Timeline:** - -- Shows memory allocation over time -- Peak memory during forward pass or backward pass? -- Memory spikes indicate large tensor allocations - -**Document:** -- Peak memory: _____ MB -- When does peak occur: Forward / Backward / Optimizer -- Are there memory spikes? Yes / No - -#### Step 4: Alternative Analysis (Without TensorBoard) - -If TensorBoard is not available, analyze the JSON trace directly: - -```bash -# View performance summary -cat ./exercise1_profiles/performance_summary.json | python3 -m json.tool -``` - -Use the Chrome trace viewer or analysis tools to identify the top operations by execution time. Look for patterns in: -- Matrix multiplication operations (mm, addmm, bmm) -- Attention-related kernels -- FFN operations -- Normalization operations - -#### Step 5: Identify Performance Patterns - -Based on your analysis, identify patterns in the baseline model: - -**Check these patterns in your results:** - -**Compute Patterns:** - -- [ ] Matrix multiplications (mm, addmm, bmm) dominate compute time -- [ ] Attention operations consume ~35-45% of total time -- [ ] FFN operations consume ~30-40% of total time -- [ ] Many small operations with low individual utilization -- [ ] Kernel launch overhead visible in timeline - -**Memory Patterns:** - -- [ ] Memory usage grows during forward pass -- [ ] Peak memory during attention computation -- [ ] Gradient tensors allocated during backward pass -- [ ] Frequent small allocations for intermediate tensors - -**Optimization Opportunities:** - -Based on the profiling results, rank these optimizations by potential benefit: - -- [ ] **High:** Kernel fusion (reduce number of operations) -- [ ] **High:** Fused QKV projection in attention -- [ ] **High:** Flash Attention implementation (reduce memory) -- [ ] **Medium:** Memory layout optimization -- [ ] **Medium:** Mixed precision training (FP16) -- [ ] **Low:** Batch size scaling (already reasonable) - -### 4.3 Expected Results - -After completing this exercise, you should have: - -#### Performance Baseline - -Representative ranges (actual results will vary by hardware): - -- **Training Speed:** 50-200 samples/sec -- **GPU Utilization:** 60-75% (typical for baseline PyTorch) -- **Memory Usage:** 2-4 GB (depends on batch size) -- **Kernel Count:** 60-80 different kernel launches per step -- **MFU (estimated):** 20-35% (memory-bound workload) - -#### Key Observations - -1. **Attention operations consume ~35-45% of total compute time** - - QKV projections: separate kernel launches - - Attention computation: O(S^2) memory complexity - - Softmax: memory-bound operation - -2. **FFN operations consume ~30-40% of total time** - - Gate/Up projections: separate operations - - SwiGLU: sequential activation and multiplication - -3. **Matrix multiplications (GEMM) are the dominant kernels** - - Linear layers in projections - - Attention score computation - - Good candidates for optimization - -4. **Multiple small operations create kernel launch overhead** - - Element-wise operations (add, multiply, activation) - - Normalization layers - - Residual connections - -5. **Memory allocation patterns show optimization opportunities** - - Intermediate tensors in attention - - Separate activations in FFN - - Gradient storage - -#### Profiling Data Generated - -``` -exercise1_profiles/ -├── trace_step_5_10.json # Chrome trace format -├── events.out.tfevents.* # TensorBoard events -├── performance_summary.json # Aggregated metrics -└── memory_timeline.json # Memory usage over time -``` - -### 4.4 Troubleshooting - -#### Common Issues - -**1. CUDA/ROCm Memory Errors** - -```bash -# Error: RuntimeError: CUDA out of memory -# Solution: Reduce batch size or sequence length -python3 tiny_llama_v1.py --batch-size 4 --seq-len 64 --num-steps 10 -``` - -**2. Profiling Files Not Generated** - -```bash -# Check permissions and disk space -ls -la ./exercise1_profiles/ -df -h . - -# Create directory manually -mkdir -p exercise1_profiles -chmod 755 exercise1_profiles -``` - -**3. TensorBoard Not Loading** - -```bash -# Try different port -tensorboard --logdir ./exercise1_profiles --port 6007 - -# Check if port is in use -netstat -tuln | grep 6006 - -# Or examine JSON files directly (see alternative analysis above) -``` - -**4. Low GPU Utilization** - -```bash -# Check if GPU is being used -rocm-smi - -# Monitor GPU during training (in separate terminal) -watch -n 1 rocm-smi - -# Check for CPU bottlenecks -htop -``` - -**5. Inconsistent Timing** - -```bash -# Ensure no other processes are using GPU -rocm-smi - -# Run with deterministic mode -python3 tiny_llama_v1.py --deterministic --seed 42 -``` - -### 4.5 Analysis Questions - -Answer these questions based on your results: - -**1. What is the primary bottleneck in the baseline model?** - - [ ] Memory bandwidth (many small operations) - - [ ] Compute utilization (GPU not fully utilized) - - [ ] Kernel launch overhead (too many launches) - - [ ] Data loading (CPU bottleneck) - -**Answer:** Likely a combination of memory bandwidth and kernel launch overhead. The baseline has many small operations that don't fully utilize the GPU. - -**2. Which operations would benefit most from fusion?** - - [ ] QKV projections in attention - - [ ] Gate/Up projections in SwiGLU - - [ ] Layer normalization operations - - [ ] All of the above - -**Answer:** All of the above. Version 2 will address these with kernel fusion. - -**3. What percentage of time is spent in attention vs FFN?** - -Based on profiling data: -- Attention: ~_____% -- FFN: ~_____% -- Other (norms, residuals): ~_____% - -**4. Based on memory usage patterns, what optimization would help most?** - - [ ] Gradient checkpointing (reduce activation memory) - - [ ] Flash Attention (reduce attention memory from O(S^2) to O(S)) - - [ ] Mixed precision (reduce memory footprint by 2x) - - [ ] Tensor fusion (reduce intermediate tensor allocations) - -**Answer:** Flash Attention for long sequences, tensor fusion for overall efficiency. - -### 4.6 Key Takeaways - -**What We Learned:** - -1. **Baseline performance characteristics:** - - Training speed: _____ samples/sec (record your value) - - GPU utilization: Moderate (60-75%) - - Memory usage: Reasonable for batch size - -2. **Primary bottlenecks identified:** - - Separate kernel launches for QKV, Gate/Up projections - - O(S^2) memory usage in attention - - Memory bandwidth limitations - -3. **Optimization targets for Version 2:** - - QKV fusion (combine 3 operations into 1) - - SwiGLU fusion (combine gate/up projections) - - Custom fused kernels for common patterns - -4. **Profiling methodology:** - - PyTorch Profiler provides detailed operator-level insights - - TensorBoard visualization helps identify patterns - - JSON traces enable programmatic analysis - -**Next Steps:** - -- Document your findings -- Compare with expected results (are your metrics in the expected ranges?) -- Identify top 3 optimization targets for Version 2 -- Save your profiling data for comparison with optimized versions - -**Exercise Complete When:** - -- [ ] Baseline training runs successfully -- [ ] Profiling data generated and analyzed -- [ ] Performance metrics documented -- [ ] Top operations identified -- [ ] Bottlenecks understood -- [ ] Ready to proceed to memory analysis - ---- - -**Next Exercise:** [Exercise 2 - Memory Analysis & Optimization](#5-exercise-2-memory-analysis--optimization) - ---- - -## 5. Exercise 2: Memory Analysis & Optimization - -### 5.1 Objective - -Understand memory usage patterns, identify memory bottlenecks, and analyze memory bandwidth utilization in the baseline Tiny LLaMA model. - -**What you'll learn:** -- How memory scales with batch size and sequence length -- Where peak memory is consumed (forward, backward, optimizer) -- Memory bandwidth utilization patterns -- How to identify memory-bound vs compute-bound operations -- Memory optimization opportunities - -### 5.2 Background: Why Memory Matters - -Memory optimization is crucial for transformer models because: - -**Memory Bandwidth:** -- Often the limiting factor, especially for small models -- Modern GPUs have very high compute (TFLOPS) but limited bandwidth (TB/s) -- Memory-bound operations don't fully utilize GPU compute - -**Peak Memory:** -- Determines maximum batch size and model size -- Out-of-memory (OOM) errors are common -- Larger batches → better GPU utilization - -**Memory Fragmentation:** -- Multiple small allocations reduce effective memory -- Garbage collection overhead -- Can cause OOM even with available memory - -**Attention Memory:** -- Quadratic scaling: O(S^2) with sequence length -- Major bottleneck for long sequences -- Target for Flash Attention optimization - -### 5.3 Step-by-Step Instructions - -#### Step 1: Memory-Focused Profiling - -Run profiling with enhanced memory analysis for different batch sizes: - -```bash -# Batch size 4 -python3 tiny_llama_v1.py \ - --batch-size 4 \ - --seq-len 128 \ - --num-steps 15 \ - --enable-pytorch-profiler \ - --profile-memory \ - --profile-dir ./memory_analysis_bs4 - -# Batch size 8 -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 15 \ - --enable-pytorch-profiler \ - --profile-memory \ - --profile-dir ./memory_analysis_bs8 - -# Batch size 16 -python3 tiny_llama_v1.py \ - --batch-size 16 \ - --seq-len 128 \ - --num-steps 15 \ - --enable-pytorch-profiler \ - --profile-memory \ - --profile-dir ./memory_analysis_bs16 -``` - -**Expected Output for Each Run:** - -``` -========================================== -Tiny LLaMA V1 - Memory Profiling -========================================== -Configuration: - Batch Size: 8 - Sequence Length: 128 - ... - -Memory Profiling Enabled - -Step 1/15: Loss = 6.9088, Time = 0.245 s, Memory = 2847 MB -... -Step 15/15: Loss = 6.8765, Time = 0.046 s, Memory = 2847 MB - -========================================== -Memory Analysis Summary: -========================================== -Peak Memory Usage: 2847 MB -Average Memory Usage: 2654 MB -Memory at Forward Pass: 2123 MB -Memory at Backward Pass: 2847 MB -Memory at Optimizer Step: 2456 MB -Number of Allocations: 1234 -Largest Tensor: 512 MB (attention_scores) -========================================== -``` - -**Record memory usage for each batch size in your results file:** - -| Batch Size | Peak Memory (MB) | Avg Memory (MB) | Training Speed (samples/sec) | -|------------|------------------|-----------------|------------------------------| -| 4 | _______ | _______ | _______ | -| 8 | _______ | _______ | _______ | -| 16 | _______ | _______ | _______ | - -**Questions to Answer:** - -1. **Memory Scaling:** Does memory double when batch size doubles? - - If yes → Linear scaling (good) - - If more than double → Superlinear scaling (fragmentation or inefficiency) - -2. **Throughput Scaling:** Does throughput double when batch size doubles? - - If yes → Perfect scaling - - If less → Diminishing returns (memory bandwidth limit) - -3. **Memory Efficiency:** What's the peak-to-average memory ratio? - - High ratio → Memory spikes, potential for optimization - - Low ratio → Consistent memory usage - -#### Step 2: Memory Timeline Analysis - -Analyze memory patterns using TensorBoard: - -```bash -# Launch TensorBoard for memory analysis -tensorboard --logdir ./memory_analysis_bs8 --port 6007 -``` - -**In TensorBoard:** - -1. Go to the **PROFILE** tab -2. Select **Memory Viewer** or **Memory Timeline** view -3. Examine the memory usage pattern over time - -**What to Look For:** - -**A. Memory Allocation Pattern:** - -``` -Memory (MB) - | -3000| ╱‾‾‾‾‾╲ - | / \ -2500| / \___________ - | / -2000| ╱‾‾‾‾‾‾╱ - | / -1500|______/ - | - +-----|-----|-----|-----|-----|------> Time - Fwd Attn FFN Bwd Opt Done -``` - -- **Forward pass:** Memory increases as activations are computed -- **Attention:** Often creates a spike (attention matrices) -- **FFN:** Additional activation memory -- **Backward pass:** Gradient tensors allocated -- **Optimizer:** Parameter updates - -**B. Memory Peaks:** - -Document when peak memory occurs: -- [ ] During forward pass (activations) -- [ ] During attention computation (attention matrices) -- [ ] During backward pass (gradients) -- [ ] During optimizer step (momentum buffers) - -**C. Memory Deallocation:** - -- Are there clear drops in memory usage? -- Does memory return to baseline after each step? -- Are tensors being deallocated promptly? - -**Record in your results file:** - -**Memory Pattern Analysis:** -- Peak memory occurs at: _______________________ -- Largest memory spike caused by: _______________________ -- Memory is deallocated: Promptly / Delayed / Not at all -- Memory usage pattern: Steady / Fluctuating / Spiking - -#### Step 3: Sequence Length Scaling - -Test how memory scales with sequence length: - -```bash -# Sequence length 64 -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 64 \ - --num-steps 10 \ - --profile-memory \ - --profile-dir ./memory_seq64 - -# Sequence length 128 (baseline) -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 10 \ - --profile-memory \ - --profile-dir ./memory_seq128 - -# Sequence length 256 -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 256 \ - --num-steps 10 \ - --profile-memory \ - --profile-dir ./memory_seq256 - -# Sequence length 512 (might OOM - use smaller batch if needed) -python3 tiny_llama_v1.py \ - --batch-size 4 \ - --seq-len 512 \ - --num-steps 5 \ - --profile-memory \ - --profile-dir ./memory_seq512 -``` - -**Record sequence length scaling:** - -| Seq Length | Batch Size | Peak Memory (MB) | Memory Increase | Scaling Factor | -|------------|------------|------------------|-----------------|----------------| -| 64 | 8 | _______ | baseline | 1.0x | -| 128 | 8 | _______ | _______ | _______ | -| 256 | 8 | _______ | _______ | _______ | -| 512 | 4 | _______ | _______ | _______ | - -**Memory Scaling Analysis:** - -Calculate the scaling factor: -``` -Scaling Factor = Memory(S) / Memory(S_baseline) - -For attention memory (theoretical): -- Linear components: O(S) → 2x when S doubles -- Attention matrix: O(S^2) → 4x when S doubles - -Expected combined: ~3x when S doubles (for attention-heavy workloads) -``` - -**Answer these questions:** - -1. **What is the memory scaling pattern?** - - [ ] Linear (~2x when sequence doubles) - - [ ] Quadratic (~4x when sequence doubles) - - [ ] Between linear and quadratic (~3x) - -2. **Which component shows steepest memory scaling?** - - Run separate profiling focusing on attention vs FFN - - Check memory timeline for attention layers - -3. **At what sequence length do you hit memory limits?** - - Record the maximum sequence length before OOM - - Note the batch size at that limit - -#### Step 4: Identifying Memory Hotspots - -Use profiling to identify which operations consume most memory: - -```bash -# Run with detailed operator profiling -python3 tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 10 \ - --enable-pytorch-profiler \ - --profile-memory \ - --profile-operators \ - --profile-dir ./memory_hotspots -``` - -**Analyze the operator memory usage:** - -Review the memory profiling output and trace files to identify operators with highest memory allocation. Use the PyTorch Profiler's memory view or trace analysis to examine memory allocation patterns. - -**Record top memory-consuming operations:** - -1. _________________: _______ MB -2. _________________: _______ MB -3. _________________: _______ MB -4. _________________: _______ MB -5. _________________: _______ MB - -**Common Memory Hotspots:** - -- **Attention scores:** `[B, H, S, S]` matrices (quadratic in S) -- **Query/Key/Value states:** `[B, S, D]` tensors -- **FFN intermediate:** `[B, S, D_ff]` tensors -- **Gradients:** Same size as parameters + activations - -#### Step 5: Memory Bandwidth Analysis - -Analyze memory bandwidth utilization: - -**Calculate memory bandwidth manually:** - -For batch_size=8, seq_len=128, hidden_dim=256, n_layers=4: - -1. **Estimate memory traffic per step:** - - Forward pass: QKV weights + activations + FFN weights - - Backward pass: ~2× forward pass - - Total: Depends on model size and batch configuration - -2. **Calculate bandwidth utilization:** - - Memory bandwidth = Total memory traffic / Step time - - Compare with theoretical peak (e.g., MI250X: ~1.6 TB/s per GCD) - - Utilization % = (Actual bandwidth / Peak bandwidth) × 100 - -3. **Calculate arithmetic intensity:** - - Arithmetic intensity = FLOPs / Memory traffic (bytes) - - < 10 FLOPS/byte: Memory-bound - - > 100 FLOPS/byte: Compute-bound - - 10-100 FLOPS/byte: Mixed workload - -Record your observations based on the profiling data collected. - -**Record in your results file:** - -**Bandwidth Analysis:** -- Memory Traffic per Step: _______ GB -- Memory Bandwidth Used: _______ GB/s -- Theoretical Peak Bandwidth: _______ GB/s -- Bandwidth Utilization: _______% -- Arithmetic Intensity: _______ FLOPS/byte -- Workload Classification: _______ - -### 5.4 Analysis and Interpretation - -#### Memory Scaling Patterns - -**Batch Size Scaling:** - -Expected pattern: -- Memory ≈ Base + (Batch_size × Per_sample_memory) -- Should be approximately linear -- If superlinear → fragmentation or inefficiency - -**Sequence Length Scaling:** - -Components: -- Linear: Activations, most projections -- Quadratic: Attention matrices `[B, H, S, S]` -- Combined: Between linear and quadratic - -**Typical Results:** - -| Component | S=64 | S=128 | S=256 | Scaling | -|----------------|------|-------|-------|---------| -| Parameters | 9MB | 9MB | 9MB | O(1) | -| Activations | ~1GB | ~2GB | ~4GB | O(S) | -| Attention | ~100MB | ~400MB | ~1.6GB | O(S^2) | -| Total | ~1.1GB | ~2.4GB | ~5.6GB | Mixed | - -#### Memory Bottleneck Classification - -**Workload Type Determination:** - -``` -Arithmetic Intensity (FLOPS/byte): -- < 10: Memory-bound (bandwidth limited) -- 10-100: Mixed workload -- > 100: Compute-bound (ALU limited) - -Typical Transformer Training: 20-50 FLOPS/byte (mixed, leaning memory-bound) -``` - -**Signs of Memory-Bound Workload:** -- Low GPU compute utilization (<70%) -- High memory bandwidth utilization (>60%) -- Many small operations -- Frequent memory transfers - -**Signs of Compute-Bound Workload:** -- High GPU compute utilization (>80%) -- Low memory bandwidth utilization (<50%) -- Large matrix multiplications dominate -- Good arithmetic intensity - -### 5.5 Memory Optimization Opportunities - -Based on your analysis, rank these optimizations: - -**1. Flash Attention** -- **Impact:** Reduces attention memory from O(S^2) to O(S) -- **Benefit:** Enables much longer sequences -- **When:** Always beneficial for S > 512 -- **Rank:** _____ (1-4) - -**2. Gradient Checkpointing** -- **Impact:** Trades compute for memory (recompute activations) -- **Benefit:** Reduces activation memory by ~2-4x -- **When:** Memory-constrained, willing to sacrifice 20-30% speed -- **Rank:** _____ (1-4) - -**3. Mixed Precision (FP16/BF16)** -- **Impact:** Reduces memory per parameter by 2x -- **Benefit:** Allows 2x larger batch or model -- **When:** Always beneficial if hardware supports it -- **Rank:** _____ (1-4) - -**4. Kernel Fusion** -- **Impact:** Reduces intermediate tensor allocations -- **Benefit:** Lower memory footprint, less fragmentation -- **When:** Many small operations (already the case) -- **Rank:** _____ (1-4) - -### 5.6 Expected Results - -After completing this exercise, you should have: - -**Memory Usage Baseline:** -- Peak memory: 2-4 GB (batch_size=8, seq_len=128) -- Memory scaling: ~Linear with batch size, ~Quadratic with sequence -- Memory hotspots: Attention matrices, FFN intermediate tensors -- Bandwidth utilization: 30-60% (memory-bound to mixed) - -**Key Findings:** - -1. **Attention Memory Dominates for Long Sequences** - - At S=512, attention alone can consume GBs - - Quadratic scaling makes long sequences expensive - - Flash Attention is critical optimization target - -2. **Memory Fragmentation Observable** - - Peak-to-average ratio often 1.2-1.5x - - Many small allocations create overhead - - Tensor fusion can reduce fragmentation - -3. **Bandwidth Utilization is Moderate** - - Typically 30-60% for baseline PyTorch - - Room for improvement through fusion - - Memory-bound operations limit performance - -4. **Linear Components Well-Behaved** - - FFN and most projections scale linearly - - Predictable memory requirements - - Batch size scaling is efficient - -### 5.7 Troubleshooting - -**Out of Memory Errors:** - -```bash -# Error: RuntimeError: CUDA out of memory -# Solution 1: Reduce batch size -python3 tiny_llama_v1.py --batch-size 2 --seq-len 128 - -# Solution 2: Reduce sequence length -python3 tiny_llama_v1.py --batch-size 8 --seq-len 64 - -# Solution 3: Enable gradient accumulation (if implemented) -python3 tiny_llama_v1.py --batch-size 4 --gradient-accumulation-steps 2 -``` - -**Memory Profiling Overhead:** - -```bash -# If profiling causes OOM, reduce profiling frequency -python3 tiny_llama_v1.py --profile-steps 2 # Profile fewer steps -``` - -**Memory Fragmentation:** - -```bash -# Set memory allocator configuration -export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 - -# Or use expandable segments (PyTorch 2.0+) -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -``` - -### 5.8 Analysis Questions - -Answer these questions based on your results: - -**1. What is the memory scaling behavior?** - - Batch size scaling: [ ] Linear [ ] Superlinear [ ] Sublinear - - Sequence length scaling: [ ] Linear [ ] Quadratic [ ] Cubic - -**2. Where is peak memory consumed?** - - [ ] Forward pass (activations) - - [ ] Backward pass (gradients) - - [ ] Optimizer step (parameter updates) - - [ ] Attention computation (attention matrices) - -**3. What is the primary memory optimization target?** - - [ ] Reduce attention memory (Flash Attention) - - [ ] Reduce activation memory (checkpointing) - - [ ] Reduce parameter memory (mixed precision) - - [ ] Reduce fragmentation (kernel fusion) - -**4. Is the workload memory-bound or compute-bound?** - - [ ] Memory-bound (low arithmetic intensity, <10 FLOPS/byte) - - [ ] Compute-bound (high arithmetic intensity, >100 FLOPS/byte) - - [ ] Mixed (moderate arithmetic intensity, 10-100 FLOPS/byte) - -**5. What memory optimization would provide the biggest benefit?** - -Rank by expected impact: -1. _______________________________________ -2. _______________________________________ -3. _______________________________________ -4. _______________________________________ - -### 5.9 Key Takeaways - -**What We Learned:** - -1. **Memory Scaling Patterns:** - - Batch size: Linear (good) - - Sequence length: Between linear and quadratic (attention dominates) - - Peak memory occurs during backward pass or attention computation - -2. **Memory Bottlenecks Identified:** - - Attention matrices: O(S^2) memory usage - - Intermediate tensors: FFN activations - - Memory fragmentation from many small allocations - -3. **Bandwidth Utilization:** - - Moderate utilization (30-60%) indicates mixed workload - - Room for optimization through kernel fusion - - Memory bandwidth limits throughput for small models - -4. **Optimization Priorities:** - - Flash Attention: Critical for long sequences (S > 512) - - Kernel fusion: Reduces fragmentation and bandwidth pressure - - Mixed precision: 2x memory reduction, always beneficial - -**Next Steps:** - -- Document memory analysis in results file -- Compare memory patterns across configurations -- Identify top 3 memory optimization targets -- Understand memory-compute trade-offs -- Proceed to Exercise 3 for bottleneck identification - -**Exercise Complete When:** - -- [ ] Memory profiling completed for multiple batch sizes -- [ ] Sequence length scaling analyzed -- [ ] Memory hotspots identified -- [ ] Bandwidth utilization calculated -- [ ] Optimization priorities ranked -- [ ] Ready to proceed to bottleneck identification - ---- - -## 6. Exercise 3: Performance Study Across Problem Sizes - -### 6.1 Objective - -Learn how model performance scales with different problem sizes by using the automated performance study launcher. This exercise demonstrates: - -- How performance varies across tiny to very large model configurations -- Scaling characteristics of attention and FFN operations -- Memory and compute requirements for different model sizes -- How to establish performance baselines for optimization comparisons - -**Time Required:** 15-30 minutes (depending on problem sizes tested) - -### 6.2 Understanding the Performance Study Script - -The `launch_performance_study.sh` script provides pre-configured problem sizes: - -| Size | Hidden Dim | Layers | Seq Len | Batch | Params | Expected Time | -|------|-----------|--------|---------|-------|--------|---------------| -| **tiny** | 256 | 4 | 128 | 8 | ~2.9M | <5s/iter | -| **small** | 512 | 8 | 256 | 8 | ~20.9M | 10-30s/iter | -| **medium** | 1024 | 12 | 512 | 16 | ~167M | 30-60s/iter | -| **large** | 2048 | 16 | 1024 | 8 | ~1.3B | 1-3min/iter | -| **very_large** | 4096 | 24 | 2048 | 4 | ~10.7B | 5-10min/iter | - -**Script Features:** -- Automatic configuration based on problem size -- Output organization with timestamps -- Configuration metadata in JSON format -- Optional profiler integration -- Performance metrics extraction -- Next steps guidance - -### 6.3 Step-by-Step Instructions - -#### Step 1: Run Tiny Problem Size (Quick Validation) - -Start with the smallest size to verify everything works: - -```bash -cd ~/castille-ai-workshop-training/version1_pytorch_baseline/ - -# Run tiny problem size (fast validation) -./launch_performance_study.sh tiny -``` - -**Expected Output:** -``` -================================================================================ -CASTILLE AI WORKSHOP - VERSION 1 BASELINE PERFORMANCE STUDY -================================================================================ - -Problem Size: TINY -Configuration: - Hidden Dimension: 256 - Number of Layers: 4 - Sequence Length: 128 - Batch Size: 8 - Training Steps: 50 - Est. Parameters: ~2.9M - Expected Time: <5s/iter - Profilers Enabled: false - -Output Directory: performance_results_tiny_20251014_123456 -================================================================================ - -Starting V1 Baseline training... -... -================================================================================ -PERFORMANCE STUDY COMPLETE -================================================================================ -Total Runtime: 42s -Throughput: 95.2 samples/sec -Peak Memory: 342 MB -``` - -**Observe:** -- Quick completion time -- Low memory usage -- Baseline throughput metrics - -#### Step 2: Run Medium Problem Size (Workshop Standard) - -Test the standard workshop configuration: - -```bash -# Run medium problem size with profiling enabled -./launch_performance_study.sh medium --enable-profilers -``` - -**Note:** This will take longer (5-10 minutes) due to profiling overhead. - -**Expected Characteristics:** -- Longer runtime per iteration -- Higher memory usage -- More realistic model size for workshops -- Profiling data generated for analysis - -#### Step 3: Compare Problem Sizes - -Run multiple sizes to observe scaling: - -```bash -# Run small size -./launch_performance_study.sh small - -# Run medium size (if not done in Step 2) -./launch_performance_study.sh medium - -# Optional: Run large (if you have time and memory) -# WARNING: This requires significant GPU memory (>16GB) -# ./launch_performance_study.sh large -``` - -#### Step 4: Analyze Results - -Each run creates a timestamped output directory. Examine the results: - -```bash -# List all performance study results -ls -lt performance_results_*/ - -# View latest tiny run configuration -cat performance_results_tiny_*/config.json - -# View training output -cat performance_results_tiny_*/training_output.log - -# Compare throughput across sizes -echo "=== Throughput Comparison ===" -for dir in performance_results_*/; do - size=$(basename "$dir" | cut -d'_' -f3) - throughput=$(grep "Throughput:" "$dir/training_output.log" | tail -1 | awk '{print $2, $3}') - echo "$size: $throughput" -done - -# Compare memory usage -echo "" -echo "=== Memory Usage Comparison ===" -for dir in performance_results_*/; do - size=$(basename "$dir" | cut -d'_' -f3) - memory=$(grep "Peak memory usage:" "$dir/training_output.log" | tail -1 | awk '{print $4, $5}') - echo "$size: $memory" -done -``` - -#### Step 5: Record Scaling Observations - -Create a comparison table from your results: - -**Performance Scaling:** - -| Problem Size | Parameters | Throughput (samples/s) | Memory (MB) | Time/Iter (s) | -|--------------|-----------|------------------------|-------------|---------------| -| tiny | ~2.9M | _________ | _________ | _________ | -| small | ~20.9M | _________ | _________ | _________ | -| medium | ~167M | _________ | _________ | _________ | - -**Scaling Analysis:** - -1. **Throughput Scaling:** - - Does throughput decrease linearly with model size? - - At what size does GPU become saturated? - - How does batch size affect throughput? - -2. **Memory Scaling:** - - Is memory scaling proportional to parameter count? - - Where does attention memory become significant? - - What's the memory overhead ratio? - -3. **Compute Characteristics:** - - Which size achieves best GPU utilization? - - How does arithmetic intensity change? - - Is the workload memory-bound or compute-bound? - -### 6.4 Understanding Scaling Patterns - -**Expected Scaling Behavior:** - -**1. Parameter Count Scaling:** -- Linear layers: Scale with D² (hidden dimension squared) -- Attention: Scales with D² for projections, S² for computation -- FFN: Scales with D × D_ff (typically D × 4D) - -**2. Memory Scaling:** -- Parameters: Linear with model size -- Activations: Linear with batch size, quadratic with sequence length -- Peak memory: Dominated by activations for large sequences - -**3. Compute Scaling:** -- FLOPs: Proportional to parameters × sequence length × batch size -- Time per iteration: Depends on GPU utilization -- Throughput: Inversely related to FLOPs per sample - -**4. GPU Utilization:** -- Small models: Memory-bound, low GPU utilization -- Medium models: Mixed workload, moderate utilization -- Large models: Compute-bound, high GPU utilization - -### 6.5 Expected Results - -After completing this exercise, you should observe: - -**Tiny → Small Transition (2.9M → 20.9M):** -- Parameter increase: ~7x -- Memory increase: ~5-8x -- Throughput decrease: ~3-5x -- GPU utilization: Still relatively low - -**Small → Medium Transition (20.9M → 167M):** -- Parameter increase: ~8x -- Memory increase: ~6-10x (sequence length doubles!) -- Throughput decrease: ~5-10x -- GPU utilization: Significantly improved - -**Key Observations:** - -1. **Quadratic Attention Cost Visible:** - - Medium (seq_len=512) shows significant attention overhead vs small (seq_len=256) - - Memory increases faster than linear due to S² term - - This motivates Flash Attention optimization - -2. **Batch Size Impact:** - - Medium uses batch_size=16 vs 8 for small/large - - Better GPU utilization with larger batches - - Memory-throughput trade-off visible - -3. **Memory Becomes Limiting:** - - Large/very_large reduce batch size to fit in memory - - Attention matrices consume significant memory at long sequences - - Gradient checkpointing would be beneficial - -4. **Compute Patterns:** - - Larger models approach compute-bound regime - - Better GPU utilization percentage - - GEMM operations dominate more clearly - -### 6.6 Profiling Analysis (If Enabled) - -If you ran with `--enable-profilers`, analyze the generated profiles: - -```bash -# Navigate to profiled run -cd performance_results_medium_*/ - -# View performance summary -cat performance_summary.json | python3 -m json.tool - -# Check for profiler outputs -ls -lh pytorch_profiles/ -``` - -**Compare profiling results across sizes:** -- How does kernel distribution change? -- Which operations dominate in small vs large models? -- How does memory bandwidth utilization scale? - -### 6.7 Troubleshooting - -**Out of Memory Error:** - -```bash -# Error: RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB - -# Solution 1: Try the next smaller size -./launch_performance_study.sh small # instead of medium - -# Solution 2: Skip large/very_large on limited hardware -# These sizes require >16GB GPU memory -``` - -**Slow Execution:** - -```bash -# If profiling is too slow, disable it -./launch_performance_study.sh medium # without --enable-profilers - -# Reduce number of steps for faster results (edit script or run directly) -python tiny_llama_v1.py --hidden-dim 1024 --num-layers 12 --seq-len 512 \ - --batch-size 16 --num-steps 20 # Reduced from 100 -``` - -**Script Permission Denied:** - -```bash -# Make script executable -chmod +x launch_performance_study.sh - -# Then run -./launch_performance_study.sh tiny -``` - -### 6.8 Analysis Questions - -Answer these based on your performance study results: - -**1. Scaling Characteristics:** - -Q: How does throughput scale with model size? -A: _________________________________________________________________ - -Q: At what model size does GPU utilization peak? -A: _________________________________________________________________ - -Q: Which component (attention vs FFN) dominates compute time? -A: _________________________________________________________________ - -**2. Memory Patterns:** - -Q: How does memory scale with sequence length? (linear, quadratic, other?) -A: _________________________________________________________________ - -Q: What is the memory overhead ratio (peak / parameters)? -A: _________________________________________________________________ - -Q: At what point does attention memory become significant? -A: _________________________________________________________________ - -**3. Performance Optimization:** - -Q: Which model size would benefit most from Flash Attention? -A: _________________________________________________________________ - -Q: Which size is most memory-bound vs compute-bound? -A: _________________________________________________________________ - -Q: What batch size would you recommend for medium model? -A: _________________________________________________________________ - -**4. Practical Insights:** - -Q: What's the largest model you can train on your GPU? -A: _________________________________________________________________ - -Q: How would you improve throughput for the medium model? -A: _________________________________________________________________ - -Q: What's the optimal problem size for this workshop? -A: _________________________________________________________________ - -### 6.9 Key Takeaways - -**1. Problem Size Dramatically Affects Performance:** -- Small models: Memory-bound, low GPU utilization -- Large models: Compute-bound, high GPU utilization -- Medium models: Sweet spot for learning optimizations - -**2. Attention Memory Scales Quadratically:** -- Visible impact when comparing seq_len=256 vs 512 vs 1024 -- Flash Attention is critical for long sequences -- Memory becomes limiting factor before compute - -**3. Batch Size is a Key Tuning Parameter:** -- Larger batches improve GPU utilization -- Memory constraints force smaller batches for large models -- Trade-off between throughput and memory usage - -**4. Automated Testing is Valuable:** -- Pre-configured sizes reduce manual configuration errors -- Consistent testing methodology across problem sizes -- Easy to reproduce and compare results - -**5. Scaling Informs Optimization Strategy:** -- Tiny models: Not worth optimizing (I/O bound) -- Small-medium: Kernel fusion, mixed precision beneficial -- Large: Flash Attention, gradient checkpointing critical - -**Next Steps:** - -- Review all performance study results -- Document scaling patterns in your notes -- Identify which optimizations would have most impact -- Use baseline results to measure optimization improvements -- Proceed to comparative analysis with optimized versions - -**Exercise Complete When:** - -- [ ] At least 2 problem sizes tested (tiny + one other) -- [ ] Scaling patterns documented -- [ ] Memory and throughput metrics recorded -- [ ] Performance characteristics understood -- [ ] Optimization priorities identified -- [ ] Ready to compare with optimized versions - ---- - -**Next Exercise:** Exercise 4 - Comparative Analysis with Optimized Versions - ---- +## Additional Resources +- [PyTorch Profiler Documentation](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) +- [ROCm Documentation](https://rocm.docs.amd.com/) +- [DeepSpeed FLOPS Profiler](https://www.deepspeed.ai/tutorials/flops-profiler/) diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/README.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/README.md index 7c1f20d3..98ea075f 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/README.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/README.md @@ -1,666 +1,155 @@ +# ML Example: TinyTransformer Baseline with ROCm Profiling -# Version 1: PyTorch Baseline - Profiling Foundation +README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` from the Training Examples repository. -README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository +In this example we provide a baseline PyTorch implementation of Tiny LLaMA for profiling transformer workloads on AMD GPUs. The model runs forward and backward passes with configurable batch size and sequence length, measuring training throughput. This workload is useful for understanding transformer performance characteristics and for learning ROCm profiling tools. Several profiling scripts are provided to capture different aspects of GPU performance, from high-level API traces to detailed hardware metrics. -## Overview +## Features of the profiling scripts -Version 1 establishes the profiling foundation for the workshop using a standard PyTorch implementation of Tiny LLaMA. This version focuses on comprehensive performance characterization using PyTorch native profiling and DeepSpeed FLOPS profiler, providing the baseline measurements for all subsequent optimizations. +The version1_pytorch_baseline example contains several profiling scripts that capture different aspects of GPU performance: -## Learning Objectives +- **get_trace.sh**: Runtime trace collection using rocprofv3. Captures HIP/HSA API calls, kernel execution timeline, memory operations (H2D, D2H, D2D transfers), and synchronization events. Output is a Perfetto trace file for timeline visualization. +- **get_counters.sh**: Kernel trace collection using rocprofv3. Captures kernel execution statistics including timing and call counts. Useful for identifying hotspot kernels and their execution patterns. +- **get_rocprof_compute.sh**: Detailed GPU hardware metrics using rocprof-compute. Provides comprehensive performance analysis including compute utilization, memory bandwidth, and hardware counter data. +- **get_rocprof_sys.sh**: System-level profiling using rocprof-sys. Captures call stack sampling and system-level performance data for end-to-end analysis. -After completing this version, you will be able to: +## Overview of the model -- Configure deterministic execution for reproducible profiling -- Use PyTorch Profiler for detailed operator-level analysis -- Integrate DeepSpeed FLOPS profiler for computational efficiency metrics -- Interpret profiling results and identify performance bottlenecks -- Establish baseline performance metrics for optimization comparison +The model is controlled with the following arguments: -## Architecture Overview +- `--batch-size `: batch size for training (default: 8) +- `--seq-len `: sequence length (default: 256) +- `--num-steps `: number of training steps (default: 50) +- `--hidden-dim `: hidden dimension (default: 512) +- `--num-layers `: number of transformer layers (default: 8) +- `--num-heads `: number of attention heads (default: 8) +- `--learning-rate `: learning rate (default: 3e-4) +- `--use-amp`: enable automatic mixed precision +- `--enable-pytorch-profiler`: enable PyTorch profiler +- `--enable-deepspeed-flops`: enable DeepSpeed FLOPS profiler -This implementation uses the standard transformer architecture with: +## Running the baseline -- **Multi-Head Attention**: Standard scaled dot-product attention -- **Feed-Forward Network**: SwiGLU activation with separate gate/up projections -- **Layer Normalization**: RMSNorm for improved training stability -- **Position Embeddings**: Rotary Position Embeddings (RoPE) +Load the required modules: -### Model Configuration - -```python -# Default Tiny LLaMA Configuration -vocab_size = 1000 # Small vocabulary for workshop -hidden_size = 256 # Model dimension -num_layers = 4 # Transformer layers -num_attention_heads = 8 # Attention heads -intermediate_size = 512 # FFN dimension -max_sequence_length = 128 # Context window ``` - -## Implementation Details - -### Mathematical Implementation - -This section provides detailed implementation specifics for the baseline PyTorch model. For complete mathematical foundations, see [TINY_LLAMA_ARCHITECTURE.md](../TINY_LLAMA_ARCHITECTURE.md). - -#### Standard PyTorch Attention Implementation - -The baseline attention mechanism follows standard PyTorch patterns: - -```python -def attention_forward(self, hidden_states, attention_mask=None): - batch_size, seq_len, _ = hidden_states.size() - - # Linear projections (separate operations - optimization target!) - query = self.q_proj(hidden_states) # [B, S, D] -> [B, S, D] - key = self.k_proj(hidden_states) # [B, S, D] -> [B, S, D] - value = self.v_proj(hidden_states) # [B, S, D] -> [B, S, D] - - # Reshape for multi-head attention - query = query.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - key = key.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - value = value.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - - # Apply rotary position embeddings - query, key = self.rotary_emb(query, key, seq_len) - - # Scaled dot-product attention - O(S^2) memory complexity - attn_weights = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.head_dim) - - if attention_mask is not None: - attn_weights = attn_weights + attention_mask - - # Softmax over last dimension - attn_weights = F.softmax(attn_weights, dim=-1) - - # Apply attention to values - attn_output = torch.matmul(attn_weights, value) - - # Reshape and project output - attn_output = attn_output.transpose(1, 2).contiguous() - attn_output = attn_output.view(batch_size, seq_len, self.hidden_size) - attn_output = self.o_proj(attn_output) - - return attn_output -``` - -**Performance Characteristics:** -- **3 separate linear projections**: Creates kernel launch overhead -- **Attention matrix materialization**: $S \times S \times H$ memory usage -- **Multiple tensor reshapes**: Memory layout inefficiencies -- **Sequential operations**: Limited parallelization opportunities - -#### SwiGLU Feed-Forward Implementation - -```python -def swiglu_forward(self, hidden_states): - # Separate gate and up projections (optimization target!) - gate = self.gate_proj(hidden_states) # [B, S, D] -> [B, S, D_ff] - up = self.up_proj(hidden_states) # [B, S, D] -> [B, S, D_ff] - - # SiLU activation (Swish) - gate_activated = F.silu(gate) # Element-wise operation - - # Element-wise multiplication - intermediate = gate_activated * up # [B, S, D_ff] - - # Down projection - output = self.down_proj(intermediate) # [B, S, D_ff] -> [B, S, D] - - return output +module load pytorch rocm ``` -**Optimization Opportunities:** -- **Separate gate/up projections**: Can be fused into single GEMM -- **Intermediate tensor storage**: Memory overhead for gate_activated and up -- **Sequential activation**: SiLU can be fused with multiplication - -#### RMSNorm Implementation - -```python -def rms_norm_forward(self, hidden_states): - input_dtype = hidden_states.dtype - variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True) - hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) - return (self.weight * hidden_states).to(input_dtype) -``` +Run a basic training run: -**Implementation Details:** - -- **Variance computation**: Single reduction operation -- **Epsilon for numerical stability**: Prevents division by zero -- **Mixed precision handling**: Maintains numerical precision - -### Operator-Level Performance Analysis - -#### FLOP Breakdown by Operation Type - -```python -# Per transformer layer FLOP count (batch_size=1, seq_len=128) -FLOPS_BREAKDOWN = { - 'q_proj': seq_len * hidden_dim * hidden_dim, # 128 * 256 * 256 = 8.4M - 'k_proj': seq_len * hidden_dim * hidden_dim, # 128 * 256 * 256 = 8.4M - 'v_proj': seq_len * hidden_dim * hidden_dim, # 128 * 256 * 256 = 8.4M - 'attn_scores': seq_len * seq_len * hidden_dim, # 128 * 128 * 256 = 4.2M - 'attn_output': seq_len * seq_len * hidden_dim, # 128 * 128 * 256 = 4.2M - 'o_proj': seq_len * hidden_dim * hidden_dim, # 128 * 256 * 256 = 8.4M - 'gate_proj': seq_len * hidden_dim * intermediate_dim, # 128 * 256 * 512 = 16.8M - 'up_proj': seq_len * hidden_dim * intermediate_dim, # 128 * 256 * 512 = 16.8M - 'down_proj': seq_len * intermediate_dim * hidden_dim, # 128 * 512 * 256 = 16.8M - 'rms_norm': 2 * seq_len * hidden_dim, # 2 * 128 * 256 = 65K -} - -# Total per layer: ~92.1M FLOPs -# Total model (4 layers): ~368M FLOPs per forward pass ``` - -#### Memory Access Patterns - -```python -# Memory bandwidth requirements per operation -MEMORY_BREAKDOWN = { - 'attention_qkv': { - 'parameters': 3 * hidden_dim * hidden_dim * 4, # 3 * 256^2 * 4B = 786KB - 'activations': seq_len * hidden_dim * 4, # 128 * 256 * 4B = 131KB - 'attention_matrix': seq_len * seq_len * num_heads * 4, # 128^2 * 8 * 4B = 524KB - 'bandwidth_requirement': 'memory-bound' # Limited by memory access - }, - 'feed_forward': { - 'parameters': 3 * hidden_dim * intermediate_dim * 4, # 3 * 256 * 512 * 4B = 1.57MB - 'activations': seq_len * intermediate_dim * 4, # 128 * 512 * 4B = 262KB - 'bandwidth_requirement': 'compute-bound' # Good arithmetic intensity - } -} +echo "Running TinyTransformer baseline" +python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 ``` -#### Kernel Launch Analysis - -The baseline implementation generates numerous kernel launches per forward pass: - -```python -# Typical kernel count per transformer layer -KERNEL_LAUNCHES = { - 'attention_block': { - 'q_projection': 1, # Linear layer - 'k_projection': 1, # Linear layer - 'v_projection': 1, # Linear layer - 'rope_application': 2, # For query and key - 'attention_computation': 3, # QK^T, softmax, attention*V - 'output_projection': 1, # Linear layer - 'residual_add': 1, # Element-wise addition - 'subtotal': 10 - }, - 'ffn_block': { - 'rms_norm': 1, # Normalization - 'gate_projection': 1, # Linear layer - 'up_projection': 1, # Linear layer - 'silu_activation': 1, # Element-wise SiLU - 'element_multiply': 1, # gate * up - 'down_projection': 1, # Linear layer - 'residual_add': 1, # Element-wise addition - 'subtotal': 7 - }, - 'layer_total': 17, # Per transformer layer - 'model_total': 68 # 4 layers * 17 kernels/layer -} -``` +For mixed precision training: -**Optimization Implications:** - -- **High kernel launch overhead**: 68+ kernels create GPU scheduling overhead -- **Memory bandwidth underutilization**: Many small operations -- **Fusion opportunities**: Adjacent operations can be combined - -### Profiling Data Interpretation - -#### PyTorch Profiler Output Analysis - -When analyzing PyTorch profiler results, focus on these key metrics: - -```python -# Key profiler metrics to examine -PROFILER_METRICS = { - 'operator_timing': { - 'aten::linear': 'Matrix multiplication operations', - 'aten::softmax': 'Attention softmax computation', - 'aten::add_': 'Residual connections', - 'aten::mul': 'Element-wise operations', - 'aten::rsqrt': 'RMSNorm operations' - }, - 'memory_analysis': { - 'peak_memory': 'Maximum GPU memory allocation', - 'memory_timeline': 'Memory usage over time', - 'fragmentation': 'Memory layout efficiency' - }, - 'gpu_utilization': { - 'kernel_efficiency': 'Individual kernel performance', - 'sm_efficiency': 'Streaming multiprocessor usage', - 'memory_bandwidth': 'Memory subsystem utilization' - } -} ``` - -#### Expected Bottleneck Patterns - -Based on the implementation analysis, expect these bottlenecks: - -```python -EXPECTED_BOTTLENECKS = { - 'attention_computation': { - 'percentage_of_time': '35-45%', - 'primary_issue': 'O(S^{2}) memory complexity', - 'kernel_count': '10 per layer', - 'optimization_target': 'Flash Attention + QKV fusion' - }, - 'feed_forward_network': { - 'percentage_of_time': '30-40%', - 'primary_issue': 'Separate gate/up projections', - 'kernel_count': '7 per layer', - 'optimization_target': 'SwiGLU fusion' - }, - 'layer_normalization': { - 'percentage_of_time': '8-12%', - 'primary_issue': 'Memory-bound operation', - 'kernel_count': '2 per layer', - 'optimization_target': 'Kernel fusion with adjacent ops' - }, - 'residual_connections': { - 'percentage_of_time': '5-8%', - 'primary_issue': 'Memory bandwidth limitation', - 'kernel_count': '2 per layer', - 'optimization_target': 'Fusion with preceding operations' - } -} +echo "Running with automatic mixed precision" +python tiny_llama_v1.py --batch-size 16 --seq-len 128 --num-steps 10 --use-amp ``` -### Code Walkthrough: Critical Performance Paths - -#### Attention Hot Path Analysis - -```python -# Performance-critical code path in attention forward pass -@profile_function("attention_forward") # PyTorch profiler annotation -def forward(self, hidden_states, attention_mask=None, position_ids=None): - bsz, q_len, _ = hidden_states.size() - - # BOTTLENECK 1: Separate linear projections (3 kernel launches) - with nvtx.range("qkv_projections"): - query_states = self.q_proj(hidden_states) # Kernel launch 1 - key_states = self.k_proj(hidden_states) # Kernel launch 2 - value_states = self.v_proj(hidden_states) # Kernel launch 3 - - # Reshape for attention heads - query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) - key_states = key_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) - value_states = value_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) +## Runtime Trace Profiling with get_trace.sh - # BOTTLENECK 2: Attention computation (O(S^2) memory) - with nvtx.range("attention_computation"): - # Attention scores: [bsz, num_heads, q_len, kv_seq_len] - attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) +This script captures GPU API calls, kernel launches, and memory operations for timeline analysis. - if attention_mask is not None: - attn_weights = attn_weights + attention_mask +Run the profiling script: - # BOTTLENECK 3: Softmax (memory-bound) - attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) - - # BOTTLENECK 4: Attention application - attn_output = torch.matmul(attn_weights, value_states) - - # Reshape and output projection - attn_output = attn_output.transpose(1, 2).contiguous() - attn_output = attn_output.view(bsz, q_len, self.hidden_size) - attn_output = self.o_proj(attn_output) # Kernel launch 4 - - return attn_output, attn_weights +``` +echo "Collecting runtime trace with rocprofv3" +./get_trace.sh ``` -**Profiling Annotations:** - -- `@profile_function`: Enables detailed timing analysis -- `nvtx.range()`: Creates named regions in profiler traces -- Performance counters will show exact kernel timing - -## Workshop Exercises - -### Exercise 1: Baseline Performance Analysis - -**Objective**: Establish baseline performance metrics and identify computational bottlenecks. +The script will output results to `traces/trace_/`. To analyze the results: -#### Step 1: Run Basic Training -```bash -# Basic training without profiling -python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 - -# Expected output: Training loss progression and timing info ``` - -#### Step 2: Enable PyTorch Profiler -```bash -# Make directory for the profiles -mkdir pytorch_profiles -# Run with PyTorch profiler enabled -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 10 \ - --enable-pytorch-profiler \ - --profile-dir ./pytorch_profiles - -# This generates detailed profiling traces in pytorch_profiles/ +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .pftrace file" ``` -#### Step 3: Analyze Profiling Results -```bash -# Launch TensorBoard to visualize profiles -tensorboard --logdir pytorch_profiles --port 6006 +If a `.db` file is generated instead (ROCm 7.x without --output-format): -# Or generate text report -python run_pytorch_profiler.py --analyze-existing pytorch_profiles/profile_*.json +``` +echo "Converting database to Perfetto format" +rocpd2pftrace -i -o trace.pftrace ``` -**Expected Analysis Results:** - -- Attention operations consuming ~40% of compute time -- Matrix multiplications (GEMM) as primary compute kernels -- Memory transfer overhead between operations -- GPU utilization patterns +## Kernel Trace Profiling with get_counters.sh -#### Step 4: DeepSpeed FLOPS Analysis -```bash -# Run with DeepSpeed FLOPS profiler -python run_deepspeed_flops.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 10 +This script collects kernel execution statistics including timing and call counts. -# Analyze computational intensity -python run_deepspeed_flops.py --analyze-results flops_profile.json -``` +Run the profiling script: -**Expected FLOPS Analysis:** - -- Total FLOPS per forward/backward pass -- FLOPS breakdown by operation type -- Model FLOPS Utilization (MFU) calculation -- Memory bandwidth requirements - -### Exercise 2: Memory Analysis and Optimization - -**Objective**: Understand memory usage patterns and bandwidth requirements. - -#### Step 1: Memory Profiling -```bash -# Run with memory profiling enabled -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --enable-pytorch-profiler \ - --profile-memory \ - --profile-dir ./memory_analysis - -# Generate memory timeline visualization -python -c " -import torch -from torch.profiler import profile, record_function, ProfilerActivity -# Memory analysis code will be embedded in tiny_llama_v1.py -" ``` - -#### Step 2: Batch Size Scaling -```bash -# Test different batch sizes -for bs in 4 8 16 32; do - echo \"Testing batch size: \$bs\" - python tiny_llama_v1.py \ - --batch-size \$bs \ - --seq-len 128 \ - --num-steps 5 \ - --enable-pytorch-profiler \ - --profile-dir ./scaling_bs\$bs -done - -# Analyze scaling behavior -python analyze_batch_scaling.py --profile-dirs scaling_bs* +echo "Collecting kernel trace with rocprofv3" +./get_counters.sh ``` -**Expected Memory Analysis:** - -- Memory usage scaling with batch size -- Peak memory allocation points -- Memory fragmentation patterns -- Opportunities for memory optimization - -### Exercise 3: Bottleneck Identification +The script will output results to `counters/counter_/`. -**Objective**: Identify computational and memory bottlenecks for optimization targets. +ROCm 6.x outputs CSV files directly, while ROCm 7.x outputs SQLite databases. For ROCm 7.x database files, use rocpd tools: -#### Step 1: Operator-Level Analysis -```bash -# Detailed operator timing -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --enable-pytorch-profiler \ - --profile-operators \ - --sort-by cuda_time_total - -# Generate bottleneck report -python analyze_bottlenecks.py \ - --profile-data pytorch_profiles/ \ - --output-report bottlenecks_v1.md ``` - -#### Step 2: Attention Pattern Analysis -```bash -# Focus on attention computation -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 128 \ - --profile-attention-only \ - --enable-pytorch-profiler +echo "Exporting kernel statistics to CSV" +rocpd2csv -i -o kernel_stats.csv ``` -#### Step 3: Matrix Multiplication Analysis -```bash -# GEMM operation profiling -python analyze_gemm_operations.py \ - --model-config tiny_llama_v1_config.yaml \ - --batch-sizes \"4,8,16,32\" \ - --sequence-lengths \"64,128,256\" +``` +echo "Getting kernel summary" +rocpd summary -i --region-categories KERNEL ``` -**Expected Bottleneck Analysis:** - -- Attention QKV projection overhead -- Softmax computation inefficiency -- Multiple small GEMM operations -- Memory-bound operations identification - -## Profiling Tools Integration +Documentation for rocpd tools: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html -### PyTorch Profiler Configuration +## GPU Hardware Metrics with get_rocprof_compute.sh -The implementation includes comprehensive PyTorch profiler integration: +This script collects detailed GPU performance metrics for hardware utilization analysis. -```python -# In tiny_llama_v1.py -from torch.profiler import profile, record_function, ProfilerActivity +Run the profiling script: -# Profiler configuration -profiler_config = { - 'activities': [ProfilerActivity.CPU, ProfilerActivity.CUDA], - 'record_shapes': True, - 'profile_memory': True, - 'with_stack': True, - 'with_flops': True, - 'experimental_config': torch._C._profiler._ExperimentalConfig(verbose=True) -} ``` - -### DeepSpeed FLOPS Profiler Integration - -```python -# FLOPS profiler setup -from deepspeed.profiling.flops_profiler import FlopsProfiler - -profiler = FlopsProfiler(model) -profiler.start_profile() -# Training step -profiler.stop_profile() -profiler.print_model_profile(profile_step=1) +echo "Collecting GPU hardware metrics with rocprof-compute" +./get_rocprof_compute.sh ``` -## Key Performance Metrics - -### Baseline Performance Expectations - -On a typical AMD MI200 series GPU: - -| Metric | Expected Range | Notes | -|--------|----------------|-------| -| **Training Speed** | 50-100 samples/sec | Batch size dependent | -| **GPU Utilization** | 60-75% | Standard PyTorch efficiency | -| **Memory Usage** | 2-4 GB | Model + batch data | -| **FLOPS Utilization** | 30-45% | Baseline MFU | -| **Memory Bandwidth** | 40-60% | Memory-bound operations | - -### Profiling Output Files - -After running exercises, expect these output files: +The script will output results to `rocprof_compute/profile_/`. To analyze the results: ``` -version1_pytorch_baseline/ -├── pytorch_profiles/ -│ ├── profile_*.json # PyTorch profiler traces -│ ├── trace_*.json # Chrome trace format -│ └── memory_timeline.html # Memory usage visualization -├── flops_analysis/ -│ ├── flops_profile.json # FLOPS breakdown -│ ├── model_profile.txt # Detailed model analysis -│ └── mfu_analysis.csv # Model FLOPS Utilization -└── bottleneck_analysis/ - ├── bottlenecks_v1.md # Comprehensive bottleneck report - ├── operator_timing.csv # Per-operator performance - └── optimization_targets.json # Prioritized optimization opportunities +echo "Generating performance analysis report" +rocprof-compute analyze -p /workloads//rocprof --dispatch -n tiny_llama_dispatch ``` -## Expected Analysis Results - -### Performance Characteristics - -1. **Compute Distribution**: - - Attention operations: ~40% of total time - - Feed-forward network: ~35% of total time - - Layer normalization: ~10% of total time - - Other operations: ~15% of total time - -2. **Memory Patterns**: - - Peak memory usage during attention computation - - Multiple intermediate tensor allocations - - Memory fragmentation from varying tensor sizes - -3. **Optimization Opportunities**: - - Kernel fusion potential in attention - - Memory layout optimization - - Reduced intermediate tensor creation +For available analysis options: -### Bottleneck Identification - -Primary bottlenecks to address in subsequent versions: - -1. **Separate QKV projections** → Fusion opportunity -2. **Standard attention computation** → Flash Attention -3. **Individual FFN gates** → SwiGLU fusion -4. **Multiple kernel launches** → Custom kernels - -## Troubleshooting - -### Common Issues - -#### CUDA/ROCm Memory Errors -```bash -# Reduce batch size if memory errors occur -python tiny_llama_v1.py --batch-size 4 --seq-len 64 ``` - -#### Profiler Permission Issues -```bash -# Ensure proper permissions for profiling -export ROCPROF_COMPUTE_DISABLE_AQL_DEBUG=1 +rocprof-compute analyze --help ``` -#### Missing Profiling Output -```bash -# Check profiling directory permissions -mkdir -p pytorch_profiles -chmod 755 pytorch_profiles -``` +Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support. Consumer GPUs may have limited counter availability. -### Performance Validation +## System-Level Profiling with get_rocprof_sys.sh -To validate your setup is working correctly: +This script captures system-level performance with call stack sampling. -```bash -# Quick validation run -python tiny_llama_v1.py \ - --batch-size 4 \ - --seq-len 64 \ - --num-steps 3 \ - --enable-pytorch-profiler \ - --validate-setup +Run the profiling script: -# Expected: Successful completion with profiling files generated ``` - -## Next Steps - -After completing all exercises in Version 1: - -1. **Review baseline metrics** - Understand current performance characteristics -2. **Identify optimization targets** - Use bottleneck analysis to prioritize improvements -3. **Prepare for Version 2** - Kernel fusion will address primary bottlenecks -4. **Document findings** - Record baseline measurements for comparison - -**Ready for optimization? Proceed to [Version 2: PyTorch Fused](../version2_pytorch_fused/README.md)** - ---- - -## Performance Summary Template - -Use this template to document your Version 1 results: - +echo "Collecting system-level profile with rocprof-sys" +./get_rocprof_sys.sh ``` -# Version 1 Baseline Results - -## Configuration - -- Batch Size: ___ -- Sequence Length: ___ -- GPU: ___ -- ROCm Version: ___ - -## Performance Metrics -- Training Speed: ___ samples/sec -- GPU Utilization: ___% -- Memory Usage: ___ GB -- FLOPS Utilization: ___% +The script will output results to `rocprof_sys/profile_/`. To analyze the results: -## Top Bottlenecks - -1. _________________ (__% of time) -2. _________________ (__% of time) -3. _________________ (__% of time) - -## Optimization Targets for Version 2 - -- [ ] QKV fusion -- [ ] Flash Attention -- [ ] SwiGLU fusion -- [ ] Other: ___________ ``` +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .proto file" +``` + +Note: rocprof-sys may produce memory map dumps in some configurations. If profiling fails or produces excessive output, consider using rocprofv3 (get_trace.sh) instead. +## Additional Resources +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- rocpd output format: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html +- Perfetto UI: https://ui.perfetto.dev/ diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/ROCPROFV3_VERSION1_RESULTS.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/ROCPROFV3_VERSION1_RESULTS.md new file mode 100644 index 00000000..9736c662 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/ROCPROFV3_VERSION1_RESULTS.md @@ -0,0 +1,66 @@ +# rocprofv3 Test Results - Version 1 Baseline + +ROCPROFV3_VERSION1_RESULTS.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository. + +## Summary + +rocprofv3 successfully captures profiling data from version1 baseline. This document shows example results from runtime trace collection. + +## Test Configuration + +**Command:** + +``` +rocprofv3 --runtime-trace --output-format pftrace -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +``` + +**Environment:** +- ROCm Version: 6.4.x +- PyTorch: ROCm-enabled build +- GPU: AMD Instinct or Radeon with gfx support + +## Example Output + +``` +Training completed: 10 steps, batch size 8 +Performance: 262.3 samples/sec, 33,571 tokens/sec +Memory usage: 434.3 MB peak +rocprofv3 exit code: 0 (success) +``` + +## Generated Files + +Output directory contains Perfetto trace files: + +| File | Size | Description | +|------|------|-------------| +| `_results.pftrace` | ~40-50 MB | Main trace with full profiling data | +| Additional `.pftrace` files | ~600 bytes | Minimal traces from subprocesses | + +The main trace file (largest) contains the full profiling data for timeline analysis. + +## Viewing the Trace + +1. Visit https://ui.perfetto.dev/ +2. Click "Open trace file" +3. Select the main `.pftrace` file +4. Examine: + - GPU kernel timeline + - Memory transfer operations + - HIP API calls + - Kernel duration and overlap + +## Warnings + +The following warning may appear and can be ignored: + +``` +rocprofiler_iterate_agent_supported_counters returned ROCPROFILER_STATUS_ERROR_AGENT_ARCH_NOT_SUPPORTED for agent X (gfxXXXX) +``` + +This typically relates to integrated GPUs or unsupported architectures and does not affect profiling of the target GPU. + +## Additional Resources + +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- Perfetto UI: https://ui.perfetto.dev/ diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_1_baseline_analysis.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_1_baseline_analysis.md index 1cb9b199..b30e4884 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_1_baseline_analysis.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_1_baseline_analysis.md @@ -1,53 +1,26 @@ +# Exercise 1: Baseline Performance Analysis -## Exercise 1: Baseline Performance Analysis +exercise_1_baseline_analysis.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises` in the Training Examples repository. -`exercise1_baseline_analysis.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository +## Objective -### Objective -Establish baseline performance metrics for Tiny LLaMA V1 and understand the profiling methodology that will be used throughout the workshop. +Establish baseline performance metrics for Tiny LLaMA V1 and understand profiling methodology. -### Prerequisites +## Step 1: Run Baseline Training -- Completed environment setup from `../setup/` -- Verified environment with validation scripts - -### Duration -**Estimated Time:** 20-30 minutes - -### Instructions - -#### Step 1: Run Baseline Training (5 minutes) - -First, let's run the basic model without any profiling to establish a clean baseline: - -```bash -## Navigate to version1_pytorch_baseline directory +``` cd version1_pytorch_baseline - -## Run basic training python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** - +Expected output: - Model configuration summary - Training progress with loss values - Performance metrics (samples/sec, memory usage) -- Final performance summary -**📝 Record the following baseline metrics:** +## Step 2: Enable PyTorch Profiler -- Training speed: _____ samples/sec -- Peak memory usage: _____ MB -- Final loss: _____ -- Average batch time: _____ ms - -#### Step 2: Enable Basic Profiling (10 minutes) - -Now let's add PyTorch profiler to understand what's happening under the hood: - -```bash -# Run with PyTorch profiler enabled +``` mkdir exercise1_profiles python tiny_llama_v1.py \ --batch-size 8 \ @@ -57,199 +30,48 @@ python tiny_llama_v1.py \ --profile-dir ./exercise1_profiles ``` -**Expected Output:** +Profile files will be generated in `./exercise1_profiles/`. -- Same training output as before -- Additional profiling information -- Profile files generated in `./exercise1_profiles/` +## Step 3: Analyze Results -**📝 Answer these questions:** +Launch TensorBoard to visualize profiling results: -1. How much overhead did profiling add to training time? -2. What files were generated in the `exercise1_profiles/` directory? -3. What's the difference in memory usage with profiling enabled? - -#### Step 3: Analyze Profiling Results (10 minutes) - -Launch TensorBoard to visualize the profiling results: +``` +tensorboard --logdir ./exercise1_profiles --port 6006 +``` -```bash -## Launch TensorBoard (run in background) -tensorboard --logdir ./exercise1_profiles --port 6006 & +Alternatively, examine JSON traces directly: -## If TensorBoard is not available, examine the JSON traces +``` ls -la ./exercise1_profiles/ ``` -**TensorBoard Analysis:** - -1. Open your browser to `http://localhost:6006` -2. Navigate to the "PROFILE" tab -3. Select the most recent run - -**📝 Explore and document:** - -**Trace Timeline:** - -- What are the top 3 longest-running operations? - 1. _________________ - 2. _________________ - 3. _________________ - -**Operator View:** - -- Which operation consumes the most GPU time? -- What percentage of time is spent in attention operations? -- How many different kernel types are launched? - -**Memory Timeline:** - -- What is the peak memory usage? -- When does peak memory occur (forward/backward pass)? -- Are there any memory spikes or unusual patterns? - -#### Step 4: Identify Performance Patterns (5 minutes) - -Based on your analysis, identify patterns in the baseline model: - -**📝 Pattern Analysis:** - -**Compute Patterns:** - -- [ ] Attention operations dominate compute time -- [ ] Matrix multiplications are the primary kernels -- [ ] Many small operations with low utilization -- [ ] Memory transfers visible between operations - -**Memory Patterns:** +## Key Observations -- [ ] Memory usage grows during forward pass -- [ ] Peak memory during attention computation -- [ ] Frequent small allocations -- [ ] Memory fragmentation visible +Typical baseline performance characteristics: +- Training speed: 50-100 samples/sec (varies by hardware) +- GPU utilization: 60-75% +- Memory usage: 2-4 GB depending on batch size +- Kernel count: 40-50 different kernel launches per step -**Optimization Opportunities:** - -Based on the profiling results, which of these optimizations would likely provide the biggest benefit: - -- [ ] Kernel fusion (reduce number of operations) -- [ ] Memory layout optimization -- [ ] Flash Attention implementation -- [ ] Mixed precision training -- [ ] Batch size scaling - -### Expected Results - -After completing this exercise, you should have: - -#### Performance Baseline - -- **Training Speed**: 50-100 samples/sec (varies by hardware) -- **GPU Utilization**: 60-75% (typical for baseline PyTorch) -- **Memory Usage**: 2-4 GB depending on batch size -- **Kernel Count**: 40-50 different kernel launches per step - -#### Key Observations +## Optimization Opportunities +Based on profiling analysis: - Attention operations consume ~40% of total compute time - Matrix multiplications (GEMM) are the dominant kernels - Multiple small operations create kernel launch overhead - Memory allocation patterns show optimization opportunities -#### Profiling Data Generated -``` -exercise1_profiles/ -├── events.out.tfevents.* # TensorBoard events -├── trace_step_*.json # Chrome trace files -├── performance_summary.json # Performance metrics -└── [additional profile files] -``` - -### Troubleshooting +## Troubleshooting -#### Common Issues +CUDA/ROCm memory errors: -**1. CUDA/ROCm Memory Errors** -```bash -## Reduce batch size if you get OOM errors +``` python tiny_llama_v1.py --batch-size 4 --seq-len 64 --num-steps 10 ``` -**2. Profiling Files Not Generated** -```bash -## Check permissions and disk space -ls -la ./exercise1_profiles/ -df -h . -``` +Check GPU utilization: -**3. TensorBoard Not Loading** -```bash -## Try different port or check firewall -tensorboard --logdir ./exercise1_profiles --port 6007 -## Or examine JSON files directly -python -c "import json; print(json.load(open('./exercise1_profiles/performance_summary.json')))" ``` - -**4. Low GPU Utilization** -```bash -## Check if GPU is being used -nvidia-smi # for NVIDIA -## or -rocm-smi # for AMD +rocm-smi ``` - -### Analysis Questions - -**📝 Answer these questions based on your results:** - -1. **What is the primary bottleneck in the baseline model?** - - [ ] Memory bandwidth - - [ ] Compute utilization - - [ ] Kernel launch overhead - - [ ] Data loading - -2. **Which operations would benefit most from fusion?** - - [ ] QKV projections in attention - - [ ] Gate/Up projections in SwiGLU - - [ ] Layer normalization operations - - [ ] All of the above - -3. **What is the Model FLOPS Utilization (rough estimate)?** - - [ ] < 20% (memory bound) - - [ ] 20-40% (mixed workload) - - [ ] 40-60% (compute bound) - - [ ] > 60% (highly optimized) - -4. **Based on memory usage patterns, what optimization would help most?** - - [ ] Gradient checkpointing - - [ ] Flash Attention - - [ ] Mixed precision - - [ ] Tensor fusion - -### Next Steps - -After completing this exercise: - -1. **Document your findings** using the performance template in the main README -2. **Compare with expected results** - are your metrics in the expected ranges? -3. **Identify top 3 optimization targets** for Version 2 -4. **Proceed to Exercise 2** for memory analysis -5. **Save your profiling data** - you'll compare against Version 2 later - -### Success Criteria - -**Exercise Complete When:** - -- [ ] Baseline training runs successfully -- [ ] Profiling data generated and analyzed -- [ ] Performance metrics documented -- [ ] Bottlenecks identified -- [ ] Ready to proceed to memory analysis - ---- - -**Key Takeaway**: The baseline model provides a solid foundation for optimization. The profiling data clearly shows opportunities for kernel fusion, memory optimization, and attention improvements that will be addressed in subsequent versions. - -**Next Exercise**: [Exercise 2 - Memory Analysis](exercise_2_memory_analysis.md) - - diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_2_memory_analysis.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_2_memory_analysis.md index e35626b7..89a2bc9d 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_2_memory_analysis.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_2_memory_analysis.md @@ -1,42 +1,19 @@ +# Exercise 2: Memory Analysis and Optimization -## Exercise 2: Memory Analysis and Optimization +exercise_2_memory_analysis.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises` in the Training Examples repository. -`exercise2_memory_analysis.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository +## Objective -### Objective -Understand memory usage patterns, identify memory bottlenecks, and analyze memory bandwidth utilization in the baseline Tiny LLaMA model. +Understand memory usage patterns, identify memory bottlenecks, and analyze memory bandwidth utilization. -### Prerequisites +## Step 1: Memory Profiling with Different Batch Sizes -- Completed Exercise 1 -- Basic understanding of GPU memory hierarchy - -### Duration -**Estimated Time:** 25-30 minutes - -### Background - -Memory optimization is crucial for transformer models because: - -- **Memory Bandwidth**: Often the limiting factor for inference -- **Peak Memory**: Determines maximum batch size and model size -- **Memory Fragmentation**: Can reduce effective memory utilization -- **Attention Memory**: Quadratic scaling with sequence length - -### Instructions - -#### Step 1: Memory-Focused Profiling (10 minutes) - -Run profiling with enhanced memory analysis: - -```bash -# Memory profiling with different batch sizes +``` python tiny_llama_v1.py \ --batch-size 4 \ --seq-len 128 \ --num-steps 15 \ --enable-pytorch-profiler \ - --enable-memory-profiling \ --profile-dir ./memory_analysis_bs4 python tiny_llama_v1.py \ @@ -44,7 +21,6 @@ python tiny_llama_v1.py \ --seq-len 128 \ --num-steps 15 \ --enable-pytorch-profiler \ - --enable-memory-profiling \ --profile-dir ./memory_analysis_bs8 python tiny_llama_v1.py \ @@ -52,279 +28,63 @@ python tiny_llama_v1.py \ --seq-len 128 \ --num-steps 15 \ --enable-pytorch-profiler \ - --enable-memory-profiling \ --profile-dir ./memory_analysis_bs16 ``` -**📝 Record memory usage for each batch size:** - -| Batch Size | Peak Memory (MB) | Avg Memory (MB) | Training Speed (samples/sec) | -|------------|------------------|-----------------|------------------------------| -| 4 | | | | -| 8 | | | | -| 16 | | | | - -#### Step 2: Memory Timeline Analysis (10 minutes) +## Step 2: Memory Timeline Analysis -Analyze memory patterns using TensorBoard: +Launch TensorBoard for memory analysis: -```bash -# Launch TensorBoard for memory analysis -tensorboard --logdir ./memory_analysis_bs8 --port 6007 ``` - -In TensorBoard: - -1. Go to the **PROFILE** tab -2. Select **Memory Timeline** view -3. Examine the memory usage pattern - -**📝 Memory Pattern Analysis:** - -**Memory Allocation Timeline:** - -- At what point does memory usage peak? ________________ -- What operations cause the largest memory spikes? ________________ -- Are there memory deallocations visible? ________________ - -**Memory Efficiency:** - -- Is memory usage steady or fluctuating? ________________ -- Are there unnecessary memory allocations? ________________ -- What's the memory utilization pattern during attention? ________________ - -#### Step 3: Sequence Length Scaling (8 minutes) - -Test how memory scales with sequence length: - -```bash -# Test different sequence lengths -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 64 \ - --num-steps 10 \ - --enable-memory-profiling \ - --profile-dir ./memory_seq64 - -python tiny_llama_v1.py \ - --batch-size 8 \ - --seq-len 256 \ - --num-steps 10 \ - --enable-memory-profiling \ - --profile-dir ./memory_seq256 - -# Note: seq-len 512 might OOM - try with smaller batch size if needed -python tiny_llama_v1.py \ - --batch-size 4 \ - --seq-len 512 \ - --num-steps 5 \ - --enable-memory-profiling \ - --profile-dir ./memory_seq512 +tensorboard --logdir ./memory_analysis_bs8 --port 6007 ``` -**📝 Sequence Length Scaling Analysis:** - -| Seq Length | Batch Size | Peak Memory (MB) | Memory per Token | Scaling Pattern | -|------------|------------|------------------|------------------|-----------------| -| 64 | 8 | | | | -| 128 | 8 | | | | -| 256 | 8 | | | | -| 512 | 4 | | | | +In TensorBoard, navigate to the PROFILE tab and select Memory Timeline view. -**Memory Scaling Questions:** +## Step 3: Sequence Length Scaling -1. Is memory scaling linear, quadratic, or something else with sequence length? -2. Which component shows the steepest memory scaling? -3. At what sequence length do you hit memory limits? +Test memory scaling with sequence length: -#### Step 4: Memory Bandwidth Analysis (7 minutes) - -Use the memory profiling results to analyze bandwidth utilization: - -```bash -# Run bandwidth-focused analysis -python run_deepspeed_flops.py \ - --batch-size 8 \ - --seq-len 128 \ - --num-steps 15 \ - --computational-intensity \ - --output-dir ./bandwidth_analysis ``` - -**📝 Bandwidth Analysis Results:** - -Check the `bandwidth_analysis/computational_intensity.json` file: - -```bash -# View bandwidth metrics -python -c " -import json -data = json.load(open('./bandwidth_analysis/computational_intensity.json')) -print('Arithmetic Intensity:', data['arithmetic_intensity_flops_per_byte']) -print('Memory Bandwidth Used:', data['memory_bandwidth_used_gb_per_sec'], 'GB/s') -print('Bandwidth Utilization:', data['memory_bandwidth_utilization_percent'], '%') -print('Workload Type:', data['memory_bound_vs_compute_bound']) -" +python tiny_llama_v1.py --batch-size 8 --seq-len 64 --num-steps 10 +python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +python tiny_llama_v1.py --batch-size 8 --seq-len 256 --num-steps 10 +python tiny_llama_v1.py --batch-size 4 --seq-len 512 --num-steps 5 ``` -**Key Metrics:** - -- Arithmetic Intensity: _______ FLOPS/byte -- Memory Bandwidth Used: _______ GB/s -- Bandwidth Utilization: _______ % -- Workload Classification: _______ - -### Analysis and Interpretation +## Expected Observations -#### Step 5: Memory Optimization Opportunities (10 minutes) +**Memory Scaling:** +- Memory scales approximately linearly with batch size +- Memory scales quadratically with sequence length (due to attention matrices) -Based on your analysis, identify optimization opportunities: +**Memory Hotspots:** +- Attention QKV matrices +- Attention score computation `[B, H, S, S]` +- FFN intermediate tensors -**📝 Memory Optimization Assessment:** +**Bandwidth Classification:** +- Arithmetic Intensity < 10 FLOPS/byte: Memory-bound +- Arithmetic Intensity 10-100 FLOPS/byte: Mixed workload +- Arithmetic Intensity > 100 FLOPS/byte: Compute-bound -**1. Memory Scaling Efficiency** +## Optimization Targets -- [ ] Linear scaling with batch size (good) -- [ ] Quadratic scaling with sequence length (attention bottleneck) -- [ ] Peak memory much higher than average (fragmentation) -- [ ] Memory plateaus (good memory reuse) +1. **Flash Attention**: Reduce attention memory from O(S^2) to O(S) +2. **Gradient Checkpointing**: Trade compute for memory +3. **Mixed Precision (FP16/BF16)**: 2x memory reduction +4. **Kernel Fusion**: Reduce intermediate tensor allocations -**2. Bandwidth Utilization** +## Troubleshooting -- [ ] High bandwidth utilization (>70%) - compute bound -- [ ] Medium bandwidth utilization (30-70%) - mixed workload -- [ ] Low bandwidth utilization (<30%) - memory bound +Out of memory errors: -**3. Memory Hotspots** (check profiling results) - -- [ ] Attention QKV matrices -- [ ] Attention score computation -- [ ] Feed-forward intermediate tensors -- [ ] Gradient accumulation - -**4. Optimization Targets** - -Rank these optimizations by memory impact (1=highest, 4=lowest): -- [ ] Flash Attention (reduce attention memory) - Rank: ___ -- [ ] Gradient checkpointing (trade compute for memory) - Rank: ___ -- [ ] Mixed precision (reduce memory per parameter) - Rank: ___ -- [ ] Tensor fusion (reduce intermediate allocations) - Rank: ___ - -#### Step 6: Memory Bottleneck Identification (5 minutes) - -Determine if your workload is memory-bound or compute-bound: - -**📝 Bottleneck Classification:** - -Based on your bandwidth analysis: - -- **Arithmetic Intensity < 10 FLOPS/byte** → Memory-bound workload -- **Arithmetic Intensity 10-100 FLOPS/byte** → Mixed workload -- **Arithmetic Intensity > 100 FLOPS/byte** → Compute-bound workload - -**Your Classification:** _______________________ - -**Evidence:** - -- Arithmetic intensity: _______ FLOPS/byte -- Memory bandwidth utilization: _______ % -- GPU compute utilization: _______ % (from Exercise 1) - -**Primary Bottleneck:** - -- [ ] Memory bandwidth (low compute util, high memory util) -- [ ] Compute throughput (high compute util, low memory util) -- [ ] Mixed (balanced utilization) -- [ ] Kernel overhead (low both) - -### Expected Results - -#### Memory Usage Patterns - -- **Peak Memory Growth**: Approximately linear with batch size -- **Sequence Scaling**: Quadratic scaling due to attention matrices -- **Memory Hotspots**: Attention computation and intermediate tensors -- **Bandwidth Utilization**: 30-60% on most modern GPUs - -#### Key Findings - -1. **Attention Memory**: Consumes significant memory, scales quadratically -2. **Memory Fragmentation**: Multiple small allocations create overhead -3. **Peak vs Average**: Large difference indicates optimization opportunity -4. **Bandwidth Bound**: Likely memory-bound for typical configurations - -### Troubleshooting - -**Out of Memory Errors:** -```bash -# Reduce batch size and/or sequence length +``` python tiny_llama_v1.py --batch-size 2 --seq-len 64 ``` -**Memory Profiling Failed:** -```bash -# Check CUDA memory debugging -export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 -``` +Memory fragmentation: -**Bandwidth Analysis Error:** -```bash -# Check DeepSpeed installation -pip install deepspeed ``` - -### Analysis Questions - -**📝 Critical Analysis Questions:** - -1. **What is the memory scaling behavior?** - - Batch size scaling: [ ] Linear [ ] Quadratic [ ] Exponential - - Sequence length scaling: [ ] Linear [ ] Quadratic [ ] Exponential - -2. **Where is peak memory consumed?** - - [ ] During forward pass (activations) - - [ ] During backward pass (gradients) - - [ ] During optimizer step (parameters) - -3. **What is the primary memory optimization target?** - - [ ] Reduce attention memory (Flash Attention) - - [ ] Reduce activation memory (checkpointing) - - [ ] Reduce parameter memory (mixed precision) - - [ ] Reduce fragmentation (tensor fusion) - -4. **Is the workload memory-bound or compute-bound?** - - [ ] Memory-bound (low arithmetic intensity) - - [ ] Compute-bound (high arithmetic intensity) - - [ ] Mixed workload (balanced) - -5. **What memory optimization would provide the biggest benefit?** - - [ ] Flash Attention (quadratic → linear attention memory) - - [ ] Gradient checkpointing (trade compute for memory) - - [ ] Mixed precision FP16/BF16 (2x memory reduction) - - [ ] Tensor fusion (reduce intermediate allocations) - -### Next Steps - -1. **Document your memory analysis** results -2. **Compare memory patterns** across different configurations -3. **Identify top memory optimization targets** for Version 2 -4. **Understand the memory vs compute trade-offs** -5. **Proceed to Exercise 3** for bottleneck identification - -### Success Criteria - -**Exercise Complete When:** - -- [ ] Memory profiling completed for multiple configurations -- [ ] Memory scaling patterns understood -- [ ] Bandwidth utilization analyzed -- [ ] Memory bottlenecks identified -- [ ] Optimization priorities ranked - ---- - -**Key Takeaway**: Memory analysis reveals that the baseline model has significant memory optimization opportunities, particularly in attention computation which scales quadratically with sequence length. Flash Attention and kernel fusion will be primary targets for Version 2. - -**Next Exercise**: [Exercise 3 - Bottleneck Identification](exercise_3_bottleneck_identification.md) - - +export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 +``` diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_3_bottleneck_identification.md b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_3_bottleneck_identification.md index 16ee8fe9..8af87e44 100644 --- a/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_3_bottleneck_identification.md +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises/exercise_3_bottleneck_identification.md @@ -1,254 +1,48 @@ +# Exercise 3: Bottleneck Identification and Optimization Planning -## Exercise 3: Bottleneck Identification and Optimization Planning +exercise_3_bottleneck_identification.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline/exercises` in the Training Examples repository. -`exercise3_bottleneck_identification.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version1_pytorch_baseline` in the Training Examples repository +## Objective -### Objective -Systematically identify performance bottlenecks in the baseline model and create an optimization roadmap for Version 2 and beyond. +Systematically identify performance bottlenecks in the baseline model and create an optimization roadmap. -### Prerequisites -- Completed Exercises 1 and 2 -- Understanding of profiling results analysis +## Step 1: Comprehensive Profiling -### Duration -**Estimated Time:** 30-35 minutes +Run the complete profiling suite: -### Background - -Bottleneck identification is critical for effective optimization: -- **Amdahl's Law**: Overall speedup is limited by the slowest component -- **Optimization ROI**: Focus effort where it provides maximum benefit -- **Systematic Approach**: Use data-driven decisions rather than intuition -- **Baseline Establishment**: Create benchmarks for measuring improvement - -### Instructions - -#### Step 1: Comprehensive Profiling Run (10 minutes) - -Run the complete profiling suite to gather all necessary data: - -```bash -## Run comprehensive profiling analysis -bash run_all_profilers.sh \ +``` +python tiny_llama_v1.py \ --batch-size 8 \ --seq-len 128 \ --num-steps 30 \ + --enable-pytorch-profiler \ --profile-dir ./bottleneck_analysis ``` -This will generate: -- Baseline training metrics -- PyTorch profiler results -- FLOPS analysis data -- Memory usage patterns -- Comprehensive reports - -**📝 Wait for completion and record:** -- Overall runtime: _______ seconds -- Profile data location: _______ -- Any errors or warnings: _______ - -#### Step 2: Operator-Level Bottleneck Analysis (10 minutes) - -Analyze the detailed profiling results to identify computational bottlenecks: - -```bash -## View the comprehensive profiling report -cat ./bottleneck_analysis/performance_summary_report.md - -## Examine PyTorch profiler operator breakdown -python run_pytorch_profiler.py \ - --analyze-existing ./bottleneck_analysis/pytorch_profiling \ - --generate-report \ - --output-dir ./detailed_analysis -``` - -**📝 Top Time-Consuming Operations:** - -From the PyTorch profiler results, identify the top 10 operations by GPU time: - -| Rank | Operation Name | GPU Time (%) | CPU Time (%) | Count | Optimization Target | -|------|----------------|-------------|-------------|-------|-------------------| -| 1 | | | | | | -| 2 | | | | | | -| 3 | | | | | | -| 4 | | | | | | -| 5 | | | | | | -| 6 | | | | | | -| 7 | | | | | | -| 8 | | | | | | -| 9 | | | | | | -| 10 | | | | | | - -**Pattern Analysis:** -- What percentage of time is spent in matrix multiplications? _______% -- How many separate linear projection operations are there? _______ -- What's the overhead from kernel launches vs. computation? _______% - -#### Step 3: FLOPS Efficiency Analysis (8 minutes) - -Examine computational efficiency using the FLOPS analysis: - -```bash -## View FLOPS analysis results -python -c " -import json -with open('./bottleneck_analysis/flops_analysis/flops_profile.json', 'r') as f: - data = json.load(f) - -print('=== FLOPS EFFICIENCY ANALYSIS ===') -print(f'Model FLOPS Utilization: {data[\"efficiency_metrics\"][\"mfu_percent\"]:.1f}%') -print(f'Achieved FLOPS/sec: {data[\"performance_metrics\"][\"flops_per_sec\"]:.2e}') -print(f'Peak Device FLOPS: {data[\"efficiency_metrics\"][\"device_peak_flops\"]:.2e}') -print(f'FLOPS per Parameter: {data[\"flops_analysis\"][\"flops_per_parameter\"]:.2f}') -print(f'Throughput: {data[\"performance_metrics\"][\"throughput_samples_per_sec\"]:.1f} samples/sec') -" -``` - -**📝 Efficiency Metrics:** -- Model FLOPS Utilization (MFU): _______% -- Achieved FLOPS per second: _______ -- FLOPS per parameter: _______ -- Overall throughput: _______ samples/sec - -**Efficiency Classification:** -- [ ] < 20% MFU: Severely underutilized (kernel overhead dominant) -- [ ] 20-40% MFU: Memory-bound workload -- [ ] 40-60% MFU: Mixed workload with optimization opportunities -- [ ] > 60% MFU: Well-optimized compute-bound workload - -#### Step 4: Memory Bottleneck Assessment (7 minutes) - -Analyze memory-related bottlenecks: - -```bash -## Check computational intensity analysis -python -c " -import json -import os - -intensity_file = './bottleneck_analysis/flops_analysis/computational_intensity.json' -if os.path.exists(intensity_file): - with open(intensity_file, 'r') as f: - data = json.load(f) - - print('=== MEMORY BOTTLENECK ANALYSIS ===') - print(f'Arithmetic Intensity: {data[\"arithmetic_intensity_flops_per_byte\"]:.2f} FLOPS/byte') - print(f'Memory Bandwidth Used: {data[\"memory_bandwidth_used_gb_per_sec\"]:.1f} GB/s') - print(f'Bandwidth Utilization: {data[\"memory_bandwidth_utilization_percent\"]:.1f}%') - print(f'Workload Type: {data[\"memory_bound_vs_compute_bound\"]}') -else: - print('Computational intensity analysis not available') -" -``` - -**📝 Memory Analysis:** -- Arithmetic Intensity: _______ FLOPS/byte -- Memory Bandwidth Utilization: _______% -- Primary Bottleneck: [ ] Memory-bound [ ] Compute-bound [ ] Mixed -- Peak Memory Usage: _______ MB - -**Roofline Model Position:** -- [ ] Below roofline - memory bound (optimize data movement) -- [ ] On roofline - balanced (optimize both) -- [ ] Below compute ceiling - compute bound (optimize kernels) - -#### Step 5: Systematic Bottleneck Ranking (10 minutes) - -Create a systematic ranking of bottlenecks based on impact and effort: - -**📝 Bottleneck Impact Assessment:** - -For each major bottleneck, assess: - -| Bottleneck Category | % of Total Time | Optimization Difficulty | Expected Speedup | Priority Rank | -|--------------------|-----------------|------------------------|------------------|---------------| -| QKV Projections | | Low-Medium | 1.2-1.5x | | -| Attention Computation | | Medium | 1.3-2.0x | | -| SwiGLU Gate/Up | | Low | 1.1-1.3x | | -| Kernel Launch Overhead | | Medium-High | 1.5-3.0x | | -| Memory Fragmentation | | Medium | 1.1-1.4x | | -| Softmax Operations | | Medium-High | 1.2-1.8x | | +## Step 2: Operator-Level Analysis -**Impact vs Effort Matrix:** +Examine the profiling results to identify computational bottlenecks. Look for the top time-consuming operations in the profiler output. -High Impact, Low Effort (Priority 1): -- _______________________________ -- _______________________________ +Expected top operations by GPU time: +- Matrix multiplications (aten::mm, aten::addmm, aten::bmm) +- Softmax operations +- Element-wise operations -High Impact, High Effort (Priority 2): -- _______________________________ -- _______________________________ +## Step 3: Efficiency Analysis -Low Impact, Low Effort (Priority 3): -- _______________________________ -- _______________________________ +Key efficiency metrics to examine: +- Model FLOPS Utilization (MFU) +- Memory bandwidth utilization +- Kernel launch overhead -Low Impact, High Effort (Priority 4 - Skip): -- _______________________________ -- _______________________________ +Typical baseline efficiency: +- MFU: 20-35% (memory-bound workload) +- Bandwidth utilization: 30-60% -### Analysis and Optimization Roadmap +## Typical Bottleneck Hierarchy -#### Step 6: Create Version 2 Optimization Plan (10 minutes) - -Based on your analysis, create a detailed optimization plan for Version 2: - -**📝 Version 2 Optimization Roadmap:** - -**Phase 1: Kernel Fusion (Expected: 1.4-1.8x speedup)** -- [ ] **QKV Fusion**: Combine Q, K, V linear projections - - Impact: Reduce 3 kernel launches to 1 - - Memory: Reduce intermediate tensor allocations - - Implementation: Fused linear layer - -- [ ] **SwiGLU Fusion**: Combine gate and up projections - - Impact: Reduce 2 kernel launches to 1 - - Memory: Eliminate intermediate activations - - Implementation: Custom fused activation - -**Phase 2: Attention Optimization (Expected: 1.3-2.0x speedup)** -- [ ] **Flash Attention**: Memory-efficient attention computation - - Impact: Reduce attention memory from O(n^2) to O(n) - - Memory: Enable longer sequences and larger batches - - Implementation: torch.nn.functional.scaled_dot_product_attention - -**Phase 3: Additional Optimizations (Expected: 1.1-1.3x speedup)** -- [ ] **Torch Compile**: Automatic kernel fusion -- [ ] **Memory Layout**: Optimize tensor layouts -- [ ] **Mixed Precision**: FP16/BF16 where appropriate - -**Expected Overall Speedup for Version 2:** _______x - -#### Step 7: Validation Metrics Definition (5 minutes) - -Define metrics to validate Version 2 improvements: - -**📝 Success Metrics for Version 2:** - -**Performance Targets:** -- Training throughput: _______ samples/sec → _______ samples/sec -- Model FLOPS Utilization: _______ % → _______ % -- Peak memory usage: _______ MB → _______ MB -- Kernel count per step: _______ → _______ - -**Validation Tests:** -- [ ] Batch size 8, sequence length 128 (baseline comparison) -- [ ] Batch size 16, sequence length 256 (scaling test) -- [ ] Memory scaling with sequence length -- [ ] Numerical accuracy validation (loss convergence) - -**Quality Gates:** -- [ ] No degradation in model accuracy -- [ ] Deterministic execution maintained -- [ ] Memory usage reduced or stable -- [ ] Throughput improved by >30% - -### Expected Results - -#### Typical Bottleneck Hierarchy 1. **Attention Operations (35-45% of time)** - - Multiple QKV projections + - QKV projections (3 separate kernel launches) - Attention score computation - Softmax operations @@ -261,98 +55,30 @@ Define metrics to validate Version 2 improvements: - Multiple small operations - Memory transfers between kernels -4. **Memory Operations (5-15% of time)** - - Tensor allocations/deallocations - - Memory fragmentation - -#### Optimization Priority Order -1. **QKV Fusion** (Low effort, medium impact) -2. **Flash Attention** (Medium effort, high impact) -3. **SwiGLU Fusion** (Low effort, low-medium impact) -4. **Torch Compile** (Very low effort, variable impact) - -### Troubleshooting - -**Missing Analysis Files:** -```bash -## Re-run comprehensive profiling if files are missing -bash run_all_profilers.sh --batch-size 8 --profile-dir ./bottleneck_retry -``` - -**Profiling Data Errors:** -```bash -## Check for GPU memory issues -nvidia-smi # or rocm-smi -## Reduce batch size if necessary -``` - -### Analysis Questions - -**📝 Critical Analysis Questions:** - -1. **What is the single largest performance bottleneck?** - - [ ] QKV projection operations - - [ ] Attention score computation - - [ ] Feed-forward network - - [ ] Kernel launch overhead - - [ ] Memory bandwidth - -2. **What type of optimization would provide the biggest benefit?** - - [ ] Kernel fusion (reduce launches) - - [ ] Memory optimization (bandwidth) - - [ ] Algorithmic optimization (attention) - - [ ] Precision optimization (mixed precision) - -3. **Is the workload primarily:** - - [ ] Memory-bound (optimize data movement) - - [ ] Compute-bound (optimize kernels) - - [ ] Overhead-bound (optimize launches) - - [ ] Mixed workload (balanced optimization) - -4. **What should be the first optimization implemented?** - - [ ] QKV fusion (immediate benefit) - - [ ] Flash Attention (biggest impact) - - [ ] SwiGLU fusion (easy implementation) - - [ ] Torch compile (automatic optimization) +## Optimization Roadmap -5. **What is the realistic speedup target for Version 2?** - - [ ] 1.2-1.4x (conservative) - - [ ] 1.5-2.0x (achievable) - - [ ] 2.0-3.0x (optimistic) - - [ ] >3.0x (unlikely without major changes) +**Priority 1: Kernel Fusion (Expected 1.4-1.8x speedup)** +- QKV Fusion: Combine Q, K, V projections into single GEMM +- SwiGLU Fusion: Combine gate and up projections -### Deliverables +**Priority 2: Attention Optimization (Expected 1.3-2.0x speedup)** +- Flash Attention: Memory-efficient attention computation +- Reduces memory from O(S^2) to O(S) -At the end of this exercise, you should have: +**Priority 3: Additional Optimizations (Expected 1.1-1.3x speedup)** +- torch.compile for automatic kernel fusion +- Mixed precision (FP16/BF16) -1. **Bottleneck Analysis Report** with quantified performance issues -2. **Optimization Roadmap** with prioritized improvements -3. **Version 2 Implementation Plan** with expected benefits -4. **Success Metrics** for validating improvements -5. **Baseline Measurements** for comparison +## Troubleshooting -### Next Steps +Missing analysis files: -1. **Document all findings** in the performance summary template -2. **Review optimization priorities** with team/instructor -3. **Validate technical feasibility** of planned optimizations -4. **Proceed to Version 2** implementation with clear targets -5. **Set up regression testing** framework for validation - -### Success Criteria - -**Exercise Complete When:** -- [ ] Comprehensive bottleneck analysis completed -- [ ] Performance bottlenecks quantified and ranked -- [ ] Optimization roadmap created with priorities -- [ ] Success metrics defined for Version 2 -- [ ] Implementation plan validated -- [ ] Ready to begin Version 2 optimizations - ---- - -**Key Takeaway**: Systematic bottleneck identification reveals that the baseline model has clear optimization opportunities in kernel fusion, attention computation, and memory usage. The data-driven approach provides a roadmap for achieving 1.5-2.0x speedup in Version 2. - -**Next Phase**: [Version 2 - PyTorch Fused](../version2_pytorch_fused/README.md) +``` +python tiny_llama_v1.py --batch-size 8 --profile-dir ./bottleneck_retry +``` +Check GPU status: +``` +rocm-smi +``` diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/get_counters.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_counters.sh new file mode 100644 index 00000000..80b43b1d --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_counters.sh @@ -0,0 +1,78 @@ +#!/bin/bash +# Script to profile TinyTransformer with rocprofv3 kernel trace +# This captures kernel execution metrics for performance analysis +# +# Supports both ROCm 6.x (CSV output) and ROCm 7.x (SQLite database output) + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./counters/counter_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 kernel trace collection for TinyTransformer..." +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprofv3 to collect kernel trace +rocprofv3 \ + --kernel-trace \ + --output-directory "$OUTPUT_DIR" \ + -- python tiny_llama_v1.py \ + --batch-size 8 \ + --seq-len 128 \ + --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Analyze results based on ROCm version +echo "To analyze results:" +DB_FILE=$(find "$OUTPUT_DIR" -name "*_results.db" 2>/dev/null | head -1) +if [ -n "$DB_FILE" ]; then + echo " Database file: $DB_FILE" + echo "" + echo " Export to CSV:" + echo " rocpd2csv -i $DB_FILE -o kernel_stats.csv" + echo "" + echo " Get kernel summary:" + echo " rocpd summary -i $DB_FILE --region-categories KERNEL" +else + echo " Check $OUTPUT_DIR for output files" +fi diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/get_hotspots.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_hotspots.sh new file mode 100755 index 00000000..1c01f867 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_hotspots.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# +# Get hotspots analysis using rocprofv3 +# Compatible with ROCm 6.x and 7.x +# + +set -e + +echo "==========================================" +echo "rocprofv3 Hotspots Analysis - Version 1" +echo "==========================================" +echo "" + +OUTPUT_DIR="./hotspots/hotspot_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" +echo "Running: rocprofv3 --stats -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprofv3 --stats -- python ../../tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] Hotspot analysis completed" +else + echo "[FAILED] Hotspot analysis failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +# Check for stats/CSV files +if ls *.csv 1> /dev/null 2>&1; then + echo "Statistics files found:" + for f in *.csv; do + echo "" + echo "File: $f" + echo "Top 10 entries:" + head -11 "$f" + done +else + echo "Looking for statistics in subdirectories:" + find . -name "*.csv" -exec echo "Found: {}" \; -exec head -11 {} \; +fi +echo "" + +echo "Hotspot analysis identifies GPU kernels with highest time consumption." +echo "" diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_compute.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_compute.sh new file mode 100755 index 00000000..65bf0649 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_compute.sh @@ -0,0 +1,50 @@ +#!/bin/bash +# +# Get detailed GPU metrics using rocprof-compute +# Compatible with ROCm 6.x and 7.x +# + +set -e + +echo "==========================================" +echo "rocprof-compute Profiling - Version 1" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_compute/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-compute to collect detailed GPU metrics +# rocprof-compute requires: profile mode --name -d -- +WORKLOAD_NAME="tiny_llama_v1_$(date +%Y%m%d_%H%M%S)" +echo "Running: rocprof-compute profile --name $WORKLOAD_NAME -d $OUTPUT_DIR -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +rocprof-compute profile --name "$WORKLOAD_NAME" -d "$OUTPUT_DIR" -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-compute profiling completed" +else + echo "[FAILED] rocprof-compute profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find "$OUTPUT_DIR" -type f -ls +echo "" + +echo "" +echo "To analyze results:" +echo " rocprof-compute analyze -p $OUTPUT_DIR/workloads/${WORKLOAD_NAME}/rocprof --dispatch -n tiny_llama_dispatch" +echo "" +echo "For available analysis options:" +echo " rocprof-compute analyze --help" +echo "" +echo "Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support." +echo "" diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_sys.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_sys.sh new file mode 100755 index 00000000..14ea1fc8 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_rocprof_sys.sh @@ -0,0 +1,47 @@ +#!/bin/bash +# +# Get system-level profiling using rocprof-sys +# Compatible with ROCm 6.x and 7.x +# +# NOTE: rocprof-sys may produce memory map dumps in some configurations. +# Issue reference: TBD +# + +set -e + +echo "==========================================" +echo "rocprof-sys Profiling - Version 1" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_sys/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-sys to collect system-level traces +# rocprof-sys-run provides call-stack sampling and system-level profiling +echo "Running: rocprof-sys-run --profile --trace -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprof-sys-run --profile --trace -- python ../../tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-sys profiling completed" +else + echo "[FAILED] rocprof-sys profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " Open the .proto file in Perfetto UI: https://ui.perfetto.dev/" +echo "" diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/get_trace.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_trace.sh new file mode 100644 index 00000000..91d9e611 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/get_trace.sh @@ -0,0 +1,96 @@ +#!/bin/bash +# Script to profile TinyTransformer with rocprofv3 runtime trace +# This captures GPU API calls, kernel launches, and memory operations +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi +OUTPUT_DIR="./traces/trace_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Build rocprofv3 command with appropriate flags for ROCm version +# ROCm 6.4+ and 7.x require explicit --output-format pftrace to generate Perfetto traces +# Earlier ROCm 6.x versions (6.0-6.3) generated pftrace by default +if [ "$ROCM_MAJOR" = "7" ] || [ "$ROCM_MAJOR" = "6" ]; then + echo "Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace)" + OUTPUT_FORMAT="--output-format pftrace" +else + echo "Using ROCm 5.x or older: default format" + OUTPUT_FORMAT="" +fi + +echo "" +echo "Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations)" +echo "" + +# Run with rocprofv3 to collect full runtime trace +# NOTE: Using --runtime-trace to capture complete timeline: +# - HIP/HSA API calls +# - Kernel execution on GPU +# - Memory operations (H2D, D2H, D2D transfers) +# - Synchronization events +# This provides the comprehensive view needed for timeline analysis in Perfetto +cd "$OUTPUT_DIR" +rocprofv3 \ + --runtime-trace \ + $OUTPUT_FORMAT \ + -- python ../../tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] Trace generation completed" +else + echo "[FAILED] Trace generation failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +echo "Perfetto trace files:" +find . -name "*.pftrace" -exec ls -lh {} \; +echo "" + +echo "To view trace:" +echo " Visit: https://ui.perfetto.dev/" +echo " Open the largest .pftrace file" +echo "" diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocpd.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocpd.sh new file mode 100755 index 00000000..128f3a53 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocpd.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# +# Test rocpd (ROCm Profiling Daemon) for continuous profiling +# + +set -e + +echo "==========================================" +echo "rocpd Test - Version 1" +echo "==========================================" +echo "" + +# Check if rocpd is available +if ! command -v rocpd &> /dev/null; then + echo "[ERROR] rocpd not found in PATH" + echo "rocpd may not be installed or available in this ROCm version" + exit 1 +fi + +echo "rocpd location: $(which rocpd)" +echo "" + +OUTPUT_DIR="./rocpd/rocpd_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Start rocpd in background +echo "Starting rocpd daemon..." +rocpd --output-dir "$OUTPUT_DIR" & +ROCPD_PID=$! +echo "rocpd running with PID: $ROCPD_PID" +echo "" + +# Give rocpd time to initialize +sleep 2 + +# Run workload +echo "Running workload: python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10" +python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 +WORKLOAD_EXIT=$? +echo "" + +# Stop rocpd +echo "Stopping rocpd daemon..." +kill $ROCPD_PID 2>/dev/null || true +wait $ROCPD_PID 2>/dev/null || true +echo "" + +if [ $WORKLOAD_EXIT -eq 0 ]; then + echo "[SUCCESS] Workload completed" +else + echo "[FAILED] Workload failed with exit code $WORKLOAD_EXIT" +fi +echo "" + +echo "Generated files in $OUTPUT_DIR:" +ls -lh "$OUTPUT_DIR" +echo "" + +echo "rocpd output is a SQLite3 database file" +echo "" +echo "To view the database:" +echo " - Use VS Code SQLite Viewer extension" +echo " - rocprof-compute and rocprof-systems can consume it directly" +echo " - No official CLI tool is provided for viewing" +echo "" +echo "rocpd provides continuous profiling with minimal overhead" +echo "" diff --git a/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocprofv3_version1.sh b/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocprofv3_version1.sh new file mode 100755 index 00000000..a108fc73 --- /dev/null +++ b/MLExamples/TinyTransformer/version1_pytorch_baseline/test_rocprofv3_version1.sh @@ -0,0 +1,145 @@ +#!/bin/bash +# +# rocprofv3 validation test for tiny_llama_v1.py +# Tests profiler capture on baseline PyTorch implementation +# + +set -e + +echo "==========================================" +echo "rocprofv3 Test Suite - Version 1 Baseline" +echo "==========================================" +echo "" + +# Step 1: Environment Validation +echo "[STEP 1] Environment Validation" +echo "----------------------------------------" + +echo "ROCm Version:" +rocm-smi --showproductname || echo "rocm-smi failed" +echo "" + +echo "GPU Visibility:" +echo " HIP_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES" +echo " ROCR_VISIBLE_DEVICES=$ROCR_VISIBLE_DEVICES" +echo " HSA_ENABLE_PROFILING=$HSA_ENABLE_PROFILING" +echo "" + +echo "rocprofv3 location:" +which rocprofv3 +echo "" + +echo "PyTorch + ROCm Check:" +python3 -c " +import torch +print(f'PyTorch version: {torch.__version__}') +print(f'CUDA available: {torch.cuda.is_available()}') +if torch.cuda.is_available(): + print(f'Device count: {torch.cuda.device_count()}') + print(f'Device name: {torch.cuda.get_device_name(0)}') + print(f'Device capability: {torch.cuda.get_device_capability(0)}') +else: + print('WARNING: CUDA/ROCm not available!') +" +echo "" + +# Step 2: Baseline Test (No Profiler) +echo "[STEP 2] Baseline Test - No Profiler" +echo "----------------------------------------" +echo "Running: python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 5 --validate-setup" +echo "" + +python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 5 --validate-setup +BASELINE_EXIT=$? + +if [ $BASELINE_EXIT -eq 0 ]; then + echo "[SUCCESS] Baseline test passed" +else + echo "[FAILED] Baseline test failed with exit code $BASELINE_EXIT" + exit 1 +fi +echo "" + +# Step 3: rocprofv3 with runtime-trace (GitHub issue command pattern) +echo "[STEP 3] rocprofv3 Test - Runtime Trace + Perfetto" +echo "----------------------------------------" +echo "Running: rocprofv3 --runtime-trace --output-format pftrace -- python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 --enable-pytorch-profiler --profile-memory" +echo "" + +OUTPUT_DIR="./rocprof_v1_test_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" +cd "$OUTPUT_DIR" + +rocprofv3 --runtime-trace --output-format pftrace -- python ../tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 10 --enable-pytorch-profiler --profile-memory +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprofv3 completed" +else + echo "[FAILED] rocprofv3 failed with exit code $ROCPROF_EXIT" +fi +echo "" + +# Check generated files +echo "Generated files:" +ls -lh +echo "" + +# Check for profiling data +if ls *.pftrace 1> /dev/null 2>&1; then + echo "Found perfetto trace files:" + ls -lh *.pftrace + + echo "" + echo "Checking trace file size:" + for f in *.pftrace; do + size=$(stat -f%z "$f" 2>/dev/null || stat -c%s "$f") + if [ $size -gt 1000 ]; then + echo " $f: $size bytes (likely has data)" + else + echo " $f: $size bytes (suspiciously small)" + fi + done +else + echo "No .pftrace files found in current directory" + echo "Checking subdirectories..." + find . -name "*.pftrace" -ls +fi +echo "" + +# Check for PyTorch profiler output +if [ -d "pytorch_profiles" ]; then + echo "" + echo "PyTorch Profiler output:" + ls -lh pytorch_profiles/ + echo "" + echo "TensorBoard traces available:" + echo " Launch: tensorboard --logdir pytorch_profiles" +else + echo "" + echo "Note: pytorch_profiles directory not found (script may need directory creation fix)" +fi + +# Summary +echo "" +echo "==========================================" +echo "Test Summary" +echo "==========================================" +echo "Results directory: $OUTPUT_DIR" +echo "" +echo "Generated profiling data:" +echo " 1. rocprofv3 perfetto traces (.pftrace files)" +echo " 2. PyTorch profiler traces (pytorch_profiles/ if present)" +echo "" +echo "Next steps:" +echo " 1. Inspect generated files in $OUTPUT_DIR" +echo " 2. Open .pftrace in perfetto.dev or chrome://tracing" +echo " 3. View PyTorch traces with tensorboard --logdir pytorch_profiles" +echo " 4. Check for GPU kernel activity in both profilers" +echo " 5. Compare to GitHub issue #1386 output" +echo "" +echo "To view perfetto trace:" +echo " Visit: https://ui.perfetto.dev/" +echo " Click 'Open trace file' and select the .pftrace file" +echo "" diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/README.md b/MLExamples/TinyTransformer/version2_pytorch_fused/README.md index 60e73ffe..df9aa6c8 100644 --- a/MLExamples/TinyTransformer/version2_pytorch_fused/README.md +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/README.md @@ -1,812 +1,171 @@ +# ML Example: TinyTransformer Fused with ROCm Profiling -# Version 2: PyTorch Fused - Kernel Fusion and ROCm Tools Integration +README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version2_pytorch_fused` from the Training Examples repository. -README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version2_pytorch_fused` in the Training Examples repository +In this example we provide a fused PyTorch implementation of Tiny LLaMA with kernel fusion optimizations for profiling transformer workloads on AMD GPUs. This version builds on the baseline (version1) with QKV fusion, Flash Attention, and SwiGLU fusion to demonstrate performance optimization techniques. Several profiling scripts are provided to capture different aspects of GPU performance. -## Overview +## Features of the profiling scripts -Version 2 demonstrates the power of kernel fusion and introduces comprehensive ROCm profiling tools. Building on the baseline analysis from Version 1, this version implements targeted optimizations to achieve significant performance improvements through strategic kernel fusion, Flash Attention, and advanced ROCm profiling integration. +The version2_pytorch_fused example contains several profiling scripts that capture different aspects of GPU performance: -## Learning Objectives +- **get_trace.sh**: Runtime trace collection using rocprofv3. Captures HIP/HSA API calls, kernel execution timeline, memory operations (H2D, D2H, D2D transfers), and synchronization events. Output is a Perfetto trace file for timeline visualization. +- **get_counters.sh**: Kernel trace collection using rocprofv3. Captures kernel execution statistics including timing and call counts. Useful for identifying hotspot kernels and their execution patterns. +- **get_rocprof_compute.sh**: Detailed GPU hardware metrics using rocprof-compute. Provides comprehensive performance analysis including compute utilization, memory bandwidth, and hardware counter data. +- **get_rocprof_sys.sh**: System-level profiling using rocprof-sys. Captures call stack sampling and system-level performance data for end-to-end analysis. +- **get_hotspots.sh**: GPU hotspot analysis using rocprofv3 stats mode. Identifies kernels with highest time consumption. -After completing this version, you will be able to: +## Key Optimizations -- Implement QKV fusion to reduce kernel launch overhead -- Integrate Flash Attention for memory-efficient attention computation -- Apply SwiGLU fusion in feed-forward networks -- Use ROCm profiling tools (rocprofv3, rocprof-sys, rocprof-compute) for hardware-level analysis -- Analyze kernel fusion impact on performance and memory usage -- Interpret ROCm profiling data for optimization insights +This version implements the following optimizations over the baseline: -## Key Optimizations Implemented +- **QKV Fusion**: Combines Q, K, V projections into single GEMM (3 kernels → 1) +- **Flash Attention**: Memory-efficient attention via scaled_dot_product_attention (O(S²) → O(S) memory) +- **SwiGLU Fusion**: Combines gate and up projections (2 kernels → 1) +- **torch.compile**: Automatic kernel fusion and optimization -### 1. QKV Fusion +## Overview of the model -- **Problem**: Separate Q, K, V linear projections create 3 kernel launches -- **Solution**: Fused QKV projection with single kernel launch -- **Expected Benefit**: 20-30% reduction in attention overhead +The model is controlled with the following arguments: -### 2. Flash Attention Integration +- `--batch-size `: batch size for training (default: 8) +- `--seq-len `: sequence length (default: 256) +- `--num-steps `: number of training steps (default: 50) +- `--hidden-dim `: hidden dimension (default: 512) +- `--num-layers `: number of transformer layers (default: 8) +- `--num-heads `: number of attention heads (default: 8) +- `--learning-rate `: learning rate (default: 3e-4) +- `--use-amp`: enable automatic mixed precision -- **Problem**: Standard attention has O(n^2) memory complexity -- **Solution**: PyTorch's scaled_dot_product_attention with Flash Attention -- **Expected Benefit**: Significant memory reduction, enables larger sequences +## Running the fused model -### 3. SwiGLU Fusion +Load the required modules: -- **Problem**: Separate gate and up projections in feed-forward network -- **Solution**: Combined gate/up computation with element-wise operations -- **Expected Benefit**: 15-25% feed-forward network speedup - -### 4. Torch Compile Integration - -- **Problem**: Remaining kernel launch overhead -- **Solution**: Automatic fusion through torch.compile() -- **Expected Benefit**: Additional 10-20% speedup through automatic optimizations - -## Architecture Enhancements and Fusion Techniques - -### Mathematical Foundation of Kernel Fusion - -Kernel fusion combines multiple operations into a single GPU kernel to reduce memory bandwidth requirements and kernel launch overhead. For complete mathematical foundations, see [TINY_LLAMA_ARCHITECTURE.md](../TINY_LLAMA_ARCHITECTURE.md). - -#### Fusion Efficiency Analysis - -**Memory Bandwidth Reduction:** - -$$ -\text{Bandwidth Reduction} = 1 - \frac{\text{Fused Operations Memory}}{\text{Separate Operations Memory}} -$$ - -**For QKV Fusion:** - -$$ -\begin{aligned} -\text{Separate}: & \quad 3 \times (\text{Input Read} + \text{Weight Read} + \text{Output Write}) \\ -& = 3 \times (B \times S \times D + D^2 + B \times S \times D) \\ -\text{Fused}: & \quad \text{Input Read} + 3 \times \text{Weight Read} + \text{Output Write} \\ -& = B \times S \times D + 3 \times D^2 + B \times S \times 3D \\ -\text{Reduction}: & \quad \frac{2 \times B \times S \times D}{\text{Total Separate Memory}} \approx 40\% \text{ for typical batch sizes} -\end{aligned} -$$ - -### 1. QKV Fusion Implementation - -#### Detailed QKV Fusion Analysis - -**Before Fusion (Baseline):** -```python -# Three separate linear projections - 3 kernel launches -q = self.q_proj(hidden_states) # Kernel 1: GEMM [B,S,D] × [D,D] = [B,S,D] -k = self.k_proj(hidden_states) # Kernel 2: GEMM [B,S,D] × [D,D] = [B,S,D] -v = self.v_proj(hidden_states) # Kernel 3: GEMM [B,S,D] × [D,D] = [B,S,D] - -# Memory reads: 3x input tensor + 3x weight matrices -# Memory writes: 3x output tensors -# Total FLOPS: 3 × (2 × B × S × D^2) +``` +module load pytorch rocm ``` -**After Fusion (Optimized):** -```python -# Single fused projection - 1 kernel launch -qkv = self.qkv_proj(hidden_states) # Kernel 1: GEMM [B,S,D] × [D,3D] = [B,S,3D] -q, k, v = qkv.chunk(3, dim=-1) # Tensor view operation (no memory copy) +Run a basic training run: -# Memory reads: 1x input tensor + 1x weight matrix (3x size) -# Memory writes: 1x output tensor (3x size) -# Total FLOPS: 2 × B × S × D × 3D = 6 × B × S × D^2 (same compute) ``` - -**Performance Analysis:** -```python -# Kernel launch overhead reduction -KERNEL_LAUNCH_OVERHEAD = { - 'baseline_launches': 3, - 'fused_launches': 1, - 'reduction': '67% fewer kernel launches', - 'overhead_per_launch': '5-50 μs depending on operation size', - 'total_overhead_saved': '10-100 μs per attention layer' -} - -# Memory bandwidth optimization -MEMORY_BANDWIDTH = { - 'baseline_reads': 'B×S×D (input) × 3 + D^2 × 3 (weights)', - 'fused_reads': 'B×S×D (input) × 1 + D^2 × 3 (weights)', - 'bandwidth_reduction': '~40% for typical batch sizes', - 'cache_efficiency': 'Improved due to temporal locality' -} +echo "Running TinyTransformer V2 fused" +python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 ``` -#### Fused QKV Implementation - -```python -class FusedQKVAttention(nn.Module): - """QKV-fused attention with detailed performance optimizations.""" - - def __init__(self, config): - super().__init__() - self.hidden_dim = config.hidden_dim - self.num_heads = config.num_attention_heads - self.head_dim = self.hidden_dim // self.num_heads - - # Single fused QKV projection - critical optimization! - self.qkv_proj = nn.Linear( - config.hidden_dim, - 3 * config.hidden_dim, - bias=False - ) - self.o_proj = nn.Linear(config.hidden_dim, config.hidden_dim, bias=False) - - # RoPE for position embeddings - self.rotary_emb = RotaryEmbedding(self.head_dim) - - def forward(self, hidden_states, attention_mask=None): - batch_size, seq_len, _ = hidden_states.size() +## Runtime Trace Profiling with get_trace.sh - # OPTIMIZATION 1: Fused QKV projection (3 ops → 1 op) - with nvtx.range("fused_qkv_projection"): - qkv = self.qkv_proj(hidden_states) # [B, S, 3*D] +This script captures GPU API calls, kernel launches, and memory operations for timeline analysis. - # OPTIMIZATION 2: Efficient tensor chunking (no memory copy) - q, k, v = qkv.chunk(3, dim=-1) # Each: [B, S, D] +Run the profiling script: - # Reshape for multi-head attention - q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2) - - # Apply RoPE (rotary position embeddings) - q, k = self.rotary_emb(q, k, seq_len) - - # OPTIMIZATION 3: Flash Attention (covered in next section) - with nvtx.range("flash_attention"): - attn_output = F.scaled_dot_product_attention( - q, k, v, - attn_mask=attention_mask, - is_causal=True # Enables causal masking optimization - ) - - # Reshape and project output - attn_output = attn_output.transpose(1, 2).contiguous() - attn_output = attn_output.view(batch_size, seq_len, self.hidden_dim) - - return self.o_proj(attn_output) ``` - -### 2. Flash Attention Deep Dive - -#### Memory Complexity Analysis - -**Standard Attention Memory:** - -$$ -\begin{aligned} -\text{Attention Matrix} &: \mathcal{O}(B \times H \times S^{2}) \\ -\text{For } S=1024: &\quad 1024^2 = 1M \text{ elements per head} \\ -\text{Total Memory} &: B \times H \times S^{2} \times 4 \text{ bytes} \\ -\text{Example}: &\quad 8 \times 8 \times 1024^2 \times 4 = 268\text{MB} -\end{aligned} -$$ - -**Flash Attention Memory:** - -$$ -\begin{aligned} -\text{Block Size} &: B_r \times B_c \quad (\text{typically } 64 \times 64) \\ -\text{Memory Usage} &: \mathcal{O}(B \times H \times (B_r + B_c) \times \frac{S^{2}}{B_r \times B_c}) \\ -&= \mathcal{O}(B \times H \times S) \text{ (linear in sequence length!)} \\ -\text{Reduction} &: \frac{S^{2}}{S} = S \text{-fold memory reduction} -\end{aligned} -$$ - -#### Flash Attention Implementation Details - -```python -# Flash Attention Algorithm (PyTorch implementation) -def flash_attention_forward(q, k, v, mask=None): - """Memory-efficient attention with O(N) memory complexity.""" - - # Use PyTorch's optimized implementation - return F.scaled_dot_product_attention( - q, k, v, - attn_mask=mask, - dropout_p=0.0, - is_causal=True, # Enables causal mask optimization - scale=None # Uses 1/sqrt(head_dim) automatically - ) - -# The above function automatically: -# 1. Tiles the computation into blocks -# 2. Computes attention scores incrementally -# 3. Maintains numerical stability with online softmax -# 4. Minimizes memory transfers between HBM and SRAM +echo "Collecting runtime trace with rocprofv3" +./get_trace.sh ``` -**Flash Attention Performance Characteristics:** -```python -FLASH_ATTENTION_BENEFITS = { - 'memory_complexity': { - 'standard': 'O(B × H × S^2)', - 'flash': 'O(B × H × S)', - 'reduction_factor': 'S (sequence length)' - }, - 'computation': { - 'flops': 'Same as standard attention', - 'io_complexity': 'O(S^2 / √M) vs O(S^2) where M is SRAM size', - 'wall_clock': '2-4x faster for sequences > 512' - }, - 'numerical_stability': { - 'method': 'Online softmax with running max', - 'precision': 'Better numerical stability than standard attention', - 'overflow_protection': 'Built-in overflow/underflow handling' - } -} -``` +The script will output results to `traces/trace_/`. To analyze the results: -### 3. SwiGLU Fusion Implementation - -#### SwiGLU Mathematical Analysis - -**Baseline SwiGLU (Separate Operations):** - -$$ -\begin{aligned} -\text{gate} &= xW_{\text{gate}} + b_{\text{gate}} \quad \text{(Linear projection 1)} \\ -\text{up} &= xW_{\text{up}} + b_{\text{up}} \quad \text{(Linear projection 2)} \\ -\text{activated} &= \text{SiLU}(\text{gate}) \quad \text{(Activation function)} \\ -\text{intermediate} &= \text{activated} \odot \text{up} \quad \text{(Element-wise multiply)} \\ -\text{output} &= \text{intermediate} W_{\text{down}} + b_{\text{down}} \quad \text{(Linear projection 3)} -\end{aligned} -$$ - -**Fused SwiGLU (Optimized):** - -$$ -\begin{aligned} -\text{gate\_up} &= x[W_{\text{gate}} \parallel W_{\text{up}}] \quad \text{(Single GEMM)} \\ -\text{gate, up} &= \text{split}(\text{gate\_up}, \text{dim}=-1) \quad \text{(Tensor view)} \\ -\text{output} &= (\text{SiLU}(\text{gate}) \odot \text{up})W_{\text{down}} \quad \text{(Fused activation + projection)} -\end{aligned} -$$ - -#### Performance Impact Analysis - -```python -# FLOP count comparison -SWIGLU_FLOPS = { - 'gate_projection': 2 * batch_size * seq_len * hidden_dim * intermediate_dim, - 'up_projection': 2 * batch_size * seq_len * hidden_dim * intermediate_dim, - 'down_projection': 2 * batch_size * seq_len * intermediate_dim * hidden_dim, - 'silu_activation': batch_size * seq_len * intermediate_dim, # Element-wise - 'elementwise_multiply': batch_size * seq_len * intermediate_dim, # Element-wise -} - -# Memory access pattern optimization -MEMORY_ACCESS_OPTIMIZATION = { - 'baseline_memory_ops': { - 'gate_proj': 'Input read + Weight read + Output write', - 'up_proj': 'Input read + Weight read + Output write', - 'down_proj': 'Input read + Weight read + Output write', - 'total_input_reads': 3, # Major inefficiency! - }, - 'fused_memory_ops': { - 'gate_up_proj': 'Input read + Weight read + Output write', - 'down_proj': 'Input read + Weight read + Output write', - 'total_input_reads': 2, # 33% reduction in memory bandwidth - } -} ``` - -#### Detailed SwiGLU Fusion Implementation - -```python -class FusedSwiGLU(nn.Module): - """SwiGLU with gate/up projection fusion for optimal performance.""" - - def __init__(self, config): - super().__init__() - self.hidden_dim = config.hidden_dim - self.intermediate_dim = config.intermediate_dim - - # OPTIMIZATION: Fused gate and up projections - self.gate_up_proj = nn.Linear( - self.hidden_dim, - 2 * self.intermediate_dim, # Combined weight matrix - bias=False - ) - - self.down_proj = nn.Linear( - self.intermediate_dim, - self.hidden_dim, - bias=False - ) - - def forward(self, hidden_states): - batch_size, seq_len, hidden_dim = hidden_states.shape - - # OPTIMIZATION 1: Single GEMM for gate and up projections - with nvtx.range("fused_gate_up_projection"): - gate_up = self.gate_up_proj(hidden_states) # [B, S, 2*I] - - # OPTIMIZATION 2: Efficient tensor splitting (no memory copy) - gate, up = gate_up.chunk(2, dim=-1) # Each: [B, S, I] - - # OPTIMIZATION 3: Fused SiLU activation with element-wise multiply - with nvtx.range("silu_and_multiply"): - # SiLU: x * sigmoid(x) = x / (1 + exp(-x)) - intermediate = F.silu(gate) * up - - # Final down projection - with nvtx.range("down_projection"): - output = self.down_proj(intermediate) - - return output -``` - -**Advanced SwiGLU Optimizations:** -```python -# Custom SiLU implementation for maximum efficiency -def fused_silu_multiply(gate, up): - """Fused SiLU activation with element-wise multiplication.""" - # Can be further optimized with custom kernels in Version 3 - return F.silu(gate) * up - -# Memory layout optimization -def optimized_weight_layout(gate_weight, up_weight): - """Optimize weight matrix layout for fused GEMM.""" - # Concatenate weights for optimal memory access - return torch.cat([gate_weight, up_weight], dim=0) +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .pftrace file" ``` -### 4. Torch Compile Integration +## Kernel Trace Profiling with get_counters.sh -#### Graph-Level Optimization +This script collects kernel execution statistics including timing and call counts. -```python -# Automatic fusion through torch.compile -@torch.compile(mode='max-autotune') -class CompiledTinyLlama(nn.Module): - """Automatically optimized model with torch.compile.""" +Run the profiling script: - def __init__(self, config): - super().__init__() - self.layers = nn.ModuleList([ - FusedTransformerBlock(config) for _ in range(config.num_layers) - ]) - - def forward(self, input_ids, attention_mask=None): - # torch.compile will automatically: - # 1. Fuse adjacent operations - # 2. Optimize memory layouts - # 3. Generate specialized kernels - # 4. Eliminate redundant operations - - hidden_states = self.embed_tokens(input_ids) - - for layer in self.layers: - hidden_states = layer(hidden_states, attention_mask) - - return self.norm(hidden_states) ``` - -**Torch Compile Optimization Benefits:** -```python -TORCH_COMPILE_OPTIMIZATIONS = { - 'automatic_fusion': { - 'elementwise_ops': 'Fuses adjacent elementwise operations', - 'reduction_ops': 'Combines reductions where possible', - 'memory_planning': 'Optimizes tensor allocation and deallocation' - }, - 'kernel_specialization': { - 'shape_specialization': 'Generates optimized kernels for specific shapes', - 'dtype_optimization': 'Optimizes for specific data types', - 'device_targeting': 'AMD GPU-specific optimizations' - }, - 'graph_optimization': { - 'dead_code_elimination': 'Removes unused operations', - 'constant_folding': 'Precomputes constant expressions', - 'common_subexpression': 'Eliminates redundant computations' - } -} +echo "Collecting kernel trace with rocprofv3" +./get_counters.sh ``` -### Fusion Performance Analysis Framework - -#### Kernel Launch Reduction Analysis - -```python -# Theoretical kernel count analysis -KERNEL_COUNT_ANALYSIS = { - 'baseline_attention': { - 'q_projection': 1, - 'k_projection': 1, - 'v_projection': 1, - 'attention_computation': 3, # QK^T, softmax, attention*V - 'output_projection': 1, - 'total': 7 - }, - 'fused_attention': { - 'qkv_projection': 1, # Fused Q,K,V - 'flash_attention': 1, # Optimized attention - 'output_projection': 1, - 'total': 3 - }, - 'reduction': '57% fewer kernels per attention layer' -} - -# Memory bandwidth utilization -MEMORY_BANDWIDTH_ANALYSIS = { - 'baseline_efficiency': { - 'multiple_small_ops': 'Poor memory bandwidth utilization', - 'cache_misses': 'Frequent cache evictions between operations', - 'bandwidth_usage': '40-60% of peak bandwidth' - }, - 'fused_efficiency': { - 'larger_operations': 'Better memory bandwidth utilization', - 'temporal_locality': 'Improved cache reuse', - 'bandwidth_usage': '70-85% of peak bandwidth' - } -} -``` +The script will output results to `counters/counter_/`. -#### Arithmetic Intensity Optimization - -```python -# Roofline model analysis for fusion optimizations -def calculate_arithmetic_intensity(operation_type, batch_size, seq_len, hidden_dim): - """Calculate arithmetic intensity for roofline analysis.""" - - intensity_metrics = { - 'baseline_attention': { - 'flops': 4 * batch_size * seq_len * hidden_dim ** 2, - 'memory_bytes': 3 * (batch_size * seq_len * hidden_dim * 4), # 3 separate reads - 'arithmetic_intensity': 'flops / memory_bytes' - }, - 'fused_qkv_attention': { - 'flops': 4 * batch_size * seq_len * hidden_dim ** 2, # Same compute - 'memory_bytes': 1 * (batch_size * seq_len * hidden_dim * 4), # Single read - 'arithmetic_intensity': '3x higher than baseline' - } - } - - return intensity_metrics -``` +ROCm 6.x outputs CSV files directly, while ROCm 7.x outputs SQLite databases. For ROCm 7.x database files, use rocpd tools: -## Workshop Exercises - -### Exercise 1: Kernel Fusion Analysis - -**Objective**: Compare baseline vs. fused implementations to quantify fusion benefits. - -#### Step 1: Baseline Comparison -```bash -# Run Version 1 baseline for comparison -cd ../version1_pytorch_baseline -python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 30 > ../version2_baseline_comparison.log - -# Run Version 2 fused implementation -cd ../version2_pytorch_fused -python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30 > fused_performance.log ``` - -#### Step 2: Kernel Count Analysis -```bash -# PyTorch profiler comparison -python run_pytorch_profiler.py --batch-size 8 --profile-dir ./fusion_analysis --generate-report - -# Compare kernel counts between versions -python analyze_kernel_reduction.py --baseline ../version1_pytorch_baseline/pytorch_profiles --fused ./fusion_analysis +echo "Exporting kernel statistics to CSV" +rocpd2csv -i -o kernel_stats.csv ``` -**Expected Results:** -- 40-60% reduction in kernel launch count -- 1.4-1.8x speedup in overall training -- Improved GPU utilization metrics - -### Exercise 2: Flash Attention Memory Analysis - -**Objective**: Analyze memory efficiency improvements from Flash Attention. - -#### Step 1: Memory Scaling Test -```bash -# Test memory scaling with sequence length -for seq_len in 128 256 512 1024; do - python tiny_llama_v2.py \ - --seq-len $seq_len \ - --batch-size 4 \ - --enable-memory-profiling \ - --profile-dir ./flash_attention_seq${seq_len} -done ``` - -#### Step 2: Memory Bandwidth Analysis -```bash -# Analyze memory bandwidth utilization -python run_deepspeed_flops.py \ - --batch-size 8 \ - --seq-len 256 \ - --computational-intensity \ - --generate-roofline +echo "Getting kernel summary" +rocpd summary -i --region-categories KERNEL ``` -**Expected Results:** - -- Linear memory scaling vs. quadratic for baseline -- 2-4x memory reduction for longer sequences -- Improved arithmetic intensity metrics - -### Exercise 3: ROCm Tools Deep Dive - -**Objective**: Master ROCm profiling tools for hardware-level optimization. +Documentation for rocpd tools: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html -AMD offers three performance profiling tools for ROCm based applications: -`rocprofv3`, `rocprof-sys`, and `rocprof-compute`. For more details about these tools, see -[Appendix C of the TECHNICAL_APPENDICES.md](https://github.com/amd/HPCTrainingExamples/blob/main/MLExamples/TinyTransformer/TECHNICAL_APPENDICES.md#appendix-c-rocm-profiling-tools-reference). -about each tool. +## GPU Hardware Metrics with get_rocprof_compute.sh -#### Step 1: rocprofv3 Basic Profiling +This script collects detailed GPU performance metrics for hardware utilization analysis. -Running rocprofv3 to collect GPU hotspots on this example would look like this: +Run the profiling script: -```bash -rocprofv3 --kernel-trace --stats --truncate-kernels -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30 ``` - -View the `_kernel_stats.csv` file to see the GPU kernel hotspots. - -#### Step 2: rocprof-sys System Analysis - -To collect a comprehensive timeline trace with host and device activity, run rocprof-sys as shown below: - -```bash -rocprof-sys-run --profile --trace -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30 +echo "Collecting GPU hardware metrics with rocprof-compute" +./get_rocprof_compute.sh ``` -Copy the `.proto` file to your laptop to visualize with the Perfetto browser based tool at [https://ui.perfetto.dev](https://ui.perfetto.dev). - -#### Step 3: rocprof-compute Advanced Analysis +The script will output results to `rocprof_compute/profile_/`. To analyze the results: -To collect roofline plots, run the following command: - -```bash -rocprof-compute profile -n roof --kernel-names --roof-only --device 0 -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30 ``` - -This generates three PDF files: two roofline plots and a legend. - -To collect a profile, then analyze a particular dispatch, run the following commands: - -```bash -rocprof-compute profile -n ver2 --no-roof -- python3 tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30 -rocprof-compute analyze -p workloads/ver2/MI300A_A1 --list-stats >& stats.txt -rocprof-compute analyze -p workloads/ver2/MI300A_A1 --dispatch 1538 >& dispatch_1538.txt +echo "Generating performance analysis report" +rocprof-compute analyze -p /workloads//rocprof --dispatch -n tiny_llama_dispatch ``` -The `--list-stats` option provides a hotspot list of GPU kernels and a list of dispatches. Pick a dispatch of the -kernel that you want to analyze further and use that in the subsequent analyze command. For example, we are -analyzing dispatch 1538 here. - - +For available analysis options: -## Key Performance Improvements - -### Expected Performance Gains - -| Optimization | Impact | Memory Reduction | Kernel Reduction | Implementation Effort | -|-------------|--------|------------------|------------------|---------------------| -| **QKV Fusion** | 1.2-1.4x | 15-25% | 33% (3→1 kernels) | Low | -| **Flash Attention** | 1.3-2.0x | 50-80% | 20% fewer kernels | Medium | -| **SwiGLU Fusion** | 1.1-1.3x | 10-20% | 50% (2→1 kernels) | Low | -| **Torch Compile** | 1.1-1.2x | 5-10% | 10-30% | Very Low | -| **Combined Effect** | **1.6-2.5x** | **60-90%** | **40-60%** | - | - -### Scaling Characteristics - -- **Batch Size Scaling**: Improved efficiency at larger batch sizes -- **Sequence Length Scaling**: Near-linear memory scaling (vs. quadratic) -- **Model Size Scaling**: Better utilization for larger hidden dimensions -- **Multi-GPU Scaling**: Reduced communication overhead - - - -## Advanced Features -### Configurable Fusion Levels +Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support. Consumer GPUs may have limited counter availability. -```bash -# Selective fusion testing -python tiny_llama_v2.py \ - --enable-qkv-fusion \ - --enable-flash-attention \ - --disable-swiglu-fusion \ - --enable-torch-compile +## System-Level Profiling with get_rocprof_sys.sh -# A/B testing different fusion combinations -python fusion_ablation_study.py --all-combinations -``` +This script captures system-level performance with call stack sampling. -### Dynamic Batch Size Optimization +Run the profiling script: -```bash -# Find optimal batch size for current hardware -python optimize_batch_size.py \ - --target-memory-usage 0.8 \ - --seq-len 128 \ - --optimization-target throughput ``` - -### Mixed Precision Integration - -```bash -# Test mixed precision with fusion -python tiny_llama_v2.py \ - --use-amp \ - --amp-dtype bfloat16 \ - --enable-all-fusion +echo "Collecting system-level profile with rocprof-sys" +./get_rocprof_sys.sh ``` -## Performance Validation +The script will output results to `rocprof_sys/profile_/`. To analyze the results: -### Regression Testing - -```bash -# Numerical accuracy validation -python validate_numerical_accuracy.py \ - --baseline ../version1_pytorch_baseline/tiny_llama_v1.py \ - --optimized ./tiny_llama_v2.py \ - --tolerance 1e-4 - -# Performance regression testing -python performance_regression_test.py \ - --baseline-results ../version1_baseline_metrics.json \ - --current-results ./version2_metrics.json \ - --min-speedup 1.3 ``` - -### Benchmark Suite - -```bash -# Comprehensive benchmarking -python benchmark_suite.py \ - --models v1,v2 \ - --batch-sizes 4,8,16,32 \ - --seq-lengths 128,256,512 \ - --metrics throughput,memory,accuracy +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .proto file" ``` -## Troubleshooting +Note: rocprof-sys may produce memory map dumps in some configurations. If profiling fails or produces excessive output, consider using rocprofv3 (get_trace.sh) instead. -### Common Issues +## GPU Hotspot Analysis with get_hotspots.sh -#### Flash Attention Compatibility -```bash -# Check PyTorch version compatibility -python -c "import torch; print(torch.__version__); print(hasattr(torch.nn.functional, 'scaled_dot_product_attention'))" +This script identifies kernels with the highest execution time using rocprofv3 stats mode. -# Fallback for older PyTorch versions -export PYTORCH_FALLBACK_ATTENTION=1 -``` +Run the profiling script: -#### ROCm Tools Permission Issues -```bash -# Ensure proper permissions for ROCm profiling -sudo usermod -a -G render $USER -export ROCPROF_COMPUTE_DISABLE_AQL_DEBUG=1 ``` - -#### Memory Issues with Larger Sequences -```bash -# Enable memory optimization flags -export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256 -export HIP_LAUNCH_BLOCKING=1 # For debugging +echo "Collecting GPU hotspots" +./get_hotspots.sh ``` - - -## Expected Learning Outcomes - -### Technical Skills Developed - -- **Kernel Fusion Techniques**: Practical implementation of operation fusion -- **Memory Optimization**: Understanding memory-efficient algorithm design -- **ROCm Profiling Mastery**: Comprehensive hardware profiling skills -- **Performance Analysis**: Data-driven optimization decision making +The script will output kernel statistics to `hotspots/hotspot_/`. -### Performance Engineering Insights - -- **Amdahl's Law in Practice**: Understanding optimization impact distribution -- **Memory vs. Compute Trade-offs**: Balancing different optimization strategies -- **Hardware Utilization**: Maximizing GPU resource utilization -- **Scaling Characteristics**: How optimizations affect different workload sizes - -## Next Steps - -After mastering Version 2: - -1. **Analyze fusion impact** across different model and batch configurations -2. **Identify remaining bottlenecks** using ROCm profiling data -3. **Prepare optimization targets** for Version 3 (Triton kernels) -4. **Document lessons learned** for production deployment -5. **Establish performance baselines** for advanced optimizations - -**Ready for Custom Kernels? Proceed to [Version 3: Triton Integration](../version3_triton/README.md)** - - +## Expected Performance Improvements -**Expected Results**: 1.6-2.5x speedup, 60-90% memory reduction, comprehensive ROCm profiling mastery. +| Optimization | Speedup | Memory Reduction | Kernel Reduction | +|-------------|---------|------------------|------------------| +| QKV Fusion | 1.2-1.4x | 15-25% | 33% (3→1) | +| Flash Attention | 1.3-2.0x | 50-80% | 20% | +| SwiGLU Fusion | 1.1-1.3x | 10-20% | 50% (2→1) | +| Combined | 1.6-2.5x | 60-90% | 40-60% | +## Additional Resources +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- rocpd output format: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html +- Perfetto UI: https://ui.perfetto.dev/ diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/get_counters.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/get_counters.sh new file mode 100644 index 00000000..2ae22c1c --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/get_counters.sh @@ -0,0 +1,78 @@ +#!/bin/bash +# Script to profile TinyTransformer V2 with rocprofv3 kernel trace +# This captures kernel execution metrics for performance analysis +# +# Supports both ROCm 6.x (CSV output) and ROCm 7.x (SQLite database output) + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./counters/counter_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 kernel trace collection for TinyTransformer V2..." +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprofv3 to collect kernel trace +rocprofv3 \ + --kernel-trace \ + --output-directory "$OUTPUT_DIR" \ + -- python tiny_llama_v2.py \ + --batch-size 8 \ + --seq-len 128 \ + --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Analyze results based on ROCm version +echo "To analyze results:" +DB_FILE=$(find "$OUTPUT_DIR" -name "*_results.db" 2>/dev/null | head -1) +if [ -n "$DB_FILE" ]; then + echo " Database file: $DB_FILE" + echo "" + echo " Export to CSV:" + echo " rocpd2csv -i $DB_FILE -o kernel_stats.csv" + echo "" + echo " Get kernel summary:" + echo " rocpd summary -i $DB_FILE --region-categories KERNEL" +else + echo " Check $OUTPUT_DIR for output files" +fi diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/get_hotspots.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/get_hotspots.sh new file mode 100755 index 00000000..1725308a --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/get_hotspots.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# +# Get hotspots analysis using rocprofv3 +# Compatible with ROCm 6.x and 7.x +# + +set -e + +echo "==========================================" +echo "rocprofv3 Hotspots Analysis - Version 2" +echo "==========================================" +echo "" + +OUTPUT_DIR="./hotspots/hotspot_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" +echo "Running: rocprofv3 --stats -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprofv3 --stats -- python ../../tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] Hotspot analysis completed" +else + echo "[FAILED] Hotspot analysis failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +# Check for stats/CSV files +if ls *.csv 1> /dev/null 2>&1; then + echo "Statistics files found:" + for f in *.csv; do + echo "" + echo "File: $f" + echo "Top 10 entries:" + head -11 "$f" + done +else + echo "Looking for statistics in subdirectories:" + find . -name "*.csv" -exec echo "Found: {}" \; -exec head -11 {} \; +fi +echo "" + +echo "Hotspot analysis identifies GPU kernels with highest time consumption." +echo "" diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_compute.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_compute.sh new file mode 100755 index 00000000..c1c265c4 --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_compute.sh @@ -0,0 +1,49 @@ +#!/bin/bash +# +# Get detailed GPU metrics using rocprof-compute +# Compatible with ROCm 6.x and 7.x +# +# Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) +# for full hardware counter support. Consumer GPUs may have limited counter availability. +# + +set -e + +echo "==========================================" +echo "rocprof-compute Profiling - TinyTransformer V2" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_compute/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-compute to collect detailed GPU metrics +WORKLOAD_NAME="tiny_llama_v2_$(date +%Y%m%d_%H%M%S)" +echo "Running: rocprof-compute profile --name $WORKLOAD_NAME -d $OUTPUT_DIR -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +rocprof-compute profile --name "$WORKLOAD_NAME" -d "$OUTPUT_DIR" -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-compute profiling completed" +else + echo "[FAILED] rocprof-compute profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find "$OUTPUT_DIR" -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " rocprof-compute analyze -p $OUTPUT_DIR/workloads/$WORKLOAD_NAME/rocprof --dispatch -n tiny_llama_dispatch" +echo "" +echo "For available analysis options:" +echo " rocprof-compute analyze --help" +echo "" diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_sys.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_sys.sh new file mode 100755 index 00000000..89209260 --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/get_rocprof_sys.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# +# Get system-level profiling using rocprof-sys +# Compatible with ROCm 6.x and 7.x +# +# NOTE: rocprof-sys may produce memory map dumps in some configurations. +# Issue reference: TBD +# + +set -e + +echo "==========================================" +echo "rocprof-sys Profiling - TinyTransformer V2" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_sys/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-sys to collect system-level traces +echo "Running: rocprof-sys-run --profile --trace -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprof-sys-run --profile --trace -- python ../../tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-sys profiling completed" +else + echo "[FAILED] rocprof-sys profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " Open the .proto file in Perfetto UI: https://ui.perfetto.dev/" +echo "" diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/get_trace.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/get_trace.sh new file mode 100644 index 00000000..0869b0cf --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/get_trace.sh @@ -0,0 +1,86 @@ +#!/bin/bash +# Script to profile TinyTransformer V2 with rocprofv3 runtime trace +# This captures GPU API calls, kernel launches, and memory operations +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./traces/trace_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 runtime trace profiling for TinyTransformer V2..." +echo "Output directory: $OUTPUT_DIR" + +# Build rocprofv3 command with appropriate flags for ROCm version +# ROCm 6.4+ and 7.x require explicit --output-format pftrace to generate Perfetto traces +if [ "$ROCM_MAJOR" = "7" ] || [ "$ROCM_MAJOR" = "6" ]; then + echo "Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace)" + OUTPUT_FORMAT="--output-format pftrace" +else + echo "Using ROCm 5.x or older: default format" + OUTPUT_FORMAT="" +fi + +echo "" +echo "Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations)" +echo "" + +# Run with rocprofv3 to collect full runtime trace +cd "$OUTPUT_DIR" +rocprofv3 \ + --runtime-trace \ + $OUTPUT_FORMAT \ + -- python ../../tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh ./*/ 2>/dev/null || ls -lh . +echo "" + +# Find and report pftrace files +PFTRACE=$(find . -name "*.pftrace" -size +1k 2>/dev/null | head -1) +if [ -n "$PFTRACE" ]; then + echo "Perfetto trace file: $PFTRACE" + echo "Size: $(ls -lh "$PFTRACE" | awk '{print $5}')" + echo "" + echo "To view the trace:" + echo " 1. Visit: https://ui.perfetto.dev/" + echo " 2. Open: $PFTRACE" +fi diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/test_github_issue.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/test_github_issue.sh new file mode 100755 index 00000000..439cfa3f --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/test_github_issue.sh @@ -0,0 +1,73 @@ +#!/bin/bash +# +# Test exact command from GitHub issue #1386 +# Issue: "No device activity" with rocprofv3 on version2 +# + +set -e + +echo "==========================================" +echo "GitHub Issue #1386 Reproduction Test" +echo "==========================================" +echo "" + +OUTPUT_DIR="./github_issue_test/test_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" +echo "Reproducing exact command from GitHub issue #1386:" +echo "rocprofv3 --runtime-trace --output-format pftrace -- python tiny_llama_v2.py --batch-size 8 --seq-len 128" +echo "" +echo "Note: GitHub issue did NOT specify --num-steps, so default value will be used" +echo "" + +cd "$OUTPUT_DIR" +rocprofv3 --runtime-trace --output-format pftrace -- python ../../tiny_llama_v2.py --batch-size 8 --seq-len 128 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprofv3 profiling completed" +else + echo "[FAILED] rocprofv3 profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +echo "Checking trace file sizes:" +if compgen -G "*/*.pftrace" > /dev/null; then + for f in */*.pftrace; do + SIZE=$(stat -c%s "$f" 2>/dev/null || stat -f%z "$f" 2>/dev/null || echo "unknown") + SIZE_MB=$(echo "scale=2; $SIZE / 1048576" | bc) + echo " $f - ${SIZE_MB} MB" + done + echo "" + LARGEST=$(find . -name "*.pftrace" -exec ls -l {} \; | sort -k5 -n -r | head -1 | awk '{print $9, $5}') + LARGEST_FILE=$(echo $LARGEST | awk '{print $1}') + LARGEST_SIZE=$(echo $LARGEST | awk '{print $2}') + LARGEST_MB=$(echo "scale=2; $LARGEST_SIZE / 1048576" | bc) + + echo "Largest trace: $LARGEST_FILE (${LARGEST_MB} MB)" + echo "" + + if (( $(echo "$LARGEST_MB < 1" | bc -l) )); then + echo "[WARNING] Trace file is very small (< 1 MB)" + echo "This may indicate 'no device activity' issue from GitHub #1386" + else + echo "[OK] Trace file size looks normal" + echo "Version2 profiling appears to be working correctly" + fi +else + echo "[ERROR] No .pftrace files found" +fi +echo "" + +echo "Comparison with version1 baseline:" +echo " Version1 trace size: ~44 MB" +echo " Version2 trace size: ${LARGEST_MB} MB" +echo "" diff --git a/MLExamples/TinyTransformer/version2_pytorch_fused/test_rocpd.sh b/MLExamples/TinyTransformer/version2_pytorch_fused/test_rocpd.sh new file mode 100755 index 00000000..a40d273b --- /dev/null +++ b/MLExamples/TinyTransformer/version2_pytorch_fused/test_rocpd.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# +# Test rocpd (ROCm Profiling Daemon) for continuous profiling +# + +set -e + +echo "==========================================" +echo "rocpd Test - Version 2" +echo "==========================================" +echo "" + +# Check if rocpd is available +if ! command -v rocpd &> /dev/null; then + echo "[ERROR] rocpd not found in PATH" + echo "rocpd may not be installed or available in this ROCm version" + exit 1 +fi + +echo "rocpd location: $(which rocpd)" +echo "" + +OUTPUT_DIR="./rocpd/rocpd_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Start rocpd in background +echo "Starting rocpd daemon..." +rocpd --output-dir "$OUTPUT_DIR" & +ROCPD_PID=$! +echo "rocpd running with PID: $ROCPD_PID" +echo "" + +# Give rocpd time to initialize +sleep 2 + +# Run workload +echo "Running workload: python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10" +python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 10 +WORKLOAD_EXIT=$? +echo "" + +# Stop rocpd +echo "Stopping rocpd daemon..." +kill $ROCPD_PID 2>/dev/null || true +wait $ROCPD_PID 2>/dev/null || true +echo "" + +if [ $WORKLOAD_EXIT -eq 0 ]; then + echo "[SUCCESS] Workload completed" +else + echo "[FAILED] Workload failed with exit code $WORKLOAD_EXIT" +fi +echo "" + +echo "Generated files in $OUTPUT_DIR:" +ls -lh "$OUTPUT_DIR" +echo "" + +echo "rocpd output is a SQLite3 database file" +echo "" +echo "To view the database:" +echo " - Use VS Code SQLite Viewer extension" +echo " - rocprof-compute and rocprof-systems can consume it directly" +echo " - No official CLI tool is provided for viewing" +echo "" +echo "rocpd provides continuous profiling with minimal overhead" +echo "" diff --git a/MLExamples/TinyTransformer/version3_triton/README.md b/MLExamples/TinyTransformer/version3_triton/README.md index 24d5e8b2..a4fac542 100644 --- a/MLExamples/TinyTransformer/version3_triton/README.md +++ b/MLExamples/TinyTransformer/version3_triton/README.md @@ -1,785 +1,177 @@ +# ML Example: TinyTransformer Triton with ROCm Profiling -# Version 3: Triton Kernel Integration +README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton` from the Training Examples repository. -README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton` in the Training Examples repository +In this example we provide a Triton-optimized implementation of Tiny LLaMA with custom GPU kernels for profiling transformer workloads on AMD GPUs. This version builds on version2 with custom Triton kernels for RMSNorm, Flash Attention, and a hybrid SwiGLU approach. Several profiling scripts are provided to capture different aspects of GPU performance. -**Objective**: Implement custom GPU kernels using Triton for maximum performance optimization +## Features of the profiling scripts -**Expected Performance**: 2.0-3.5x speedup over baseline, 70-95% memory reduction +The version3_triton example contains several profiling scripts that capture different aspects of GPU performance: -**Learning Focus**: GPU kernel programming, memory access optimization, custom operator development +- **get_trace.sh**: Runtime trace collection using rocprofv3. Captures HIP/HSA API calls, kernel execution timeline, memory operations (H2D, D2H, D2D transfers), and synchronization events. Output is a Perfetto trace file for timeline visualization. +- **get_counters.sh**: Kernel trace collection using rocprofv3. Captures kernel execution statistics including timing and call counts. Useful for identifying hotspot kernels and their execution patterns. +- **get_rocprof_compute.sh**: Detailed GPU hardware metrics using rocprof-compute. Provides comprehensive performance analysis including compute utilization, memory bandwidth, and hardware counter data. +- **get_rocprof_sys.sh**: System-level profiling using rocprof-sys. Captures call stack sampling and system-level performance data for end-to-end analysis. +- **get_hotspots.sh**: GPU hotspot analysis using rocprofv3 stats mode. Identifies kernels with highest time consumption. -## Overview +## Key Optimizations -Version 3 introduces custom Triton GPU kernels for the most performance-critical operations in the Tiny LLaMA model. Triton provides a Python-like syntax for writing GPU kernels while automatically handling low-level optimizations like memory coalescing and register allocation. +This version implements custom Triton GPU kernels: -### Key Optimizations +- **RMSNorm Triton Kernel**: Fused variance computation and normalization (3 kernels → 1) +- **Flash Attention Triton Kernel**: Memory-efficient attention with O(S) complexity instead of O(S²) +- **Hybrid SwiGLU**: PyTorch for matrix multiplications + Triton for activation fusion +- **Automatic Tuning**: Triton compiler optimizations for target hardware -1. **Custom RMSNorm Kernel**: Fused variance computation and normalization -2. **SwiGLU Kernel**: Combined gate/up projections with SiLU activation -3. **Flash Attention Kernel**: Memory-efficient attention with O(N) complexity -4. **Automatic Optimization**: Triton compiler optimizations for target hardware +## Overview of the model -### Architecture Changes +The model is controlled with the following arguments: + +- `--batch-size `: batch size for training (default: 8) +- `--seq-len `: sequence length (default: 256) +- `--num-steps `: number of training steps (default: 50) +- `--hidden-dim `: hidden dimension (default: 512) +- `--num-layers `: number of transformer layers (default: 8) +- `--num-heads `: number of attention heads (default: 8) +- `--learning-rate `: learning rate (default: 3e-4) +- `--use-amp`: enable automatic mixed precision + +## Running the Triton model + +Load the required modules: ``` -Previous: PyTorch Operations → Multiple Kernel Launches → Memory Transfers -Current: Custom Triton Kernels → Single Optimized Launch → Minimal Memory Traffic +module load pytorch rocm triton ``` -## Files and Structure +Run a basic training run: ``` -version3_triton/ -├── README.md # This file -├── tiny_llama_v3.py # Main model with Triton kernels -├── run_triton_profiling.py # Triton-specific profiling -├── run_rocprof_triton.sh # ROCProfiler for Triton kernels -├── exercises/ -│ ├── exercise1_triton_basics.md # Triton fundamentals -│ ├── exercise2_swiglu_optimization.md # SwiGLU kernel deep dive -│ └── exercise3_flash_attention.md # Flash Attention implementation -└── results/ # Generated profiling results +echo "Running TinyTransformer V3 Triton" +python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 ``` -## Key Components and Triton Kernel Implementation - -### Mathematical Foundation of Triton Kernels +## Runtime Trace Profiling with get_trace.sh -Triton kernels optimize GPU computation by exploiting the memory hierarchy and parallelism patterns. For complete mathematical foundations, see [TINY_LLAMA_ARCHITECTURE.md](../TINY_LLAMA_ARCHITECTURE.md). +This script captures GPU API calls, kernel launches, and memory operations for timeline analysis. -#### Memory Hierarchy Optimization +Run the profiling script: -**GPU Memory Hierarchy:** ``` -Registers (fastest, ~40KB per SM) → Data reuse within thread -Shared Memory (~164KB per SM) → Data sharing within thread block -L1 Cache (~128KB per SM) → Automatic caching -L2 Cache (~8MB global) → Cross-SM data sharing -HBM (slowest, ~64GB) → Main memory +echo "Collecting runtime trace with rocprofv3" +./get_trace.sh ``` -**Triton Optimization Strategy:** +The script will output results to `traces/trace_/`. To analyze the results: -$$\text{Arithmetic Intensity} = \frac{\text{FLOPS}}{\text{Memory Bytes Accessed}}$$ - -Triton maximizes this ratio by: - -1. **Tiling**: Processing data in blocks that fit in fast memory -2. **Fusion**: Combining multiple operations to reuse data -3. **Vectorization**: Using SIMD instructions efficiently +``` +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .pftrace file" +``` -### 1. Triton RMSNorm Implementation +## Kernel Trace Profiling with get_counters.sh -#### RMSNorm Mathematical Analysis +This script collects kernel execution statistics including timing and call counts. -**Standard Implementation (PyTorch):** -```python -# Multiple kernel launches and memory accesses -variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True) # Kernel 1: Power + Reduction -rstd = torch.rsqrt(variance + eps) # Kernel 2: Reciprocal sqrt -output = (x * rstd).to(input_dtype) * weight # Kernel 3: Multiply + Scale +Run the profiling script: -# Total: 3 kernel launches, 3x memory bandwidth usage ``` - -**Triton Fused Implementation:** -```python -@triton.jit -def rmsnorm_kernel( - x_ptr, weight_ptr, output_ptr, - n_rows, n_cols, eps, - BLOCK_SIZE: tl.constexpr -): - """ - Fused RMSNorm kernel with optimal memory access patterns. - - Mathematical Operation: - output = (x / sqrt(mean(x^2) + eps)) * weight - - Memory Optimization: - - Single pass through input data - - Variance computation in registers - - Immediate normalization and scaling - """ - # Program ID determines which row this thread block processes - row_idx = tl.program_id(0) - - # Bounds checking - if row_idx >= n_rows: - return - - # Compute memory offsets for this row - x_row_ptr = x_ptr + row_idx * n_cols - output_row_ptr = output_ptr + row_idx * n_cols - - # Load weight vector (broadcast across all rows) - col_offsets = tl.arange(0, BLOCK_SIZE) - mask = col_offsets < n_cols - weight = tl.load(weight_ptr + col_offsets, mask=mask, other=0.0) - - # OPTIMIZATION 1: Streaming variance computation - variance = 0.0 - for block_start in range(0, n_cols, BLOCK_SIZE): - col_offsets = block_start + tl.arange(0, BLOCK_SIZE) - mask = col_offsets < n_cols - - # Load input block - x_block = tl.load(x_row_ptr + col_offsets, mask=mask, other=0.0) - - # Accumulate variance in registers (no memory writes!) - variance += tl.sum(x_block * x_block) - - # Compute RMS normalization factor - variance = variance / n_cols - rstd = 1.0 / tl.sqrt(variance + eps) - - # OPTIMIZATION 2: Fused normalization and scaling - for block_start in range(0, n_cols, BLOCK_SIZE): - col_offsets = block_start + tl.arange(0, BLOCK_SIZE) - mask = col_offsets < n_cols - - # Load input block again (cached in L1/L2) - x_block = tl.load(x_row_ptr + col_offsets, mask=mask, other=0.0) - weight_block = tl.load(weight_ptr + col_offsets, mask=mask, other=0.0) - - # Fused normalize + scale in single operation - output_block = x_block * rstd * weight_block - - # Store result - tl.store(output_row_ptr + col_offsets, output_block, mask=mask) +echo "Collecting kernel trace with rocprofv3" +./get_counters.sh ``` -**Performance Analysis:** -```python -RMSNORM_PERFORMANCE = { - 'memory_access_pattern': { - 'pytorch': 'Multiple passes through data', - 'triton': 'Two passes (variance + normalize)', - 'bandwidth_reduction': '~50% fewer memory accesses' - }, - 'kernel_launches': { - 'pytorch': 3, # pow, mean, multiply - 'triton': 1, # fused operation - 'overhead_reduction': '67% fewer kernel launches' - }, - 'numerical_precision': { - 'pytorch': 'Multiple intermediate tensors', - 'triton': 'High-precision accumulation in registers', - 'stability': 'Better numerical stability' - } -} -``` +The script will output results to `counters/counter_/`. -### 2. Triton SwiGLU Implementation - -#### SwiGLU Fusion Analysis - -**Memory Access Pattern Optimization:** - -$$\begin{aligned} -\text{Standard SwiGLU}: & \quad \text{4 separate operations} \\ -\text{gate} &= xW_{\text{gate}} \quad \text{(GEMM 1)} \\ -\text{up} &= xW_{\text{up}} \quad \text{(GEMM 2)} \\ -\text{activated} &= \text{SiLU}(\text{gate}) \quad \text{(Elementwise 1)} \\ -\text{output} &= \text{activated} \odot \text{up} \quad \text{(Elementwise 2)} \\ -\text{Memory Reads}: & \quad 4 \times \text{input tensor} + 2 \times \text{weight matrices} -\end{aligned}$$ - -**Triton Fused SwiGLU:** - -$$\begin{aligned} -\text{Triton SwiGLU}: & \quad \text{Single fused operation} \\ -\text{output} &= \text{SiLU}(xW_{\text{gate}}) \odot (xW_{\text{up}}) \\ -\text{Memory Reads}: & \quad 1 \times \text{input tensor} + 2 \times \text{weight matrices} -\end{aligned}$$ - -#### Detailed Triton SwiGLU Kernel - -```python -@triton.jit -def swiglu_kernel( - x_ptr, gate_weight_ptr, up_weight_ptr, output_ptr, - batch_size, seq_len, hidden_dim, intermediate_dim, - BLOCK_SIZE_M: tl.constexpr, - BLOCK_SIZE_K: tl.constexpr, - BLOCK_SIZE_N: tl.constexpr -): - """ - Fused SwiGLU kernel with optimal memory tiling. - - Computes: output = SiLU(x @ gate_weight) * (x @ up_weight) - - Tiling Strategy: - - M dimension: batch_size * seq_len - - K dimension: hidden_dim - - N dimension: intermediate_dim - """ - # Thread block coordinates - pid_m = tl.program_id(0) - pid_n = tl.program_id(1) - - # Compute tile offsets - m_offset = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) - n_offset = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) - - # Initialize accumulators for both gate and up projections - gate_acc = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) - up_acc = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) - - # OPTIMIZATION 1: Fused GEMM computation - for k in range(0, hidden_dim, BLOCK_SIZE_K): - k_offset = k + tl.arange(0, BLOCK_SIZE_K) - - # Load input tile (shared between gate and up computations) - x_tile = tl.load( - x_ptr + m_offset[:, None] * hidden_dim + k_offset[None, :], - mask=(m_offset[:, None] < batch_size * seq_len) & (k_offset[None, :] < hidden_dim) - ) - - # Load weight tiles - gate_weight_tile = tl.load( - gate_weight_ptr + k_offset[:, None] * intermediate_dim + n_offset[None, :], - mask=(k_offset[:, None] < hidden_dim) & (n_offset[None, :] < intermediate_dim) - ) - up_weight_tile = tl.load( - up_weight_ptr + k_offset[:, None] * intermediate_dim + n_offset[None, :], - mask=(k_offset[:, None] < hidden_dim) & (n_offset[None, :] < intermediate_dim) - ) - - # Fused matrix multiplication (data reuse in registers) - gate_acc += tl.dot(x_tile, gate_weight_tile) - up_acc += tl.dot(x_tile, up_weight_tile) - - # OPTIMIZATION 2: Fused SiLU activation and element-wise multiply - # SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x)) - gate_activated = gate_acc / (1.0 + tl.exp(-gate_acc)) - swiglu_output = gate_activated * up_acc - - # Store final result - output_mask = (m_offset[:, None] < batch_size * seq_len) & (n_offset[None, :] < intermediate_dim) - tl.store( - output_ptr + m_offset[:, None] * intermediate_dim + n_offset[None, :], - swiglu_output, - mask=output_mask - ) -``` +ROCm 6.x outputs CSV files directly, while ROCm 7.x outputs SQLite databases. For ROCm 7.x database files, use rocpd tools: -**Triton SwiGLU Performance Characteristics:** -```python -SWIGLU_TRITON_BENEFITS = { - 'memory_efficiency': { - 'data_reuse': 'Input tensor loaded once, used for both gate and up', - 'register_usage': 'Intermediate results kept in registers', - 'bandwidth_reduction': '60-75% reduction in memory traffic' - }, - 'computational_efficiency': { - 'operation_fusion': 'GEMM + SiLU + elementwise in single kernel', - 'vectorization': 'Automatic SIMD instruction generation', - 'occupancy': 'Optimized thread block configuration' - }, - 'numerical_stability': { - 'precision': 'FP32 accumulation with FP16 storage', - 'activation_stability': 'Numerically stable SiLU implementation', - 'overflow_protection': 'Built-in overflow handling' - } -} ``` - -### 3. Triton Flash Attention Implementation - -#### Flash Attention Tiling Strategy - -**Memory Complexity Analysis:** - -$$\begin{aligned} -\text{Standard Attention Memory} &: O(B \times H \times S^{2}) \\ -\text{Flash Attention Memory} &: O(B \times H \times S) \\ -\text{SRAM Usage} &: O(B_r + B_c) \text{ where } B_r, B_c \text{ are tile sizes} \\ -\text{IO Complexity} &: O\left(\frac{S^{2}}{\sqrt{M}}\right) \text{ where } M \text{ is SRAM size} -\end{aligned}$$ - -#### Triton Flash Attention Kernel - -```python -@triton.jit -def flash_attention_kernel( - q_ptr, k_ptr, v_ptr, output_ptr, - batch_size, num_heads, seq_len, head_dim, - BLOCK_SIZE_M: tl.constexpr, - BLOCK_SIZE_N: tl.constexpr -): - """ - Memory-efficient Flash Attention with tiled computation. - - Algorithm: - 1. Tile Q, K, V into blocks that fit in SRAM - 2. Compute attention scores incrementally - 3. Use online softmax for numerical stability - 4. Accumulate attention output progressively - """ - # Thread block IDs - batch_idx = tl.program_id(0) - head_idx = tl.program_id(1) - q_tile_idx = tl.program_id(2) - - # Compute base pointers for this batch and head - q_base = q_ptr + batch_idx * num_heads * seq_len * head_dim + head_idx * seq_len * head_dim - k_base = k_ptr + batch_idx * num_heads * seq_len * head_dim + head_idx * seq_len * head_dim - v_base = v_ptr + batch_idx * num_heads * seq_len * head_dim + head_idx * seq_len * head_dim - output_base = output_ptr + batch_idx * num_heads * seq_len * head_dim + head_idx * seq_len * head_dim - - # Load Q tile (stays in SRAM for entire computation) - q_offset_m = q_tile_idx * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) - q_mask_m = q_offset_m < seq_len - - q_tile = tl.load( - q_base + q_offset_m[:, None] * head_dim + tl.arange(0, head_dim)[None, :], - mask=q_mask_m[:, None] - ) - - # Initialize output accumulator and normalization factors - output_acc = tl.zeros((BLOCK_SIZE_M, head_dim), dtype=tl.float32) - row_max = tl.full((BLOCK_SIZE_M,), float('-inf'), dtype=tl.float32) - row_sum = tl.zeros((BLOCK_SIZE_M,), dtype=tl.float32) - - # OPTIMIZATION 1: Tiled computation over K, V - for k_tile_idx in range(0, tl.cdiv(seq_len, BLOCK_SIZE_N)): - k_offset_n = k_tile_idx * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) - k_mask_n = k_offset_n < seq_len - - # Load K and V tiles - k_tile = tl.load( - k_base + k_offset_n[:, None] * head_dim + tl.arange(0, head_dim)[None, :], - mask=k_mask_n[:, None] - ) - v_tile = tl.load( - v_base + k_offset_n[:, None] * head_dim + tl.arange(0, head_dim)[None, :], - mask=k_mask_n[:, None] - ) - - # OPTIMIZATION 2: Compute attention scores in tiles - scores = tl.dot(q_tile, k_tile.T) * (1.0 / tl.sqrt(head_dim.to(tl.float32))) - - # Apply causal mask - causal_mask = q_offset_m[:, None] >= k_offset_n[None, :] - scores = tl.where(causal_mask, scores, float('-inf')) - - # OPTIMIZATION 3: Online softmax (numerically stable) - tile_max = tl.max(scores, axis=1) - new_row_max = tl.maximum(row_max, tile_max) - - # Rescale previous accumulated values - old_scale = tl.exp(row_max - new_row_max) - tile_scale = tl.exp(tile_max - new_row_max) - - # Update output accumulator - output_acc = output_acc * old_scale[:, None] - scores_softmax = tl.exp(scores - new_row_max[:, None]) * tile_scale[:, None] - output_acc += tl.dot(scores_softmax, v_tile) - - # Update normalization factors - row_sum = row_sum * old_scale + tl.sum(scores_softmax, axis=1) - row_max = new_row_max - - # Final normalization - output_final = output_acc / row_sum[:, None] - - # Store result - tl.store( - output_base + q_offset_m[:, None] * head_dim + tl.arange(0, head_dim)[None, :], - output_final, - mask=q_mask_m[:, None] - ) +echo "Exporting kernel statistics to CSV" +rocpd2csv -i -o kernel_stats.csv ``` -**Flash Attention Performance Benefits:** -```python -FLASH_ATTENTION_TRITON = { - 'memory_efficiency': { - 'complexity': 'O(N) vs O(N^2) for standard attention', - 'sram_usage': 'Optimal SRAM utilization with tiling', - 'hbm_access': 'Minimized high-bandwidth memory access' - }, - 'computational_efficiency': { - 'online_softmax': 'Numerically stable incremental computation', - 'tiled_gemm': 'Optimal matrix multiplication blocking', - 'kernel_fusion': 'Single kernel for entire attention computation' - }, - 'scalability': { - 'sequence_length': 'Linear scaling with sequence length', - 'batch_processing': 'Efficient batched computation', - 'multi_head': 'Parallelized across attention heads' - } -} ``` - -### Advanced Triton Optimization Techniques - -#### Block Size Tuning - -```python -def auto_tune_block_sizes(operation_type, input_shape, device_properties): - """ - Automatically tune block sizes for optimal performance. - """ - tuning_space = { - 'rmsnorm': { - 'block_sizes': [64, 128, 256, 512, 1024], - 'criteria': 'Memory bandwidth utilization', - 'constraints': 'Register usage < 64KB' - }, - 'swiglu': { - 'block_sizes': [(32, 64, 32), (64, 64, 64), (128, 32, 64)], - 'criteria': 'Arithmetic intensity maximization', - 'constraints': 'Shared memory < 164KB' - }, - 'flash_attention': { - 'block_sizes': [(64, 64), (128, 64), (64, 128)], - 'criteria': 'SRAM utilization efficiency', - 'constraints': 'Memory coalescing requirements' - } - } - - return optimize_for_hardware(tuning_space[operation_type], device_properties) +echo "Getting kernel summary" +rocpd summary -i --region-categories KERNEL ``` -#### Memory Coalescing Optimization - -```python -# Optimal memory access patterns for AMD GPUs -MEMORY_ACCESS_PATTERNS = { - 'coalesced_access': { - 'pattern': 'Consecutive threads access consecutive memory addresses', - 'bandwidth': '100% of peak memory bandwidth', - 'implementation': 'Proper stride patterns in Triton kernels' - }, - 'strided_access': { - 'pattern': 'Regular stride pattern across memory', - 'bandwidth': '50-80% of peak memory bandwidth', - 'optimization': 'Adjust block sizes to match stride' - }, - 'random_access': { - 'pattern': 'Irregular memory access pattern', - 'bandwidth': '10-30% of peak memory bandwidth', - 'mitigation': 'Data reordering and blocking strategies' - } -} -``` +Documentation for rocpd tools: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html -## Quick Start +## GPU Hardware Metrics with get_rocprof_compute.sh -### 1. Environment Setup +This script collects detailed GPU performance metrics for hardware utilization analysis. -Ensure Triton is installed in your environment: +Run the profiling script: -```bash -# Should already be installed from setup/ -pip install triton ``` - -Verify Triton installation: - -```python -import triton -print(f"Triton version: {triton.__version__}") +echo "Collecting GPU hardware metrics with rocprof-compute" +./get_rocprof_compute.sh ``` -### 2. Run the Model +The script will output results to `rocprof_compute/profile_/`. To analyze the results: -Execute the optimized model: - -```bash -cd version3_triton/ -python3 tiny_llama_v3.py -``` - -**Expected Output:** ``` -=== Triton Kernel Model Benchmark === -Model size: XXX.X M parameters -Input shape: torch.Size([4, 512]) -Average forward pass time: XX.XX ms -Throughput: XXXX tokens/second -Memory allocated: X.XX GB -Estimated FLOPS/second: XX.XX TFLOPS +echo "Generating performance analysis report" +rocprof-compute analyze -p /workloads//rocprof --dispatch -n tiny_llama_dispatch ``` -### 3. Profile Performance - -Run comprehensive profiling: +For available analysis options: -```bash -# Triton-specific profiling -python3 run_triton_profiling.py ``` - - -### 4. Analyze Results +Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support. Consumer GPUs may have limited counter availability. -Check generated results: +## System-Level Profiling with get_rocprof_sys.sh -```bash -ls profiling_results/ -cat profiling_results/triton_summary_report.md -``` +This script captures system-level performance with call stack sampling. - - -### Key Metrics to Monitor - -1. **Kernel Performance** - - Execution time per kernel - - Launch overhead - - Occupancy rates - -2. **Memory Utilization** - - Bandwidth efficiency - - Cache hit rates - - Memory access patterns - -3. **Compute Efficiency** - - VALU utilization - - Arithmetic intensity - - Roofline performance - -## Troubleshooting - -### Common Issues - -1. **Triton Not Found** - ```bash - pip install triton - # Or check environment setup - ``` - -2. **Kernel Compilation Errors** - - Verify GPU compatibility - - Check CUDA/ROCm installation - - Review tensor dimensions - -3. **Performance Regression** - - Ensure proper warmup - - Check block size settings - - Verify input data layout - -4. **Memory Errors** - - Reduce batch size or sequence length - - Check for memory leaks - - Monitor peak memory usage - -### Performance Debugging +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .proto file" +``` -1. **Profile Each Kernel Individually** - ```python - # Isolate kernel performance - triton_rmsnorm = TritonRMSNorm(dim) - # Benchmark just this component - ``` +Note: rocprof-sys may produce memory map dumps in some configurations. If profiling fails or produces excessive output, consider using rocprofv3 (get_trace.sh) instead. -2. **Compare Block Sizes** - ```python - # Test different configurations - for block_size in [64, 128, 256, 512]: - # Measure performance - ``` +## GPU Hotspot Analysis with get_hotspots.sh -3. **Memory Pattern Analysis** - ```python - # Check memory access efficiency - torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) - ``` +This script identifies kernels with the highest execution time using rocprofv3 stats mode. -## Next Steps +Run the profiling script: -After completing Version 3: +``` +echo "Collecting GPU hotspots" +./get_hotspots.sh +``` -1. **Review Performance Gains**: Compare with previous versions -2. **Understand Optimization Principles**: Kernel design patterns -3. **Prepare for Version 4**: Ultra-fused implementations +The script will output kernel statistics to `hotspots/hotspot_/`. -Version 4 will combine all optimizations into ultra-fused kernels that process entire transformer blocks in minimal kernel launches. +## Expected Performance Improvements -## Resources +Results from AMD MI325X with ROCm 6.4.4: -### Documentation -- [Triton Language Tutorial](https://triton-lang.org/main/getting-started/tutorials/index.html) -- [GPU Architecture Guide](https://rocmdocs.amd.com/en/latest/Programming_Guides/Programming-Guides.html) -- [ROCm Profiler Documentation](https://rocmdocs.amd.com/en/latest/ROCm_Tools/ROCm-Tools.html) +| Version | Throughput | Memory | Improvement | +|---------|-----------|--------|-------------| +| V1 Baseline | 372.9 samples/sec | 522.3 MB | - | +| V3 Triton | 2065.0 samples/sec | 281.8 MB | 5.5x faster, 46% less memory | -### Papers and References -- [Flash Attention Paper](https://arxiv.org/abs/2205.14135) -- [Triton: A Language for AI Kernel Programming](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf) -- [Roofline Model for GPU Performance](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/) +Key optimizations impact: +- Flash Attention (Triton): 46% memory reduction +- RMSNorm (Triton): 3 kernels → 1 +- Hybrid SwiGLU: PyTorch matmul + Triton activation -### AMD ROCm Resources -- [ROCm Documentation](https://rocmdocs.amd.com/) -- [HIP Programming Guide](https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-GUIDE.html) -- [Performance Optimization Tips](https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-programming-guide.html) +## Additional Resources +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- rocpd output format: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html +- Perfetto UI: https://ui.perfetto.dev/ +- Triton Language Tutorial: https://triton-lang.org/main/getting-started/tutorials/index.html diff --git a/MLExamples/TinyTransformer/version3_triton/README_WORKSHOP.md b/MLExamples/TinyTransformer/version3_triton/README_WORKSHOP.md index faea4dc0..e0b218e7 100644 --- a/MLExamples/TinyTransformer/version3_triton/README_WORKSHOP.md +++ b/MLExamples/TinyTransformer/version3_triton/README_WORKSHOP.md @@ -1,394 +1,76 @@ - # Version 3: Triton Kernel Integration - Workshop Edition -`README_WORKSHOP.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton` in the Training Examples repository - -**Objective**: Implement custom GPU kernels using Triton for maximum performance optimization - -**Actual Performance**: **5.5x speedup** over baseline, **46% memory reduction** - -**Learning Focus**: GPU kernel programming, performance debugging, hybrid optimization strategies +README_WORKSHOP.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton` in the Training Examples repository. ---- +## Quick Start -## Quick Start (5 minutes) - -```bash +``` cd version3_triton/ - -# Run the optimized version python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20 - -# Expected output: -# Loss: 7.0108 (correct!) -# Speed: 2065.0 samples/sec (5.5x faster than V1!) -# Memory: 281.8 MB (46% less than V1's 522 MB!) ``` ---- - -## Performance Results +Expected output: Loss ~7.0, Speed ~2065 samples/sec, Memory ~282 MB -### Actual Measurements (AMD MI325X, ROCm 6.4.4) - -**Test Configuration:** Batch=8, SeqLen=128, Hidden=512, Layers=8, Heads=8 +## Performance Results (AMD MI325X, ROCm 6.4.4) | Metric | V1 Baseline | V3 Optimized | Improvement | |--------|-------------|--------------|-------------| -| **Training Speed** | 372.9 samples/sec | **2065.0 samples/sec** | **5.5x faster** | -| **Batch Time** | 21.7 ms | **3.9 ms** | **5.6x faster** | -| **Forward Pass** | 10.8 ms | **3.2 ms** | **3.4x faster** | -| **Backward Pass** | 9.2 ms | **0.3 ms** | **30x faster** | -| **Memory Usage** | 522.3 MB | **281.8 MB** | **46% reduction** | -| **Throughput** | 47,735 tokens/sec | **264,320 tokens/sec** | **5.5x faster** | - ---- - -## Key Concepts - -### What is Triton? - -**Triton** is a Python-based GPU programming language that makes it easy to write high-performance GPU kernels without dealing with low-level CUDA/HIP complexity. +| Training Speed | 372.9 samples/sec | 2065.0 samples/sec | 5.5x faster | +| Memory Usage | 522.3 MB | 281.8 MB | 46% reduction | -**Why Use Triton?** -- Python-like syntax (easier than CUDA/HIP) -- Automatic memory coalescing and optimization -- Works on both NVIDIA and AMD GPUs -- Great for memory-bound operations and fusion - -**When NOT to Use Triton?** -- Large matrix multiplications (use PyTorch/rocBLAS instead) -- Operations already well-optimized in PyTorch -- Compute-bound ops where BLAS libraries excel - ---- - -## Optimizations Applied in V3 +## Optimizations Applied ### 1. Flash Attention (Triton Kernel) -**What it does:** Memory-efficient attention using online softmax - -**PyTorch Standard Attention:** -```python -# Materializes full attention matrix: O(N²) memory -scores = Q @ K.T # [batch, heads, seq, seq] - HUGE! -attn = softmax(scores) -output = attn @ V -``` - -**Flash Attention:** -```python -# Online computation: O(N) memory -# Processes attention in blocks, never materializes full matrix -# Uses tiled computation with recomputation for backward pass -``` - -**Result:** - -- 46% memory reduction (282 MB vs 522 MB) -- Enables longer sequences -- Slightly faster forward pass +Memory-efficient attention using online softmax. Reduces memory from O(S²) to O(S). ### 2. RMSNorm (Triton Kernel) -**What it does:** Fused variance computation + normalization - -**Before (PyTorch):** 3 separate kernels -```python -variance = x.pow(2).mean(-1, keepdim=True) # Kernel 1 -rstd = torch.rsqrt(variance + eps) # Kernel 2 -output = (x * rstd) * weight # Kernel 3 -``` - -**After (Triton):** Single fused kernel -```python -# All operations in one kernel launch -# Variance computed in registers -# Immediate normalization and scaling -``` - -**Result:** - -- 3x fewer kernel launches -- Better cache utilization -- Reduced memory bandwidth +Fused variance computation + normalization (3 kernels → 1). ### 3. Hybrid SwiGLU Strategy -**Critical Lesson:** Don't use custom kernels for everything! - -**Initial (Broken) Approach:** -```python -# Used Triton kernel for matrix multiply - BAD IDEA! -# Launched 2,097,152 threads (batch × seq × d_ff) -# Each thread did manual reduction - VERY SLOW! -# Result: 25.5ms forward pass (2.4x SLOWER than V1!) -``` +Use PyTorch/rocBLAS for matrix multiplies, PyTorch for activation. Custom Triton kernel was 2.4x slower. -**Optimized (Hybrid) Approach:** -```python -# Use PyTorch for matrix multiplies (rocBLAS optimized) -gate = self.gate_proj(x) # rocBLAS -up = self.up_proj(x) # rocBLAS - -# Use PyTorch for activation (already fused) -gate_activated = F.silu(gate) * up - -# Use PyTorch for final projection -output = self.down_proj(intermediate) # rocBLAS -``` - -**Result:** -- 8x forward pass speedup (25.5ms → 3.2ms) -- **Key insight:** Use the best tool for each operation - -### 4. Tensor Contiguity (Critical!) -**The Bug:** Non-contiguous tensors after `repeat_interleave` for GQA - -**Before:** -```python -k = k.repeat_interleave(n_rep, dim=1) # Creates non-contiguous tensor! -v = v.repeat_interleave(n_rep, dim=1) # Bad memory layout for Triton! -``` - -**After:** -```python -k = k.repeat_interleave(n_rep, dim=1).contiguous() # Fix memory layout -v = v.repeat_interleave(n_rep, dim=1).contiguous() # Now Triton-friendly! -``` +### 4. Tensor Contiguity +Always `.contiguous()` before Triton kernels. Non-contiguous tensors caused 20x slowdown. -**Result:** +### 5. Weight Initialization +Proper initialization (std=0.02) prevents exploding loss. -- 20x speedup! (15.2 → 310.8 samples/sec) -- Triton kernels depend on contiguous memory for efficient access -- Always check tensor contiguity before passing to custom kernels +## Key Learnings -### 5. Proper Weight Initialization -**The Bug:** Default `nn.Embedding` uses `Normal(0, 1)` - too large! +1. **Correctness First**: Validate before optimizing +2. **Memory Layout Matters**: Non-contiguous tensors kill performance +3. **Hybrid Wins**: Use best tool for each operation +4. **Measure Accurately**: Always `torch.cuda.synchronize()` for timing +5. **Iterate**: Fix one issue at a time, re-measure -**Before:** -```python -# No weight initialization -# Embedding weight ~ Normal(0, 1) -# With dim=1024, logits have std ≈ √1024 ≈ 32 -# Result: Logits explode to hundreds, loss = 942! -``` +## Performance Debugging Exercise -**After:** -```python -def _init_weights(self): - for module in self.modules(): - if isinstance(module, nn.Embedding): - torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) ``` - -**Result:** -- Loss: 942 → 7.0 -- Critical for tied weights (embedding + lm_head) -- Small std prevents exploding gradients - ---- - -## Performance Debugging Exercise - -Want to see the complete optimization journey? Try our hands-on debugging exercise: - -```bash cd exercises/performance_debugging/ - -# Read the guide -cat README.md - -# Run all 5 stages of optimization with profiling ./run_all_stages.sh - -# This shows the complete journey: -# Stage 1: Broken (loss=942) - missing weight init -# Stage 2: Slow (15 samp/s) - non-contiguous tensors -# Stage 3: Better (311 samp/s) - added .contiguous() -# Stage 4: Same (306 samp/s) - accurate timing revealed issue -# Stage 5: Optimal (2065 samp/s) - hybrid kernel strategy! -``` - -**What you'll learn:** - -- How to diagnose incorrect model behavior (exploding loss) -- How to identify performance bottlenecks with profiling -- When to use custom kernels vs. optimized libraries -- How memory layout affects GPU performance -- Systematic debugging methodology - ---- - - - -## Key Learnings - -### 1. Correctness First, Performance Second - -- Stage 1 had broken loss (942 instead of 7) -- No point optimizing a broken model! -- Always validate correctness before optimizing - -### 2. Memory Layout Matters - -- Non-contiguous tensors killed performance (20x slower!) -- Always `.contiguous()` before Triton kernels -- Check with `tensor.is_contiguous()` - -### 3. Hybrid Optimization Wins - -- Don't write custom kernels for everything -- Use Triton for: memory-bound ops, fusion opportunities -- Use PyTorch/BLAS for: large matrix multiplies -- Profile to decide! - -### 4. Measure Accurately - -- GPU operations are asynchronous -- Always `torch.cuda.synchronize()` for accurate timing -- Without sync, timings are meaningless - -### 5. Iterative Debugging - -- Fix one issue at a time -- Re-measure after each fix -- Profile to identify next bottleneck -- Repeat until optimal - ---- - -## Files Overview - -``` -version3_triton/ - README_WORKSHOP.md # This file - tiny_llama_v3.py # Main optimized model - exercises/ - performance_debugging/ # Hands-on debugging exercise - README.md # Complete optimization journey - run_all_stages.sh # Run all 5 stages with profiling - WORKSHOP_GUIDE.md # Quick reference guide - exercise1_triton_basics.md # Triton fundamentals - exercise2_swiglu_optimization.md # SwiGLU deep dive - exercise3_flash_attention.md # Flash Attention implementation - triton_profiles/ # Generated profiling data -``` - ---- - -## Next Steps - -### After Running V3 +Shows the complete optimization journey through 5 stages: +- Stage 1: Broken (loss=942) - missing weight init +- Stage 2: Slow (15 samp/s) - non-contiguous tensors +- Stage 3: Better (311 samp/s) - added .contiguous() +- Stage 4: Same (306 samp/s) - accurate timing revealed issue +- Stage 5: Optimal (2065 samp/s) - hybrid kernel strategy -1. **Compare with V1:** -```bash -# Run V1 for comparison -cd ../version1_pytorch_baseline/ -python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 20 +## Common Issues -# Compare outputs -# V1: 372.9 samp/s, 522.3 MB -# V3: 2065.0 samp/s, 281.8 MB (5.5x faster, 46% less memory!) +**ImportError: No module named 'triton'** ``` - -2. **Try V4 (Ultra-Fused):** -```bash -cd ../version4_pytorch_sdpa/ -python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 20 - -# Expected: ~8x faster than V1! -``` - -3. **Deep Dive into Profiling:** -```bash -cd exercises/performance_debugging/ -./run_all_stages.sh - -# Analyze the profiling CSV files -# Compare kernel execution times -# Understand the optimization journey -``` - ---- - -## Common Issues and Solutions - -### Issue 1: ImportError: No module named 'triton' -```bash pip install triton ``` -### Issue 2: RuntimeError: CUDA not available -```bash -# Verify ROCm installation -rocminfo - -# Check PyTorch sees GPU -python -c "import torch; print(torch.cuda.is_available())" -``` - -### Issue 3: Loss is not ~7.0 - -- Check weight initialization is enabled -- Verify model architecture matches V1 -- Check for tensor shape mismatches - -### Issue 4: Performance slower than expected - -- Ensure tensors are contiguous: `.contiguous()` -- Check CUDA synchronization for accurate timing -- Profile to identify bottleneck kernel -- Verify using optimized SwiGLU (hybrid approach) - ---- - -## Additional Resources - -- **Triton Documentation:** https://triton-lang.org/ -- **Flash Attention Paper:** https://arxiv.org/abs/2205.14135 -- **ROCm Profiling Guide:** https://rocm.docs.amd.com/projects/rocprofiler/ -- **Performance Debugging Guide:** exercises/performance_debugging/README.md - ---- - -## Summary - -**V3 achieves 5.5x speedup through:** - -1. Flash Attention (Triton) - 46% memory reduction -2. RMSNorm (Triton) - Fused kernel -3. Hybrid SwiGLU - Use rocBLAS for matmul -4. Tensor contiguity - Critical for Triton performance -5. Proper initialization - Correctness first! - -**Key insight:** Best performance comes from using the right tool for each operation - not from using custom kernels everywhere! - -**Ready to debug?** Start with `cd exercises/performance_debugging/` +**Performance slower than expected** +- Ensure tensors are contiguous +- Use CUDA synchronization for accurate timing +- Use hybrid SwiGLU (not custom Triton matmul) +## Additional Resources +- Triton Documentation: https://triton-lang.org/ +- Flash Attention Paper: https://arxiv.org/abs/2205.14135 +- Performance Debugging Guide: exercises/performance_debugging/README.md diff --git a/MLExamples/TinyTransformer/version3_triton/exercises/exercise1_triton_basics.md b/MLExamples/TinyTransformer/version3_triton/exercises/exercise1_triton_basics.md index 9ceb42ab..9c23cc48 100644 --- a/MLExamples/TinyTransformer/version3_triton/exercises/exercise1_triton_basics.md +++ b/MLExamples/TinyTransformer/version3_triton/exercises/exercise1_triton_basics.md @@ -1,30 +1,12 @@ - ## Exercise 1: Understanding Triton Kernel Basics -`exercise1_triton_basics.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton/exercises` in the Training Examples repository - -**Objective**: Learn the fundamentals of Triton GPU programming and analyze basic kernel performance. - -**Time**: 45 minutes - -**Prerequisites**: Completed Version 1 and Version 2 exercises - -### Background +**Objective**: Learn Triton GPU programming fundamentals and analyze basic kernel performance. -Triton is a language and compiler for writing custom GPU kernels. It provides: +**Time**: 45 minutes | **Prerequisites**: Completed Version 1 and Version 2 exercises -- Python-like syntax for GPU programming -- Automatic memory coalescing and optimization -- Block-level programming model -- Integration with PyTorch +### Part A: Kernel Structure Analysis -In this exercise, you'll analyze the basic structure of Triton kernels and understand their performance characteristics. - -### Part A: Kernel Structure Analysis (15 minutes) - -#### Step 1: Examine the RMSNorm Kernel - -Open `tiny_llama_v3.py` and locate the `rmsnorm_kernel` function: +Examine the `rmsnorm_kernel` in `tiny_llama_v3.py`: ```python @triton.jit @@ -36,212 +18,48 @@ def rmsnorm_kernel( ): ``` -**Questions to Answer:** - -1. **Pointer Management**: How does Triton handle memory pointers compared to CUDA? -2. **Block Processing**: What is the role of `BLOCK_SIZE` in this kernel? -3. **Constexpr Usage**: Why are `eps` and `BLOCK_SIZE` marked as `tl.constexpr`? -4. **Memory Access Pattern**: How does the kernel ensure coalesced memory access? - -#### Step 2: Analyze Memory Access Patterns - -Look at the variance computation loop: - -```python -for i in range(0, n_elements, BLOCK_SIZE): - offsets = i + tl.arange(0, BLOCK_SIZE) - mask = offsets < n_elements - x_vals = tl.load(x_ptr + row_idx * n_elements + offsets, mask=mask, other=0.0) - variance += tl.sum(x_vals * x_vals, axis=0) -``` - -**Analysis Tasks:** - -1. **Memory Coalescing**: Explain how the `offsets` calculation ensures coalesced memory access -2. **Boundary Handling**: What does the `mask` parameter accomplish? -3. **Reduction Pattern**: How does this implement an efficient parallel reduction? - -#### Step 3: Compare with PyTorch Implementation - -Compare the Triton RMSNorm with the PyTorch version: - -```python -def pytorch_rmsnorm(x): - variance = x.pow(2).mean(dim=-1, keepdim=True) - x = x * torch.rsqrt(variance + eps) - return x * weight -``` - -**Discussion Points:** +**Questions:** +1. How does Triton handle memory pointers compared to CUDA? +2. What is the role of `BLOCK_SIZE`? +3. Why are `eps` and `BLOCK_SIZE` marked as `tl.constexpr`? -1. **Kernel Fusion**: How does Triton fuse operations that PyTorch keeps separate? -2. **Memory Efficiency**: What memory advantages does the Triton version have? -3. **Numerical Precision**: Are there any precision considerations? +### Part B: Performance Profiling -### Part B: Performance Profiling (20 minutes) - -#### Step 4: Run Basic Profiling - -Execute the Triton profiling script: +Run the Triton profiling script: ```bash cd version3_triton/ python3 run_triton_profiling.py ``` -**Expected Output Analysis:** - -``` -=== Triton Kernel Performance Analysis === - -1. RMSNorm Kernel Profiling - Triton RMSNorm: X.XXX ms - PyTorch RMSNorm: Y.YYY ms - Speedup: Z.ZZx - Max error: E.EEe-XX -``` - -**Performance Questions:** - -1. **Speedup Analysis**: What speedup did you achieve? Is it consistent with expectations? -2. **Accuracy Check**: What is the maximum error between implementations? Is this acceptable? -3. **Memory Usage**: How does memory usage compare between the implementations? - -#### Step 5: Analyze ROCProfiler Results - -Run the ROCProfiler analysis: +Run ROCProfiler analysis: ```bash -chmod +x run_rocprof_triton.sh ./run_rocprof_triton.sh -``` - -Examine the generated results: - -```bash -ls rocprof_results/ cat rocprof_results/triton_analysis_summary.md ``` -**Profiling Analysis:** - -1. **Kernel Launch Overhead**: What is the launch overhead for Triton kernels? -2. **Memory Bandwidth**: What memory bandwidth utilization are you achieving? -3. **GPU Utilization**: How well are you utilizing the available compute units? - -### Part C: Block Size Optimization (10 minutes) - -#### Step 6: Experiment with Block Sizes - -Modify the `rmsnorm_kernel` call in `TritonRMSNorm.forward()`: - -```python -# Try different block sizes -for block_size in [64, 128, 256, 512, 1024]: - rmsnorm_kernel[grid]( - x_reshaped, self.weight, output, - dim, self.eps, BLOCK_SIZE=block_size - ) -``` - -**Optimization Tasks:** - -1. **Performance Testing**: Measure execution time for each block size -2. **Memory Analysis**: How does block size affect memory access patterns? -3. **Occupancy Impact**: What's the relationship between block size and GPU occupancy? - -#### Step 7: Memory Access Analysis - -Create a simple memory access pattern analyzer: - -```python -def analyze_memory_pattern(): - # Simulate memory access pattern - dim = 2048 - block_sizes = [64, 128, 256, 512] - - for block_size in block_sizes: - total_blocks = (dim + block_size - 1) // block_size - print(f"Block size {block_size}: {total_blocks} blocks") - - # Analyze memory transactions - elements_per_transaction = min(block_size, 32) # Typical coalescing width - transactions = (block_size + elements_per_transaction - 1) // elements_per_transaction - print(f" Memory transactions per block: {transactions}") - print(f" Total transactions: {total_blocks * transactions}") -``` - -**Memory Analysis Questions:** +### Part C: Block Size Optimization -1. **Coalescing Efficiency**: Which block size provides the best memory coalescing? -2. **Transaction Overhead**: How does the number of memory transactions scale? -3. **Cache Utilization**: What's the impact on L1/L2 cache utilization? +Experiment with block sizes (64, 128, 256, 512, 1024) and measure: +- Execution time +- Memory transactions +- GPU occupancy -### Exercise Results - -Document your findings: - -#### Performance Results Table +### Results Template | Metric | Triton RMSNorm | PyTorch RMSNorm | Speedup | |--------|----------------|------------------|---------| | Execution Time (ms) | | | | | Memory Usage (MB) | | | | -| Bandwidth (GB/s) | | | | - -#### Block Size Analysis - -| Block Size | Execution Time (ms) | Memory Transactions | GPU Occupancy | -|------------|-------------------|-------------------|---------------| -| 64 | | | | -| 128 | | | | -| 256 | | | | -| 512 | | | | -| 1024 | | | | - -#### Key Insights - -1. **Best Block Size**: _____ -2. **Primary Performance Bottleneck**: _____ -3. **Memory Efficiency**: _____ -4. **Optimization Opportunities**: _____ - -### Discussion Questions - -1. **Triton vs CUDA**: How does Triton kernel development compare to writing CUDA kernels? -2. **Automatic Optimizations**: What optimizations does Triton perform automatically? +### Common Issues -3. **Performance Portability**: How portable are Triton kernels across different GPU architectures? +- **Compilation Errors**: Check tensor shapes and constexpr values +- **Performance Regression**: Verify block size tuning and proper warmup +- **Numerical Differences**: Small FP precision differences are normal -4. **Integration Complexity**: What are the challenges of integrating Triton kernels into PyTorch models? - -### Next Steps - -In Exercise 2, you'll dive deeper into the SwiGLU kernel implementation and learn about: -- Multi-dimensional memory access patterns -- Kernel fusion strategies -- Advanced optimization techniques -- Debugging Triton kernels - -### Common Issues and Solutions - -#### Issue 1: Compilation Errors -**Problem**: Triton kernel fails to compile -**Solution**: Check that all tensor shapes are compatible and constexpr values are properly defined - -#### Issue 2: Performance Regression -**Problem**: Triton kernel is slower than PyTorch -**Solution**: Verify block size tuning and memory access patterns; ensure proper warmup - -#### Issue 3: Numerical Differences -**Problem**: Results don't match PyTorch exactly -**Solution**: Check floating-point precision and reduction order; small differences are normal - -### Additional Resources +### Resources - [Triton Documentation](https://triton-lang.org/main/index.html) - [Triton Tutorials](https://triton-lang.org/main/getting-started/tutorials/index.html) -- [GPU Memory Coalescing Guide](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/) -- [ROCm Performance Guidelines](https://rocmdocs.amd.com/) - diff --git a/MLExamples/TinyTransformer/version3_triton/exercises/exercise2_swiglu_optimization.md b/MLExamples/TinyTransformer/version3_triton/exercises/exercise2_swiglu_optimization.md index 0607ab3e..7eca4afc 100644 --- a/MLExamples/TinyTransformer/version3_triton/exercises/exercise2_swiglu_optimization.md +++ b/MLExamples/TinyTransformer/version3_triton/exercises/exercise2_swiglu_optimization.md @@ -1,30 +1,16 @@ - ## Exercise 2: SwiGLU Kernel Optimization -`exercise2_swiglu_optimization.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton` in the Training Examples repository - -**Objective**: Master advanced Triton kernel development through SwiGLU optimization and learn multi-dimensional memory access patterns. - -**Time**: 60 minutes +**Objective**: Master advanced Triton kernel development through SwiGLU optimization. -**Prerequisites**: Completed Exercise 1 +**Time**: 60 minutes | **Prerequisites**: Completed Exercise 1 ### Background -The SwiGLU (Swish-Gated Linear Unit) is a key component in modern transformer architectures. It combines: - -- Gate projection with SiLU activation -- Up projection -- Element-wise multiplication -- Down projection - -Traditional implementations require multiple kernel launches and intermediate storage. Our Triton kernel fuses the gate and up projections with activation, reducing memory traffic and improving performance. +SwiGLU (Swish-Gated Linear Unit) combines gate projection with SiLU activation, up projection, element-wise multiplication, and down projection. Our Triton kernel fuses the gate and up projections with activation. -### Part A: SwiGLU Kernel Deep Dive (20 minutes) +### Part A: SwiGLU Kernel Analysis -#### Step 1: Analyze the Kernel Structure - -Examine the `swiglu_kernel` in `tiny_llama_v3.py`: +Examine `swiglu_kernel` in `tiny_llama_v3.py`: ```python @triton.jit @@ -37,425 +23,40 @@ def swiglu_kernel( ): ``` -**Analysis Questions:** - -1. **Multi-dimensional Blocking**: Why does this kernel use three different block sizes? -2. **Memory Layout**: How are the tensors laid out in memory (batch, sequence, feature dimensions)? -3. **Compute Intensity**: What is the arithmetic intensity of this kernel? - -#### Step 2: Understand the Computation Flow - -Follow the kernel execution: - -```python -# Load input -input_offset = batch_idx * seq_len * d_model + seq_idx * d_model -x_block = tl.load(x_ptr + input_offset + tl.arange(0, d_model)) - -# Compute projections -for i in range(0, d_model, BLOCK_SIZE_D): - x_vals = tl.load(x_ptr + input_offset + i + tl.arange(0, BLOCK_SIZE_D)) - gate_weights = tl.load(gate_weight_ptr + d_idx * d_model + i + tl.arange(0, BLOCK_SIZE_D)) - up_weights = tl.load(up_weight_ptr + d_idx * d_model + i + tl.arange(0, BLOCK_SIZE_D)) - - gate_sum += tl.sum(x_vals * gate_weights) - up_sum += tl.sum(x_vals * up_weights) - -# Apply activation -gate_activated = gate_sum / (1.0 + tl.exp(-gate_sum)) -result = gate_activated * up_sum -``` - -**Computation Analysis:** - -1. **Memory Reuse**: How does the kernel maximize input data reuse? -2. **Reduction Pattern**: Explain the dot product computation strategy -3. **Activation Fusion**: How is the SiLU activation integrated efficiently? - -#### Step 3: Memory Access Pattern Visualization - -Create a visualization tool for memory access patterns: - -```python -def visualize_swiglu_access_pattern(): - """Visualize memory access patterns for SwiGLU kernel.""" - - # Example dimensions - batch_size, seq_len, d_model, d_ff = 2, 4, 8, 12 - - print("SwiGLU Memory Access Pattern Analysis") - print("=" * 50) - - print(f"Tensor shapes:") - print(f" Input (x): [{batch_size}, {seq_len}, {d_model}]") - print(f" Gate weights: [{d_ff}, {d_model}]") - print(f" Up weights: [{d_ff}, {d_model}]") - print(f" Output: [{batch_size}, {seq_len}, {d_ff}]") - - print(f"\nTotal elements:") - print(f" Input: {batch_size * seq_len * d_model}") - print(f" Weights: {2 * d_ff * d_model}") - print(f" Output: {batch_size * seq_len * d_ff}") - - # Analyze memory traffic - input_reads = batch_size * seq_len * d_model * d_ff # Each input element read d_ff times - weight_reads = 2 * d_ff * d_model * batch_size * seq_len # Weight reuse across batch/seq - output_writes = batch_size * seq_len * d_ff - - total_bytes = (input_reads + weight_reads + output_writes) * 4 # float32 - - print(f"\nMemory traffic analysis:") - print(f" Input reads: {input_reads}") - print(f" Weight reads: {weight_reads}") - print(f" Output writes: {output_writes}") - print(f" Total memory traffic: {total_bytes / 1e6:.2f} MB") - - # Compute to memory ratio - flops = 2 * batch_size * seq_len * d_model * d_ff * 2 # 2 projections, 2 ops per MAC - arithmetic_intensity = flops / total_bytes * 4 # ops per byte - - print(f" FLOPs: {flops}") - print(f" Arithmetic intensity: {arithmetic_intensity:.2f} ops/byte") - -# Run the analysis -visualize_swiglu_access_pattern() -``` - -### Part B: Performance Optimization (25 minutes) - -#### Step 4: Block Size Tuning - -Create a systematic block size tuning script: - -```python -import time -import torch -from tiny_llama_v3 import TritonSwiGLU - -def tune_swiglu_block_sizes(): - """Tune block sizes for optimal SwiGLU performance.""" - - device = torch.device('cuda') - batch_size, seq_len, d_model = 4, 512, 2048 - hidden_dim = int(2.67 * d_model) - - # Test different block size combinations - block_configs = [ - (1, 1, 32), # Small blocks - (1, 1, 64), # Medium blocks - (1, 1, 128), # Large blocks - (1, 2, 64), # Sequence blocking - (2, 1, 64), # Batch blocking - (1, 1, 256), # Extra large feature blocks - ] - - x = torch.randn(batch_size, seq_len, d_model, device=device) - - results = [] - - for b_block, s_block, d_block in block_configs: - print(f"\nTesting block configuration: B={b_block}, S={s_block}, D={d_block}") - - # Create modified SwiGLU with specific block sizes - swiglu = TritonSwiGLU(d_model, hidden_dim).to(device) - - # Warmup - for _ in range(10): - _ = swiglu(x) - torch.cuda.synchronize() - - # Benchmark - start_time = time.time() - for _ in range(100): - output = swiglu(x) - torch.cuda.synchronize() - - avg_time = (time.time() - start_time) / 100 - - results.append({ - 'config': (b_block, s_block, d_block), - 'time_ms': avg_time * 1000, - 'throughput': batch_size * seq_len / avg_time - }) - - print(f" Time: {avg_time*1000:.3f} ms") - print(f" Throughput: {batch_size * seq_len / avg_time:.0f} tokens/s") - - # Find best configuration - best_result = min(results, key=lambda x: x['time_ms']) - print(f"\nBest configuration: {best_result['config']}") - print(f"Best time: {best_result['time_ms']:.3f} ms") - - return results - -# Run block size tuning -block_results = tune_swiglu_block_sizes() -``` - -#### Step 5: Memory Layout Optimization - -Experiment with different memory layouts: - -```python -def analyze_memory_layouts(): - """Analyze impact of different memory layouts on performance.""" - - device = torch.device('cuda') - batch_size, seq_len, d_model = 4, 512, 2048 - hidden_dim = int(2.67 * d_model) - - # Test different weight layouts - layouts = ['row_major', 'column_major', 'transposed'] - - x = torch.randn(batch_size, seq_len, d_model, device=device) - - for layout in layouts: - print(f"\nTesting {layout} weight layout:") - - swiglu = TritonSwiGLU(d_model, hidden_dim).to(device) - - if layout == 'column_major': - # Transpose weights for column-major access - swiglu.gate_proj.weight.data = swiglu.gate_proj.weight.data.t().contiguous().t() - swiglu.up_proj.weight.data = swiglu.up_proj.weight.data.t().contiguous().t() - elif layout == 'transposed': - # Use transposed weights - swiglu.gate_proj.weight.data = swiglu.gate_proj.weight.data.t().contiguous() - swiglu.up_proj.weight.data = swiglu.up_proj.weight.data.t().contiguous() - - # Benchmark - torch.cuda.synchronize() - start_time = time.time() - - for _ in range(50): - output = swiglu(x) - - torch.cuda.synchronize() - avg_time = (time.time() - start_time) / 50 - - print(f" Average time: {avg_time*1000:.3f} ms") - print(f" Memory bandwidth: {estimate_bandwidth(x, swiglu, avg_time):.1f} GB/s") - -def estimate_bandwidth(x, swiglu, exec_time): - """Estimate memory bandwidth utilization.""" - - # Calculate memory footprint - input_size = x.numel() * 4 # float32 - weight_size = (swiglu.gate_proj.weight.numel() + swiglu.up_proj.weight.numel()) * 4 - output_size = x.shape[0] * x.shape[1] * swiglu.gate_proj.out_features * 4 - - total_bytes = input_size + weight_size + output_size - bandwidth = total_bytes / exec_time / 1e9 - - return bandwidth - -# Run memory layout analysis -analyze_memory_layouts() -``` - -#### Step 6: Arithmetic Intensity Analysis - -Calculate and optimize arithmetic intensity: - -```python -def analyze_arithmetic_intensity(): - """Analyze arithmetic intensity and roofline performance.""" - - batch_size, seq_len, d_model = 4, 512, 2048 - hidden_dim = int(2.67 * d_model) - - # Calculate FLOPs - # Gate projection: batch_size * seq_len * d_model * hidden_dim * 2 (MAC) - gate_flops = batch_size * seq_len * d_model * hidden_dim * 2 - - # Up projection: same as gate - up_flops = gate_flops - - # SiLU activation: ~4 FLOPs per element (exp, add, div, mul) - silu_flops = batch_size * seq_len * hidden_dim * 4 - - # Element-wise multiply: 1 FLOP per element - multiply_flops = batch_size * seq_len * hidden_dim +**Questions:** +1. Why does this kernel use three different block sizes? +2. How are tensors laid out in memory? +3. What is the arithmetic intensity? - total_flops = gate_flops + up_flops + silu_flops + multiply_flops +### Part B: Performance Optimization - # Calculate memory traffic - input_bytes = batch_size * seq_len * d_model * 4 - gate_weight_bytes = d_model * hidden_dim * 4 - up_weight_bytes = d_model * hidden_dim * 4 - output_bytes = batch_size * seq_len * hidden_dim * 4 +Test different block size combinations: +- (1, 1, 32), (1, 1, 64), (1, 1, 128) +- (1, 2, 64), (2, 1, 64), (1, 1, 256) - total_bytes = input_bytes + gate_weight_bytes + up_weight_bytes + output_bytes +### Part C: Arithmetic Intensity Analysis - arithmetic_intensity = total_flops / total_bytes +Calculate for batch_size=4, seq_len=512, d_model=2048: +- Total FLOPs (gate + up projections + activation) +- Total memory traffic +- Arithmetic intensity (FLOPs/byte) - print("SwiGLU Arithmetic Intensity Analysis") - print("=" * 40) - print(f"Problem size: {batch_size}x{seq_len}x{d_model} -> {hidden_dim}") - print(f"Total FLOPs: {total_flops/1e9:.2f} GFLOPs") - print(f"Total memory: {total_bytes/1e6:.2f} MB") - print(f"Arithmetic intensity: {arithmetic_intensity:.2f} FLOPs/byte") +Determine if kernel is compute-bound or memory-bound using roofline analysis. - # Roofline analysis - peak_flops = 200e12 # Example: 200 TFLOPS (MI250X) - peak_bandwidth = 1600e9 # Example: 1.6 TB/s +### Results Template - compute_bound_intensity = peak_flops / peak_bandwidth - - print(f"\nRoofline Analysis:") - print(f"Peak compute: {peak_flops/1e12:.0f} TFLOPS") - print(f"Peak bandwidth: {peak_bandwidth/1e9:.0f} GB/s") - print(f"Compute-bound threshold: {compute_bound_intensity:.2f} FLOPs/byte") - - if arithmetic_intensity > compute_bound_intensity: - print("Kernel is compute-bound - optimize arithmetic operations") - bottleneck = "compute" - else: - print("Kernel is memory-bound - optimize memory access") - bottleneck = "memory" - - return { - 'arithmetic_intensity': arithmetic_intensity, - 'total_flops': total_flops, - 'total_bytes': total_bytes, - 'bottleneck': bottleneck - } - -# Run arithmetic intensity analysis -intensity_results = analyze_arithmetic_intensity() -``` - -### Part C: Advanced Optimization Techniques (15 minutes) - -#### Step 7: Implement Kernel Variants - -Create optimized kernel variants: - -```python -# Version 1: Basic implementation (current) -# Version 2: Optimized for memory-bound workloads -# Version 3: Optimized for compute-bound workloads - -@triton.jit -def swiglu_kernel_optimized_memory( - x_ptr, gate_weight_ptr, up_weight_ptr, output_ptr, - batch_size, seq_len, d_model, d_ff, - BLOCK_SIZE_D: tl.constexpr, -): - """Memory-optimized SwiGLU kernel with better data reuse.""" +| Configuration | Time (ms) | Speedup vs PyTorch | Bandwidth (GB/s) | +|---------------|-----------|-------------------|------------------| +| Block Size (1,1,64) | | | | +| Block Size (1,1,128) | | | | - # Single thread processes entire token - batch_idx = tl.program_id(0) - seq_idx = tl.program_id(1) - - input_offset = batch_idx * seq_len * d_model + seq_idx * d_model - - # Process all outputs for this token - for d_out in range(0, d_ff, BLOCK_SIZE_D): - gate_sum = tl.zeros((BLOCK_SIZE_D,), dtype=tl.float32) - up_sum = tl.zeros((BLOCK_SIZE_D,), dtype=tl.float32) - - # Load output indices - d_indices = d_out + tl.arange(0, BLOCK_SIZE_D) - d_mask = d_indices < d_ff - - # Compute projections - for d_in in range(d_model): - x_val = tl.load(x_ptr + input_offset + d_in) - - gate_weights = tl.load(gate_weight_ptr + d_indices * d_model + d_in, mask=d_mask) - up_weights = tl.load(up_weight_ptr + d_indices * d_model + d_in, mask=d_mask) - - gate_sum += x_val * gate_weights - up_sum += x_val * up_weights - - # Apply SiLU and multiply - gate_activated = gate_sum / (1.0 + tl.exp(-gate_sum)) - result = gate_activated * up_sum - - # Store results - output_offset = batch_idx * seq_len * d_ff + seq_idx * d_ff + d_indices - tl.store(output_ptr + output_offset, result, mask=d_mask) - - -def benchmark_kernel_variants(): - """Benchmark different kernel implementations.""" - - device = torch.device('cuda') - batch_size, seq_len, d_model = 4, 512, 2048 - hidden_dim = int(2.67 * d_model) - - x = torch.randn(batch_size, seq_len, d_model, device=device) - - variants = [ - ('Original', TritonSwiGLU(d_model, hidden_dim)), - # Add other variants here - ] - - for name, swiglu in variants: - swiglu = swiglu.to(device) - - # Warmup - for _ in range(10): - _ = swiglu(x) - torch.cuda.synchronize() - - # Benchmark - start_time = time.time() - for _ in range(100): - output = swiglu(x) - torch.cuda.synchronize() - - avg_time = (time.time() - start_time) / 100 - print(f"{name}: {avg_time*1000:.3f} ms") - -# Run variant benchmarks -benchmark_kernel_variants() -``` - -### Exercise Results - -#### Performance Comparison Table - -| Configuration | Time (ms) | Speedup vs PyTorch | Memory Usage | Bandwidth (GB/s) | -|---------------|-----------|-------------------|--------------|------------------| -| Original SwiGLU | | | | | -| Block Size (1,1,32) | | | | | -| Block Size (1,1,64) | | | | | -| Block Size (1,1,128) | | | | | -| Memory Optimized | | | | | - -#### Arithmetic Intensity Analysis - -- **Total FLOPs**: _____ GFLOPs -- **Memory Traffic**: _____ MB -- **Arithmetic Intensity**: _____ FLOPs/byte -- **Performance Bottleneck**: _____ (compute/memory) -- **Optimization Strategy**: _____ - -#### Key Findings +### Key Findings 1. **Optimal Block Size**: _____ 2. **Memory Layout Impact**: _____ -3. **Arithmetic Intensity**: _____ -4. **Performance Bottleneck**: _____ - -### Discussion Questions - -1. **Multi-dimensional Blocking**: How do you choose optimal block sizes for multi-dimensional problems? - -2. **Memory vs Compute Optimization**: When should you optimize for memory bandwidth vs computational throughput? - -3. **Kernel Fusion Trade-offs**: What are the trade-offs between kernel fusion and memory usage? - -4. **Scalability**: How do these optimizations scale with different problem sizes? - -### Next Steps - -Exercise 3 will cover Flash Attention implementation, focusing on: +3. **Performance Bottleneck**: _____ (compute/memory) -- Memory-efficient attention patterns -- Tiling strategies for large sequences -- Numerical stability in custom kernels -- Advanced debugging techniques +### Resources +- Arithmetic intensity and roofline model concepts +- Memory coalescing patterns for multi-dimensional data diff --git a/MLExamples/TinyTransformer/version3_triton/exercises/exercise3_flash_attention.md b/MLExamples/TinyTransformer/version3_triton/exercises/exercise3_flash_attention.md index 0b8a9045..84131bc5 100644 --- a/MLExamples/TinyTransformer/version3_triton/exercises/exercise3_flash_attention.md +++ b/MLExamples/TinyTransformer/version3_triton/exercises/exercise3_flash_attention.md @@ -1,30 +1,16 @@ +## Exercise 3: Flash Attention Implementation -## Exercise 3: Flash Attention Implementation and Optimization +**Objective**: Master memory-efficient attention patterns and Flash Attention in Triton. -`exercise3_flash_attention.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version3_triton/exercises` in the Training Examples repository - -**Objective**: Master advanced memory-efficient attention patterns and understand the Flash Attention algorithm implementation in Triton. - -**Time**: 75 minutes - -**Prerequisites**: Completed Exercises 1 and 2 +**Time**: 75 minutes | **Prerequisites**: Completed Exercises 1 and 2 ### Background -Flash Attention is a memory-efficient implementation of scaled dot-product attention that: - -- Reduces memory complexity from O(N^2) to O(N) -- Uses tiling to fit computations in SRAM -- Maintains numerical stability through online statistics -- Achieves significant speedups for long sequences - -This exercise explores the Triton implementation and optimization strategies. +Flash Attention reduces memory complexity from O(N²) to O(N) using tiling and online statistics, enabling significant speedups for long sequences. -### Part A: Flash Attention Algorithm Understanding (25 minutes) +### Part A: Algorithm Understanding -#### Step 1: Analyze the Algorithm Structure - -Examine the `flash_attention_kernel` in `tiny_llama_v3.py`: +Examine `flash_attention_kernel` in `tiny_llama_v3.py`: ```python @triton.jit @@ -37,546 +23,41 @@ def flash_attention_kernel( ): ``` -**Key Components Analysis:** - -1. **Tiling Strategy**: How does the algorithm tile the attention matrix? -2. **Online Statistics**: How are max values and sum exponentials maintained? -3. **Numerical Stability**: What prevents overflow in the softmax computation? - -#### Step 2: Understand the Core Loop - -Analyze the main computation loop: - -```python -# Initialize output accumulators -output_acc = tl.zeros((BLOCK_SIZE_Q, head_dim), dtype=tl.float32) -max_scores = tl.full((BLOCK_SIZE_Q,), -float('inf'), dtype=tl.float32) -sum_exp = tl.zeros((BLOCK_SIZE_Q,), dtype=tl.float32) - -# Process K,V blocks -for k_block_start in range(0, seq_len, BLOCK_SIZE_K): - # Compute attention scores - scores = tl.zeros((BLOCK_SIZE_Q, BLOCK_SIZE_K), dtype=tl.float32) - - # Update running statistics - block_max = tl.max(scores, axis=1) - new_max = tl.maximum(max_scores, block_max) - exp_scores = tl.exp(scores - new_max[:, None]) - - # Update accumulated values - decay = tl.exp(max_scores - new_max) - sum_exp = sum_exp * decay + tl.sum(exp_scores, axis=1) - max_scores = new_max -``` - -**Algorithm Questions:** - -1. **Memory Complexity**: How does this achieve O(N) memory complexity? -2. **Numerical Stability**: Why subtract the maximum before exponentiation? -3. **Online Updates**: How are the running statistics updated correctly? - -#### Step 3: Compare with Standard Attention - -Create a comparison analysis: - -```python -def compare_attention_algorithms(): - """Compare Flash Attention with standard attention implementation.""" - - print("Attention Algorithm Comparison") - print("=" * 40) - - # Example sequence lengths - seq_lengths = [128, 256, 512, 1024, 2048, 4096] - head_dim = 64 - - for seq_len in seq_lengths: - # Standard attention memory - attention_matrix_size = seq_len * seq_len * 4 # float32 - qkv_size = 3 * seq_len * head_dim * 4 - output_size = seq_len * head_dim * 4 - - standard_memory = attention_matrix_size + qkv_size + output_size - - # Flash attention memory (tiled) - block_size = 64 # Typical block size - tile_size = block_size * block_size * 4 - flash_memory = tile_size + qkv_size + output_size - - memory_ratio = standard_memory / flash_memory - - print(f"Seq len {seq_len:4d}: Standard {standard_memory/1e6:6.2f} MB, " - f"Flash {flash_memory/1e6:6.2f} MB, " - f"Ratio: {memory_ratio:5.1f}x") - - return seq_lengths, [standard_memory, flash_memory] - -# Run comparison -compare_attention_algorithms() -``` - -#### Step 4: Analyze Causal Masking - -Understand how causal masking is implemented: - -```python -# Apply causal mask -causal_mask = q_offsets[:, None] >= k_offsets[None, :] -scores = tl.where(causal_mask, scores, -float('inf')) -``` - -**Masking Analysis:** - -1. **Mask Generation**: How is the causal mask computed efficiently? -2. **Memory Impact**: What's the memory overhead of masking? -3. **Alternative Strategies**: What other masking approaches exist? - -### Part B: Performance Analysis and Optimization (30 minutes) - -#### Step 5: Benchmark Flash Attention Performance - -Create a comprehensive benchmark: - -```python -import time -import torch -import torch.nn.functional as F -from tiny_llama_v3 import TritonAttention - -def benchmark_attention_implementations(): - """Benchmark Flash Attention vs standard PyTorch attention.""" - - device = torch.device('cuda') - - # Test configurations - configs = [ - (1, 8, 128, 64), # Small - (2, 16, 256, 64), # Medium - (4, 32, 512, 64), # Large - (2, 16, 1024, 64), # Long sequence - (1, 8, 2048, 64), # Very long - ] - - results = [] - - for batch_size, num_heads, seq_len, head_dim in configs: - print(f"\nTesting: B={batch_size}, H={num_heads}, S={seq_len}, D={head_dim}") - - dim = num_heads * head_dim - - # Create input - x = torch.randn(batch_size, seq_len, dim, device=device) - - # Flash Attention (Triton) - flash_attn = TritonAttention(dim, num_heads).to(device) - - # Standard PyTorch Attention - class StandardAttention(torch.nn.Module): - def __init__(self, dim, num_heads): - super().__init__() - self.num_heads = num_heads - self.head_dim = dim // num_heads - self.scale = 1.0 / (self.head_dim ** 0.5) - - self.q_proj = torch.nn.Linear(dim, dim, bias=False) - self.k_proj = torch.nn.Linear(dim, dim, bias=False) - self.v_proj = torch.nn.Linear(dim, dim, bias=False) - self.o_proj = torch.nn.Linear(dim, dim, bias=False) - - def forward(self, x): - B, T, C = x.shape - - q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) - k = self.k_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) - v = self.v_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2) - - # Standard attention - scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale - - # Causal mask - mask = torch.tril(torch.ones(T, T, device=x.device)) - scores = scores.masked_fill(mask == 0, float('-inf')) - - attn = F.softmax(scores, dim=-1) - out = torch.matmul(attn, v) - - out = out.transpose(1, 2).contiguous().view(B, T, C) - return self.o_proj(out) - - standard_attn = StandardAttention(dim, num_heads).to(device) - - # Copy weights for fair comparison - standard_attn.q_proj.weight.data.copy_(flash_attn.q_proj.weight.data) - standard_attn.k_proj.weight.data.copy_(flash_attn.k_proj.weight.data) - standard_attn.v_proj.weight.data.copy_(flash_attn.v_proj.weight.data) - standard_attn.o_proj.weight.data.copy_(flash_attn.o_proj.weight.data) - - # Benchmark Flash Attention - torch.cuda.synchronize() - start_time = time.time() - - for _ in range(20): - flash_output = flash_attn(x) - - torch.cuda.synchronize() - flash_time = (time.time() - start_time) / 20 - - # Benchmark Standard Attention - torch.cuda.synchronize() - start_time = time.time() - - for _ in range(20): - standard_output = standard_attn(x) - - torch.cuda.synchronize() - standard_time = (time.time() - start_time) / 20 - - # Memory usage - torch.cuda.reset_peak_memory_stats() - _ = flash_attn(x) - flash_memory = torch.cuda.max_memory_allocated() - - torch.cuda.reset_peak_memory_stats() - _ = standard_attn(x) - standard_memory = torch.cuda.max_memory_allocated() - - # Calculate metrics - speedup = standard_time / flash_time - memory_ratio = standard_memory / flash_memory - throughput = batch_size * seq_len / flash_time - - result = { - 'config': (batch_size, num_heads, seq_len, head_dim), - 'flash_time_ms': flash_time * 1000, - 'standard_time_ms': standard_time * 1000, - 'speedup': speedup, - 'flash_memory_mb': flash_memory / 1e6, - 'standard_memory_mb': standard_memory / 1e6, - 'memory_ratio': memory_ratio, - 'throughput': throughput - } - - results.append(result) - - print(f" Flash Attention: {flash_time*1000:.2f} ms, {flash_memory/1e6:.1f} MB") - print(f" Standard Attention: {standard_time*1000:.2f} ms, {standard_memory/1e6:.1f} MB") - print(f" Speedup: {speedup:.2f}x, Memory reduction: {memory_ratio:.2f}x") - print(f" Throughput: {throughput:.0f} tokens/s") - - return results - -# Run attention benchmarks -attention_results = benchmark_attention_implementations() -``` - -#### Step 6: Block Size Optimization - -Optimize block sizes for different sequence lengths: - -```python -def optimize_flash_attention_blocks(): - """Find optimal block sizes for Flash Attention.""" +**Key Questions:** +1. How does tiling achieve O(N) memory complexity? +2. Why subtract the maximum before exponentiation? +3. How are running statistics updated correctly? - device = torch.device('cuda') +### Part B: Performance Analysis - # Test different block size combinations - block_configs = [ - (32, 32), # Small blocks - (64, 64), # Medium blocks - (128, 128), # Large blocks - (64, 32), # Asymmetric 1 - (32, 64), # Asymmetric 2 - (128, 64), # Asymmetric 3 - ] +Benchmark configurations: +- (1, 8, 128, 64), (2, 16, 256, 64), (4, 32, 512, 64) +- (2, 16, 1024, 64), (1, 8, 2048, 64) - # Test on different sequence lengths - seq_lengths = [256, 512, 1024] +Compare Flash Attention vs standard PyTorch attention: +- Execution time +- Memory usage +- Speedup and memory reduction - batch_size, num_heads, head_dim = 2, 16, 64 - dim = num_heads * head_dim +### Part C: Block Size Optimization - for seq_len in seq_lengths: - print(f"\nOptimizing for sequence length: {seq_len}") +Test block sizes: (32,32), (64,64), (128,128), (64,32), (32,64), (128,64) - x = torch.randn(batch_size, seq_len, dim, device=device) +### Results Template - best_time = float('inf') - best_config = None - - for block_q, block_k in block_configs: - # Skip if blocks are too large for sequence - if block_q > seq_len or block_k > seq_len: - continue - - print(f" Testing blocks: Q={block_q}, K={block_k}") - - # Create attention with specific block sizes - # Note: This requires modifying the kernel call - flash_attn = TritonAttention(dim, num_heads).to(device) - - # Warmup - for _ in range(5): - _ = flash_attn(x) - torch.cuda.synchronize() - - # Benchmark - start_time = time.time() - for _ in range(20): - _ = flash_attn(x) - torch.cuda.synchronize() - - avg_time = (time.time() - start_time) / 20 - - print(f" Time: {avg_time*1000:.3f} ms") - - if avg_time < best_time: - best_time = avg_time - best_config = (block_q, block_k) - - print(f" Best configuration: Q={best_config[0]}, K={best_config[1]}") - print(f" Best time: {best_time*1000:.3f} ms") - -# Run block size optimization -optimize_flash_attention_blocks() -``` - -#### Step 7: Memory Pattern Analysis - -Analyze memory access patterns: - -```python -def analyze_flash_attention_memory(): - """Analyze memory access patterns in Flash Attention.""" - - print("Flash Attention Memory Pattern Analysis") - print("=" * 45) - - # Example configuration - batch_size, num_heads, seq_len, head_dim = 2, 16, 1024, 64 - block_q, block_k = 64, 64 - - print(f"Configuration: B={batch_size}, H={num_heads}, S={seq_len}, D={head_dim}") - print(f"Block sizes: Q={block_q}, K={block_k}") - - # Calculate memory accesses - num_q_blocks = (seq_len + block_q - 1) // block_q - num_k_blocks = (seq_len + block_k - 1) // block_k - - print(f"\nTiling information:") - print(f" Q blocks: {num_q_blocks}") - print(f" K blocks: {num_k_blocks}") - print(f" Total block pairs: {num_q_blocks * num_k_blocks}") - - # Memory per block - q_block_size = block_q * head_dim * 4 # float32 - k_block_size = block_k * head_dim * 4 - v_block_size = block_k * head_dim * 4 - scores_size = block_q * block_k * 4 - - print(f"\nMemory per block:") - print(f" Q block: {q_block_size/1e3:.1f} KB") - print(f" K block: {k_block_size/1e3:.1f} KB") - print(f" V block: {v_block_size/1e3:.1f} KB") - print(f" Scores: {scores_size/1e3:.1f} KB") - print(f" Total per iteration: {(q_block_size + k_block_size + v_block_size + scores_size)/1e3:.1f} KB") - - # Total memory traffic - q_reads = num_q_blocks * q_block_size * num_k_blocks # Q reused across K blocks - k_reads = num_k_blocks * k_block_size * num_q_blocks # K reused across Q blocks - v_reads = num_k_blocks * v_block_size * num_q_blocks # V same as K - output_writes = seq_len * head_dim * 4 - - total_traffic = q_reads + k_reads + v_reads + output_writes - - print(f"\nTotal memory traffic:") - print(f" Q reads: {q_reads/1e6:.2f} MB") - print(f" K reads: {k_reads/1e6:.2f} MB") - print(f" V reads: {v_reads/1e6:.2f} MB") - print(f" Output writes: {output_writes/1e6:.2f} MB") - print(f" Total: {total_traffic/1e6:.2f} MB") - - # Compare with standard attention - standard_traffic = ( - 3 * seq_len * head_dim * 4 + # Q, K, V - seq_len * seq_len * 4 + # Attention matrix - seq_len * head_dim * 4 # Output - ) - - print(f"\nStandard attention traffic: {standard_traffic/1e6:.2f} MB") - print(f"Flash attention reduction: {standard_traffic/total_traffic:.2f}x") - - return { - 'flash_traffic_mb': total_traffic / 1e6, - 'standard_traffic_mb': standard_traffic / 1e6, - 'reduction_ratio': standard_traffic / total_traffic - } - -# Run memory analysis -memory_analysis = analyze_flash_attention_memory() -``` - -### Part C: Advanced Optimizations and Debugging (20 minutes) - -#### Step 8: Numerical Stability Testing - -Test numerical stability across different conditions: - -```python -def test_numerical_stability(): - """Test numerical stability of Flash Attention implementation.""" - - device = torch.device('cuda') - - # Test conditions - test_cases = [ - ("normal", 1.0, 0.0), - ("large_values", 10.0, 0.0), - ("small_values", 0.1, 0.0), - ("extreme_large", 100.0, 0.0), - ("with_noise", 1.0, 0.1), - ] - - batch_size, num_heads, seq_len, head_dim = 2, 8, 256, 64 - dim = num_heads * head_dim - - flash_attn = TritonAttention(dim, num_heads).to(device) - - for name, scale, noise in test_cases: - print(f"\nTesting {name} (scale={scale}, noise={noise}):") - - # Generate test input - x = torch.randn(batch_size, seq_len, dim, device=device) * scale - if noise > 0: - x += torch.randn_like(x) * noise - - try: - output = flash_attn(x) - - # Check for NaN/Inf - has_nan = torch.isnan(output).any() - has_inf = torch.isinf(output).any() - - print(f" Input range: [{x.min():.3f}, {x.max():.3f}]") - print(f" Output range: [{output.min():.3f}, {output.max():.3f}]") - print(f" Has NaN: {has_nan}") - print(f" Has Inf: {has_inf}") - - if has_nan or has_inf: - print(" WARNING: Numerical instability detected!") - else: - print(" PASS Numerically stable") - - except Exception as e: - print(f" FAIL Error: {e}") - -# Run stability tests -test_numerical_stability() -``` - -#### Step 9: Performance Profiling Integration - -Integrate with ROCProfiler for detailed analysis: - -```python -def create_flash_attention_profile(): - """Create focused profiling for Flash Attention kernels.""" - - # Create ROCProfiler configuration for Flash Attention - profile_config = """ -# Flash Attention Kernel Profiling Configuration -pmc : Wavefronts VALUInsts SALUInsts SFetchInsts FlatVMemInsts LDSInsts -pmc : VALUUtilization FlatVMemUtilization MemUnitBusy L2CacheHit -pmc : WriteUnitStalled ALUStalledByLDS LDSBankConflict -range: 0x1000000000000:0x2000000000000 -gpu: 0 -kernel: flash_attention_kernel -""" - - with open("flash_attention_profile.txt", "w") as f: - f.write(profile_config) - - print("Created Flash Attention profiling configuration") - print("Run with: rocprof --input flash_attention_profile.txt python3 tiny_llama_v3.py") - -# Create profiling configuration -create_flash_attention_profile() -``` - -### Exercise Results - -#### Performance Summary Table - -| Sequence Length | Flash Attention (ms) | Standard Attention (ms) | Speedup | Memory Reduction | -|----------------|---------------------|------------------------|---------|------------------| +| Sequence Length | Flash (ms) | Standard (ms) | Speedup | Memory Reduction | +|----------------|------------|---------------|---------|------------------| | 128 | | | | | -| 256 | | | | | | 512 | | | | | | 1024 | | | | | -| 2048 | | | | | - -#### Block Size Optimization Results - -| Sequence Length | Optimal Q Block | Optimal K Block | Best Time (ms) | Notes | -|----------------|----------------|----------------|----------------|-------| -| 256 | | | | | -| 512 | | | | | -| 1024 | | | | | - -#### Memory Analysis Results - -- **Flash Attention Memory**: _____ MB -- **Standard Attention Memory**: _____ MB -- **Memory Reduction**: _____x -- **Arithmetic Intensity**: _____ FLOPs/byte - -#### Key Insights - -1. **Performance Scaling**: How does Flash Attention performance scale with sequence length? -2. **Memory Efficiency**: What's the memory reduction at different sequence lengths? -3. **Optimal Block Sizes**: What patterns emerge in optimal block size selection? -4. **Numerical Stability**: Are there any stability concerns with the implementation? - -### Discussion Questions - -1. **Algorithm Trade-offs**: What are the trade-offs between memory efficiency and computational complexity in Flash Attention? - -2. **Implementation Challenges**: What are the main challenges in implementing Flash Attention in Triton vs CUDA? - -3. **Sequence Length Scaling**: How does the algorithm's efficiency change with very long sequences (8K, 16K tokens)? - -4. **Hardware Considerations**: How might different GPU architectures affect Flash Attention performance? - -### Next Steps - -With Version 3 complete, you've learned: -- Advanced Triton kernel development -- Memory-efficient algorithm implementation -- Performance optimization strategies -- Numerical stability considerations - -Version 4 will cover ultra-fused implementations combining all optimizations into a single, highly optimized kernel suite. - -### Troubleshooting Guide - -#### Common Issues - -1. **Kernel Compilation Errors** - - Check tensor dimension compatibility - - Verify block sizes don't exceed hardware limits - - Ensure proper constexpr usage -2. **Performance Regression** - - Verify block sizes are optimal for your sequence length - - Check memory access patterns - - Ensure proper warmup before benchmarking +### Troubleshooting -3. **Numerical Instability** - - Monitor for overflow in softmax computation - - Check running statistics update logic - - Verify causal mask application +- **Kernel Compilation**: Check dimension compatibility and block size limits +- **Performance Regression**: Verify block sizes are optimal for sequence length +- **Numerical Instability**: Monitor overflow in softmax, check running statistics -4. **Memory Issues** - - Reduce block sizes if running out of memory - - Check for memory leaks in repeated runs - - Monitor peak memory usage during profiling +### Resources +- [Flash Attention Paper](https://arxiv.org/abs/2205.14135) +- Online softmax algorithm diff --git a/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/README.md b/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/README.md index 2149bda5..dfd49d3c 100644 --- a/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/README.md +++ b/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/README.md @@ -2,290 +2,87 @@ ## Overview -This exercise demonstrates the systematic debugging and optimization process for V3 Triton kernels. You'll learn how to: +This exercise demonstrates systematic debugging and optimization for V3 Triton kernels: 1. **Diagnose incorrect model behavior** (wrong loss values) 2. **Fix correctness issues** (weight initialization) 3. **Profile and identify performance bottlenecks** 4. **Systematically optimize for performance** -## The Problem - -Initial V3 implementation showed: -- **Loss = 942** (should be ~7 like V1/V2) -- **Fake timing** (reported 4ms but actually much slower) -- **6.4x slower than baseline** after initial fixes - ## Exercise Progression -Each file represents a stage in the debugging process: - ### Stage 1: Broken Loss (`v3_stage1_broken_loss.py`) **Problem:** Loss = 942 instead of ~7 **Root Cause:** Missing weight initialization -**What to Learn:** -- How to add diagnostic logging -- How to trace values through the model -- How exploding logits break training -**Run:** ```bash python v3_stage1_broken_loss.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** -``` -Loss: 942.8047 # WRONG! -Logits stats: min=-161, max=1025, std=43.79 # Exploding values -``` - ---- - ### Stage 2: Fixed Loss, Terrible Performance (`v3_stage2_slow_performance.py`) **Problem:** Loss fixed (7.0) but only 15.2 samples/sec (vs V1's 97 samples/sec) **Root Cause:** Non-contiguous tensors after `repeat_interleave` for GQA -**What to Learn:** -- How memory layout affects Triton kernel performance -- Why `.contiguous()` matters for GPU kernels -- How to identify stride-related issues -**Run:** ```bash python v3_stage2_slow_performance.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** -``` -Loss: 7.0108 # CORRECT! -Speed: 15.2 samples/sec # TERRIBLE! (V1 = 97 samples/sec) -Time: 526ms per batch -``` - ---- - ### Stage 3: Better Performance, Wrong Timing (`v3_stage3_fake_timing.py`) **Problem:** Improved to 310 samples/sec but timing breakdown is wrong -**Root Cause:** Missing CUDA synchronization for individual operation timing -**What to Learn:** -- GPU operations are asynchronous -- How to properly measure GPU kernel timing -- Why you need `torch.cuda.synchronize()` +**Root Cause:** Missing CUDA synchronization for timing -**Run:** ```bash python v3_stage3_fake_timing.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** -``` -Loss: 7.0108 # CORRECT! -Speed: 310.8 samples/sec # GOOD! -Forward: 3.2ms # Seems reasonable -Backward: 0.2ms # WRONG! Too fast! -Total: 25.7ms # Doesn't add up! (3.2 + 0.2 + 0.2 ≠ 25.7) -``` - ---- - ### Stage 4: Accurate Timing, Slow Kernels (`v3_stage4_slow_kernels.py`) -**Problem:** Accurate timing shows forward pass is 25.5ms (2.4x slower than V1's 10.8ms) +**Problem:** Forward pass is 25.5ms (2.4x slower than V1's 10.8ms) **Root Cause:** Inefficient Triton SwiGLU kernel doing manual matrix multiplication -**What to Learn:** -- How to identify kernel bottlenecks -- When NOT to use custom kernels (for large matrix ops) -- Why PyTorch BLAS is faster than naive Triton implementations -**Run:** ```bash python v3_stage4_slow_kernels.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** -``` -Loss: 7.0108 # CORRECT! -Speed: 305.9 samples/sec # STILL SLOWER THAN V1! -Forward: 25.5ms # TOO SLOW! (V1 = 10.8ms) -Backward: 0.3ms -Total: 26.2ms -``` - -**Profiling Analysis:** -- SwiGLU kernel launches 2,097,152 threads (batch × seq × d_ff = 8 × 128 × 2048) -- Each thread does manual reduction over 512 dimensions -- PyTorch's optimized BLAS would be much faster - ---- - ### Stage 5: Final Optimized (`../tiny_llama_v3.py`) **Solution:** Use PyTorch for matrix multiplies, Triton only for element-wise fusion **Result:** 2065 samples/sec (5.5x faster than V1!) -**What to Learn:** -- Hybrid optimization: use the best tool for each operation -- When to use Triton (memory-bound ops, fusion opportunities) -- When to use PyTorch (compute-bound large matrix ops) -**Run:** ```bash cd .. && python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20 ``` -**Expected Output:** -``` -Loss: 7.0108 # CORRECT! -Speed: 2065.0 samples/sec # EXCELLENT! (5.5x faster than V1!) -Forward: 3.2ms # Fast! -Backward: 0.3ms -Total: 3.9ms # Dramatic improvement! -Memory: 281.8 MB # 46% less than V1's 522.3 MB -``` - ---- - -## Profiling with ROCm Tools - -### Using rocprof to Profile Each Stage - -For each stage, you can generate detailed profiling traces: - -```bash -# Stage 1: Broken Loss (short run to see the issue) -rocprof --stats -o stage1_broken.csv python v3_stage1_broken_loss.py --batch-size 8 --seq-len 128 --num-steps 5 - -# Stage 2: Slow Performance -rocprof --stats -o stage2_slow.csv python v3_stage2_slow_performance.py --batch-size 8 --seq-len 128 --num-steps 20 - -# Stage 4: Slow Kernels (shows SwiGLU bottleneck) -rocprof --stats -o stage4_kernels.csv python v3_stage4_slow_kernels.py --batch-size 8 --seq-len 128 --num-steps 20 - -# Stage 5: Final Optimized -rocprof --stats -o stage5_optimized.csv python ../tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20 -``` - -### What to Look for in Traces +## Summary Table -**Stage 2 (Slow Performance):** -- Look for non-coalesced memory accesses in Flash Attention kernel -- High L2 cache miss rate -- Memory stalls +| Stage | Loss | Speed (samples/sec) | Issue | Fix | +|-------|------|---------------------|-------|-----| +| 1 | 942 | N/A | Missing weight init | Add `_init_weights()` | +| 2 | 7.0 | 15.2 | Non-contiguous tensors | Add `.contiguous()` | +| 3 | 7.0 | 310.8 | Wrong timing | Add CUDA sync | +| 4 | 7.0 | 305.9 | Slow Triton SwiGLU | Use PyTorch matmul | +| 5 | 7.0 | 2065.0 | **OPTIMIZED!** | Hybrid approach | -**Stage 4 (Slow Kernels):** -- SwiGLU kernel shows: - - 2M+ kernel launches - - Low occupancy (< 25%) - - High kernel launch overhead -- Compare to PyTorch matmul: - - Uses rocBLAS (optimized) - - High throughput (90%+ of peak) +**Baseline (V1):** 372.9 samples/sec | **Final Speedup:** 5.5x faster, 46% less memory -**Stage 5 (Optimized):** -- Flash Attention: High occupancy, good memory throughput -- RMSNorm: Fused operations, low latency -- Matrix ops: Delegated to rocBLAS (optimal) +## Key Learnings -### Analyzing with rocprofv2 +1. **Correctness First**: Validate loss/accuracy before optimizing +2. **Tensor Contiguity**: Always `.contiguous()` before Triton kernels +3. **Accurate Timing**: Use `torch.cuda.synchronize()` for GPU timing +4. **Hybrid Approach**: Triton for memory-bound ops, PyTorch BLAS for matrix ops -For more detailed analysis: +## Profiling Commands ```bash -# Profile with kernel trace -rocprofv2 --kernel-trace -o stage4_trace.json python v3_stage4_slow_kernels.py --batch-size 8 --seq-len 128 --num-steps 10 +# Basic profiling +rocprof --stats python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20 -# View in Perfetto UI -# Upload stage4_trace.json to https://ui.perfetto.dev +# Detailed kernel trace +rocprofv2 --kernel-trace -o trace.json python tiny_llama_v3.py ... +# View at https://ui.perfetto.dev ``` -**What to observe:** -- Kernel timeline showing SwiGLU dominating execution -- Memory transfer patterns -- Kernel duration vs. compute capability - ---- - -## Key Learnings - -### 1. Correctness First, Performance Second -- Stage 1 shows why: broken model can't be optimized -- Always validate loss/accuracy before optimizing - -### 2. Systematic Debugging -- Add diagnostic logging (Stage 1) -- Measure accurately (Stage 3) -- Profile to identify bottlenecks (Stage 4) -- Fix one issue at a time - -### 3. Know Your Tools -- **Triton**: Memory-bound ops, element-wise fusion, Flash Attention -- **PyTorch/BLAS**: Compute-bound matrix operations -- **Profilers**: rocprof for GPU metrics, timing for coarse analysis - -### 4. Common Performance Pitfalls -- **Tensor contiguity**: Always `.contiguous()` before Triton kernels -- **CUDA synchronization**: Required for accurate GPU timing -- **Kernel granularity**: Avoid launching millions of tiny kernels -- **Use optimized libraries**: Don't reimplement BLAS in Triton - -### 5. Optimization is Iterative -- V1 baseline: 372.9 samples/sec -- Stage 2 (correct): 15.2 samples/sec (40x SLOWER!) -- Stage 3 (contiguous): 310.8 samples/sec (0.83x baseline) -- **Stage 5 (optimized): 2065.0 samples/sec (5.5x FASTER!)** - ---- - -## Exercises - -### Exercise 1: Diagnose Stage 1 -Run `v3_stage1_broken_loss.py` and: -1. Uncomment the diagnostic logging -2. Identify which layer produces exploding values -3. Explain why default weight initialization causes this - -### Exercise 2: Profile Stage 2 -1. Run with rocprof: `rocprof --stats python v3_stage2_slow_performance.py ...` -2. Find the Flash Attention kernel in the trace -3. Look at memory metrics - what's wrong? - -### Exercise 3: Compare Stage 4 vs Stage 5 -1. Profile both versions with rocprof -2. Compare SwiGLU execution time -3. Explain the 8x speedup in the forward pass - -### Exercise 4: Design Your Own Optimization -1. Look at the RMSNorm kernel implementation -2. Can you further optimize it? -3. What profiling metrics would validate your optimization? - ---- - -## Next Steps - -After completing this exercise: - -1. **Apply to V4**: V4 has similar issues - can you fix them? -2. **Custom Kernels**: Try writing your own Triton kernel for a simple operation -3. **Advanced Profiling**: Learn rocprofv2 for detailed analysis -4. **Production Deployment**: Consider hybrid Triton+PyTorch approaches - ---- - -## Additional Resources - -- **Triton Documentation**: https://triton-lang.org/ -- **ROCm Profiling Guide**: https://rocm.docs.amd.com/projects/rocprofiler/en/latest/ -- **Flash Attention Paper**: https://arxiv.org/abs/2205.14135 -- **PyTorch Profiler**: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html - ---- - -## Summary Table - -| Stage | Loss | Speed (samples/sec) | Issue | Fix | -|-------|------|---------------------|-------|-----| -| 1 | 942 | N/A | Missing weight init | Add `_init_weights()` | -| 2 | 7.0 | 15.2 | Non-contiguous tensors | Add `.contiguous()` | -| 3 | 7.0 | 310.8 | Wrong timing | Add CUDA sync | -| 4 | 7.0 | 305.9 | Slow Triton SwiGLU | Use PyTorch matmul | -| 5 | 7.0 | 2065.0 | **OPTIMIZED!** | Hybrid approach | +## Resources -**Baseline (V1):** 372.9 samples/sec -**Final Speedup:** 5.5x faster, 46% less memory +- [Triton Documentation](https://triton-lang.org/) +- [ROCm Profiling Guide](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) +- [Flash Attention Paper](https://arxiv.org/abs/2205.14135) diff --git a/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/WORKSHOP_GUIDE.md b/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/WORKSHOP_GUIDE.md index 2b95de0e..e27478bb 100644 --- a/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/WORKSHOP_GUIDE.md +++ b/MLExamples/TinyTransformer/version3_triton/exercises/performance_debugging/WORKSHOP_GUIDE.md @@ -3,145 +3,52 @@ ## Quick Start ```bash -cd /workspace/version3_triton/exercises/performance_debugging +cd version3_triton/exercises/performance_debugging # Read the comprehensive guide cat README.md -# Note: Individual stage files (v3_stage1_broken_loss.py, etc.) are symbolic links -# to the main tiny_llama_v3.py with modifications applied at runtime or via -# configuration flags. This keeps the exercise files manageable. - -# Run all stages with automatic profiling and comparison +# Run all stages with automatic profiling ./run_all_stages.sh - -# Results will be saved to results/ directory with: -# - stage*_output.log: Full training outputs -# - stage*_profile.csv: rocprof profiling data -# - Performance comparison summary ``` ## What This Exercise Teaches -This is a **realistic performance debugging scenario** that mirrors real-world optimization work: - -### 1. **Correctness Before Performance** (Stage 1) -- Shows how subtle bugs (missing weight init) can completely break training -- Demonstrates diagnostic logging techniques -- Loss goes from 942 → 7.0 after one-line fix - -### 2. **Memory Layout Matters** (Stage 2→3) -- Non-contiguous tensors after `repeat_interleave` killed performance -- Adding `.contiguous()` gave 20x speedup (15 → 310 samples/sec) -- Critical lesson for GPU kernel developers - -### 3. **Measure Accurately** (Stage 3→4) -- GPU operations are asynchronous -- Without `torch.cuda.synchronize()`, timings are meaningless -- Same performance, but now we can see WHERE the time is spent - -### 4. **Know When NOT to Use Custom Kernels** (Stage 4→5) -- Triton SwiGLU kernel was launching 2M+ threads -- Each doing naive matrix multiplication -- PyTorch's rocBLAS is orders of magnitude faster -- Result: 8x forward pass speedup (25.5ms → 3.2ms) - -### 5. **Hybrid Optimization Wins** -- Final version: 2065 samples/sec (5.5x faster than V1 baseline!) -- Uses Triton for: Flash Attention, RMSNorm (memory-bound ops) -- Uses PyTorch for: Matrix multiplies (compute-bound ops) -- **Best of both worlds** - -## For Workshop Participants - -### Beginner Level -1. Run `./run_all_stages.sh` and observe the progression -2. Read the output logs to understand what changed each stage -3. Focus on the "Key Observations" in the comparison summary - -### Intermediate Level -1. Examine the profiling CSV files in `results/` -2. Compare kernel execution times between stages -3. Try modifying block sizes in Flash Attention kernel -4. Re-run and observe impact on performance - -### Advanced Level -1. Use `rocprofv2 --kernel-trace` for detailed timeline analysis -2. Identify memory bandwidth bottlenecks -3. Experiment with different Triton kernel implementations -4. Write a custom kernel for RoPE application +### 1. Correctness Before Performance (Stage 1) +Missing weight init → Loss 942 → 7.0 after one-line fix + +### 2. Memory Layout Matters (Stage 2→3) +Non-contiguous tensors → 20x speedup with `.contiguous()` + +### 3. Measure Accurately (Stage 3→4) +GPU ops are async → `torch.cuda.synchronize()` required + +### 4. Know When NOT to Use Custom Kernels (Stage 4→5) +Triton SwiGLU 2M+ threads → PyTorch rocBLAS 8x faster + +### 5. Hybrid Optimization Wins +Final: 2065 samples/sec (5.5x faster than V1!) ## Key Takeaways -| Metric | Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 | -|--------|---------|---------|---------|---------|---------| -| **Loss** | 942 | 7.0 | 7.0 | 7.0 | 7.0 | -| **Speed** | N/A | 15 samp/s | 311 samp/s | 306 samp/s | **2065 samp/s** | -| **vs Baseline** | N/A | 0.04x | 0.83x | 0.82x | **5.5x** | -| **Key Issue** | No weight init | Non-contig tensors | Wrong timing | Slow SwiGLU | **OPTIMAL** | -| **Memory** | N/A | ~282 MB | ~282 MB | ~282 MB | **~282 MB** | +| Stage | Loss | Speed | vs Baseline | Key Issue | +|-------|------|-------|-------------|-----------| +| 1 | 942 | N/A | N/A | No weight init | +| 2 | 7.0 | 15 samp/s | 0.04x | Non-contig tensors | +| 3 | 7.0 | 311 samp/s | 0.83x | Wrong timing | +| 4 | 7.0 | 306 samp/s | 0.82x | Slow SwiGLU | +| 5 | 7.0 | **2065 samp/s** | **5.5x** | **OPTIMAL** | **Baseline (V1):** 372.9 samples/sec, 522.3 MB -## Profiling Commands Reference +## Profiling Commands ```bash -# Basic profiling with rocprof rocprof --stats python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20 - -# Detailed kernel trace rocprofv2 --kernel-trace -o trace.json python tiny_llama_v3.py ... - -# View trace in Perfetto -# Upload trace.json to https://ui.perfetto.dev - -# Compare two stages -diff results/stage2_profile.csv results/stage5_profile.csv - -# Find slowest kernels -sort -t',' -k4 -nr results/stage4_profile.csv | head -20 ``` -## Common Questions - -**Q: Why not just use the final optimized version?** -A: Understanding the journey is more valuable than the destination. Each stage teaches a critical lesson about GPU programming and performance debugging. - -**Q: Can I apply these techniques to my own models?** -A: Absolutely! The debugging methodology is universal: - 1. Ensure correctness first - 2. Add accurate timing/profiling - 3. Identify bottlenecks with profilers - 4. Fix one issue at a time - 5. Re-measure and validate - -**Q: Should I always use Triton for custom kernels?** -A: No! As Stage 5 shows, hybrid approaches work best: - - Use Triton for memory-bound, fusion opportunities (Flash Attention, layer norm) - - Use PyTorch/BLAS for compute-bound matrix ops - - Profile to verify your assumptions - -**Q: Why is memory usage the same across all stages?** -A: The memory footprint is determined by model architecture (activations, weights, gradients), not by the kernel implementations. The performance gains come from faster computation, not lower memory usage. Flash Attention provides memory savings by avoiding materialization of the full attention matrix. - -## Next Steps - -After completing this exercise: - -1. **Apply to V4**: The ultra-fused version has similar issues - try fixing them yourself -2. **Explore ROCm Tools**: Deep dive into rocprofv2, rocprof, omniperf -3. **Custom Kernels**: Write your own Triton kernel for a simple operation -4. **Production Deployment**: Consider trade-offs between development time and performance gains - -## Additional Resources - -- **Triton Tutorials**: https://triton-lang.org/main/getting-started/tutorials/index.html -- **Flash Attention**: https://github.com/Dao-AILab/flash-attention -- **ROCm Profiling**: https://rocm.docs.amd.com/projects/rocprofiler/en/latest/ -- **PyTorch Profiler**: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html - ---- +## Resources -**Exercise Created**: October 2025 -**Target Hardware**: AMD MI325X with ROCm 6.4.4 -**Framework**: PyTorch 2.7.1 + Triton +- [Triton Tutorials](https://triton-lang.org/main/getting-started/tutorials/index.html) +- [ROCm Profiling](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) diff --git a/MLExamples/TinyTransformer/version3_triton/get_counters.sh b/MLExamples/TinyTransformer/version3_triton/get_counters.sh new file mode 100644 index 00000000..20bd0986 --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/get_counters.sh @@ -0,0 +1,78 @@ +#!/bin/bash +# Script to profile TinyTransformer V3 with rocprofv3 kernel trace +# This captures kernel execution metrics for performance analysis +# +# Supports both ROCm 6.x (CSV output) and ROCm 7.x (SQLite database output) + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./counters/counter_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 kernel trace collection for TinyTransformer V3..." +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprofv3 to collect kernel trace +rocprofv3 \ + --kernel-trace \ + --output-directory "$OUTPUT_DIR" \ + -- python tiny_llama_v3.py \ + --batch-size 8 \ + --seq-len 128 \ + --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Analyze results based on ROCm version +echo "To analyze results:" +DB_FILE=$(find "$OUTPUT_DIR" -name "*_results.db" 2>/dev/null | head -1) +if [ -n "$DB_FILE" ]; then + echo " Database file: $DB_FILE" + echo "" + echo " Export to CSV:" + echo " rocpd2csv -i $DB_FILE -o kernel_stats.csv" + echo "" + echo " Get kernel summary:" + echo " rocpd summary -i $DB_FILE --region-categories KERNEL" +else + echo " Check $OUTPUT_DIR for output files" +fi diff --git a/MLExamples/TinyTransformer/version3_triton/get_hotspots.sh b/MLExamples/TinyTransformer/version3_triton/get_hotspots.sh new file mode 100755 index 00000000..e1e7d822 --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/get_hotspots.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# +# Get hotspots analysis using rocprofv3 +# Compatible with ROCm 6.x and 7.x +# + +set -e + +echo "==========================================" +echo "rocprofv3 Hotspots Analysis - Version 3" +echo "==========================================" +echo "" + +OUTPUT_DIR="./hotspots/hotspot_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" +echo "Running: rocprofv3 --stats -- python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprofv3 --stats -- python ../../tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] Hotspot analysis completed" +else + echo "[FAILED] Hotspot analysis failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +# Check for stats/CSV files +if ls *.csv 1> /dev/null 2>&1; then + echo "Statistics files found:" + for f in *.csv; do + echo "" + echo "File: $f" + echo "Top 10 entries:" + head -11 "$f" + done +else + echo "Looking for statistics in subdirectories:" + find . -name "*.csv" -exec echo "Found: {}" \; -exec head -11 {} \; +fi +echo "" + +echo "Hotspot analysis identifies GPU kernels with highest time consumption." +echo "" diff --git a/MLExamples/TinyTransformer/version3_triton/get_rocprof_compute.sh b/MLExamples/TinyTransformer/version3_triton/get_rocprof_compute.sh new file mode 100755 index 00000000..4445ee30 --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/get_rocprof_compute.sh @@ -0,0 +1,49 @@ +#!/bin/bash +# +# Get detailed GPU metrics using rocprof-compute +# Compatible with ROCm 6.x and 7.x +# +# Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) +# for full hardware counter support. Consumer GPUs may have limited counter availability. +# + +set -e + +echo "==========================================" +echo "rocprof-compute Profiling - TinyTransformer V3" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_compute/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-compute to collect detailed GPU metrics +WORKLOAD_NAME="tiny_llama_v3_$(date +%Y%m%d_%H%M%S)" +echo "Running: rocprof-compute profile --name $WORKLOAD_NAME -d $OUTPUT_DIR -- python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +rocprof-compute profile --name "$WORKLOAD_NAME" -d "$OUTPUT_DIR" -- python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-compute profiling completed" +else + echo "[FAILED] rocprof-compute profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find "$OUTPUT_DIR" -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " rocprof-compute analyze -p $OUTPUT_DIR/workloads/$WORKLOAD_NAME/rocprof --dispatch -n tiny_llama_dispatch" +echo "" +echo "For available analysis options:" +echo " rocprof-compute analyze --help" +echo "" diff --git a/MLExamples/TinyTransformer/version3_triton/get_rocprof_sys.sh b/MLExamples/TinyTransformer/version3_triton/get_rocprof_sys.sh new file mode 100755 index 00000000..95d492cb --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/get_rocprof_sys.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# +# Get system-level profiling using rocprof-sys +# Compatible with ROCm 6.x and 7.x +# +# NOTE: rocprof-sys may produce memory map dumps in some configurations. +# Issue reference: TBD +# + +set -e + +echo "==========================================" +echo "rocprof-sys Profiling - TinyTransformer V3" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_sys/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-sys to collect system-level traces +echo "Running: rocprof-sys-run --profile --trace -- python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprof-sys-run --profile --trace -- python ../../tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-sys profiling completed" +else + echo "[FAILED] rocprof-sys profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " Open the .proto file in Perfetto UI: https://ui.perfetto.dev/" +echo "" diff --git a/MLExamples/TinyTransformer/version3_triton/get_trace.sh b/MLExamples/TinyTransformer/version3_triton/get_trace.sh new file mode 100644 index 00000000..8d2c0a82 --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/get_trace.sh @@ -0,0 +1,86 @@ +#!/bin/bash +# Script to profile TinyTransformer V3 with rocprofv3 runtime trace +# This captures GPU API calls, kernel launches, and memory operations +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./traces/trace_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 runtime trace profiling for TinyTransformer V3..." +echo "Output directory: $OUTPUT_DIR" + +# Build rocprofv3 command with appropriate flags for ROCm version +# ROCm 6.4+ and 7.x require explicit --output-format pftrace to generate Perfetto traces +if [ "$ROCM_MAJOR" = "7" ] || [ "$ROCM_MAJOR" = "6" ]; then + echo "Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace)" + OUTPUT_FORMAT="--output-format pftrace" +else + echo "Using ROCm 5.x or older: default format" + OUTPUT_FORMAT="" +fi + +echo "" +echo "Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations)" +echo "" + +# Run with rocprofv3 to collect full runtime trace +cd "$OUTPUT_DIR" +rocprofv3 \ + --runtime-trace \ + $OUTPUT_FORMAT \ + -- python ../../tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh ./*/ 2>/dev/null || ls -lh . +echo "" + +# Find and report pftrace files +PFTRACE=$(find . -name "*.pftrace" -size +1k 2>/dev/null | head -1) +if [ -n "$PFTRACE" ]; then + echo "Perfetto trace file: $PFTRACE" + echo "Size: $(ls -lh "$PFTRACE" | awk '{print $5}')" + echo "" + echo "To view the trace:" + echo " 1. Visit: https://ui.perfetto.dev/" + echo " 2. Open: $PFTRACE" +fi diff --git a/MLExamples/TinyTransformer/version3_triton/test_rocpd.sh b/MLExamples/TinyTransformer/version3_triton/test_rocpd.sh new file mode 100755 index 00000000..50ac7c3f --- /dev/null +++ b/MLExamples/TinyTransformer/version3_triton/test_rocpd.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# +# Test rocpd (ROCm Profiling Daemon) for continuous profiling +# + +set -e + +echo "==========================================" +echo "rocpd Test - Version 3" +echo "==========================================" +echo "" + +# Check if rocpd is available +if ! command -v rocpd &> /dev/null; then + echo "[ERROR] rocpd not found in PATH" + echo "rocpd may not be installed or available in this ROCm version" + exit 1 +fi + +echo "rocpd location: $(which rocpd)" +echo "" + +OUTPUT_DIR="./rocpd/rocpd_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Start rocpd in background +echo "Starting rocpd daemon..." +rocpd --output-dir "$OUTPUT_DIR" & +ROCPD_PID=$! +echo "rocpd running with PID: $ROCPD_PID" +echo "" + +# Give rocpd time to initialize +sleep 2 + +# Run workload +echo "Running workload: python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10" +python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 10 +WORKLOAD_EXIT=$? +echo "" + +# Stop rocpd +echo "Stopping rocpd daemon..." +kill $ROCPD_PID 2>/dev/null || true +wait $ROCPD_PID 2>/dev/null || true +echo "" + +if [ $WORKLOAD_EXIT -eq 0 ]; then + echo "[SUCCESS] Workload completed" +else + echo "[FAILED] Workload failed with exit code $WORKLOAD_EXIT" +fi +echo "" + +echo "Generated files in $OUTPUT_DIR:" +ls -lh "$OUTPUT_DIR" +echo "" + +echo "rocpd output is a SQLite3 database file" +echo "" +echo "To view the database:" +echo " - Use VS Code SQLite Viewer extension" +echo " - rocprof-compute and rocprof-systems can consume it directly" +echo " - No official CLI tool is provided for viewing" +echo "" +echo "rocpd provides continuous profiling with minimal overhead" +echo "" diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/README.md b/MLExamples/TinyTransformer/version4_pytorch_sdpa/README.md index 441f52d1..1f8604e4 100644 --- a/MLExamples/TinyTransformer/version4_pytorch_sdpa/README.md +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/README.md @@ -1,1037 +1,178 @@ +# ML Example: TinyTransformer PyTorch SDPA with ROCm Profiling -# Version 4: Ultra-Fused Triton Implementation +README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version4_pytorch_sdpa` from the Training Examples repository. -README.md from `HPCTrainingExamples/MLExamples/TinyTransformer/version4_pytorch_sdpa` in the Training Examples repository +This version implements ultra-fused Triton kernels with PyTorch SDPA (Scaled Dot Product Attention) for maximum performance. It builds on version3 with complete transformer block fusion, achieving 3.14x speedup and 61% memory reduction over baseline. -**Objective**: Achieve maximum performance through ultra-fusion techniques and state-of-the-art optimization +## Features of the profiling scripts -**Expected Performance**: 3.5-5.0x speedup over baseline, 85-98% memory reduction +The version4_pytorch_sdpa example contains several profiling scripts that capture different aspects of GPU performance: -**Learning Focus**: Advanced kernel fusion, performance engineering, optimization limits +- **get_trace.sh**: Runtime trace collection using rocprofv3. Captures HIP/HSA API calls, kernel execution timeline, memory operations (H2D, D2H, D2D transfers), and synchronization events. Output is a Perfetto trace file for timeline visualization. +- **get_counters.sh**: Kernel trace collection using rocprofv3. Captures kernel execution statistics including timing and call counts. Useful for identifying hotspot kernels and their execution patterns. +- **get_rocprof_compute.sh**: Detailed GPU hardware metrics using rocprof-compute. Provides comprehensive performance analysis including compute utilization, memory bandwidth, and hardware counter data. +- **get_rocprof_sys.sh**: System-level profiling using rocprof-sys. Captures call stack sampling and system-level performance data for end-to-end analysis. +- **get_hotspots.sh**: GPU hotspot analysis using rocprofv3 stats mode. Identifies kernels with highest time consumption. -## Overview +## Key Optimizations -Version 4 represents the pinnacle of GPU optimization for transformer models. It implements ultra-fused kernels that process entire transformer blocks in single kernel launches, achieving unprecedented efficiency through: +This version implements the pinnacle of GPU optimization: -- **Complete Block Fusion**: Entire transformer blocks in one kernel -- **Advanced Memory Management**: Optimal register and cache utilization -- **Cross-Layer Optimization**: Optimization across multiple computational layers -- **State-of-the-Art Techniques**: Latest advances in GPU performance engineering +- **PyTorch SDPA**: Hardware-accelerated scaled dot product attention with automatic Flash Attention backend +- **Ultra-Fused Transformer Block**: Entire transformer block in single kernel launch (12 kernels → 1) +- **Advanced Memory Management**: Optimal register and cache utilization, 85-98% memory bandwidth reduction +- **Adaptive Block Sizing**: Hardware-aware block size optimization for different GPU architectures -### Revolutionary Changes +## Overview of the model -``` -Version 1: 12+ kernels per transformer block -Version 2: ~8 kernels per transformer block (basic fusion) -Version 3: ~4 kernels per transformer block (Triton kernels) -Version 4: 1 kernel per transformer block (ultra-fusion) -``` - -### Performance Achievements - -- **Kernel Launch Overhead**: Reduced by 90-95% -- **Memory Traffic**: Reduced by 85-98% -- **Cache Efficiency**: Maximized through optimal data reuse -- **Register Utilization**: Optimal balance of parallelism and resource usage - -## Architecture Innovations and Ultra-Fusion Techniques - -### Mathematical Foundation of Ultra-Fusion - -Ultra-fusion represents the theoretical limit of kernel fusion, combining entire transformer blocks into single GPU kernels. For complete mathematical foundations, see [TINY_LLAMA_ARCHITECTURE.md](../TINY_LLAMA_ARCHITECTURE.md). - -#### Ultra-Fusion Efficiency Analysis - -**Kernel Launch Overhead Elimination:** - -$$\begin{aligned} -\text{Baseline Kernel Count} &: K_{\text{base}} = 12 \text{ kernels per block} \\ -\text{Ultra-Fused Count} &: K_{\text{ultra}} = 1 \text{ kernel per block} \\ -\text{Overhead Reduction} &: \frac{K_{\text{base}} - K_{\text{ultra}}}{K_{\text{base}}} = \frac{11}{12} = 91.7\% \\ -\text{Latency Savings} &: 11 \times T_{\text{launch}} \text{ per block} -\end{aligned}$$ - -**Memory Bandwidth Optimization:** - -$$\begin{aligned} -\text{Baseline Memory Access} &: \sum_{i=1}^{12} (\text{Input}_i + \text{Output}_i) \\ -\text{Ultra-Fused Access} &: \text{Input}_{\text{block}} + \text{Output}_{\text{block}} \\ -\text{Bandwidth Reduction} &: \frac{\text{Baseline} - \text{Ultra-Fused}}{\text{Baseline}} \approx 85-95\% -\end{aligned}$$ - -### 1. Ultra-Fused Transformer Block Implementation - -#### Complete Mathematical Flow - -**Single-Kernel Transformer Block:** - -$$\begin{aligned} -\text{Input:} \quad & x \in \mathbb{R}^{B \times S \times D} \\ -\text{Attention Block:} \quad & \text{attn\_out} = x + \text{Attention}(\text{RMSNorm}(x)) \\ -\text{FFN Block:} \quad & \text{output} = \text{attn\_out} + \text{SwiGLU}(\text{RMSNorm}(\text{attn\_out})) \\ -\text{All in One Kernel!} \quad & \text{Eliminates } 11 \text{ intermediate memory operations} -\end{aligned}$$ - -#### Ultra-Fused Kernel Implementation - -```python -@triton.jit -def ultra_fused_transformer_block_kernel( - # Input/Output pointers - x_ptr, output_ptr, - # Attention weights - attn_norm_weight_ptr, qkv_weight_ptr, attn_out_weight_ptr, - # FFN weights - ffn_norm_weight_ptr, gate_weight_ptr, up_weight_ptr, down_weight_ptr, - # Dimensions - batch_size, seq_len, hidden_dim, num_heads, intermediate_dim, - # Block sizes (auto-tuned) - BLOCK_SIZE_B: tl.constexpr, - BLOCK_SIZE_S: tl.constexpr, - BLOCK_SIZE_D: tl.constexpr -): - """ - Ultra-fused transformer block - entire block in single kernel. - - Fusion Strategy: - 1. Load input once into shared memory - 2. Compute attention norm + QKV + attention + output in registers - 3. Add residual connection in registers - 4. Compute FFN norm + gate/up + SiLU + down in registers - 5. Add final residual and write output once - - Memory Optimization: - - Input read: 1x per block - - Weight reads: Streamed through cache - - Intermediate results: Kept in registers/shared memory - - Output write: 1x per block - """ - - # Thread block coordinates - batch_idx = tl.program_id(0) - seq_block_idx = tl.program_id(1) - dim_block_idx = tl.program_id(2) +The model is controlled with the following arguments: - # Compute global indices - seq_offset = seq_block_idx * BLOCK_SIZE_S + tl.arange(0, BLOCK_SIZE_S) - dim_offset = dim_block_idx * BLOCK_SIZE_D + tl.arange(0, BLOCK_SIZE_D) +- `--batch-size `: batch size for training (default: 8) +- `--seq-len `: sequence length (default: 256) +- `--num-steps `: number of training steps (default: 50) +- `--hidden-dim `: hidden dimension (default: 512) +- `--num-layers `: number of transformer layers (default: 8) +- `--num-heads `: number of attention heads (default: 8) +- `--learning-rate `: learning rate (default: 3e-4) +- `--use-amp`: enable automatic mixed precision - # Bounds checking - seq_mask = seq_offset < seq_len - dim_mask = dim_offset < hidden_dim +## Running the ultra-fused model - # PHASE 1: Load input data (single global memory read) - input_ptr_offset = ( - batch_idx * seq_len * hidden_dim + - seq_offset[:, None] * hidden_dim + - dim_offset[None, :] - ) +Load the required modules: - x_block = tl.load( - x_ptr + input_ptr_offset, - mask=seq_mask[:, None] & dim_mask[None, :], - other=0.0 - ) - - # Store original input for residual connections - residual_1 = x_block # Stored in registers! - - # PHASE 2: Attention normalization (fused with attention) - # RMSNorm computation in registers - variance = tl.sum(x_block * x_block, axis=1, keepdims=True) / hidden_dim - rstd = 1.0 / tl.sqrt(variance + 1e-6) - - # Load attention norm weights and apply - attn_norm_weight = tl.load( - attn_norm_weight_ptr + dim_offset, - mask=dim_mask - ) - x_normed = x_block * rstd * attn_norm_weight[None, :] - - # PHASE 3: Ultra-fused attention computation - # This would include QKV projection, attention, and output projection - # (Simplified for brevity - full implementation would include all attention logic) - attn_output = ultra_fused_attention_computation( - x_normed, qkv_weight_ptr, attn_out_weight_ptr, - seq_offset, dim_offset, num_heads - ) - - # First residual connection (in registers) - post_attn = residual_1 + attn_output - - # PHASE 4: FFN normalization (fused with FFN) - variance_2 = tl.sum(post_attn * post_attn, axis=1, keepdims=True) / hidden_dim - rstd_2 = 1.0 / tl.sqrt(variance_2 + 1e-6) - - ffn_norm_weight = tl.load( - ffn_norm_weight_ptr + dim_offset, - mask=dim_mask - ) - ffn_input = post_attn * rstd_2 * ffn_norm_weight[None, :] - - # PHASE 5: Ultra-fused SwiGLU computation - ffn_output = ultra_fused_swiglu_computation( - ffn_input, gate_weight_ptr, up_weight_ptr, down_weight_ptr, - seq_offset, dim_offset, intermediate_dim - ) - - # Final residual connection (in registers) - final_output = post_attn + ffn_output - - # PHASE 6: Single global memory write - output_ptr_offset = ( - batch_idx * seq_len * hidden_dim + - seq_offset[:, None] * hidden_dim + - dim_offset[None, :] - ) - - tl.store( - output_ptr + output_ptr_offset, - final_output, - mask=seq_mask[:, None] & dim_mask[None, :] - ) - -@triton.jit -def ultra_fused_attention_computation( - x_normed, qkv_weight_ptr, attn_out_weight_ptr, - seq_offset, dim_offset, num_heads -): - """ - Ultra-fused attention computation within transformer block kernel. - """ - # QKV projection with register reuse - head_dim = hidden_dim // num_heads - - # Compute Q, K, V in parallel using register blocking - # (Implementation details for space efficiency) - - # Flash attention computation with optimal memory access - # (Using techniques from Version 3 but within ultra-fused context) - - # Return attention output (kept in registers) - return attention_result - -@triton.jit -def ultra_fused_swiglu_computation( - ffn_input, gate_weight_ptr, up_weight_ptr, down_weight_ptr, - seq_offset, dim_offset, intermediate_dim -): - """ - Ultra-fused SwiGLU computation within transformer block kernel. - """ - # Gate and up projections with register reuse - # SiLU activation fused with element-wise multiply - # Down projection with output accumulation - - # All operations optimized for register usage - return swiglu_result ``` - -#### Memory Access Pattern Analysis - -```python -ULTRA_FUSION_MEMORY_ANALYSIS = { - 'baseline_transformer_block': { - 'memory_reads': { - 'input_tensor': 12, # Read 12 times across operations - 'weight_matrices': 12, # Various weight reads - 'intermediate_tensors': 22, # Multiple intermediate results - 'total_memory_ops': 46 - }, - 'memory_writes': { - 'intermediate_results': 11, # 11 intermediate tensors stored - 'final_output': 1, - 'total_writes': 12 - } - }, - 'ultra_fused_block': { - 'memory_reads': { - 'input_tensor': 1, # Single read at start - 'weight_matrices': 7, # Streamed weight access - 'intermediate_tensors': 0, # Kept in registers! - 'total_memory_ops': 8 - }, - 'memory_writes': { - 'intermediate_results': 0, # No intermediate storage - 'final_output': 1, - 'total_writes': 1 - } - }, - 'memory_bandwidth_reduction': '83% fewer memory operations', - 'register_utilization': '95% of available register file' -} +module load pytorch rocm triton ``` -### 2. Advanced Memory Hierarchy Management - -#### Register File Optimization +Run a basic training run: -```python -class UltraOptimizedRegisterManagement: - """ - Sophisticated register allocation for ultra-fused kernels. - """ - - def __init__(self, gpu_arch): - self.register_file_size = gpu_arch.register_file_size # e.g., 64KB per SM - self.max_threads_per_block = gpu_arch.max_threads_per_block - self.register_allocation_strategy = self._optimize_register_allocation() - - def _optimize_register_allocation(self): - """ - Optimize register allocation for maximum occupancy. - - Trade-off Analysis: - - More registers per thread → Better performance per thread - - Fewer registers per thread → Higher occupancy - - Optimal Point: Maximum (threads × performance_per_thread) - """ - - optimization_space = { - 'high_occupancy': { - 'registers_per_thread': 32, - 'threads_per_block': 256, - 'occupancy': '100%', - 'performance_per_thread': '85%' - }, - 'high_performance': { - 'registers_per_thread': 64, - 'threads_per_block': 128, - 'occupancy': '50%', - 'performance_per_thread': '120%' - }, - 'optimal_balance': { - 'registers_per_thread': 48, - 'threads_per_block': 192, - 'occupancy': '75%', - 'performance_per_thread': '105%', - 'total_performance': '78.75% (optimal)' - } - } - - return optimization_space['optimal_balance'] ``` - -#### Cache Hierarchy Optimization - -```python -# L1 Cache optimization (32KB per SM) -L1_CACHE_STRATEGY = { - 'temporal_locality': { - 'weight_reuse': 'Keep frequently accessed weights in L1', - 'activation_reuse': 'Reuse activations across attention heads', - 'pattern': 'Block-wise computation to maximize reuse' - }, - 'spatial_locality': { - 'memory_coalescing': 'Ensure consecutive threads access consecutive memory', - 'cache_line_utilization': 'Full 128-byte cache line usage', - 'stride_optimization': 'Minimize memory stride patterns' - } -} - -# L2 Cache optimization (8MB shared across CUs) -L2_CACHE_STRATEGY = { - 'weight_streaming': { - 'pattern': 'Stream weights through L2 for multiple attention heads', - 'prefetching': 'Prefetch next weight blocks during computation', - 'retention': 'Keep frequently accessed weights in L2' - }, - 'activation_sharing': { - 'cross_head_sharing': 'Share activations across attention heads', - 'batch_sharing': 'Share activations across batch elements', - 'temporal_reuse': 'Optimize for temporal reuse patterns' - } -} +echo "Running TinyTransformer V4 Ultra-Fused" +python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 ``` -### 3. Intelligent Compilation and Auto-Tuning System - -#### Hardware-Adaptive Compilation - -```python -class UltraFusedCompiler: - """ - Intelligent compilation system for ultra-fused kernels. - """ - - def __init__(self, target_gpu): - self.gpu_arch = self._detect_gpu_architecture(target_gpu) - self.optimization_parameters = self._derive_optimal_parameters() - self.kernel_cache = {} - - def _detect_gpu_architecture(self, target_gpu): - """ - Detect GPU architecture and capabilities. - """ - gpu_specs = { - 'gfx906': { # MI50 - 'compute_units': 60, - 'register_file_per_cu': 64 * 1024, # 64KB - 'shared_memory_per_cu': 64 * 1024, # 64KB - 'memory_bandwidth': 1024, # GB/s - 'peak_flops_fp32': 6.7e12 # FLOPS - }, - 'gfx908': { # MI100 - 'compute_units': 120, - 'register_file_per_cu': 64 * 1024, - 'shared_memory_per_cu': 64 * 1024, - 'memory_bandwidth': 1200, - 'peak_flops_fp32': 11.5e12 - }, - 'gfx90a': { # MI200 series - 'compute_units': 110, - 'register_file_per_cu': 64 * 1024, - 'shared_memory_per_cu': 64 * 1024, - 'memory_bandwidth': 1600, - 'peak_flops_fp32': 23e12 - } - } - - return gpu_specs.get(target_gpu, gpu_specs['gfx90a']) - - def _derive_optimal_parameters(self): - """ - Derive optimal kernel parameters based on hardware characteristics. - """ - # Roofline analysis for optimal block sizes - arithmetic_intensity_target = self.gpu_arch['peak_flops_fp32'] / self.gpu_arch['memory_bandwidth'] - - # Optimize for memory hierarchy - l1_cache_size = 32 * 1024 # 32KB L1 cache - optimal_working_set = l1_cache_size * 0.8 # 80% utilization - - # Derive block sizes - block_size_optimization = { - 'BLOCK_SIZE_B': self._optimize_batch_blocking(), - 'BLOCK_SIZE_S': self._optimize_sequence_blocking(), - 'BLOCK_SIZE_D': self._optimize_feature_blocking(), - 'BLOCK_SIZE_H': self._optimize_head_blocking() - } - - return block_size_optimization - - def _optimize_batch_blocking(self): - """Optimize batch dimension blocking.""" - # Consider memory coalescing and occupancy - optimal_batch_block = 4 # Empirically determined - return optimal_batch_block - - def _optimize_sequence_blocking(self): - """Optimize sequence dimension blocking.""" - # Balance between cache utilization and parallelism - sequence_block_candidates = [32, 64, 128, 256] - optimal_seq_block = 64 # Based on cache analysis - return optimal_seq_block +## Runtime Trace Profiling with get_trace.sh - def _optimize_feature_blocking(self): - """Optimize feature dimension blocking.""" - # Vectorization and memory coalescing - feature_block_candidates = [64, 128, 256] - optimal_feature_block = 128 # Optimal for most architectures - return optimal_feature_block +This script captures GPU API calls, kernel launches, and memory operations for timeline analysis. - def _optimize_head_blocking(self): - """Optimize attention head blocking.""" - # Balance between register usage and parallelism - head_block_candidates = [1, 2, 4, 8] - optimal_head_block = 2 # Good balance for register pressure - return optimal_head_block +Run the profiling script: - def compile_ultra_kernel(self, kernel_signature): - """ - Compile ultra-fused kernel with optimal parameters. - """ - if kernel_signature in self.kernel_cache: - return self.kernel_cache[kernel_signature] - - # Generate kernel with optimal parameters - compiled_kernel = self._generate_optimized_kernel( - kernel_signature, - self.optimization_parameters - ) - - # Cache for reuse - self.kernel_cache[kernel_signature] = compiled_kernel - - return compiled_kernel ``` - -#### Auto-Tuning Framework - -```python -class UltraFusedAutoTuner: - """ - Automatic tuning system for ultra-fused kernels. - """ - - def __init__(self, search_space, evaluation_metric='throughput'): - self.search_space = search_space - self.evaluation_metric = evaluation_metric - self.tuning_history = [] - - def tune_kernel_parameters(self, model, test_inputs, max_iterations=100): - """ - Auto-tune kernel parameters for optimal performance. - """ - - # Define search space - parameter_space = { - 'block_sizes': { - 'BLOCK_SIZE_B': [1, 2, 4, 8], - 'BLOCK_SIZE_S': [32, 64, 128, 256], - 'BLOCK_SIZE_D': [64, 128, 256], - 'BLOCK_SIZE_H': [1, 2, 4] - }, - 'memory_optimization': { - 'use_shared_memory': [True, False], - 'vectorization_factor': [1, 2, 4], - 'prefetch_distance': [0, 1, 2] - }, - 'compute_optimization': { - 'unroll_factor': [1, 2, 4, 8], - 'pipeline_stages': [1, 2, 3], - 'register_allocation_strategy': ['high_occupancy', 'high_performance'] - } - } - - # Bayesian optimization for efficient parameter search - best_params, best_performance = self._bayesian_optimization( - parameter_space, model, test_inputs, max_iterations - ) - - return best_params, best_performance - - def _bayesian_optimization(self, param_space, model, inputs, max_iter): - """Bayesian optimization for parameter tuning.""" - # Efficient parameter space exploration - # (Simplified implementation) - - best_params = None - best_performance = 0 - - for iteration in range(max_iter): - # Sample parameters from posterior distribution - params = self._sample_parameters(param_space) - - # Evaluate performance - performance = self._evaluate_performance(model, inputs, params) - - # Update best configuration - if performance > best_performance: - best_performance = performance - best_params = params - - # Update posterior distribution - self._update_posterior(params, performance) - - return best_params, best_performance +echo "Collecting runtime trace with rocprofv3" +./get_trace.sh ``` -## Files and Structure +The script will output results to `traces/trace_/`. To analyze the results: ``` -version4_pytorch_sdpa/ -├── README.md # This file -├── tiny_llama_v4.py # Ultra-fused implementation -├── run_ultra_profiling.py # Advanced profiling suite -├── exercises/ -│ └── exercise1_ultra_fusion.md # Ultra-fusion deep dive -└── results/ # Generated analysis results +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .pftrace file" ``` -### Performance Engineering Principles - -#### Roofline Model Integration - -```python -class UltraFusedRooflineAnalysis: - """ - Roofline model analysis for ultra-fused kernels. - """ - - def __init__(self, gpu_specifications): - self.peak_compute = gpu_specifications['peak_flops_fp32'] # FLOPS/second - self.peak_bandwidth = gpu_specifications['memory_bandwidth'] # Bytes/second - self.ridge_point = self.peak_compute / self.peak_bandwidth # FLOPS/byte +## Kernel Trace Profiling with get_counters.sh - def analyze_kernel_performance(self, kernel_name, flops, bytes_accessed): - """ - Analyze kernel performance using roofline model. - """ - arithmetic_intensity = flops / bytes_accessed +This script collects kernel execution statistics including timing and call counts. - if arithmetic_intensity < self.ridge_point: - # Memory-bound operation - theoretical_performance = arithmetic_intensity * self.peak_bandwidth - bottleneck = 'memory_bandwidth' - optimization_strategy = 'reduce_memory_access' - else: - # Compute-bound operation - theoretical_performance = self.peak_compute - bottleneck = 'compute_throughput' - optimization_strategy = 'increase_arithmetic_intensity' +Run the profiling script: - analysis_result = { - 'kernel': kernel_name, - 'arithmetic_intensity': arithmetic_intensity, - 'ridge_point': self.ridge_point, - 'bottleneck': bottleneck, - 'theoretical_peak': theoretical_performance, - 'optimization_strategy': optimization_strategy - } - - return analysis_result - -# Example roofline analysis for ultra-fused transformer block -TRANSFORMER_BLOCK_ROOFLINE = { - 'ultra_fused_block': { - 'total_flops': 4 * batch_size * seq_len * hidden_dim * (hidden_dim + intermediate_dim), - 'memory_bytes': batch_size * seq_len * hidden_dim * 8, # Input + output only! - 'arithmetic_intensity': 'total_flops / memory_bytes', - 'expected_intensity': '~500 FLOPS/byte (highly compute-bound)', - 'performance_regime': 'compute_bound (good for GPUs)' - }, - 'baseline_comparison': { - 'baseline_arithmetic_intensity': '~50 FLOPS/byte', - 'ultra_fused_intensity': '~500 FLOPS/byte', - 'improvement': '10x better arithmetic intensity' - } -} ``` - -#### Advanced Memory Optimization Techniques - -```python -class UltraMemoryOptimizer: - """ - Advanced memory optimization for ultra-fused kernels. - """ - - def __init__(self, gpu_memory_hierarchy): - self.memory_hierarchy = gpu_memory_hierarchy - self.optimization_strategies = self._initialize_strategies() - - def _initialize_strategies(self): - return { - 'register_optimization': { - 'vectorization': 'Use float4 for 4x memory throughput', - 'register_blocking': 'Tile data to fit in register file', - 'spill_minimization': 'Careful variable lifetime management' - }, - 'shared_memory_optimization': { - 'bank_conflict_avoidance': 'Pad data structures to avoid conflicts', - 'coalesced_loading': 'Ensure optimal memory access patterns', - 'double_buffering': 'Overlap computation with memory access' - }, - 'global_memory_optimization': { - 'prefetching': 'Prefetch next data blocks during computation', - 'streaming': 'Stream large data through memory hierarchy', - 'compression': 'Use mixed precision to reduce bandwidth' - } - } - - def optimize_memory_access_pattern(self, kernel_specification): - """ - Optimize memory access patterns for ultra-fused kernels. - """ - - optimizations = { - 'coalescing_optimization': { - 'thread_mapping': 'Map consecutive threads to consecutive memory', - 'memory_stride': 'Ensure stride-1 access patterns', - 'alignment': 'Align data to cache line boundaries' - }, - 'cache_optimization': { - 'temporal_locality': 'Reuse data while in cache', - 'spatial_locality': 'Access nearby memory locations', - 'cache_blocking': 'Tile computations to fit in cache' - }, - 'bandwidth_optimization': { - 'vectorized_loads': 'Use SIMD memory instructions', - 'memory_pipelining': 'Overlap memory with computation', - 'bandwidth_balancing': 'Balance read/write bandwidth usage' - } - } - - return optimizations +echo "Collecting kernel trace with rocprofv3" +./get_counters.sh ``` -## Key Components Deep Dive +The script will output results to `counters/counter_/`. -### Ultra-Fused Transformer Block +ROCm 6.x outputs CSV files directly, while ROCm 7.x outputs SQLite databases. For ROCm 7.x database files, use rocpd tools: -**Input Processing:** -```python -# Single token, entire transformer block -residual_1 = x_token -# Attention norm → QKV → Attention → Output → Residual -# FFN norm → Gate/Up → SiLU → Down → Residual -final_output = residual_2 + ffn_output ``` - -**Memory Efficiency:** - -- **Register Reuse**: Maximizes data kept in fast registers -- **Memory Coalescing**: Optimal access patterns for global memory -- **Cache Optimization**: Designed for L1/L2 cache efficiency - -### Advanced Performance Features - -**1. Adaptive Block Sizing:** -```python -BLOCK_SIZE_B: tl.constexpr, # Batch dimension blocking -BLOCK_SIZE_S: tl.constexpr, # Sequence dimension blocking -BLOCK_SIZE_D: tl.constexpr, # Feature dimension blocking -BLOCK_SIZE_H: tl.constexpr, # Head dimension blocking +echo "Exporting kernel statistics to CSV" +rocpd2csv -i -o kernel_stats.csv ``` -**2. Ultra-Mode Toggle:** -```python -model.enable_ultra_mode(True) # Maximum performance -model.enable_ultra_mode(False) # Fallback for debugging ``` - -**3. Performance Prediction:** -```python -# Built-in performance modeling -predicted_time = predict_performance(batch_size, seq_len, d_model) +echo "Getting kernel summary" +rocpd summary -i --region-categories KERNEL ``` -## Quick Start +Documentation for rocpd tools: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html -### 1. Run Ultra-Fused Model - -```bash -cd version4_pytorch_sdpa/ -python3 tiny_llama_v4.py -``` +## GPU Hardware Metrics with get_rocprof_compute.sh -**Expected Output:** -``` -Compiling ultra-fused kernels... -Ultra-fused kernels compiled successfully! +This script collects detailed GPU performance metrics for hardware utilization analysis. -=== Ultra-Fused Model Benchmark === -Testing: batch_size=1, seq_len=128 - Ultra-fused: XX.XX ms - Standard: YY.YY ms - Speedup: Z.ZZx - Throughput: XXXX tokens/s - Memory: X.XX GB +Run the profiling script: -Average speedup: X.XXx -Maximum speedup: Y.YYx -Peak throughput: ZZZZ tokens/s ``` - - - -## Performance Analysis - -### Expected Performance Gains - -| Metric | Baseline | Version 2 | Version 3 | Version 4 | V4 Total Gain | -|--------|----------|-----------|-----------|-----------|---------------| -| Execution Time | 100% | 50-70% | 30-45% | **20-30%** | **3.3-5.0x** | -| Memory Usage | 100% | 40-60% | 20-35% | **10-20%** | **5.0-10x** | -| Kernel Launches | 100% | 30-50% | 15-25% | **8-12%** | **8.3-12.5x** | -| Cache Efficiency | 100% | 120-140% | 150-180% | **200-250%** | **2.0-2.5x** | - -### Scaling Characteristics - -**Sequence Length Scaling:** - -- **Short sequences (≤256)**: 4.0-5.0x speedup -- **Medium sequences (512)**: 3.5-4.5x speedup -- **Long sequences (1024+)**: 3.0-4.0x speedup - -**Batch Size Scaling:** - -- **Single batch**: 3.5-4.5x speedup -- **Small batches (2-4)**: 4.0-5.0x speedup -- **Large batches (8+)**: 3.5-4.5x speedup - -**Model Size Scaling:** - -- **Small models**: 4.5-5.0x speedup -- **Medium models**: 4.0-4.5x speedup -- **Large models**: 3.5-4.0x speedup - -## Advanced Features +echo "Generating performance analysis report" +rocprof-compute analyze -p /workloads//rocprof --dispatch -n tiny_llama_dispatch +``` -### 1. Performance Engineering +For available analysis options: -**Roofline Model Integration:** -```python -arithmetic_intensity = total_flops / total_bytes -if arithmetic_intensity > compute_bound_threshold: - # Optimize for compute efficiency -else: - # Optimize for memory bandwidth ``` - -**Register Pressure Management:** -```python -# Intelligent register allocation -# Float4 vectorization -# Optimal loop unrolling -# Compiler hint optimization +rocprof-compute analyze --help ``` -### 2. Memory Hierarchy Optimization - -**L1 Cache Optimization:** - -- Temporal locality maximization -- Spatial locality optimization -- Cache line utilization +Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support. Consumer GPUs may have limited counter availability. -**L2 Cache Strategy:** +## System-Level Profiling with get_rocprof_sys.sh -- Weight reuse patterns -- Prefetching optimization -- Bank conflict avoidance +This script captures system-level performance with call stack sampling. -**Global Memory Efficiency:** +Run the profiling script: -- Coalescing optimization -- Bandwidth utilization -- Access pattern optimization - -### 3. Adaptive Optimization - -**Hardware Detection:** -```python -# Automatic GPU architecture detection -# Optimal kernel parameter selection -# Performance characteristic adaptation ``` - -**Dynamic Configuration:** -```python -# Runtime performance optimization -# Adaptive block size selection -# Memory configuration tuning +echo "Collecting system-level profile with rocprof-sys" +./get_rocprof_sys.sh ``` -## Hands-on Exercises - -### Exercise 1: Ultra-Fusion Architecture (90 minutes) - -**Focus Areas:** +The script will output results to `rocprof_sys/profile_/`. To analyze the results: -- Ultra-fusion architecture analysis -- Advanced memory management -- Performance engineering deep dive -- Roofline model application - -**Key Learning Objectives:** - -1. Understand ultra-fusion principles and trade-offs -2. Analyze advanced memory hierarchy optimization -3. Apply performance engineering techniques -4. Master roofline model analysis - -## Advanced Topics - -### Performance Engineering Principles - -1. **Kernel Fusion Strategies** - - Identify fusion opportunities - - Balance register pressure vs parallelism - - Optimize memory access patterns - -2. **Memory Hierarchy Mastery** - - Register allocation optimization - - Cache utilization maximization - - Global memory bandwidth efficiency - -3. **Hardware-Specific Optimization** - - GPU architecture adaptation - - Instruction-level optimization - - Memory subsystem tuning - -### Optimization Methodology - -1. **Profile-Guided Optimization** - ```bash - # Profile → Analyze → Optimize → Validate - # Identify bottlenecks - # Apply targeted optimizations - # Measure improvements - ``` - -2. **Performance Modeling** - ```python - # Predict performance for new configurations - # Guide optimization decisions - # Validate theoretical vs actual performance - ``` - -3. **Iterative Refinement** - ```python - # Continuous optimization cycle - # A/B testing of optimizations - # Performance regression detection - ``` - - - -### Performance Metrics - -**Key Metrics to Monitor:** - -1. **Kernel Efficiency**: Execution time, occupancy, utilization -2. **Memory Performance**: Bandwidth, cache hit rates, access patterns -3. **System Integration**: CPU-GPU coordination, data transfer efficiency - -## Production Considerations - -### Deployment Optimization - -1. **Model Compilation** - ```python - # Precompile for target hardware - # Cache compiled kernels - # Version management - ``` - -2. **Runtime Optimization** - ```python - # Dynamic adaptation - # Performance monitoring - # Fallback strategies - ``` - -3. **Scalability** - ```python - # Multi-GPU scaling - # Memory management - # Load balancing - ``` - -### Monitoring and Debugging - -1. **Performance Monitoring** - - Real-time performance metrics - - Trend analysis - - Anomaly detection - -2. **Debugging Tools** - - Kernel-level debugging - - Memory access visualization - - Performance bottleneck identification - -## Limitations and Trade-offs - -### Current Limitations - -1. **Hardware Dependency**: Optimized for specific GPU architectures -2. **Complexity**: Increased development and maintenance complexity -3. **Debugging Difficulty**: More challenging to debug fused kernels -4. **Portability**: May require adaptation for different hardware - -### Trade-off Analysis - -| Aspect | Benefit | Cost | -|--------|---------|------| -| Performance | 3.5-5.0x speedup | Development complexity | -| Memory Efficiency | 85-98% reduction | Debugging difficulty | -| Kernel Fusion | Minimal launches | Hardware dependency | -| Optimization | Maximum efficiency | Maintenance overhead | - -## Future Directions - -### Emerging Techniques - -1. **AI-Guided Optimization** - - ML-based kernel optimization - - Automated parameter tuning - - Performance prediction - -2. **Hardware Co-design** - - Kernel-hardware co-optimization - - Custom instruction utilization - - Memory hierarchy adaptation - -3. **Cross-Layer Optimization** - - Model-kernel co-design - - End-to-end optimization - - System-level efficiency +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .proto file" +``` -### Research Opportunities +Note: rocprof-sys may produce memory map dumps in some configurations. If profiling fails or produces excessive output, consider using rocprofv3 (get_trace.sh) instead. -1. **Automatic Fusion** - - Compiler-driven optimization - - Pattern recognition - - Optimization space exploration +## GPU Hotspot Analysis with get_hotspots.sh -2. **Adaptive Optimization** - - Runtime adaptation - - Workload-specific tuning - - Dynamic reconfiguration +This script identifies kernels with the highest execution time using rocprofv3 stats mode. -## Conclusion +Run the profiling script: -Version 4 represents the state-of-the-art in GPU optimization for transformer models. Through ultra-fusion techniques, it achieves: +``` +echo "Collecting GPU hotspots" +./get_hotspots.sh +``` -- **Maximum Performance**: 3.5-5.0x speedup over baseline -- **Optimal Efficiency**: 85-98% memory reduction -- **Advanced Techniques**: State-of-the-art optimization methods -- **Production Ready**: Robust, scalable implementation +The script will output kernel statistics to `hotspots/hotspot_/`. -This implementation demonstrates the pinnacle of what's possible with current GPU optimization techniques while providing a foundation for future advances. +## Expected Performance Improvements -## Resources +Results from AMD MI325X with ROCm 6.4.4: -### Technical Documentation -- [Triton Advanced Programming Guide](https://triton-lang.org/main/programming-guide/index.html) -- [AMD GPU Architecture](https://rocmdocs.amd.com/en/latest/Programming_Guides/Programming-Guides.html) -- [Performance Optimization Best Practices](https://rocmdocs.amd.com/en/latest/Programming_Guides/Performance_optimization.html) +| Version | Throughput | Memory | Improvement | +|---------|-----------|--------|-------------| +| V1 Baseline | 372.9 samples/sec | 522.3 MB | - | +| V4 Ultra-Fused | 1171.0 samples/sec | 203.5 MB | 3.14x faster, 61% less memory | -### Research Papers -- [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) -- [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf) -- [The Roofline Model: A Tool for Performance Analysis](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/) +Key optimization impacts: +- Ultra-fused transformer block: 12 kernel launches → 1 +- PyTorch SDPA: Hardware-accelerated attention with Flash Attention backend +- Memory hierarchy optimization: 85-98% intermediate memory elimination -### Community Resources -- [AMD ROCm Community](https://github.com/RadeonOpenCompute/ROCm) -- [Triton Community](https://github.com/openai/triton) -- [GPU Optimization Forums](https://developer.amd.com/community/) +## Additional Resources +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- rocpd output format: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html +- Perfetto UI: https://ui.perfetto.dev/ +- Triton Language Tutorial: https://triton-lang.org/main/getting-started/tutorials/index.html +- Flash Attention Paper: https://arxiv.org/abs/2205.14135 diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/exercises/exercise1_ultra_fusion.md b/MLExamples/TinyTransformer/version4_pytorch_sdpa/exercises/exercise1_ultra_fusion.md index 760496c1..f3c6e31c 100644 --- a/MLExamples/TinyTransformer/version4_pytorch_sdpa/exercises/exercise1_ultra_fusion.md +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/exercises/exercise1_ultra_fusion.md @@ -1,525 +1,78 @@ - ## Exercise 1: Ultra-Fusion Architecture and Design -`exercise1_ultra_fusion.md` from `HPCTrainingExamples/MLExamples/TinyTransformer/version4_pytorch_sdpa/exercises` in the Training Examples repository - -**Objective**: Understand ultra-fusion principles and analyze the most advanced optimization techniques in GPU kernel development. - -**Time**: 90 minutes +**Objective**: Understand ultra-fusion principles and analyze advanced GPU kernel optimization techniques. -**Prerequisites**: Completed all exercises in Versions 1-3 +**Time**: 90 minutes | **Prerequisites**: Completed all exercises in Versions 1-3 ### Background -Ultra-fusion represents the pinnacle of GPU optimization, where entire transformer blocks are processed in single kernel launches with minimal memory traffic. This exercise explores the advanced techniques used to achieve maximum performance: - -- Cross-layer kernel fusion -- Advanced memory hierarchy optimization -- Ultra-efficient data flow patterns -- State-of-the-art performance engineering - -### Part A: Ultra-Fusion Architecture Analysis (30 minutes) +Ultra-fusion represents the pinnacle of GPU optimization, where entire transformer blocks are processed in single kernel launches with minimal memory traffic. -#### Step 1: Understand the Ultra-Fused Transformer Block +### Part A: Ultra-Fusion Architecture Analysis -Examine the `ultra_fused_transformer_block_kernel` in `tiny_llama_v4.py`: +Examine `ultra_fused_transformer_block_kernel` in `tiny_llama_v4.py`: ```python @triton.jit def ultra_fused_transformer_block_kernel( - # Input and output tensors x_ptr, output_ptr, - # All weights (attention + FFN + norms) q_weight_ptr, k_weight_ptr, v_weight_ptr, o_weight_ptr, gate_weight_ptr, up_weight_ptr, down_weight_ptr, attn_norm_weight_ptr, ffn_norm_weight_ptr, - # Dimensions and constants batch_size, seq_len, d_model, n_heads, d_ff, head_dim, scale, norm_eps, - # Advanced block sizing BLOCK_SIZE_B, BLOCK_SIZE_S, BLOCK_SIZE_D, BLOCK_SIZE_H, ): ``` -**Architecture Analysis Questions:** - -1. **Fusion Scope**: What operations are fused together in this single kernel? -2. **Memory Efficiency**: How does this kernel minimize memory traffic compared to Version 3? -3. **Computational Overlap**: How are different computations overlapped for efficiency? -4. **Register Usage**: How is register pressure managed with so many operations? - -#### Step 2: Analyze the Computation Flow - -Follow the ultra-fused execution pattern: - -```python -# Store original input for residual -residual_1 = x_token - -# === ATTENTION LAYER NORM === -variance = tl.sum(x_token * x_token) / d_model -inv_std = 1.0 / tl.sqrt(variance + norm_eps) -x_normed = x_token * inv_std * attn_norm_weights - -# === ULTRA-FUSED ATTENTION === -# Parallel QKV computation... - -# === FIRST RESIDUAL CONNECTION === -x_token = residual_1 + attn_output -residual_2 = x_token - -# === FFN LAYER NORM === -# === ULTRA-FUSED SWIGLU FFN === -# === FINAL RESIDUAL CONNECTION === -``` - -**Flow Analysis Tasks:** - -1. **Data Dependencies**: Map out all data dependencies in the computation -2. **Memory Reuse**: Identify opportunities for register and shared memory reuse -3. **Parallelization**: Analyze how different operations can be parallelized -4. **Critical Path**: Identify the critical path through the computation - -#### Step 3: Compare with Previous Versions - -Create a comparison table of kernel launches: - -| Operation | Version 1 | Version 2 | Version 3 | Version 4 | -|-----------|-----------|-----------|-----------|-----------| -| Input Layer Norm | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| Q Projection | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| K Projection | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| V Projection | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| Attention Compute | Multiple | Fused | 1 kernel | **Fused** | -| Output Projection | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| Residual Add | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| FFN Layer Norm | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| Gate Projection | 1 kernel | Fused | 1 kernel | **Fused** | -| Up Projection | 1 kernel | Fused | 1 kernel | **Fused** | -| SiLU Activation | 1 kernel | Fused | 1 kernel | **Fused** | -| Down Projection | 1 kernel | 1 kernel | 1 kernel | **Fused** | -| Final Residual | 1 kernel | 1 kernel | 1 kernel | **Fused** | +**Analysis Questions:** +1. What operations are fused in this single kernel? +2. How does this minimize memory traffic vs Version 3? +3. How is register pressure managed? + +### Part B: Kernel Launch Comparison + +| Operation | V1 | V2 | V3 | V4 | +|-----------|----|----|----|----| +| Input Layer Norm | 1 | 1 | 1 | **Fused** | +| QKV Projections | 3 | 3 | 3 | **Fused** | +| Attention Compute | Multi | Fused | 1 | **Fused** | +| Output Projection | 1 | 1 | 1 | **Fused** | +| FFN (Gate/Up/Down) | 3 | Fused | 3 | **Fused** | +| Residual Adds | 2 | 2 | 2 | **Fused** | | **Total Kernels** | **~12** | **~8** | **~4** | **1** | -**Performance Implications:** - -1. **Launch Overhead**: Calculate the kernel launch overhead savings -2. **Memory Bandwidth**: Estimate memory bandwidth reduction -3. **Cache Efficiency**: Analyze L1/L2 cache utilization improvements - -### Part B: Advanced Memory Management Analysis (35 minutes) - -#### Step 4: Memory Hierarchy Optimization - -Analyze how the ultra-fused kernel optimizes memory usage: - -```python -def analyze_memory_hierarchy(): - """Analyze memory usage patterns in ultra-fused kernel.""" - - # Model configuration - batch_size, seq_len, d_model = 4, 512, 2048 - n_heads = 32 - head_dim = d_model // n_heads - d_ff = int(2.67 * d_model) - - print("Ultra-Fused Memory Hierarchy Analysis") - print("=" * 45) - - # Register usage analysis - registers_per_token = ( - d_model + # Input token - d_model + # Residual 1 - d_model + # Normed input - n_heads * head_dim + # Q projections - n_heads * head_dim + # K projections - n_heads * head_dim + # V projections - d_model + # Attention output - d_model + # Residual 2 - d_ff + # FFN intermediate - d_model # Final output - ) - - print(f"Estimated register usage per token: {registers_per_token}") - print(f"Register pressure: {registers_per_token * 4 / 1024:.1f} KB per token") - - # Global memory access patterns - input_reads = batch_size * seq_len * d_model * 4 # Read input once - weight_reads = ( - # Attention weights (read once per token) - 4 * d_model * d_model * 4 + # Q, K, V, O weights - # FFN weights (read once per token) - 3 * d_model * d_ff * 4 + # Gate, Up, Down weights - # Norm weights (read once per token) - 2 * d_model * 4 # Attention + FFN norms - ) * batch_size * seq_len - - output_writes = batch_size * seq_len * d_model * 4 # Write output once - - total_memory_traffic = input_reads + weight_reads + output_writes - - print(f"\nMemory Traffic Analysis:") - print(f" Input reads: {input_reads / 1e6:.2f} MB") - print(f" Weight reads: {weight_reads / 1e6:.2f} MB") - print(f" Output writes: {output_writes / 1e6:.2f} MB") - print(f" Total: {total_memory_traffic / 1e6:.2f} MB") +### Part C: Roofline Analysis - # Compare with previous versions - version3_memory = ( - input_reads * 4 + # Read input 4 times (each kernel) - weight_reads * 1.5 + # Some weight reuse - output_writes * 4 # Multiple intermediate writes - ) +For batch_size=4, seq_len=512, d_model=2048: +- Calculate total FLOPs (attention + FFN + norms) +- Calculate total memory traffic (input + weights + output) +- Compute arithmetic intensity (FLOPs/byte) +- Determine if compute-bound or memory-bound - memory_reduction = (version3_memory - total_memory_traffic) / version3_memory - print(f"\nMemory traffic reduction vs Version 3: {memory_reduction * 100:.1f}%") +### Results Template - return { - 'register_usage': registers_per_token, - 'total_memory_mb': total_memory_traffic / 1e6, - 'memory_reduction': memory_reduction - } +| Metric | Value | +|--------|-------| +| Register usage per token | | +| Memory traffic reduction | % | +| Arithmetic intensity | FLOPs/byte | +| Performance bottleneck | (compute/memory) | +| Kernel count reduction | x | -# Run memory analysis -memory_analysis = analyze_memory_hierarchy() -``` - -#### Step 5: Cache Optimization Strategies - -Examine cache optimization techniques: - -```python -def analyze_cache_optimization(): - """Analyze cache optimization in ultra-fused kernels.""" - - print("\nCache Optimization Analysis") - print("=" * 35) - - # L1 cache utilization - l1_cache_size = 128 * 1024 # 128KB typical L1 cache - l2_cache_size = 8 * 1024 * 1024 # 8MB typical L2 cache - - # Data reuse analysis - d_model = 2048 - seq_len = 512 - - # Input token reuse - input_reuse_factor = 4 # Used in norm, Q, K, V projections - print(f"Input data reuse factor: {input_reuse_factor}x") - - # Weight reuse patterns - attention_weight_reuse = seq_len # Each weight used for all tokens - ffn_weight_reuse = seq_len # FFN weights reused across sequence - - print(f"Attention weight reuse: {attention_weight_reuse}x") - print(f"FFN weight reuse: {ffn_weight_reuse}x") - - # Cache hit rate estimation - working_set_size = d_model * 4 * 4 # Input + weights for one token - l1_hit_rate = min(1.0, l1_cache_size / working_set_size) - - print(f"Estimated L1 cache hit rate: {l1_hit_rate * 100:.1f}%") - - # Temporal locality analysis - temporal_locality_score = ( - input_reuse_factor + - attention_weight_reuse / seq_len + - ffn_weight_reuse / seq_len - ) / 3 - - print(f"Temporal locality score: {temporal_locality_score:.2f}") - - return { - 'l1_hit_rate': l1_hit_rate, - 'temporal_locality': temporal_locality_score, - 'working_set_mb': working_set_size / 1e6 - } - -# Run cache analysis -cache_analysis = analyze_cache_optimization() -``` - -#### Step 6: Register Pressure Management - -Analyze register usage optimization: - -```python -def analyze_register_pressure(): - """Analyze register pressure and management strategies.""" - - print("\nRegister Pressure Analysis") - print("=" * 30) - - # GPU specifications (example for MI250X) - registers_per_cu = 65536 # 64K registers per CU - max_threads_per_cu = 2048 - registers_per_thread_max = registers_per_cu // max_threads_per_cu - - print(f"Max registers per thread: {registers_per_thread_max}") - - # Estimate register usage in ultra-fused kernel - d_model = 2048 - n_heads = 32 - head_dim = d_model // n_heads - - registers_needed = ( - d_model // 4 + # Input token (float4 packing) - d_model // 4 + # Residual storage - n_heads + # Attention accumulators - head_dim + # Head computation temp - 64 + # Loop counters, indices, etc. - 32 # Compiler temporaries - ) - - print(f"Estimated registers needed: {registers_needed}") - print(f"Register utilization: {registers_needed / registers_per_thread_max * 100:.1f}%") - - # Occupancy impact - max_threads_with_registers = registers_per_cu // registers_needed - occupancy = min(max_threads_with_registers / max_threads_per_cu, 1.0) - - print(f"Theoretical occupancy: {occupancy * 100:.1f}%") - - # Register optimization strategies - print(f"\nOptimization Strategies:") - print(f"1. Float4 vectorization reduces registers by 4x") - print(f"2. Loop unrolling vs register pressure trade-off") - print(f"3. Shared memory for intermediate results") - print(f"4. Careful compiler hint placement") - - return { - 'registers_needed': registers_needed, - 'occupancy': occupancy, - 'utilization_percent': registers_needed / registers_per_thread_max * 100 - } - -# Run register analysis -register_analysis = analyze_register_pressure() -``` - -### Part C: Performance Engineering Deep Dive (25 minutes) - -#### Step 7: Roofline Model Analysis - -Apply roofline analysis to ultra-fused kernels: - -```python -def roofline_analysis(): - """Perform roofline model analysis for ultra-fused kernel.""" - - print("\nRoofline Model Analysis") - print("=" * 25) - - # Problem size - batch_size, seq_len, d_model = 4, 512, 2048 - n_heads = 32 - d_ff = int(2.67 * d_model) - - # Calculate FLOPs for entire transformer block - # Attention FLOPs - qkv_flops = 3 * batch_size * seq_len * d_model * d_model * 2 # Q, K, V projections - attn_flops = batch_size * n_heads * seq_len * seq_len * d_model // n_heads * 2 # Attention matrix - o_proj_flops = batch_size * seq_len * d_model * d_model * 2 # Output projection - - attention_total_flops = qkv_flops + attn_flops + o_proj_flops - - # FFN FLOPs - gate_up_flops = 2 * batch_size * seq_len * d_model * d_ff * 2 # Gate + Up projections - silu_flops = batch_size * seq_len * d_ff * 4 # SiLU activation (~4 ops) - down_flops = batch_size * seq_len * d_ff * d_model * 2 # Down projection - - ffn_total_flops = gate_up_flops + silu_flops + down_flops - - # Layer norm FLOPs (2 layer norms) - norm_flops = 2 * batch_size * seq_len * d_model * 8 # Variance + normalization - - total_flops = attention_total_flops + ffn_total_flops + norm_flops - - # Memory traffic (ultra-optimized) - input_bytes = batch_size * seq_len * d_model * 4 - weight_bytes = (4 * d_model * d_model + 3 * d_model * d_ff + 2 * d_model) * 4 - output_bytes = batch_size * seq_len * d_model * 4 - - total_bytes = input_bytes + weight_bytes + output_bytes - - # Arithmetic intensity - arithmetic_intensity = total_flops / total_bytes - - print(f"Problem size: {batch_size}x{seq_len}x{d_model}") - print(f"Total FLOPs: {total_flops / 1e9:.2f} GFLOPs") - print(f"Total memory: {total_bytes / 1e6:.2f} MB") - print(f"Arithmetic intensity: {arithmetic_intensity:.2f} FLOPs/byte") - - # GPU specifications (MI250X example) - peak_flops = 47.9e12 # 47.9 TFLOPS FP32 - peak_bandwidth = 1638e9 # 1.638 TB/s - - # Roofline analysis - compute_bound_threshold = peak_flops / peak_bandwidth - - print(f"\nGPU Specifications:") - print(f"Peak compute: {peak_flops / 1e12:.1f} TFLOPS") - print(f"Peak bandwidth: {peak_bandwidth / 1e9:.0f} GB/s") - print(f"Compute-bound threshold: {compute_bound_threshold:.2f} FLOPs/byte") - - if arithmetic_intensity > compute_bound_threshold: - print(f"PASS Kernel is compute-bound (good for GPU utilization)") - bottleneck = "compute" - theoretical_performance = peak_flops - else: - print(f"WARNING: Kernel is memory-bound (optimize memory access)") - bottleneck = "memory" - theoretical_performance = arithmetic_intensity * peak_bandwidth - - # Performance potential - performance_potential = theoretical_performance / 1e12 - - print(f"Theoretical peak performance: {performance_potential:.1f} TFLOPS") - - return { - 'arithmetic_intensity': arithmetic_intensity, - 'bottleneck': bottleneck, - 'performance_potential_tflops': performance_potential, - 'compute_bound': arithmetic_intensity > compute_bound_threshold - } - -# Run roofline analysis -roofline_results = roofline_analysis() -``` - -#### Step 8: Performance Prediction Model - -Create a performance prediction model: - -```python -def performance_prediction_model(): - """Create performance prediction model for different configurations.""" - - print("\nPerformance Prediction Model") - print("=" * 32) - - # Base performance characteristics - base_config = { - 'batch_size': 4, - 'seq_len': 512, - 'd_model': 2048, - 'measured_time_ms': 15.0 # Example measured time - } - - def predict_performance(batch_size, seq_len, d_model): - """Predict performance for given configuration.""" - - # Scaling factors based on algorithmic complexity - batch_scale = batch_size / base_config['batch_size'] - seq_scale = (seq_len / base_config['seq_len']) ** 1.8 # Slightly sub-quadratic due to optimizations - model_scale = (d_model / base_config['d_model']) ** 2.5 # Between O(n^2) and O(n^3) - - # Memory bandwidth limiting factor - memory_factor = max(1.0, (batch_size * seq_len * d_model) / (4 * 512 * 2048) * 0.8) - - predicted_time = ( - base_config['measured_time_ms'] * - batch_scale * seq_scale * model_scale * memory_factor - ) - - return predicted_time - - # Test predictions - test_configs = [ - (1, 128, 1024), - (2, 256, 1536), - (4, 512, 2048), - (8, 512, 2048), - (4, 1024, 2048), - (4, 512, 4096) - ] - - print("Performance Predictions:") - print("| Batch | Seq Len | Model Dim | Predicted Time (ms) | Throughput (tokens/s) |") - print("|-------|---------|-----------|--------------------|-----------------------|") - - for batch_size, seq_len, d_model in test_configs: - predicted_time = predict_performance(batch_size, seq_len, d_model) - throughput = batch_size * seq_len / (predicted_time / 1000) - - print(f"| {batch_size:5d} | {seq_len:7d} | {d_model:9d} | {predicted_time:18.2f} | {throughput:21.0f} |") - - return test_configs - -# Run performance predictions -performance_predictions = performance_prediction_model() -``` - -### Exercise Results - -#### Ultra-Fusion Analysis Summary - -Fill in your analysis results: - -**Memory Efficiency:** - -- Register usage per token: _____ -- Memory traffic reduction: _____% -- L1 cache hit rate: _____% - -**Performance Characteristics:** - -- Arithmetic intensity: _____ FLOPs/byte -- Performance bottleneck: _____ (compute/memory) -- Theoretical peak: _____ TFLOPS - -**Optimization Impact:** - -- Kernel count reduction: _____x -- Memory bandwidth savings: _____% -- Register utilization: _____% - -#### Key Insights +### Key Insights 1. **Most Critical Optimization**: _____ -2. **Biggest Performance Bottleneck**: _____ -3. **Next Optimization Opportunity**: _____ -4. **Scalability Limitations**: _____ +2. **Biggest Bottleneck**: _____ +3. **Scalability Limitation**: _____ ### Discussion Questions -1. **Ultra-Fusion Trade-offs**: What are the main trade-offs of ultra-fusion (complexity, maintainability, portability)? - -2. **Hardware Dependencies**: How do ultra-fused kernels depend on specific GPU architectures? - -3. **Optimization Limits**: What are the theoretical limits of kernel fusion optimization? - -4. **Development Complexity**: How does ultra-fusion impact development time and debugging complexity? - -5. **Future Directions**: What future GPU architecture features would enable even better ultra-fusion? - -### Advanced Challenges - -#### Challenge 1: Register Optimization -Redesign a portion of the ultra-fused kernel to reduce register pressure while maintaining performance. - -#### Challenge 2: Memory Pattern Analysis -Implement a tool to visualize memory access patterns in the ultra-fused kernel. - -#### Challenge 3: Performance Modeling -Create a detailed performance model that predicts ultra-fused kernel performance across different GPU architectures. - -#### Challenge 4: Debugging Framework -Design a debugging framework for ultra-fused kernels that can isolate performance issues. - -### Next Steps - -This exercise completes your understanding of ultra-fusion techniques. In Exercise 2, you'll: - -- Compare all four versions comprehensively -- Analyze performance scaling characteristics -- Create optimization decision frameworks -- Design production deployment strategies - -### Additional Resources +1. What are the trade-offs of ultra-fusion (complexity, maintainability, portability)? +2. How do ultra-fused kernels depend on specific GPU architectures? +3. What are the theoretical limits of kernel fusion? -- [Advanced GPU Programming Patterns](https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing) -- [Memory Optimization Techniques](https://rocmdocs.amd.com/en/latest/Programming_Guides/Performance_optimization.html) -- [Roofline Model Deep Dive](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/) -- [Register Pressure Analysis](https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/) +### Resources +- [AMD Performance Optimization Guide](https://rocmdocs.amd.com/en/latest/Programming_Guides/Performance_optimization.html) +- [Roofline Model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/) diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_counters.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_counters.sh new file mode 100644 index 00000000..35e914d7 --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_counters.sh @@ -0,0 +1,78 @@ +#!/bin/bash +# Script to profile TinyTransformer V4 with rocprofv3 kernel trace +# This captures kernel execution metrics for performance analysis +# +# Supports both ROCm 6.x (CSV output) and ROCm 7.x (SQLite database output) + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./counters/counter_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 kernel trace collection for TinyTransformer V4..." +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprofv3 to collect kernel trace +rocprofv3 \ + --kernel-trace \ + --output-directory "$OUTPUT_DIR" \ + -- python tiny_llama_v4.py \ + --batch-size 8 \ + --seq-len 128 \ + --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Analyze results based on ROCm version +echo "To analyze results:" +DB_FILE=$(find "$OUTPUT_DIR" -name "*_results.db" 2>/dev/null | head -1) +if [ -n "$DB_FILE" ]; then + echo " Database file: $DB_FILE" + echo "" + echo " Export to CSV:" + echo " rocpd2csv -i $DB_FILE -o kernel_stats.csv" + echo "" + echo " Get kernel summary:" + echo " rocpd summary -i $DB_FILE --region-categories KERNEL" +else + echo " Check $OUTPUT_DIR for output files" +fi diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_hotspots.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_hotspots.sh new file mode 100755 index 00000000..6f32acb5 --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_hotspots.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# +# Get hotspots analysis using rocprofv3 +# Compatible with ROCm 6.x and 7.x +# + +set -e + +echo "==========================================" +echo "rocprofv3 Hotspots Analysis - TinyTransformer V4" +echo "==========================================" +echo "" + +OUTPUT_DIR="./hotspots/hotspot_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" +echo "Running: rocprofv3 --stats -- python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprofv3 --stats -- python ../../tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] Hotspot analysis completed" +else + echo "[FAILED] Hotspot analysis failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls +echo "" + +# Check for stats/CSV files +if ls *.csv 1> /dev/null 2>&1; then + echo "Statistics files found:" + for f in *.csv; do + echo "" + echo "File: $f" + echo "Top 10 entries:" + head -11 "$f" + done +else + echo "Looking for statistics in subdirectories:" + find . -name "*.csv" -exec echo "Found: {}" \; -exec head -11 {} \; +fi +echo "" + +echo "Hotspot analysis identifies GPU kernels with highest time consumption." +echo "" diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_compute.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_compute.sh new file mode 100755 index 00000000..2d6e2433 --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_compute.sh @@ -0,0 +1,48 @@ +#!/bin/bash +# +# Get detailed GPU metrics using rocprof-compute +# Compatible with ROCm 6.x and 7.x +# +# Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) +# for full hardware counter support. Consumer GPUs may have limited counter availability. + +set -e + +echo "==========================================" +echo "rocprof-compute Profiling - TinyTransformer V4" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_compute/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-compute to collect detailed GPU metrics +WORKLOAD_NAME="tiny_llama_v4_$(date +%Y%m%d_%H%M%S)" +echo "Running: rocprof-compute profile --name $WORKLOAD_NAME -d $OUTPUT_DIR -- python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +rocprof-compute profile --name "$WORKLOAD_NAME" -d "$OUTPUT_DIR" -- python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-compute profiling completed" +else + echo "[FAILED] rocprof-compute profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find "$OUTPUT_DIR" -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " rocprof-compute analyze -p $OUTPUT_DIR/workloads/$WORKLOAD_NAME/rocprof --dispatch -n tiny_llama_dispatch" +echo "" +echo "For available analysis options:" +echo " rocprof-compute analyze --help" +echo "" diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_sys.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_sys.sh new file mode 100755 index 00000000..bace77df --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_rocprof_sys.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# +# Get system-level profiling using rocprof-sys +# Compatible with ROCm 6.x and 7.x +# +# NOTE: rocprof-sys may produce memory map dumps in some configurations. +# Issue reference: TBD +# + +set -e + +echo "==========================================" +echo "rocprof-sys Profiling - TinyTransformer V4" +echo "==========================================" +echo "" + +OUTPUT_DIR="./rocprof_sys/profile_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Run with rocprof-sys to collect system-level traces +echo "Running: rocprof-sys-run --profile --trace -- python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10" +echo "" + +cd "$OUTPUT_DIR" +rocprof-sys-run --profile --trace -- python ../../tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 +ROCPROF_EXIT=$? + +echo "" +if [ $ROCPROF_EXIT -eq 0 ]; then + echo "[SUCCESS] rocprof-sys profiling completed" +else + echo "[FAILED] rocprof-sys profiling failed with exit code $ROCPROF_EXIT" + exit 1 +fi +echo "" + +echo "Generated files:" +find . -type f -ls | head -20 +echo "" + +echo "To analyze results:" +echo " Open the .proto file in Perfetto UI: https://ui.perfetto.dev/" +echo "" diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_trace.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_trace.sh new file mode 100644 index 00000000..e8607fa5 --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/get_trace.sh @@ -0,0 +1,86 @@ +#!/bin/bash +# Script to profile TinyTransformer V4 with rocprofv3 runtime trace +# This captures GPU API calls, kernel launches, and memory operations +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="./traces/trace_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 runtime trace profiling for TinyTransformer V4..." +echo "Output directory: $OUTPUT_DIR" + +# Build rocprofv3 command with appropriate flags for ROCm version +# ROCm 6.4+ and 7.x require explicit --output-format pftrace to generate Perfetto traces +if [ "$ROCM_MAJOR" = "7" ] || [ "$ROCM_MAJOR" = "6" ]; then + echo "Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace)" + OUTPUT_FORMAT="--output-format pftrace" +else + echo "Using ROCm 5.x or older: default format" + OUTPUT_FORMAT="" +fi + +echo "" +echo "Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations)" +echo "" + +# Run with rocprofv3 to collect full runtime trace +cd "$OUTPUT_DIR" +rocprofv3 \ + --runtime-trace \ + $OUTPUT_FORMAT \ + -- python ../../tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh ./*/ 2>/dev/null || ls -lh . +echo "" + +# Find and report pftrace files +PFTRACE=$(find . -name "*.pftrace" -size +1k 2>/dev/null | head -1) +if [ -n "$PFTRACE" ]; then + echo "Perfetto trace file: $PFTRACE" + echo "Size: $(ls -lh "$PFTRACE" | awk '{print $5}')" + echo "" + echo "To view the trace:" + echo " 1. Visit: https://ui.perfetto.dev/" + echo " 2. Open: $PFTRACE" +fi diff --git a/MLExamples/TinyTransformer/version4_pytorch_sdpa/test_rocpd.sh b/MLExamples/TinyTransformer/version4_pytorch_sdpa/test_rocpd.sh new file mode 100755 index 00000000..2d864165 --- /dev/null +++ b/MLExamples/TinyTransformer/version4_pytorch_sdpa/test_rocpd.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# +# Test rocpd (ROCm Profiling Daemon) for continuous profiling +# + +set -e + +echo "==========================================" +echo "rocpd Test - Version 4" +echo "==========================================" +echo "" + +# Check if rocpd is available +if ! command -v rocpd &> /dev/null; then + echo "[ERROR] rocpd not found in PATH" + echo "rocpd may not be installed or available in this ROCm version" + exit 1 +fi + +echo "rocpd location: $(which rocpd)" +echo "" + +OUTPUT_DIR="./rocpd/rocpd_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Output directory: $OUTPUT_DIR" +echo "" + +# Start rocpd in background +echo "Starting rocpd daemon..." +rocpd --output-dir "$OUTPUT_DIR" & +ROCPD_PID=$! +echo "rocpd running with PID: $ROCPD_PID" +echo "" + +# Give rocpd time to initialize +sleep 2 + +# Run workload +echo "Running workload: python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10" +python tiny_llama_v4.py --batch-size 8 --seq-len 128 --num-steps 10 +WORKLOAD_EXIT=$? +echo "" + +# Stop rocpd +echo "Stopping rocpd daemon..." +kill $ROCPD_PID 2>/dev/null || true +wait $ROCPD_PID 2>/dev/null || true +echo "" + +if [ $WORKLOAD_EXIT -eq 0 ]; then + echo "[SUCCESS] Workload completed" +else + echo "[FAILED] Workload failed with exit code $WORKLOAD_EXIT" +fi +echo "" + +echo "Generated files in $OUTPUT_DIR:" +ls -lh "$OUTPUT_DIR" +echo "" + +echo "rocpd output is a SQLite3 database file" +echo "" +echo "To view the database:" +echo " - Use VS Code SQLite Viewer extension" +echo " - rocprof-compute and rocprof-systems can consume it directly" +echo " - No official CLI tool is provided for viewing" +echo "" +echo "rocpd provides continuous profiling with minimal overhead" +echo "" diff --git a/MLExamples/inference_benchmark/README.md b/MLExamples/inference_benchmark/README.md deleted file mode 100644 index fa52322e..00000000 --- a/MLExamples/inference_benchmark/README.md +++ /dev/null @@ -1,77 +0,0 @@ -# pytorch-micro-benchmarking -We supply a small microbenchmarking script for PyTorch training on ROCm. - -To execute: -`python micro_benchmarking_pytorch.py --network [--batch-size ] [--iterations ] [--fp16 <0 or 1> ] [--distributed_dataparallel] [--device_ids ] ` - -Possible network names are: `alexnet`, `densenet121`, `inception_v3`, `resnet50`, `resnet101`, `SqueezeNet`, `vgg16` etc. - -Default are 10 training iterations, `fp16` off (i.e., 0), and a batch size of 64. - -For mGPU runs, use one of the following methods. -- `torchrun`: It will spawn multiple sub-processes for each of the GPUs and adjust `world_size` and `rank` accordingly. `torchrun` also defaults to using distributed dataparallel. -- `--distributed_dataparallel`: Uses torch.nn.parallel.DistributedDataParallel to run multiple processes/node. However, the script only launches one process per GPU, multiple processes need to be launched manually. See example below. - -_NOTE_: `--distributed_dataparallel` option will be deprecated in the future as this path can be exercised now with `torchrun`. -_NOTE_: If comparing `--distributed_dataprallel` performance with `torchrun` one, you need to multiply the `--batch-size` with number of nodes in the `torchrun` command. `torchrun` will split the batch size into mini batches that run on each of the nodes. `--distributed_dataparallel` doesn't do that automatically, it run with whatever the user provides. - -Examples: -- for a 1-GPU resnet50 run: -``` -python3 micro_benchmarking_pytorch.py --network resnet50 -``` - -- for a 2-GPU run on a single node using `torchrun`: -``` -torchrun --nproc-per-node 2 micro_benchmarking_pytorch.py --network resnet50 --batch-size 128 - -``` - -- for a 2-GPU run on a single node using `--distributed_dataparallel`: -``` -python3 micro_benchmarking_pytorch.py --device_ids=0 --network resnet50 --distributed_dataparallel --rank 0 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 --batch-size 64 & -python3 micro_benchmarking_pytorch.py --device_ids=1 --network resnet50 --distributed_dataparallel --rank 1 --world-size 2 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 --batch-size 64 & -``` - - -To run FlopsProfiler (with deepspeed.profiling.flops_profiler imported): -`python micro_benchmarking_pytorch.py --network resnet50 --amp-opt-level=2 --batch-size=256 --iterations=20 --flops-prof-step 10` - -## Performance tuning -If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, `export MIOPEN_FIND_ENFORCE=3` prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the [MIOpen documentation](https://rocm.github.io/MIOpen/doc/html/perfdatabase.html). - -## PyTorch 2.0 -Added the `--compile` option opens up PyTorch 2.0 capabilities, which comes with several options. Here are some notes from upstream: -``` - Optimizes given model/function using TorchDynamo and specified backend. - - Args: - model (Callable): Module/function to optimize - fullgraph (bool): Whether it is ok to break model into several subgraphs - dynamic (bool): Use dynamic shape tracing - backend (str or Callable): backend to be used - mode (str): Can be either "default", "reduce-overhead" or "max-autotune" - options (dict): A dictionary of options to pass to the backend. - disable (bool): Turn torch.compile() into a no-op for testing - - Example:: - - @torch.compile(options={"matmul-padding": True}, fullgraph=True) - def foo(x): - return torch.sin(x) + torch.cos(x) -``` - -With the required `--compile` option, these additional options are now available from the command line with the `--compileContext` flag. Here are a few examples: - -```bash -python micro_benchmarking_pytorch.py --network resnet50 --compile # default run -``` - -```bash -python micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'mode': 'max-autotune', 'fullgraph': 'True'}" -``` - -```bash -python micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'options': {'static-memory': 'True', 'matmul-padding': 'True'}}" -``` -Note: you cannot pass the `mode` and `options` options together. diff --git a/MLExamples/inference_benchmark/ATTRIBUTION.md b/MLExamples/pytorch_microbench/ATTRIBUTION.md similarity index 100% rename from MLExamples/inference_benchmark/ATTRIBUTION.md rename to MLExamples/pytorch_microbench/ATTRIBUTION.md diff --git a/MLExamples/inference_benchmark/INFERENCE_BENCHMARK_NOTES.md b/MLExamples/pytorch_microbench/INFERENCE_BENCHMARK_NOTES.md similarity index 100% rename from MLExamples/inference_benchmark/INFERENCE_BENCHMARK_NOTES.md rename to MLExamples/pytorch_microbench/INFERENCE_BENCHMARK_NOTES.md diff --git a/MLExamples/inference_benchmark/INFERENCE_BENCHMARK_WORKSHOP_WALKTHROUGH.md b/MLExamples/pytorch_microbench/INFERENCE_BENCHMARK_WORKSHOP_WALKTHROUGH.md similarity index 100% rename from MLExamples/inference_benchmark/INFERENCE_BENCHMARK_WORKSHOP_WALKTHROUGH.md rename to MLExamples/pytorch_microbench/INFERENCE_BENCHMARK_WORKSHOP_WALKTHROUGH.md diff --git a/MLExamples/pytorch_microbench/PROFILING_SCRIPTS.md b/MLExamples/pytorch_microbench/PROFILING_SCRIPTS.md new file mode 100644 index 00000000..5b539b2b --- /dev/null +++ b/MLExamples/pytorch_microbench/PROFILING_SCRIPTS.md @@ -0,0 +1,273 @@ +# Profiling Scripts for inference_benchmark + +This directory contains profiling scripts for analyzing the performance of PyTorch inference benchmarks using various ROCm profiling tools. + +**Compatible with ROCm 6.x and 7.x** - Scripts automatically detect ROCm version and handle different output formats. + +## Overview + +All scripts are configured to profile **ResNet50** with: +- Batch size: 64 +- Iterations: 10 + +The scripts use the standard command: +```bash +python micro_benchmarking_pytorch.py --network resnet50 --batch-size 64 --iterations 10 +``` + +## Available Profiling Scripts + +### 1. get_counters.sh - rocprofv3 Kernel Trace with Hardware Counters + +**Purpose:** Captures detailed GPU hardware metrics and kernel execution statistics + +**Features:** +- Automatically detects ROCm version (6.x or 7.x) +- Collects hardware counter data for all GPU kernels +- Automatic analysis with appropriate tool: + - ROCm 6.x: `analyze_kernel_trace.py` (CSV format) + - ROCm 7.x: `analyze_rocpd_db.py` (SQLite database) +- Shows kernel execution statistics and performance hotspots +- Identifies top time-consuming kernels + +**Output:** +- `profiling_results/counters_/` directory +- ROCm 6.x: `kernel_trace.csv` with detailed kernel metrics +- ROCm 7.x: `*_results.db` SQLite database with comprehensive profiling data +- Automated analysis summary showing: + - Kernel execution counts + - Total/average/min/max durations + - Percentage of total GPU time + +**Usage:** +```bash +./get_counters.sh +``` + +**When to use:** +- Identify performance bottlenecks at the kernel level +- Understand which GPU operations consume the most time +- Analyze kernel execution patterns and frequencies + +--- + +### 2. get_trace.sh - rocprofv3 Runtime Trace + +**Purpose:** Captures GPU API calls, kernel launches, and memory operations + +**Features:** +- Records HIP/HSA API calls +- Traces kernel launches and execution +- Captures memory operations (allocations, transfers) +- Generates Perfetto trace format (.pftrace) for visualization + +**Output:** +- `profiling_results/trace_/` directory +- `.pftrace` file for interactive timeline visualization + +**Visualization:** +Open the `.pftrace` file at [https://ui.perfetto.dev/](https://ui.perfetto.dev/) + +**Usage:** +```bash +./get_trace.sh +``` + +**When to use:** +- Visualize timeline of GPU operations +- Analyze CPU-GPU synchronization +- Identify memory transfer bottlenecks +- Understand overall execution flow + +--- + +### 3. get_rocprof_sys.sh - System-Level Profiling + +**Purpose:** System-level profiling with call stack sampling + +**Features:** +- Call stack sampling for CPU and GPU code +- System-level performance analysis +- Captures both application and runtime behavior + +**Output:** +- `profiling_results/rocprof_sys_/` directory +- System-level profiling data + +**Known Issues:** +⚠️ **Note:** rocprof-sys may produce memory map dumps in some configurations. This is a known issue tracked in GitHub issue #1406. If profiling fails or produces excessive output, consider using `get_trace.sh` (rocprofv3) or `get_rocprof_compute.sh` instead. + +**Usage:** +```bash +./get_rocprof_sys.sh +``` + +**Analysis:** +```bash +rocprof-sys-avail --help +rocprof-sys-analyze --help +``` + +**When to use:** +- System-level performance analysis +- Call stack profiling +- When kernel-level profiling is insufficient + +--- + +### 4. get_rocprof_compute.sh - Detailed GPU Metrics + +**Purpose:** Comprehensive compute performance analysis with detailed hardware metrics + +**Features:** +- Detailed GPU hardware counter collection +- Compute performance analysis +- Unique workload names with timestamps +- Comprehensive metric coverage + +**Output:** +- `profiling_results/rocprof_compute_/` directory +- Workload-specific performance data + +**Usage:** +```bash +./get_rocprof_compute.sh +``` + +**Analysis:** +```bash +rocprof-compute analyze --help +rocprof-compute analyze --workload-dir profiling_results/rocprof_compute_ +``` + +**When to use:** +- Detailed hardware performance analysis +- Compute utilization metrics +- Memory bandwidth and cache analysis +- Advanced performance tuning + +--- + +## Workflow Recommendations + +### Quick Performance Check +1. Start with `get_counters.sh` to identify top kernels +2. Review the automated analysis for hotspots + +### Detailed Analysis +1. Run `get_trace.sh` to visualize execution timeline +2. Open `.pftrace` in Perfetto UI to analyze CPU-GPU interaction +3. Run `get_rocprof_compute.sh` for detailed hardware metrics + +### Advanced Tuning +1. Use `get_rocprof_compute.sh` for comprehensive metrics +2. Analyze specific hardware counters +3. Iterate on optimizations and re-profile + +--- + +## Output Directory Structure + +All scripts create timestamped output directories: +``` +profiling_results/ +├── counters_YYYYMMDD_HHMMSS/ +├── trace_YYYYMMDD_HHMMSS/ +├── rocprof_sys_YYYYMMDD_HHMMSS/ +└── rocprof_compute_YYYYMMDD_HHMMSS/ +``` + +--- + +## Customizing Profiling Runs + +To profile different networks or configurations, modify the scripts to use different arguments: + +```bash +# Example: Profile VGG16 with larger batch size +python micro_benchmarking_pytorch.py --network vgg16 --batch-size 128 --iterations 10 + +# Example: Profile with FP16 +python micro_benchmarking_pytorch.py --network resnet50 --batch-size 64 --iterations 10 --fp16 1 + +# Example: Profile with PyTorch 2.0 compile +python micro_benchmarking_pytorch.py --network resnet50 --batch-size 64 --iterations 10 --compile +``` + +Available networks include: `alexnet`, `densenet121`, `inception_v3`, `resnet50`, `resnet101`, `SqueezeNet`, `vgg16`, and more. + +--- + +## Requirements + +- **ROCm 6.x or 7.x** (tested with 6.4.4 and 7.0) +- AMD GPU (tested on RX 7900 XTX / gfx1100 and MI300) +- Profiling tools installed: + - `rocprofv3` + - `rocprof-compute` + - `rocprof-sys` +- Python 3 with PyTorch (ROCm build) +- SQLite3 (for ROCm 7.x database analysis) + +--- + +## ROCm Version Differences + +### ROCm 6.x Output Format +- **CSV files**: `kernel_trace.csv`, `agent_info.csv` +- **Analysis tool**: `analyze_kernel_trace.py` +- **Performance**: May use naive convolution kernels (slower) + +### ROCm 7.x Output Format +- **SQLite database**: `*_results.db` (single database file) +- **Analysis tool**: `analyze_rocpd_db.py` +- **Performance**: Uses optimized MLIR-generated kernels (faster) +- **Tables**: UUID-suffixed table names (e.g., `rocpd_kernel_dispatch_`) + +### Example Performance Comparison (ResNet50) +``` +ROCm 6.x: ~90-140 seconds GPU time (naive kernels dominate 98%+) +ROCm 7.x: ~1.2 seconds GPU time (optimized MLIR kernels) +``` + +The `get_counters.sh` script automatically detects the ROCm version and uses the appropriate analysis tool. + +--- + +## Troubleshooting + +### Locale Errors (rocprof-compute) +If you see: `ERROR Please ensure that the 'en_US.UTF-8' locale is available` + +**Solution:** Rebuild the devcontainer (Dockerfiles already updated) or set locale manually: +```bash +export LANG=en_US.UTF-8 +export LANGUAGE=en_US:en +export LC_ALL=en_US.UTF-8 +``` + +### Memory Map Dumps (rocprof-sys) +If `get_rocprof_sys.sh` produces excessive memory map output instead of clean profiles, this is a known issue. Use alternative profilers: `get_trace.sh` or `get_rocprof_compute.sh`. + +### Permission Errors +Ensure scripts are executable: +```bash +chmod +x get_*.sh +``` + +--- + +## Additional Resources + +- [ROCm Profiling Documentation](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) +- [Perfetto UI](https://ui.perfetto.dev/) +- [MIOpen Performance Database](https://rocm.github.io/MIOpen/doc/html/perfdatabase.html) + +--- + +## Related Files + +- `README.md` - Main documentation for inference_benchmark +- `analyze_kernel_trace.py` - ROCm 6.x CSV analysis script (auto-created by `get_counters.sh`) +- `analyze_rocpd_db.py` - ROCm 7.x SQLite database analysis script +- `micro_benchmarking_pytorch.py` - Main benchmark script diff --git a/MLExamples/pytorch_microbench/README.md b/MLExamples/pytorch_microbench/README.md new file mode 100644 index 00000000..ae236674 --- /dev/null +++ b/MLExamples/pytorch_microbench/README.md @@ -0,0 +1,256 @@ +# ML Example: PyTorch Micro-Benchmarking with ROCm Profiling + +README.md from `HPCTrainingExamples/MLExamples/pytorch_microbench` from the Training Examples repository. + +In this example we provide a PyTorch micro-benchmarking tool for measuring GPU throughput on AMD GPUs. The benchmark runs forward and backward passes on various CNN architectures, measuring images processed per second. This workload is useful for establishing baseline GPU performance and for learning ROCm profiling tools. Several profiling scripts are provided to capture different aspects of GPU performance, from high-level API traces to detailed hardware metrics. + +## Features of the profiling scripts + +The pytorch_microbench example contains several profiling scripts that capture different aspects of GPU performance: + +- **get_trace.sh**: Runtime trace collection using rocprofv3. Captures HIP/HSA API calls, kernel execution timeline, memory operations (H2D, D2H, D2D transfers), and synchronization events. Output is a Perfetto trace file for timeline visualization. +- **get_counters.sh**: Kernel trace collection using rocprofv3. Captures kernel execution statistics including timing and call counts. Useful for identifying hotspot kernels and their execution patterns. +- **get_rocprof_compute.sh**: Detailed GPU hardware metrics using rocprof-compute. Provides comprehensive performance analysis including compute utilization, memory bandwidth, and hardware counter data. +- **get_rocprof_sys.sh**: System-level profiling using rocprof-sys. Captures call stack sampling and system-level performance data for end-to-end analysis. + +## Overview of the benchmark + +The benchmark is controlled with the following arguments: + +- `--network `: neural network architecture to benchmark (alexnet, densenet121, inception_v3, resnet50, resnet101, SqueezeNet, vgg16, etc.) +- `--batch-size `: batch size for forward/backward passes (default: 64) +- `--iterations `: number of iterations to run (default: 10) +- `--fp16 <0|1>`: enable FP16 precision (default: 0, disabled) +- `--compile`: enable PyTorch 2.0 torch.compile optimizations +- `--compileContext `: compilation options as Python dict string +- `--distributed_dataparallel`: use DistributedDataParallel for multi-GPU +- `--device_ids `: comma-separated GPU indices for distributed runs + +## Running the micro-benchmark + +Load the required modules: + +``` +module load pytorch rocm +``` + +Run a basic micro-benchmark with ResNet50: + +``` +echo "Running ResNet50 micro-benchmark" +python micro_benchmarking_pytorch.py --network resnet50 --batch-size 64 --iterations 10 +``` + +Example output (Radeon RX 7900 XTX, ROCm 6.4): + +``` +INFO: running forward and backward for warmup. +INFO: running the benchmark.. +OK: finished running benchmark.. +--------------------SUMMARY-------------------------- +Microbenchmark for network : resnet50 +Num devices: 1 +Dtype: FP32 +Mini batch size [img] : 64 +Time per mini-batch : 0.177 +Throughput [img/sec] : 360.74 +``` + +Note the throughput reported in images/second. This measures the combined forward and backward pass performance. + +For multi-GPU runs using torchrun (recommended): + +``` +echo "Running 2-GPU micro-benchmark with torchrun" +torchrun --nproc-per-node 2 micro_benchmarking_pytorch.py --network resnet50 --batch-size 128 +``` + +For PyTorch 2.0 compilation: + +``` +echo "Running with torch.compile max-autotune" +python micro_benchmarking_pytorch.py --network resnet50 --compile --compileContext "{'mode': 'max-autotune'}" +``` + +## Runtime Trace Profiling with get_trace.sh + +This script captures GPU API calls, kernel launches, and memory operations for timeline analysis. + +Run the profiling script: + +``` +echo "Collecting runtime trace with rocprofv3" +./get_trace.sh +``` + +The script will output results to `profiling_results/trace_/`. To analyze the results: + +``` +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .pftrace file" +``` + +Example output (ROCm 6.4): + +``` +Detected ROCm version: 6.4.4-129 +Starting rocprofv3 runtime trace profiling for pytorch_microbench... +Output directory: profiling_results/trace_20260114_151142 +Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace) + +Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations) + +INFO: running forward and backward for warmup. +INFO: running the benchmark.. +OK: finished running benchmark.. +... +Profiling complete! Results saved to: profiling_results/trace_20260114_151142 + +Generated files: +total 25M +-rw-r--r-- 1 root root 25M Jan 14 15:11 5712_results.pftrace + +Perfetto trace file found: profiling_results/trace_20260114_151142/.../5712_results.pftrace +Size: 25M + +To view the trace: + 1. Visit: https://ui.perfetto.dev/ + 2. Open: profiling_results/trace_20260114_151142/.../5712_results.pftrace +``` + +If a `.db` file is generated instead (ROCm 7.x without --output-format): + +``` +echo "Converting database to Perfetto format" +rocpd2pftrace -i -o trace.pftrace +``` + +## Kernel Trace Profiling with get_counters.sh + +This script collects kernel execution statistics including timing and call counts. + +Run the profiling script: + +``` +echo "Collecting kernel trace with rocprofv3" +./get_counters.sh +``` + +The script will output results to `profiling_results/counters_/`. + +Example output (ROCm 6.4): + +``` +Detected ROCm version: 6.4.4-129 +Starting rocprofv3 kernel trace collection for pytorch_microbench... +Output directory: profiling_results/counters_20260114_151213 +... +Profiling complete! Results saved to: profiling_results/counters_20260114_151213 + +Generated files: +total 8.6M +-rw-r--r-- 1 root root 1.6K Jan 14 15:12 5864_agent_info.csv +-rw-r--r-- 1 root root 8.5M Jan 14 15:12 5864_kernel_trace.csv + +To analyze results: + Check profiling_results/counters_20260114_151213 for output files +``` + +ROCm 6.x outputs CSV files directly, while ROCm 7.x outputs SQLite databases. For ROCm 7.x database files, use rocpd tools: + +``` +echo "Exporting kernel statistics to CSV" +rocpd2csv -i -o kernel_stats.csv +``` + +``` +echo "Getting kernel summary" +rocpd summary -i --region-categories KERNEL +``` + +Example kernel analysis (ResNet50, 10 iterations): + +``` +Total kernels: 21175 +Unique kernels: 68 +Total GPU time: 2080.62 ms + +Kernel Name Count Total(ms) Avg(us) %Time +-------------------------------------------------------------------------------------------------------- +miopenSp3AsmConv_v30_3_1_gfx11_fp32_f2x3_stride1 732 760.707 1039.217 36.6% +MIOpenBatchNormBwdSpatial 636 168.497 264.932 8.1% +void at::native::vectorized_elementwise_kernel<4, at::nati... 384 120.959 314.997 5.8% +void at::native::vectorized_elementwise_kernel<4, at::nati... 588 96.744 164.530 4.6% +Cijk_Alik_Bljk_SB_MT64x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0E... 2304 88.475 38.401 4.3% +MIOpenBatchNormFwdTrainSpatial 480 73.505 153.136 3.5% +Cijk_Alik_Bljk_SB_MT16x16x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0... 768 70.635 91.973 3.4% +miopenSp3AsmConv_v30_3_1_gfx11_fp32_f3x2_stride1 108 48.377 447.933 2.3% +... +``` + +The top kernels show MIOpen convolutions (`miopenSp3AsmConv`) and batch normalization (`MIOpenBatchNorm`) dominate execution time, which is expected for ResNet50. + +Documentation for rocpd tools: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html + +## GPU Hardware Metrics with get_rocprof_compute.sh + +This script collects detailed GPU performance metrics for hardware utilization analysis. + +Run the profiling script: + +``` +echo "Collecting GPU hardware metrics with rocprof-compute" +./get_rocprof_compute.sh +``` + +The script will output results to `profiling_results/rocprof_compute_/`. To analyze the results: + +``` +echo "Generating performance analysis report" +rocprof-compute analyze -p /workloads//rocprof --dispatch -n microbench_dispatch +``` + +For available analysis options: + +``` +rocprof-compute analyze --help +``` + +Note: rocprof-compute requires data center GPUs (MI100, MI200, MI300 series) for full hardware counter support. Consumer GPUs may have limited counter availability. + +## System-Level Profiling with get_rocprof_sys.sh + +This script captures system-level performance with call stack sampling. + +Run the profiling script: + +``` +echo "Collecting system-level profile with rocprof-sys" +./get_rocprof_sys.sh +``` + +The script will output results to `profiling_results/rocprof_sys_/`. To analyze the results: + +``` +echo "Opening trace in Perfetto UI" +echo "Visit https://ui.perfetto.dev/ and open the .proto file" +``` + +Note: rocprof-sys may produce memory map dumps in some configurations. If profiling fails or produces excessive output, consider using rocprofv3 (get_trace.sh) instead. + +## Performance Tuning + +For optimal performance on specific hardware, tune MIOpen by setting the environment variable before running: + +``` +export MIOPEN_FIND_ENFORCE=3 +python micro_benchmarking_pytorch.py --network resnet50 +``` + +This writes to a local performance database. See [MIOpen documentation](https://rocm.github.io/MIOpen/doc/html/perfdatabase.html) for details. + +## Additional Resources + +- rocprofv3 documentation: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocprofv3.html +- rocpd output format: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/develop/how-to/using-rocpd-output-format.html +- Perfetto UI: https://ui.perfetto.dev/ diff --git a/MLExamples/inference_benchmark/TorchTensorOpsBench/README.md b/MLExamples/pytorch_microbench/TorchTensorOpsBench/README.md similarity index 100% rename from MLExamples/inference_benchmark/TorchTensorOpsBench/README.md rename to MLExamples/pytorch_microbench/TorchTensorOpsBench/README.md diff --git a/MLExamples/inference_benchmark/TorchTensorOpsBench/run.sh b/MLExamples/pytorch_microbench/TorchTensorOpsBench/run.sh similarity index 100% rename from MLExamples/inference_benchmark/TorchTensorOpsBench/run.sh rename to MLExamples/pytorch_microbench/TorchTensorOpsBench/run.sh diff --git a/MLExamples/inference_benchmark/TorchTensorOpsBench/torch_tensor_ops_bench.py b/MLExamples/pytorch_microbench/TorchTensorOpsBench/torch_tensor_ops_bench.py similarity index 100% rename from MLExamples/inference_benchmark/TorchTensorOpsBench/torch_tensor_ops_bench.py rename to MLExamples/pytorch_microbench/TorchTensorOpsBench/torch_tensor_ops_bench.py diff --git a/MLExamples/inference_benchmark/fp16util.py b/MLExamples/pytorch_microbench/fp16util.py similarity index 100% rename from MLExamples/inference_benchmark/fp16util.py rename to MLExamples/pytorch_microbench/fp16util.py diff --git a/MLExamples/pytorch_microbench/get_counters.sh b/MLExamples/pytorch_microbench/get_counters.sh new file mode 100755 index 00000000..dda018a0 --- /dev/null +++ b/MLExamples/pytorch_microbench/get_counters.sh @@ -0,0 +1,79 @@ +#!/bin/bash +# Script to profile pytorch_microbench with rocprofv3 kernel trace +# This captures kernel execution metrics for performance analysis +# +# Supports both ROCm 6.x (CSV output) and ROCm 7.x (SQLite database output) + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="profiling_results/counters_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 kernel trace collection for pytorch_microbench..." +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprofv3 to collect kernel trace +# Using resnet50 as the default network with standard batch size +rocprofv3 \ + --kernel-trace \ + --output-directory "$OUTPUT_DIR" \ + -- python micro_benchmarking_pytorch.py \ + --network resnet50 \ + --batch-size 64 \ + --iterations 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Analyze results based on ROCm version +echo "To analyze results:" +DB_FILE=$(find "$OUTPUT_DIR" -name "*_results.db" 2>/dev/null | head -1) +if [ -n "$DB_FILE" ]; then + echo " Database file: $DB_FILE" + echo "" + echo " Export to CSV:" + echo " rocpd2csv -i $DB_FILE -o kernel_stats.csv" + echo "" + echo " Get kernel summary:" + echo " rocpd summary -i $DB_FILE --region-categories KERNEL" +else + echo " Check $OUTPUT_DIR for output files" +fi diff --git a/MLExamples/pytorch_microbench/get_rocprof_compute.sh b/MLExamples/pytorch_microbench/get_rocprof_compute.sh new file mode 100755 index 00000000..69cfa800 --- /dev/null +++ b/MLExamples/pytorch_microbench/get_rocprof_compute.sh @@ -0,0 +1,40 @@ +#!/bin/bash +# Script to profile pytorch_microbench with rocprof-compute +# This captures detailed GPU hardware metrics and compute performance analysis +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Create output directory with timestamp +OUTPUT_DIR="profiling_results/rocprof_compute_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +# Generate unique workload name with timestamp +WORKLOAD_NAME="pytorch_microbench_resnet50_$(date +%Y%m%d_%H%M%S)" + +echo "Starting rocprof-compute profiling for pytorch_microbench..." +echo "Workload name: $WORKLOAD_NAME" +echo "Output directory: $OUTPUT_DIR" + +# Run with rocprof-compute to collect detailed GPU metrics +# Using resnet50 as the default network with standard batch size +rocprof-compute profile \ + --name "$WORKLOAD_NAME" \ + -d "$OUTPUT_DIR" \ + -- python micro_benchmarking_pytorch.py \ + --network resnet50 \ + --batch-size 64 \ + --iterations 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR" +echo "" +echo "To analyze results:" +echo " rocprof-compute analyze -p $OUTPUT_DIR/workloads/${WORKLOAD_NAME}/rocprof --dispatch -n inference_dispatch" +echo "" +echo "For help on analysis options:" +echo " rocprof-compute analyze --help" diff --git a/MLExamples/pytorch_microbench/get_rocprof_sys.sh b/MLExamples/pytorch_microbench/get_rocprof_sys.sh new file mode 100755 index 00000000..da816327 --- /dev/null +++ b/MLExamples/pytorch_microbench/get_rocprof_sys.sh @@ -0,0 +1,41 @@ +#!/bin/bash +# Script to profile pytorch_microbench with rocprof-sys +# This captures system-level performance with call stack sampling +# +# Compatible with ROCm 6.x and 7.x +# +# NOTE: rocprof-sys may produce memory map dumps in some configurations. +# Issue reference: TBD + +set -e + +# Create output directory with timestamp +OUTPUT_DIR="profiling_results/rocprof_sys_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprof-sys profiling for pytorch_microbench..." +echo "Output directory: $OUTPUT_DIR" +echo "" + +cd "$OUTPUT_DIR" + +# Run with rocprof-sys to collect system-level profile +# Using resnet50 as the default network with standard batch size +rocprof-sys-run \ + --profile \ + --trace \ + -- python ../../micro_benchmarking_pytorch.py \ + --network resnet50 \ + --batch-size 64 \ + --iterations 10 + +cd ../.. + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR" +echo "" +echo "To analyze results:" +echo " Open the .proto file in Perfetto UI: https://ui.perfetto.dev/" diff --git a/MLExamples/pytorch_microbench/get_trace.sh b/MLExamples/pytorch_microbench/get_trace.sh new file mode 100755 index 00000000..7aeda243 --- /dev/null +++ b/MLExamples/pytorch_microbench/get_trace.sh @@ -0,0 +1,109 @@ +#!/bin/bash +# Script to profile pytorch_microbench with rocprofv3 runtime trace +# This captures GPU API calls, kernel launches, and memory operations +# +# Compatible with ROCm 6.x and 7.x + +set -e + +# Detect ROCm version +ROCM_VERSION="" +ROCM_MAJOR="" + +# Method 1: Check rocminfo +if command -v rocminfo &> /dev/null; then + ROCM_VERSION=$(rocminfo | grep -i "ROCm Version" | head -1 | awk '{print $3}') +fi + +# Method 2: Check ROCM_PATH +if [ -z "$ROCM_VERSION" ] && [ -n "$ROCM_PATH" ]; then + if [ -f "$ROCM_PATH/.info/version" ]; then + ROCM_VERSION=$(cat "$ROCM_PATH/.info/version") + fi +fi + +# Method 3: Check hipcc version (more reliable for module-loaded ROCm) +if [ -z "$ROCM_VERSION" ] && command -v hipcc &> /dev/null; then + HIP_VERSION=$(hipcc --version 2>/dev/null | grep -i "HIP version" | head -1 | awk '{print $3}') + if [ -n "$HIP_VERSION" ]; then + ROCM_VERSION="$HIP_VERSION" + fi +fi + +# Extract major version +if [ -n "$ROCM_VERSION" ]; then + ROCM_MAJOR=$(echo "$ROCM_VERSION" | cut -d. -f1) + echo "Detected ROCm version: $ROCM_VERSION" +else + echo "Warning: Could not detect ROCm version, assuming ROCm 7.x" + ROCM_MAJOR="7" +fi + +# Create output directory with timestamp +OUTPUT_DIR="profiling_results/trace_$(date +%Y%m%d_%H%M%S)" +mkdir -p "$OUTPUT_DIR" + +echo "Starting rocprofv3 runtime trace profiling for pytorch_microbench..." +echo "Output directory: $OUTPUT_DIR" + +# Build rocprofv3 command with appropriate flags for ROCm version +# ROCm 6.4+ and 7.x require explicit --output-format pftrace to generate Perfetto traces +# Earlier ROCm 6.x versions (6.0-6.3) generated pftrace by default +if [ "$ROCM_MAJOR" = "7" ] || [ "$ROCM_MAJOR" = "6" ]; then + echo "Using ROCm 6.x/7.x: --output-format pftrace (generates Perfetto trace)" + OUTPUT_FORMAT="--output-format pftrace" +else + echo "Using ROCm 5.x or older: default format" + OUTPUT_FORMAT="" +fi + +echo "" +echo "Collecting full runtime trace (HIP/HSA API calls, kernels, memory operations)" +echo "" + +# Run with rocprofv3 to collect full runtime trace +# Using resnet50 as the default network with standard batch size +# NOTE: Using --runtime-trace to capture complete timeline: +# - HIP/HSA API calls +# - Kernel execution on GPU +# - Memory operations (H2D, D2H, D2D transfers) +# - Synchronization events +# This provides the comprehensive view needed for timeline analysis in Perfetto +rocprofv3 \ + --runtime-trace \ + $OUTPUT_FORMAT \ + --output-directory "$OUTPUT_DIR" \ + -- python micro_benchmarking_pytorch.py \ + --network resnet50 \ + --batch-size 64 \ + --iterations 10 + +echo "" +echo "Profiling complete! Results saved to: $OUTPUT_DIR" +echo "" +echo "Generated files:" +ls -lh "$OUTPUT_DIR"/*/ 2>/dev/null || ls -lh "$OUTPUT_DIR" +echo "" + +# Find and highlight the pftrace file +PFTRACE_FILE=$(find "$OUTPUT_DIR" -name "*.pftrace" | head -1) +DB_FILE=$(find "$OUTPUT_DIR" -name "*.db" | head -1) + +if [ -n "$PFTRACE_FILE" ]; then + echo "Perfetto trace file found: $PFTRACE_FILE" + echo "Size: $(du -h "$PFTRACE_FILE" | cut -f1)" + echo "" + echo "To view the trace:" + echo " 1. Visit: https://ui.perfetto.dev/" + echo " 2. Open: $PFTRACE_FILE" +elif [ -n "$DB_FILE" ]; then + echo "SQLite database found (ROCm 7.x without --output-format): $DB_FILE" + echo "To convert to Perfetto format:" + echo " rocpd2pftrace -i $DB_FILE -o trace.pftrace" + echo "" + echo "Next time, use --output-format pftrace to generate Perfetto traces directly" +else + echo "WARNING: No .pftrace or .db file found" + echo "Check the output directory for profiling results" +fi +echo "" diff --git a/MLExamples/inference_benchmark/micro_benchmarking_pytorch.py b/MLExamples/pytorch_microbench/micro_benchmarking_pytorch.py similarity index 100% rename from MLExamples/inference_benchmark/micro_benchmarking_pytorch.py rename to MLExamples/pytorch_microbench/micro_benchmarking_pytorch.py diff --git a/MLExamples/inference_benchmark/shufflenet.py b/MLExamples/pytorch_microbench/shufflenet.py similarity index 100% rename from MLExamples/inference_benchmark/shufflenet.py rename to MLExamples/pytorch_microbench/shufflenet.py diff --git a/MLExamples/inference_benchmark/shufflenet_v2.py b/MLExamples/pytorch_microbench/shufflenet_v2.py similarity index 100% rename from MLExamples/inference_benchmark/shufflenet_v2.py rename to MLExamples/pytorch_microbench/shufflenet_v2.py diff --git a/MLExamples/inference_benchmark/xception.py b/MLExamples/pytorch_microbench/xception.py similarity index 100% rename from MLExamples/inference_benchmark/xception.py rename to MLExamples/pytorch_microbench/xception.py