From c7374e2d7ba4589557990ce1f18bab522470fe76 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 7 Aug 2025 01:19:29 +0000 Subject: [PATCH 1/2] Initial plan From 97d4c99500ed9c21923c57abc39dc9d00a232788 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 7 Aug 2025 01:34:34 +0000 Subject: [PATCH 2/2] Create comprehensive GitHub Copilot instructions --- .github/copilot-instructions.md | 255 ++++++++++++++++++++++++++++++++ 1 file changed, 255 insertions(+) create mode 100644 .github/copilot-instructions.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 00000000..dc39dcc4 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,255 @@ +# IntelliPerf: AI-Powered GPU Performance Engineering Framework + +IntelliPerf is a Python-based framework that uses Large Language Models (LLMs) to automatically analyze and optimize GPU kernel performance. It supports HIP/ROCm, Triton, and PyTorch applications, targeting bottlenecks like bank conflicts, memory access patterns, and atomic contention. + +Always reference these instructions first and fallback to search or bash commands only when you encounter unexpected information that does not match the info here. + +## Working Effectively + +### Quick Start (Container Recommended) +Use containers for full functionality including GPU-dependent features: +```bash +# Using Docker (recommended) +./docker/build.sh +./docker/run.sh + +# Using Apptainer +./apptainer/build.sh +./apptainer/run.sh +``` + +### Development Installation (Basic Python Functionality) +For Python-only development without GPU dependencies: +```bash +# Install the main package (takes ~90 seconds) +pip install -e . + +# Verify installation +intelliperf --help +``` + +### Full Dependencies Installation (Network-Intensive) +**WARNING**: This step frequently fails due to network timeouts. NEVER CANCEL builds - they may take 45+ minutes. +```bash +# Install external tools - NEVER CANCEL: Can take 45+ minutes. Set timeout to 60+ minutes. +python3 scripts/install_tool.py --all + +# If network timeouts occur, this is expected - document as "may fail due to network limitations" +``` + +### Examples Build (Requires ROCm/HIP) +```bash +# Build examples - requires ROCm/HIP environment +cd examples +./scripts/build_examples.sh -c + +# Clean build if needed +./scripts/build_examples.sh -c --clean + +# Verbose build for debugging +./scripts/build_examples.sh -c --verbose +``` + +## Core Development Commands + +### Code Quality (Always Run Before Committing) +```bash +# Install linting tools +pip install ruff==0.3.0 + +# Check code style (fast, <1 second) +ruff check . + +# Fix auto-fixable issues +ruff check . --fix + +# Format code +ruff format . +``` + +### Pre-commit Hooks (May Fail Due to Network Issues) +```bash +pip install pre-commit==3.6.0 +pre-commit install + +# Run all hooks - NEVER CANCEL: Takes 2-5 minutes. Set timeout to 10+ minutes. +# NOTE: May fail due to network timeouts - this is expected in some environments +pre-commit run --all-files +``` + +### Testing +```bash +# Note: Most tests require GPU hardware and ROCm environment +# Basic test check (will fail without GPU libraries but shows test structure) +python -m pytest tests/ -v + +# Shell-based integration tests (require built examples) +./tests/test_matrix_transpose.sh +``` + +## IntelliPerf Usage Patterns + +### Diagnose Only (Works Without GPU Optimization) +```bash +# Diagnose HIP application +intelliperf --formula=diagnoseOnly -- ./examples/build/access_pattern/uncoalesced + +# Diagnose PyTorch application +intelliperf --formula=diagnoseOnly -- python ./examples/torch/add.py + +# Diagnose Triton application +TRITON_DISABLE_LINE_INFO=0 intelliperf --formula=diagnoseOnly -- python ./examples/triton/reduce.py +``` + +### Full Optimization (Requires LLM API Key and GPU) +```bash +# Set required environment variable +export LLM_GATEWAY_KEY="your_api_key_here" + +# Memory access optimization +intelliperf --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + --formula=memoryAccess -- ./build/access_pattern/uncoalesced + +# Bank conflict optimization +intelliperf --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + --formula=bankConflict -- ./build/bank_conflict/matrix_transpose 1024 1024 + +# Atomic contention optimization +intelliperf --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + --instrument_command="./scripts/build_examples.sh -i -c" \ + --formula=atomicContention -- ./build/contention/reduction +``` + +## Manual Validation Requirements + +**CRITICAL**: After making any changes to IntelliPerf, ALWAYS run through these complete validation scenarios: + +### 1. Memory Access Pattern Validation +```bash +# Test uncoalesced memory access detection and optimization +intelliperf --formula=memoryAccess --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + -- ./build/access_pattern/uncoalesced + +# Verify: Should show memory coalescing improvements and performance gains +``` + +### 2. Bank Conflict Validation +```bash +# Test shared memory bank conflict detection and optimization +intelliperf --formula=bankConflict --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + -- ./build/bank_conflict/matrix_transpose 1024 1024 + +# Verify: Should show bank conflict reduction and speedup +``` + +### 3. Atomic Contention Validation +```bash +# Test atomic operation contention detection and optimization +intelliperf --formula=atomicContention --project_directory=./examples \ + --build_command="./scripts/build_examples.sh -c" \ + --instrument_command="./scripts/build_examples.sh -i -c" \ + -- ./build/contention/reduction + +# Verify: Should show atomic contention reduction and performance improvement +``` + +### 4. Multi-Backend Diagnose Validation +```bash +# Test HIP application analysis +intelliperf --formula=diagnoseOnly -- ./examples/build/access_pattern/uncoalesced + +# Test PyTorch application analysis +intelliperf --formula=diagnoseOnly -- python ./examples/torch/add.py + +# Test Triton application analysis +TRITON_DISABLE_LINE_INFO=0 intelliperf --formula=diagnoseOnly -- python ./examples/triton/reduce.py + +# Verify: All should generate valid performance analysis JSON output +``` + +## Critical Timing and Timeout Information + +### Build Commands - NEVER CANCEL +- **Python package install**: 90 seconds normal, set timeout to 3+ minutes +- **External tools install**: 45+ minutes normal, set timeout to 60+ minutes +- **Examples build**: 5-10 minutes normal, set timeout to 15+ minutes +- **Pre-commit setup**: 2-5 minutes normal, set timeout to 10+ minutes +- **IntelliPerf optimization runs**: 10-30 minutes normal, set timeout to 45+ minutes + +### Network Issues (Expected) +- External dependency installation frequently fails due to network timeouts +- Pre-commit hooks may fail to install due to PyPI timeouts +- Document these as "may fail due to network limitations" rather than fixing +- Use containers for reliable development environment + +## Repository Structure + +### Key Directories +``` +src/intelliperf/ # Main Python package +src/accordo/ # Validation and correctness checking +examples/ # Test applications in HIP, Triton, PyTorch + scripts/build_examples.sh # Example build system +external/ # External dependencies (rocprofiler-compute, omniprobe, nexus) +tests/ # Integration tests (require GPU hardware) +.github/workflows/ # CI that runs on AMD GPU droplets +``` + +### Configuration Files +- `pyproject.toml` - Python dependencies and tool configuration +- `.pre-commit-config.yaml` - Code quality hooks +- `.github/workflows/ci.yml` - Full GPU-based testing pipeline +- `docker/` and `apptainer/` - Container definitions + +## Environment Requirements + +### Minimal (Python Development) +- Python 3.8+ +- pip + +### Full Functionality +- ROCm/HIP environment +- AMD GPU hardware (tested on MI300X) +- Network access for dependency installation +- LLM API key for optimization features + +## Common Issues and Solutions + +### "ROCm not found" Error +- Expected in non-GPU environments +- Use containers for full GPU functionality +- Python-only features still work (CLI, some validation) + +### Network Timeout Errors +- Very common with `python3 scripts/install_tool.py --all` +- Expected with pre-commit installation +- Document as limitation rather than trying to fix +- Use containers which have dependencies pre-installed + +### Test Failures Without GPU +- Expected - most tests require GPU hardware +- CI runs on actual AMD GPU droplets +- Focus on code quality checks for local development + +### Performance Validation +- Always test at least one complete optimization scenario after changes +- Verify JSON output contains expected performance metrics +- Check that both correctness and performance validation pass + +## CI Integration + +The CI system (.github/workflows/ci.yml) runs comprehensive tests on AMD GPU hardware: +- Spins up GPU droplets with MI300X hardware +- Installs full dependency chain +- Tests all optimization formulas +- Validates correctness and performance improvements +- NEVER CANCEL: CI can take 45+ minutes including droplet provisioning + +Always ensure your changes pass both local code quality checks and will work in the GPU CI environment. + +Always ensure your changes pass both local code quality checks and will work in the GPU CI environment. \ No newline at end of file