A curated list of awesome open source projects, tools, and resources for AMD GPUs via ROCm — covering AI/ML, inference, creative workloads, HPC, and developer tooling.
Maintained by @mikeroysoft · PRs welcome · See CONTRIBUTING.md
- 🏁 Getting Started
- 🤖 AI & ML Frameworks
- 🚀 Inference & Serving
- 🎨 Creative & Image Generation
- ⚡ Performance & Math Libraries
- 🔧 Developer Tools & Profiling
- 📦 Containers & Orchestration
- 🏛️ HPC & Scientific Computing
- 🔄 CUDA Migration
- 🌐 Community & Learning
New to AMD GPUs and ROCm? Start here.
- ROCm Documentation — Official docs. Start with "What is ROCm?" and the install guide.
- ROCm Installation Guide (Linux) — Supported distros, package manager install, and post-install verification.
- ROCm GPU & OS Compatibility Matrix — Check if your GPU and OS are supported before you do anything else.
- ROCm Docker Hub — Official Docker images. The fastest way to get a working PyTorch + ROCm environment without touching your system Python.
- GPUOpen — AMD's developer portal: ISA docs, SDKs, libraries, and tools from AMD and partners.
Quick smoke test after install:
rocminfo # list detected GPUs
rocm-smi # GPU stats (like nvidia-smi)
python3 -c "import torch; print(torch.cuda.is_available())" # should be TrueCore frameworks with first-class or near-first-class ROCm support.
- PyTorch (ROCm) ⭐ 90k+ — Select "ROCm" on the install matrix. The de facto standard. AMD actively contributes upstream.
- TensorFlow-ROCm ⭐ — AMD's maintained TensorFlow fork with ROCm support.
- JAX on ROCm — Google JAX ported to ROCm. Good for research workloads and differentiable programming.
- DeepSpeed ⭐ 36k+ — Microsoft's distributed training library. ROCm support is built-in and actively tested.
- PEFT (Hugging Face) ⭐ 17k+ — Parameter-Efficient Fine-Tuning (LoRA, QLoRA, etc.). Works on ROCm via PyTorch.
- Transformers (Hugging Face) ⭐ 140k+ — The model hub CLI and training library. ROCm works via PyTorch backend.
- Unsloth ⭐ 25k+ — Fast fine-tuning (2–5x speedup, 70% less VRAM). ROCm support added in 2024.
The hottest space in the ROCm ecosystem right now.
- vLLM ⭐ 50k+ — High-throughput LLM inference server. ROCm is a first-class target; AMD contributes upstream. See the AMD install guide.
- Ollama ⭐ 130k+ — Run LLMs locally. ROCm support is built in — just install and run on a supported AMD GPU.
- llama.cpp ⭐ 80k+ — Lightweight inference with a HIP/ROCm backend (
-DGGML_HIPBLAS=ON). Broad model support. - SGLang ⭐ 15k+ — Structured generation and high-performance serving runtime. ROCm support active and growing.
- LMDeploy ⭐ 5k+ — Efficient inference and serving toolkit from Shanghai AI Lab. ROCm-compatible via PyTorch.
- MLC-LLM ⭐ 20k+ — Compile and deploy LLMs natively. Supports ROCm via Apache TVM backend.
- TGI (Text Generation Inference) ⭐ 9k+ — Hugging Face's production inference server. AMD/ROCm Docker image available.
Consumer Radeon and Instinct GPUs both shine here. Often overlooked in other ROCm lists.
- AUTOMATIC1111 Stable Diffusion WebUI ⭐ 145k+ — The classic. ROCm works on Linux; use
--precision full --no-halfif you hit precision issues on RDNA2. - ComfyUI ⭐ 70k+ — Node-based SD UI. Generally better ROCm compatibility than A1111. Actively tested on RX 7000 series.
- InvokeAI ⭐ 23k+ — Professional-grade image generation. ROCm-compatible via PyTorch.
- Fooocus ⭐ 42k+ — Simplified Flux/SDXL UI. Works on ROCm with PyTorch.
- Flux (Black Forest Labs) ⭐ 20k+ — State-of-the-art image generation model. Runs on ROCm via standard PyTorch.
The building blocks — what makes everything above fast.
- rocBLAS — BLAS on AMD GPUs. The foundation for most linear algebra in ML workloads.
- hipBLASLt — Lightweight, flexible GEMM library. Used heavily in transformer attention layers.
- MIOpen — AMD's deep learning primitives library (convolutions, batch norm, RNN). The cuDNN equivalent.
- rocFFT — Fast Fourier Transform on ROCm. Used in scientific computing and signal processing.
- Flash Attention (ROCm) — AMD's port of Tri Dao's Flash Attention. Critical for long-context LLM training and inference.
- Triton (ROCm backend) ⭐ 14k+ — OpenAI's GPU programming language. ROCm backend is upstream and increasingly production-quality.
- hipSPARSELt — Sparse GEMM library optimized for MI300X structured sparsity.
- rocWMMA — Wave Matrix Multiply Accumulate — low-level access to AMD matrix cores.
- composable_kernel — High-performance fused kernels for ML ops. MIOpen is built on top of this.
- RCCL — ROCm Communication Collectives Library (all-reduce, broadcast, etc.). The NCCL equivalent for multi-GPU/multi-node training.
- ROCProfiler — GPU performance profiling and hardware counter access. The foundation for most profiling tools in the ecosystem.
- Omniperf — System-level GPU performance profiling. Designed for MI-series Instinct GPUs; great roofline analysis.
- Omnitrace — Full-system tracing (CPU + GPU + MPI). Replaces rocTracer for most workflows.
- HIPIFY — Translates CUDA source to HIP. Best first step when porting a CUDA project to ROCm.
- HIP — The C++ GPU programming API that runs on both AMD (via ROCm) and NVIDIA (via CUDA). Write once, run on both.
- hipcc — ROCm's compiler driver. Based on LLVM/Clang with AMD GPU codegen.
- rocm-cmake — CMake modules and utilities for ROCm projects. Use this to properly set up HIP builds.
- rocDecode — Hardware video decode library for AMD GPUs. Useful for CV and video ML pipelines.
- ROCm Docker Images — Official images: bare ROCm, PyTorch, TensorFlow, JAX. Start here for reproducible environments.
- AMD GPU Operator (Kubernetes) — Deploy and manage AMD GPUs in k8s clusters. Based on the NVIDIA GPU Operator pattern.
- AMD Device Plugin for Kubernetes — Exposes AMD GPUs as schedulable resources in k8s.
- ROCm on WSL2 — Windows Subsystem for Linux 2 support. RDNA2/3 only; check compatibility matrix first.
- OpenMPI + ROCm — UCX and OpenMPI both support ROCm-aware MPI (GPU-to-GPU transfers without CPU round-trips).
- GROMACS (AMD build) — Molecular dynamics. ROCm-accelerated kernels for MI-series GPUs.
- LAMMPS ⭐ 2.5k+ — Molecular dynamics simulator with HIP/ROCm support via the KOKKOS package.
- CP2K ⭐ 2.5k+ — Quantum chemistry and MD. ROCm GPU offload available.
- OpenCL on ROCm — ROCm includes a full OpenCL 2.0 implementation. Legacy HPC codes can often run without porting.
- rocRAND — GPU random number generation. Used in Monte Carlo and stochastic simulation workloads.
Porting from CUDA? These tools and guides are your starting point.
- HIPIFY — Automated CUDA → HIP source translation. Handles most common patterns; check the unsupported features list.
- HIP Programming Guide — The definitive guide to writing HIP code and understanding CUDA equivalencies.
- CUDA to ROCm Porting Guide — AMD's official porting walkthrough with common patterns and gotchas.
- PyTorch CUDA to ROCm Migration — Notes on CUDA/HIP compatibility in PyTorch — most CUDA PyTorch code runs on ROCm unchanged.
- ROCm GitHub Organization — The canonical source for all ROCm components.
- AMD ROCm Blogs — Technical deep-dives from the AMD engineering team.
- AMD Instinct YouTube Channel — Webinars, tutorials, and product talks.
- r/ROCm — Community troubleshooting and news. Good signal-to-noise ratio.
- AMD Developer Forums — Official support forums for Instinct and ROCm.
- ROCm Discord — Real-time community help and dev discussion.
See CONTRIBUTING.md. In short: open source, demonstrably ROCm-compatible, and worth including based on quality or popularity. File an issue or open a PR.
To the extent possible under law, Michael Roy has waived all copyright and related or neighboring rights to this work.
