[RFC] Enable ATOM as vLLM out-of-tree Platform

### Motivation
[ATOM](https://github.com/ROCm/ATOM) is a foundational component of AMD’s AI inference strategy. It can be used to serve as the out-of-tree plugin platform of vLLM for high-performance inference on AMD GPUs. It is built by integrating optimizations from ROCm’s high-performance operator library [aiter](https://github.com/ROCm/aiter/tree/main) and high-performance communication library [mori](https://github.com/ROCm/mori) into model execution path.

The key motivation of enabling ATOM as a out-of-tree plugin platform of vLLM is to speedup the velocity and efficiency of the ATOM iteration. The ATOM plugin platform can reuse almost all of the vLLM features and integrate the high-optimized model implementations into vLLM with the latest kernels and fusions. It can help ATOM focus more on the model-level and kernel-level optimizations.

It is more important that our intent is still prioritizing in-tree native operator integration in vLLM, along with sustained upstream contributions. ATOM platform is working as the incubator of the optimizations. When those optimizations are getting mature, they will be upstreamed immediately.

### Background
ATOM can deliver performance gains through following points with ROCm components:
- Cross-layer / cross-module fusion opportunities (spanning layer boundaries)
- Specific KV Cache layout required by kernel
- New optimized ops and kernel implementations tailored for AMD GPU characteristics
- Communication and runtime optimizations

While ATOM primarily focuses on model-level and kernel-level optimizations, it is designed to integrate with and leverage the high-level framework features. Meanwhile vLLM is most popular framework, where many features are developed and many hardware devices are supported. Additionally, vLLM has mature plugin extension mechanism and lots of accelerators can use this mechanism to work as the OOT plugin platform. Given above points, ATOM plans to work as the OOT plugin platform of vLLM.
By building on this well-designed extension point, ATOM can deliver hardware-aware model-level optimizations while fully respecting vLLM’s separation of concerns and commitment to maintainability. Given vLLM’s position as the de facto standard for high-performance LLM serving, this approach ensures that AMD GPU users can benefit from the latest model-level optimizations without fragmenting the ecosystem or compromising the consistency that vLLM users rely on.

### Design Overview
<img width="1176" height="928" alt="Image" src="https://github.com/user-attachments/assets/e3d7c07d-d8a8-4979-84ca-405fc34345f5" />
When ATOM is installed, there are 2 entry points will be installed and they are strictly following the vLLM convention to register the platform and models. When vLLM server has been launched, it will scan the entry points of all the installed python packages and call the functions atom.plugin.vllm:register_platform and atom.plugin.vllm:register_model. The former is used to register the ATOM platform and the latter is used to override the models maintained by vLLM. Both of these 2 registry mechanism are officially provided by vLLM.

### Attention
The AttentionBackend, implemented in ATOM, is designed to follow the vLLM attention convention. It is provided to vLLM through the ATOMPlatform standard method `get_attn_backend_cls`.

### vLLM execution flow with ATOM
<img width="728" height="1080" alt="Image" src="https://github.com/user-attachments/assets/ac74a46a-bf16-4d4c-84b1-dfe1c01e0c14" />

### Current Status

The results of accuracy check is shown as below for model Qwen235B-FP8
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9037|±  |0.0081|
|     |       |strict-match    |     3|exact_match|↑  |0.8832|±  |0.0088|

The results of performance for model Qwen235B-FP8
<img width="1859" height="521" alt="Image" src="https://github.com/user-attachments/assets/a87add7b-f1a2-419a-87ce-9bc6e52357b9" />

### PRs
https://github.com/ROCm/ATOM/pull/126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enable ATOM as vLLM out-of-tree Platform #201

Motivation

Background

Design Overview

Attention

vLLM execution flow with ATOM

Current Status

PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	3	exact_match	↑	0.9037	±	0.0081
		strict-match	3	exact_match	↑	0.8832	±	0.0088

[RFC] Enable ATOM as vLLM out-of-tree Platform #201

Description

Motivation

Background

Design Overview

Attention

vLLM execution flow with ATOM

Current Status

PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions