-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Motivation
ATOM is a foundational component of AMD’s AI inference strategy. It can be used to serve as the out-of-tree plugin platform of vLLM for high-performance inference on AMD GPUs. It is built by integrating optimizations from ROCm’s high-performance operator library aiter and high-performance communication library mori into model execution path.
The key motivation of enabling ATOM as a out-of-tree plugin platform of vLLM is to speedup the velocity and efficiency of the ATOM iteration. The ATOM plugin platform can reuse almost all of the vLLM features and integrate the high-optimized model implementations into vLLM with the latest kernels and fusions. It can help ATOM focus more on the model-level and kernel-level optimizations.
It is more important that our intent is still prioritizing in-tree native operator integration in vLLM, along with sustained upstream contributions. ATOM platform is working as the incubator of the optimizations. When those optimizations are getting mature, they will be upstreamed immediately.
Background
ATOM can deliver performance gains through following points with ROCm components:
- Cross-layer / cross-module fusion opportunities (spanning layer boundaries)
- Specific KV Cache layout required by kernel
- New optimized ops and kernel implementations tailored for AMD GPU characteristics
- Communication and runtime optimizations
While ATOM primarily focuses on model-level and kernel-level optimizations, it is designed to integrate with and leverage the high-level framework features. Meanwhile vLLM is most popular framework, where many features are developed and many hardware devices are supported. Additionally, vLLM has mature plugin extension mechanism and lots of accelerators can use this mechanism to work as the OOT plugin platform. Given above points, ATOM plans to work as the OOT plugin platform of vLLM.
By building on this well-designed extension point, ATOM can deliver hardware-aware model-level optimizations while fully respecting vLLM’s separation of concerns and commitment to maintainability. Given vLLM’s position as the de facto standard for high-performance LLM serving, this approach ensures that AMD GPU users can benefit from the latest model-level optimizations without fragmenting the ecosystem or compromising the consistency that vLLM users rely on.
Design Overview
When ATOM is installed, there are 2 entry points will be installed and they are strictly following the vLLM convention to register the platform and models. When vLLM server has been launched, it will scan the entry points of all the installed python packages and call the functions atom.plugin.vllm:register_platform and atom.plugin.vllm:register_model. The former is used to register the ATOM platform and the latter is used to override the models maintained by vLLM. Both of these 2 registry mechanism are officially provided by vLLM.
Attention
The AttentionBackend, implemented in ATOM, is designed to follow the vLLM attention convention. It is provided to vLLM through the ATOMPlatform standard method get_attn_backend_cls.
vLLM execution flow with ATOM
Current Status
The results of accuracy check is shown as below for model Qwen235B-FP8
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 3 | exact_match | ↑ | 0.9037 | ± | 0.0081 |
| strict-match | 3 | exact_match | ↑ | 0.8832 | ± | 0.0088 |
The results of performance for model Qwen235B-FP8
