Skip to content

[RFC] Enable ATOM as vLLM out-of-tree Platform #201

@zejunchen-zejun

Description

@zejunchen-zejun

Motivation

ATOM is a foundational component of AMD’s AI inference strategy. It can be used to serve as the out-of-tree plugin platform of vLLM for high-performance inference on AMD GPUs. It is built by integrating optimizations from ROCm’s high-performance operator library aiter and high-performance communication library mori into model execution path.

The key motivation of enabling ATOM as a out-of-tree plugin platform of vLLM is to speedup the velocity and efficiency of the ATOM iteration. The ATOM plugin platform can reuse almost all of the vLLM features and integrate the high-optimized model implementations into vLLM with the latest kernels and fusions. It can help ATOM focus more on the model-level and kernel-level optimizations.

It is more important that our intent is still prioritizing in-tree native operator integration in vLLM, along with sustained upstream contributions. ATOM platform is working as the incubator of the optimizations. When those optimizations are getting mature, they will be upstreamed immediately.

Background

ATOM can deliver performance gains through following points with ROCm components:

  • Cross-layer / cross-module fusion opportunities (spanning layer boundaries)
  • Specific KV Cache layout required by kernel
  • New optimized ops and kernel implementations tailored for AMD GPU characteristics
  • Communication and runtime optimizations

While ATOM primarily focuses on model-level and kernel-level optimizations, it is designed to integrate with and leverage the high-level framework features. Meanwhile vLLM is most popular framework, where many features are developed and many hardware devices are supported. Additionally, vLLM has mature plugin extension mechanism and lots of accelerators can use this mechanism to work as the OOT plugin platform. Given above points, ATOM plans to work as the OOT plugin platform of vLLM.
By building on this well-designed extension point, ATOM can deliver hardware-aware model-level optimizations while fully respecting vLLM’s separation of concerns and commitment to maintainability. Given vLLM’s position as the de facto standard for high-performance LLM serving, this approach ensures that AMD GPU users can benefit from the latest model-level optimizations without fragmenting the ecosystem or compromising the consistency that vLLM users rely on.

Design Overview

Image When ATOM is installed, there are 2 entry points will be installed and they are strictly following the vLLM convention to register the platform and models. When vLLM server has been launched, it will scan the entry points of all the installed python packages and call the functions atom.plugin.vllm:register_platform and atom.plugin.vllm:register_model. The former is used to register the ATOM platform and the latter is used to override the models maintained by vLLM. Both of these 2 registry mechanism are officially provided by vLLM.

Attention

The AttentionBackend, implemented in ATOM, is designed to follow the vLLM attention convention. It is provided to vLLM through the ATOMPlatform standard method get_attn_backend_cls.

vLLM execution flow with ATOM

Image

Current Status

The results of accuracy check is shown as below for model Qwen235B-FP8

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 3 exact_match 0.9037 ± 0.0081
strict-match 3 exact_match 0.8832 ± 0.0088

The results of performance for model Qwen235B-FP8
Image

PRs

#126

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions