Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request successfully introduces a GPU abstraction layer (gpu_vmm.hpp) to support both NVIDIA (CUDA) and AMD (HIP/ROCm) backends. This is a well-structured refactoring that generalizes memory management operations. The use of CppExtension for HIP builds to avoid unnecessary hipify steps is a clever choice for code that is already cross-platform. However, there are a few issues that need attention: the benchmark code was not fully abstracted and will fail to compile on AMD platforms, and there are potential API mismatches in setup.py regarding include_paths and library_paths arguments.
|
@ivanium @cui36 @shenrunzhang I reviewed the code myself, and finished the testing plan. Feel free to take a look and let me know if there is any issues. |
Impressive work! Will take a look shortly |
|
Haven't finished reading the code, but one QQ: what's the performance we get on AMD GPUs and VMM APIs there? Can we also attach some benchmark results (if we have already had them). |
Just made the AMD version worked. @shenrunzhang will do the benchmark. |
Summary
Add AMD GPU (ROCm/HIP) support to kvcached by abstracting over CUDA and HIP at compile time, allowing the same codebase to build and run on both NVIDIA and AMD GPUs.
New GPU abstraction layer (
csrc/inc/gpu_vmm.hpp)kvcached::gpu_vmmnamespace that wraps CUDA Driver API (cuMem*) and HIP VMM API (hipMem*) behind a unified interface-DKVCACHED_USE_CUDAor-DKVCACHED_USE_HIPallocation_handle_t,allocation_prop_t,access_desc_t), error handling, and all VMM operations (reserve, create, map, unmap, set_access, release)C++ header cleanup
#include <torch/extension.h>with targetedc10/ATenheaders (c10::ScalarType,c10::Device,at::Tensor) to reduce build times#include <cuda_runtime.h>/#include <cuda.h>from all files except the abstraction layercuda_utils.hpp→gpu_utils.hpp; routed error-checking macros throughgpu_vmm.hppinit_cuda_()→init_gpu_()Build system (
setup.py)torch.version.hip/torch.version.cudaCppExtension(avoids PyTorch's hipify step which is unnecessary since the code handles HIP natively)CUDAExtensionas before-lamdhip64for HIP,-lcudafor CUDA)Python integration
"cuda"and"hip"prefixes)page_allocator.pyBenchmark (
benchmarks/bench_vmm/)gpu_vmm.hppabstraction — now builds for both CUDA and HIPcuda_utils.hppduplicate; uses the main project'sgpu_utils.hppmakefor CUDA,make KVCACHED_BACKEND=hipfor ROCmTest Plan
-DKVCACHED_USE_CUDA) on NVIDIA GPU and run two vLLMs with kvcached-DKVCACHED_USE_HIP) on AMD GPU (MI300X) and two vLLMs with kvcachedmake KVCACHED_BACKEND=hip)make)