This repository contains the supporting code for our paper Understanding GPU Resource Interference One Level Deeper. It provides an open-source framework for measuring the interference experienced by LLM token generation when colocated with a suite of CUDA kernels designed to stress specific GPU subsystems. The repository builds on a fork of the vllm-project (as of May 2025).
The most relevant components added or modified relative to the upstream vLLM project are:
.
├── inter_sm # Experiments for inter-SM interference (paper section 4.3)
├── intra_sm # Experiments for intra-SM interference (paper section 4.2 and 4.4)
├── vllm
│ ├── interference # CUDA benchmark suite and helper code for spatial colocation
│ ├── v1
│ │ ├── worker
│ │ │ ├── gpu_model_runner.py # Coordinates interference kernels with vLLM during token generation
├── main.py # Entrypoint to launch the paper experiments
The experiments require the following dependencies:
- cmake >= 3.22
- C++17 or later
- CUDA >= 12.9 recommended. Earlier versions may fail when profiling LLMs with Nsight Compute. See discussion in this NVIDIA thread)
1 Create and activate a virtual environment. Install the repository (library versions are pinned in requirements.txt):
python3 -m venv venv
source venv/bin/activate
VLLM_USE_PRECOMPILED=1 python3 -m pip install --editable .- Configure the GPU architecture. Update the compute capability in
vllm/interference/MakeLists.txt:
set(CMAKE_CUDA_ARCHITECTURES 90)- Build the CUDA benchmark suite:
cd vllm/interference
mkdir -p build && cd build
cmake .. && makeThis produces the shared library libinterference_kernels.so, which is required to run the experiments.
The experiments modify each model’s config.json and expect models to be present in the HuggingFace cache under ./.cache/huggingface.
We recommend creating an .env file containing your authentication token and an explicit cache directory. This file is automatically loaded by main.py.
Example .env:
HF_TOKEN=hf_...
HF_HOME=~/vllm_profile/.cache/huggingfaceDownload models prior to running experiments:
python3 main.py --model gemma-1bEach experiment folder (inter_sm/ and intra_sm/) contains its own README with detailed instructions.
Navigate to the desired experiment folder and follow the steps provided there.
To measure interference during token generation, we make several modifications to the vLLM execution pipeline.
We create two CUDA streams:
- The first stream is used for the vLLM engine.
- The second stream is used for the interference kernels.
Streams are created in vllm/interference/interference_kernels.cu and may be associated with CUDA green context depending on configuration.
extern "C" GlobalContextInfo* init(bool use_gcontext)
{
GlobalContextInfo* ginfo = (GlobalContextInfo*)malloc(sizeof(GlobalContextInfo));
if (use_gcontext) {
initialize_green_contexts(ginfo);
ginfo->has_gcontext = true;
}
else {
CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[0]), cudaStreamNonBlocking));
CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[1]), cudaStreamNonBlocking));
ginfo->has_gcontext = false;
}
return ginfo;
}CUDA Green Contexts are used in the inter-SM experiments to ensure that the two streams run on disjoint sets of SMs. Contrary to common belief, CUDA MPS does not guarantee mutual exclusion of SMs between clients. It only limits the maximum number of SMs available to a client.
Interference kernels are wrapped as Python classes in vllm/interference/inter_funcs.py.
Each experiment provides a config.json containing:
- Initialization arguments for the wrapper class (
init_args) - Launch parameters (
run_args) - Additional experiment metadata
Example (IPC experiment):
# config.json file for the IPC experiment
{
"kernel_name": "ComputeKernel",
"init_args": {
"shared_lib": "/home/elpaul/vllm_profile/vllm/interference/build/libinterference_kernels.so",
"num_floats": 128,
"use_gcontexts": false
},
"run_args": {
"num_tb": 82,
"num_threads_per_block": 128,
"num_itrs": 7000000,
"kernel_name": "fma_fp32_ilp4"
},
"num_requests": 0,
"inter_sm": false,
"intra_sm": true
}In the execute_model function, CUDA events and thread barriers synchronize the two streams. This ensures that one interference kernel is launched at the beginning of each decode iteration, allowing measurement of per-token interference effects.
If you use our benchmarks, profiling methodology, or reproduce our experiments, please cite our paper.
@inproceedings{elvinger2025gpuinterf,
author = {Elvinger, Paul and Strati, Foteini and Jerger, Natalie Enright and Klimovic, Ana},
title = {Understanding GPU Resource Interference One Level Deeper},
year = {2026},
isbn = {9798400722769},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3772052.3772270},
doi = {10.1145/3772052.3772270},
booktitle = {Proceedings of the 2025 ACM Symposium on Cloud Computing},
pages = {687–694},
numpages = {8},
keywords = {GPU interference, GPU utilization},
series = {SoCC '25}
}