Understanding GPU Resource Interference One Level Deeper

This repository contains the supporting code for our paper Understanding GPU Resource Interference One Level Deeper. It provides an open-source framework for measuring the interference experienced by LLM token generation when colocated with a suite of CUDA kernels designed to stress specific GPU subsystems. The repository builds on a fork of the vllm-project (as of May 2025).

Repository Structure

The most relevant components added or modified relative to the upstream vLLM project are:

.
├── inter_sm                              # Experiments for inter-SM interference (paper section 4.3)
├── intra_sm                              # Experiments for intra-SM interference (paper section 4.2 and 4.4)
├── vllm
│   ├── interference                      # CUDA benchmark suite and helper code for spatial colocation
│   ├── v1
│   │   ├── worker
│   │   │   ├── gpu_model_runner.py       # Coordinates interference kernels with vLLM during token generation
├── main.py                               # Entrypoint to launch the paper experiments

Requirements and Installations

Prerequisites

The experiments require the following dependencies:

cmake >= 3.22
C++17 or later
CUDA >= 12.9 recommended. Earlier versions may fail when profiling LLMs with Nsight Compute. See discussion in this NVIDIA thread)

Installation

1 Create and activate a virtual environment. Install the repository (library versions are pinned in requirements.txt):

python3 -m venv venv
source venv/bin/activate
VLLM_USE_PRECOMPILED=1 python3 -m pip install --editable .

Configure the GPU architecture. Update the compute capability in vllm/interference/MakeLists.txt:

set(CMAKE_CUDA_ARCHITECTURES 90)

Build the CUDA benchmark suite:

cd vllm/interference
mkdir -p build && cd build
cmake .. && make

This produces the shared library libinterference_kernels.so, which is required to run the experiments.

HuggingFace Configuration and Model Downloads

The experiments modify each model’s config.json and expect models to be present in the HuggingFace cache under ./.cache/huggingface. We recommend creating an .env file containing your authentication token and an explicit cache directory. This file is automatically loaded by main.py.

Example .env:

HF_TOKEN=hf_...
HF_HOME=~/vllm_profile/.cache/huggingface

Download models prior to running experiments:

python3 main.py --model gemma-1b

Running Experiments

Each experiment folder (inter_sm/ and intra_sm/) contains its own README with detailed instructions. Navigate to the desired experiment folder and follow the steps provided there.

Modifications made to vLLM

To measure interference during token generation, we make several modifications to the vLLM execution pipeline.

Separate CUDA Streams for vLLM and Interference Kernels

We create two CUDA streams:

The first stream is used for the vLLM engine.
The second stream is used for the interference kernels.

Streams are created in vllm/interference/interference_kernels.cu and may be associated with CUDA green context depending on configuration.

extern "C" GlobalContextInfo* init(bool use_gcontext)
{
    GlobalContextInfo* ginfo = (GlobalContextInfo*)malloc(sizeof(GlobalContextInfo));
    if (use_gcontext) {
        initialize_green_contexts(ginfo);
        ginfo->has_gcontext = true;
    }
    else {
        CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[0]), cudaStreamNonBlocking));
        CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[1]), cudaStreamNonBlocking));
        ginfo->has_gcontext = false;
    }
    return ginfo;
}

CUDA Green Contexts are used in the inter-SM experiments to ensure that the two streams run on disjoint sets of SMs. Contrary to common belief, CUDA MPS does not guarantee mutual exclusion of SMs between clients. It only limits the maximum number of SMs available to a client.

Configuring Interference Kernels

Interference kernels are wrapped as Python classes in vllm/interference/inter_funcs.py. Each experiment provides a config.json containing:

Initialization arguments for the wrapper class (init_args)
Launch parameters (run_args)
Additional experiment metadata

Example (IPC experiment):

# config.json file for the IPC experiment
{
  "kernel_name": "ComputeKernel",
  "init_args": {
    "shared_lib": "/home/elpaul/vllm_profile/vllm/interference/build/libinterference_kernels.so",
    "num_floats": 128,
    "use_gcontexts": false
  },
  "run_args": {
    "num_tb": 82,
    "num_threads_per_block": 128,
    "num_itrs": 7000000,
    "kernel_name": "fma_fp32_ilp4"
  },
  "num_requests": 0,
  "inter_sm": false,
  "intra_sm": true
}

Synchronization with CUDA Events and Barriers

In the execute_model function, CUDA events and thread barriers synchronize the two streams. This ensures that one interference kernel is launched at the beginning of each decode iteration, allowing measurement of per-token interference effects.

Citation

If you use our benchmarks, profiling methodology, or reproduce our experiments, please cite our paper.

@inproceedings{elvinger2025gpuinterf,
  author = {Elvinger, Paul and Strati, Foteini and Jerger, Natalie Enright and Klimovic, Ana},
  title = {Understanding GPU Resource Interference One Level Deeper},
  year = {2026},
  isbn = {9798400722769},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3772052.3772270},
  doi = {10.1145/3772052.3772270},
  booktitle = {Proceedings of the 2025 ACM Symposium on Cloud Computing},
  pages = {687–694},
  numpages = {8},
  keywords = {GPU interference, GPU utilization},
  series = {SoCC '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6,338 Commits
.buildkite		.buildkite
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
inter_sm		inter_sm
intra_sm		intra_sm
requirements		requirements
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
check_kernels_same.py		check_kernels_same.py
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Understanding GPU Resource Interference One Level Deeper

Repository Structure

Requirements and Installations

Prerequisites

Installation

HuggingFace Configuration and Model Downloads

Running Experiments

Modifications made to vLLM

Separate CUDA Streams for vLLM and Interference Kernels

Configuring Interference Kernels

Synchronization with CUDA Events and Barriers

Citation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

eth-easl/vllm_profile

Folders and files

Latest commit

History

Repository files navigation

Understanding GPU Resource Interference One Level Deeper

Repository Structure

Requirements and Installations

Prerequisites

Installation

HuggingFace Configuration and Model Downloads

Running Experiments

Modifications made to vLLM

Separate CUDA Streams for vLLM and Interference Kernels

Configuring Interference Kernels

Synchronization with CUDA Events and Barriers

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 2

Uh oh!

Languages

Packages