Skip to content

eth-easl/vllm_profile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Understanding GPU Resource Interference One Level Deeper

This repository contains the supporting code for our paper Understanding GPU Resource Interference One Level Deeper. It provides an open-source framework for measuring the interference experienced by LLM token generation when colocated with a suite of CUDA kernels designed to stress specific GPU subsystems. The repository builds on a fork of the vllm-project (as of May 2025).

Repository Structure

The most relevant components added or modified relative to the upstream vLLM project are:

.
├── inter_sm                              # Experiments for inter-SM interference (paper section 4.3)
├── intra_sm                              # Experiments for intra-SM interference (paper section 4.2 and 4.4)
├── vllm
│   ├── interference                      # CUDA benchmark suite and helper code for spatial colocation
│   ├── v1
│   │   ├── worker
│   │   │   ├── gpu_model_runner.py       # Coordinates interference kernels with vLLM during token generation
├── main.py                               # Entrypoint to launch the paper experiments

Requirements and Installations

Prerequisites

The experiments require the following dependencies:

  • cmake >= 3.22
  • C++17 or later
  • CUDA >= 12.9 recommended. Earlier versions may fail when profiling LLMs with Nsight Compute. See discussion in this NVIDIA thread)

Installation

1 Create and activate a virtual environment. Install the repository (library versions are pinned in requirements.txt):

python3 -m venv venv
source venv/bin/activate
VLLM_USE_PRECOMPILED=1 python3 -m pip install --editable .
  1. Configure the GPU architecture. Update the compute capability in vllm/interference/MakeLists.txt:
set(CMAKE_CUDA_ARCHITECTURES 90)
  1. Build the CUDA benchmark suite:
cd vllm/interference
mkdir -p build && cd build
cmake .. && make

This produces the shared library libinterference_kernels.so, which is required to run the experiments.

HuggingFace Configuration and Model Downloads

The experiments modify each model’s config.json and expect models to be present in the HuggingFace cache under ./.cache/huggingface. We recommend creating an .env file containing your authentication token and an explicit cache directory. This file is automatically loaded by main.py.

Example .env:

HF_TOKEN=hf_...
HF_HOME=~/vllm_profile/.cache/huggingface

Download models prior to running experiments:

python3 main.py --model gemma-1b

Running Experiments

Each experiment folder (inter_sm/ and intra_sm/) contains its own README with detailed instructions. Navigate to the desired experiment folder and follow the steps provided there.

Modifications made to vLLM

To measure interference during token generation, we make several modifications to the vLLM execution pipeline.

Separate CUDA Streams for vLLM and Interference Kernels

We create two CUDA streams:

Streams are created in vllm/interference/interference_kernels.cu and may be associated with CUDA green context depending on configuration.

extern "C" GlobalContextInfo* init(bool use_gcontext)
{
    GlobalContextInfo* ginfo = (GlobalContextInfo*)malloc(sizeof(GlobalContextInfo));
    if (use_gcontext) {
        initialize_green_contexts(ginfo);
        ginfo->has_gcontext = true;
    }
    else {
        CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[0]), cudaStreamNonBlocking));
        CUDACHECK(cudaStreamCreateWithFlags(&(ginfo->streams[1]), cudaStreamNonBlocking));
        ginfo->has_gcontext = false;
    }
    return ginfo;
}

CUDA Green Contexts are used in the inter-SM experiments to ensure that the two streams run on disjoint sets of SMs. Contrary to common belief, CUDA MPS does not guarantee mutual exclusion of SMs between clients. It only limits the maximum number of SMs available to a client.

Configuring Interference Kernels

Interference kernels are wrapped as Python classes in vllm/interference/inter_funcs.py. Each experiment provides a config.json containing:

  • Initialization arguments for the wrapper class (init_args)
  • Launch parameters (run_args)
  • Additional experiment metadata

Example (IPC experiment):

# config.json file for the IPC experiment
{
  "kernel_name": "ComputeKernel",
  "init_args": {
    "shared_lib": "/home/elpaul/vllm_profile/vllm/interference/build/libinterference_kernels.so",
    "num_floats": 128,
    "use_gcontexts": false
  },
  "run_args": {
    "num_tb": 82,
    "num_threads_per_block": 128,
    "num_itrs": 7000000,
    "kernel_name": "fma_fp32_ilp4"
  },
  "num_requests": 0,
  "inter_sm": false,
  "intra_sm": true
}

Synchronization with CUDA Events and Barriers

In the execute_model function, CUDA events and thread barriers synchronize the two streams. This ensures that one interference kernel is launched at the beginning of each decode iteration, allowing measurement of per-token interference effects.

Citation

If you use our benchmarks, profiling methodology, or reproduce our experiments, please cite our paper.

@inproceedings{elvinger2025gpuinterf,
  author = {Elvinger, Paul and Strati, Foteini and Jerger, Natalie Enright and Klimovic, Ana},
  title = {Understanding GPU Resource Interference One Level Deeper},
  year = {2026},
  isbn = {9798400722769},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3772052.3772270},
  doi = {10.1145/3772052.3772270},
  booktitle = {Proceedings of the 2025 ACM Symposium on Cloud Computing},
  pages = {687–694},
  numpages = {8},
  keywords = {GPU interference, GPU utilization},
  series = {SoCC '25}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

No packages published

Contributors 2

  •  
  •