feat: add GPU-enabled self-hosted runners for vLLM recording by cdoern · Pull Request #5297 · llamastack/llama-stack

cdoern · 2026-03-25T18:11:17Z

Summary

Add support for recording vLLM integration tests on GPU-enabled EC2 instances with the gpt-oss:20b model (20B parameters). This enables testing larger models that don't fit on standard CPU runners while maintaining cost efficiency.

Key Features:

OIDC authentication - No long-lived AWS credentials stored in GitHub
Multi-region/AZ fallback - High availability across 9 availability zones (us-east-2 and us-east-1)
Security hardened - Test jobs run with zero permissions to prevent credential theft
Always-cleanup - EC2 instances terminated even on failure or cancellation
Cost efficient - ~$0.43 per run on g6.2xlarge

Components:

.github/workflows/record-vllm-gpu-tests.yml - Main workflow with manual trigger
.github/actions/launch-gpu-runner/ - EC2 instance launcher (wraps machulav/ec2-github-runner)
.github/actions/setup-vllm-gpu/ - vLLM GPU installation and server setup with AWQ quantization
Test configurations for vllm-gpu-gpt-oss setup

AWS Setup Required

This PR includes all code but requires AWS infrastructure setup before it can be used:

OIDC Provider + IAM Role - See AWS_SETUP_GUIDE.md Step 1
VPC and Networking - Subnets, security groups in 2 regions
GPU-Enabled AMI - CUDA 12.4, Docker, NVIDIA Container Toolkit
GitHub Secrets/Variables - Configure AWS_ROLE_ARN, subnet IDs, AMI IDs, etc.

See IMPLEMENTATION_STATUS.md for detailed checklist.

Test Plan

Once AWS infrastructure is set up:

Trigger workflow via Actions tab > vLLM GPU Recording > Run workflow
Verify EC2 instance launches in us-east-2
Verify GPU detected and vLLM server starts with gpt-oss:20b
Verify tests run successfully in record mode
Verify recordings uploaded as artifacts
Verify EC2 instance terminated automatically
Run 5-10 times to validate reliability

Documentation

User Guide: docs/gpu-runners.md - How to trigger workflows, troubleshooting, cost estimates
AWS Setup: AWS_SETUP_GUIDE.md - Step-by-step infrastructure setup with OIDC
Implementation Plan: IMPLEMENTATION_PLAN.md - Roadmap and future optimizations
Status Tracker: IMPLEMENTATION_STATUS.md - Current status and next steps

References

Design based on instructlab/instructlab and opendatahub-io/data-processing GPU runner patterns
Uses machulav/ec2-github-runner for EC2 instance management

Add GitHub Actions workflow and custom actions to support recording vLLM integration tests on GPU-enabled EC2 instances with gpt-oss:20b model. Key features: - OIDC authentication for AWS (no long-lived credentials) - Multi-region/AZ fallback for high availability (9 AZs across us-east-2 and us-east-1) - Security hardened with zero permissions on test jobs - Always-cleanup guarantee to prevent orphaned instances - Support for multiple models and instance types via workflow_dispatch Components: - record-vllm-gpu-tests.yml: Main workflow with 3-job pattern (launch → test → cleanup) - launch-gpu-runner: Wrapper action for machulav/ec2-github-runner - setup-vllm-gpu: Installs vLLM with CUDA support and starts server with AWQ quantization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

Add test suite and CI matrix configuration for GPU-based vLLM testing with gpt-oss:20b model. This enables recording integration tests on GPU runners. Changes: - Add vllm-gpu-gpt-oss setup in suites.py with gpt-oss:20b model - Add gpu-vllm matrix in ci_matrix.json for base, responses, and reasoning suites Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

Add comprehensive documentation for using GPU-enabled self-hosted runners. - gpu-runners.md: User guide covering quick start, architecture, troubleshooting, cost estimates, and performance tuning - AWS_SETUP_GUIDE.md: Step-by-step instructions for setting up AWS infrastructure with OIDC authentication, VPC/networking, GPU AMI creation, and GitHub configuration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

Add comprehensive implementation plan and status tracking documents for GPU runners. - IMPLEMENTATION_PLAN.md: 4-phase roadmap with tasks, time estimates, dependencies, and success metrics - IMPLEMENTATION_STATUS.md: Current status tracker with completed tasks, AWS setup requirements, and next actions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

.github/workflows/record-vllm-gpu-tests.yml

cdoern · 2026-03-25T18:15:57Z

( I will remove the annoying MD files once this is ready )

iamemilio · 2026-03-25T18:43:49Z

.github/workflows/record-vllm-gpu-tests.yml

+          - gpt-oss:20b
+          - gpt-oss:latest
+          - Qwen/Qwen3-0.6B
+      instance_type:


It might be a little easier to keep a mapping of instance types and fallbacks to each supported model. That way users don't need to know or configure this when setting jobs up. That way we can also take into consideration the different resource requirements of the models. For example qwen 3.5 0.8B doesn't need an L4, it can probably run on a g4dn.xlarge and do reasonably well. To make your life easier for v1 of this pr, its probably easier to support just one model for now.

iamemilio · 2026-03-25T18:54:48Z

.github/workflows/record-vllm-gpu-tests.yml

+      - name: Checkout code
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Select region and AZ with fallback


You may want to move this into a script, so that its testable. I don't think its doing what we think it is. I'm pretty sure this only ever uses us-east-2 due to the iterator being hardcoded to 0:

IFS='|' read -r REGION AZ SUBNET AMI SG <<< "${CONFIGS[0]}"

courtneypacheco · 2026-03-25T20:37:45Z

.github/actions/launch-gpu-runner/action.yml

+      uses: machulav/ec2-github-runner@v2.3.6
+      with:
+        mode: start
+        github-token: ${{ inputs.github-token }}
+        ec2-image-id: ${{ inputs.ec2-ami-id }}
+        ec2-instance-type: ${{ inputs.instance-type }}
+        subnet-id: ${{ inputs.subnet-id }}
+        security-group-id: ${{ inputs.security-group-id }}
+        aws-resource-tags: ${{ inputs.ec2-instance-tags }}
+        runner-home-dir: ${{ inputs.runner-home-dir }}
+        iam-role-name: ${{ inputs.iam-role-name }}


Ah, this version of the GH action actually allows for setting different availability zones!

See here: https://github.com/opendatahub-io/data-processing/blob/main/.github/workflows/execute-all-notebooks.yml#L49

Let's use this approach so that we can try different availability zones in case the first one lacks availability. Then let me know which AWS regions you need the AMI ID to exist in.

@iamemilio

Address feedback from code review: 1. Simplify to single model (gpt-oss:20b only) - Remove model dropdown - hardcode to gpt-oss:20b - Remove instance type selection - hardcode to g6.2xlarge - Make suite selection a dropdown for better UX 2. Fix multi-AZ fallback implementation - Use machulav/ec2-github-runner's built-in availability-zones-config - Remove broken custom region selection logic (was always using CONFIGS[0]) - Simplify to us-east-2 only with 3 AZ fallback (2a, 2b, 2c) 3. Pin actions to commit hashes for security - Pin aws-actions/configure-aws-credentials@v4.0.2 to commit hash - Pin machulav/ec2-github-runner@v2.3.6 to commit hash 4. Remove emojis from workflow output - Clean up summary messages Based on feedback from @iamemilio and @courtneypacheco Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

Fix markdownlint issues and auto-generate workflow documentation: - Add language tags to code blocks - Fix markdown formatting (spacing, headings) - Auto-generate .github/workflows/README.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern and others added 4 commits March 25, 2026 14:09

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026

github-advanced-security bot found potential problems Mar 25, 2026

View reviewed changes

.github/workflows/record-vllm-gpu-tests.yml Fixed Show fixed Hide fixed

.github/workflows/record-vllm-gpu-tests.yml Fixed Show fixed Hide fixed

iamemilio reviewed Mar 25, 2026

View reviewed changes

courtneypacheco reviewed Mar 25, 2026

View reviewed changes

cdoern force-pushed the feat/gpu-runners-vllm-recording branch from ce3f415 to ab8da29 Compare March 26, 2026 18:27

cdoern force-pushed the feat/gpu-runners-vllm-recording branch from ab8da29 to 4837466 Compare March 26, 2026 18:41

chore: use existing RELEASE_PAT instead of creating new GH_RUNNER_PAT

1fe3f30

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>

msager27 mentioned this pull request Mar 26, 2026

test: Update responses tests based on vllm testing #5328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GPU-enabled self-hosted runners for vLLM recording#5297

feat: add GPU-enabled self-hosted runners for vLLM recording#5297
cdoern wants to merge 7 commits intollamastack:mainfrom
cdoern:feat/gpu-runners-vllm-recording

cdoern commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cdoern commented Mar 25, 2026

Uh oh!

iamemilio Mar 25, 2026 •

edited

Loading

Uh oh!

iamemilio Mar 25, 2026 •

edited

Loading

Uh oh!

courtneypacheco Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cdoern commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AWS Setup Required

Test Plan

Documentation

References

Uh oh!

Uh oh!

Uh oh!

cdoern commented Mar 25, 2026

Uh oh!

iamemilio Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamemilio Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

courtneypacheco Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cdoern commented Mar 25, 2026 •

edited

Loading

iamemilio Mar 25, 2026 •

edited

Loading

iamemilio Mar 25, 2026 •

edited

Loading