Skip to content

feat: add GPU-enabled self-hosted runners for vLLM recording#5297

Draft
cdoern wants to merge 7 commits intollamastack:mainfrom
cdoern:feat/gpu-runners-vllm-recording
Draft

feat: add GPU-enabled self-hosted runners for vLLM recording#5297
cdoern wants to merge 7 commits intollamastack:mainfrom
cdoern:feat/gpu-runners-vllm-recording

Conversation

@cdoern
Copy link
Copy Markdown
Collaborator

@cdoern cdoern commented Mar 25, 2026

Summary

Add support for recording vLLM integration tests on GPU-enabled EC2 instances with the gpt-oss:20b model (20B parameters). This enables testing larger models that don't fit on standard CPU runners while maintaining cost efficiency.

Key Features:

  • OIDC authentication - No long-lived AWS credentials stored in GitHub
  • Multi-region/AZ fallback - High availability across 9 availability zones (us-east-2 and us-east-1)
  • Security hardened - Test jobs run with zero permissions to prevent credential theft
  • Always-cleanup - EC2 instances terminated even on failure or cancellation
  • Cost efficient - ~$0.43 per run on g6.2xlarge

Components:

  • .github/workflows/record-vllm-gpu-tests.yml - Main workflow with manual trigger
  • .github/actions/launch-gpu-runner/ - EC2 instance launcher (wraps machulav/ec2-github-runner)
  • .github/actions/setup-vllm-gpu/ - vLLM GPU installation and server setup with AWQ quantization
  • Test configurations for vllm-gpu-gpt-oss setup

AWS Setup Required

This PR includes all code but requires AWS infrastructure setup before it can be used:

  1. OIDC Provider + IAM Role - See AWS_SETUP_GUIDE.md Step 1
  2. VPC and Networking - Subnets, security groups in 2 regions
  3. GPU-Enabled AMI - CUDA 12.4, Docker, NVIDIA Container Toolkit
  4. GitHub Secrets/Variables - Configure AWS_ROLE_ARN, subnet IDs, AMI IDs, etc.

See IMPLEMENTATION_STATUS.md for detailed checklist.

Test Plan

Once AWS infrastructure is set up:

  • Trigger workflow via Actions tab > vLLM GPU Recording > Run workflow
  • Verify EC2 instance launches in us-east-2
  • Verify GPU detected and vLLM server starts with gpt-oss:20b
  • Verify tests run successfully in record mode
  • Verify recordings uploaded as artifacts
  • Verify EC2 instance terminated automatically
  • Run 5-10 times to validate reliability

Documentation

  • User Guide: docs/gpu-runners.md - How to trigger workflows, troubleshooting, cost estimates
  • AWS Setup: AWS_SETUP_GUIDE.md - Step-by-step infrastructure setup with OIDC
  • Implementation Plan: IMPLEMENTATION_PLAN.md - Roadmap and future optimizations
  • Status Tracker: IMPLEMENTATION_STATUS.md - Current status and next steps

References

cdoern and others added 4 commits March 25, 2026 14:09
Add GitHub Actions workflow and custom actions to support recording vLLM
integration tests on GPU-enabled EC2 instances with gpt-oss:20b model.

Key features:
- OIDC authentication for AWS (no long-lived credentials)
- Multi-region/AZ fallback for high availability (9 AZs across us-east-2 and us-east-1)
- Security hardened with zero permissions on test jobs
- Always-cleanup guarantee to prevent orphaned instances
- Support for multiple models and instance types via workflow_dispatch

Components:
- record-vllm-gpu-tests.yml: Main workflow with 3-job pattern (launch → test → cleanup)
- launch-gpu-runner: Wrapper action for machulav/ec2-github-runner
- setup-vllm-gpu: Installs vLLM with CUDA support and starts server with AWQ quantization

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add test suite and CI matrix configuration for GPU-based vLLM testing with
gpt-oss:20b model. This enables recording integration tests on GPU runners.

Changes:
- Add vllm-gpu-gpt-oss setup in suites.py with gpt-oss:20b model
- Add gpu-vllm matrix in ci_matrix.json for base, responses, and reasoning suites

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add comprehensive documentation for using GPU-enabled self-hosted runners.

- gpu-runners.md: User guide covering quick start, architecture, troubleshooting,
  cost estimates, and performance tuning
- AWS_SETUP_GUIDE.md: Step-by-step instructions for setting up AWS infrastructure
  with OIDC authentication, VPC/networking, GPU AMI creation, and GitHub configuration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add comprehensive implementation plan and status tracking documents for GPU runners.

- IMPLEMENTATION_PLAN.md: 4-phase roadmap with tasks, time estimates, dependencies,
  and success metrics
- IMPLEMENTATION_STATUS.md: Current status tracker with completed tasks, AWS setup
  requirements, and next actions

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026
@cdoern
Copy link
Copy Markdown
Collaborator Author

cdoern commented Mar 25, 2026

( I will remove the annoying MD files once this is ready )

- gpt-oss:20b
- gpt-oss:latest
- Qwen/Qwen3-0.6B
instance_type:
Copy link
Copy Markdown
Contributor

@iamemilio iamemilio Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little easier to keep a mapping of instance types and fallbacks to each supported model. That way users don't need to know or configure this when setting jobs up. That way we can also take into consideration the different resource requirements of the models. For example qwen 3.5 0.8B doesn't need an L4, it can probably run on a g4dn.xlarge and do reasonably well. To make your life easier for v1 of this pr, its probably easier to support just one model for now.

- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Select region and AZ with fallback
Copy link
Copy Markdown
Contributor

@iamemilio iamemilio Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to move this into a script, so that its testable. I don't think its doing what we think it is. I'm pretty sure this only ever uses us-east-2 due to the iterator being hardcoded to 0:

IFS='|' read -r REGION AZ SUBNET AMI SG <<< "${CONFIGS[0]}"

Comment on lines +62 to +72
uses: machulav/ec2-github-runner@v2.3.6
with:
mode: start
github-token: ${{ inputs.github-token }}
ec2-image-id: ${{ inputs.ec2-ami-id }}
ec2-instance-type: ${{ inputs.instance-type }}
subnet-id: ${{ inputs.subnet-id }}
security-group-id: ${{ inputs.security-group-id }}
aws-resource-tags: ${{ inputs.ec2-instance-tags }}
runner-home-dir: ${{ inputs.runner-home-dir }}
iam-role-name: ${{ inputs.iam-role-name }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this version of the GH action actually allows for setting different availability zones!

See here: https://github.com/opendatahub-io/data-processing/blob/main/.github/workflows/execute-all-notebooks.yml#L49

Let's use this approach so that we can try different availability zones in case the first one lacks availability. Then let me know which AWS regions you need the AMI ID to exist in.

Address feedback from code review:

1. Simplify to single model (gpt-oss:20b only)
   - Remove model dropdown - hardcode to gpt-oss:20b
   - Remove instance type selection - hardcode to g6.2xlarge
   - Make suite selection a dropdown for better UX

2. Fix multi-AZ fallback implementation
   - Use machulav/ec2-github-runner's built-in availability-zones-config
   - Remove broken custom region selection logic (was always using CONFIGS[0])
   - Simplify to us-east-2 only with 3 AZ fallback (2a, 2b, 2c)

3. Pin actions to commit hashes for security
   - Pin aws-actions/configure-aws-credentials@v4.0.2 to commit hash
   - Pin machulav/ec2-github-runner@v2.3.6 to commit hash

4. Remove emojis from workflow output
   - Clean up summary messages

Based on feedback from @iamemilio and @courtneypacheco

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern cdoern force-pushed the feat/gpu-runners-vllm-recording branch from ce3f415 to ab8da29 Compare March 26, 2026 18:27
Fix markdownlint issues and auto-generate workflow documentation:
- Add language tags to code blocks
- Fix markdown formatting (spacing, headings)
- Auto-generate .github/workflows/README.md

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern cdoern force-pushed the feat/gpu-runners-vllm-recording branch from ab8da29 to 4837466 Compare March 26, 2026 18:41
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants