Skip to content

treadiehq/gpu-kill

Repository files navigation

GPU Kill

A CLI tool for managing GPUs across NVIDIA, AMD, Intel, and Apple Silicon systems. Monitor, control, and secure your GPU infrastructure with ease.

Community & Support

Join our Discord community for discussions, support, and updates:

Discord

Features

  • Monitor GPUs: Real-time usage, memory, temperature, and processes
  • Kill Processes: Gracefully terminate stuck GPU processes
  • Security: Detect crypto miners and suspicious activity
  • Guard Mode: Policy enforcement to prevent resource abuse
  • Remote: Manage GPUs across multiple servers
  • Multi-Vendor: Works with NVIDIA, AMD, Intel, and Apple Silicon
  • AI Integration: MCP server for AI assistant integration

Requirements

Build Performance

For faster development builds:

# Fast release build (recommended for development)
cargo build --profile release-fast

# Standard release build (optimized for production)
cargo build --release

# Maximum optimization (slowest, best performance)
cargo build --profile release-max

Build times on typical hardware:

  • Debug build: ~3 seconds
  • Release-fast: ~28 seconds
  • Release: ~28 seconds (improved from 76 seconds)
  • Release-max: ~60+ seconds (maximum optimization)

System Dependencies

Linux (Ubuntu/Debian):

sudo apt install build-essential libssl-dev pkg-config

Linux (Fedora/RHEL/CentOS):

sudo dnf install gcc gcc-c++ pkg-config openssl-devel
# or for older systems:
# sudo yum install gcc gcc-c++ pkg-config openssl-devel

macOS:

# Install Xcode command line tools
xcode-select --install
# OpenSSL is included with macOS

Windows:

  • Install Visual Studio Build Tools
  • OpenSSL is handled automatically by vcpkg

GPU Drivers

  • NVIDIA: NVIDIA drivers installed
  • AMD: ROCm drivers installed
  • Intel: intel-gpu-tools package installed
  • Apple Silicon: macOS with Apple Silicon (M1/M2/M3/M4)

Build Requirements

  • OS: Linux, macOS, or Windows
  • Rust: 1.70+ (for building from source)

Quick Start

Install & Run

# Build from source (first build may take 2-3 minutes)
git clone https://github.com/treadiehq/gpu-kill.git
cd gpu-kill
cargo build --release

# Or install via Cargo
cargo install gpukill

# Or one-liner installers (recommended)
# macOS/Linux
curl -fsSL https://gpukill.com/install | sh
# Windows (PowerShell)
irm https://gpukill.com/install-windows | iex

# List your GPUs
gpukill --list

# Watch GPU usage in real-time
gpukill --list --watch

Dead-simple cheatsheet

# Live watch (alias)
gpukill watch            # = gpukill --list --watch

# Kill job by PID (positional alias)
gpukill 12345            # = gpukill --kill --pid 12345

# Free a specific GPU index (kill all jobs on GPU 0)
gpukill --kill --gpu 0   # add --batch to actually kill; preview without it

# Force reset a GPU (shorthand)
gpukill --reset 0        # = gpukill --reset --gpu 0

# Safe mode: dry-run first (no changes)
gpukill 12345 --safe     # alias: --dry-run

Dashboard (Local Development)

The GPU Kill dashboard provides a modern web interface for GPU cluster monitoring. The dashboard is included in the repository for local development but is not required for core GPU Kill functionality.

GPU Kill Dashboard

Quick Start

# 1. Start the backend API server
gpukill --server --server-port 8080

# 2. In a new terminal, start the dashboard UI
cd dashboard
npm install  # First time only
npm run dev

# 3. Access the dashboard
open http://localhost:3000

Requirements:

  • Node.js 18+ and npm
  • GPU Kill backend server running (provides the API)

Note: You need both the backend server (port 8080) and frontend UI (port 3000) running for the dashboard to work.

Dashboard Features

  • Real-time monitoring of all GPUs across your cluster
  • Security detection with threat analysis and risk scoring
  • Policy management for resource control and enforcement
  • Cluster overview with Magic Moment contention insights
  • Interactive controls for process management and GPU operations

Production Deployment

For production GPU monitoring solutions, check the Kill Suite website.

MCP Server

GPU Kill includes a MCP server that enables AI assistants to interact with GPU management functionality:

  • Resources: Read GPU status, processes, audit data, policies, and security scans
  • Tools: Kill processes, reset GPUs, scan for threats, create policies
# Start the MCP server
cargo run --release -p gpukill-mcp

# Server runs on http://localhost:3001/mcp

Usage

Ask your AI to use the tools.

What GPUs do I have and what's their current usage?
Kill the Python process that's stuck on GPU 0
Kill all training processes that are using too much GPU memory
Show me GPU usage and kill any stuck processes
Scan for crypto miners and suspicious activity
Create a policy to limit user memory usage to 8GB
Reset GPU 1 because it's not responding
What processes are currently using my GPUs?

See mcp/README.md for detailed MCP server documentation.

Security & Policies

Detect Threats

# Scan for crypto miners and suspicious activity
gpukill --audit --rogue

# Configure detection rules
gpukill --audit --rogue-config

Policy Enforcement

# Enable Guard Mode
gpukill --guard --guard-enable

# Test policies safely
gpukill --guard --guard-test-policies

For detailed security and policy documentation, see DETAILED.md.

Remote Management

Manage GPUs across multiple servers via SSH:

# List GPUs on remote server
gpukill --remote staging-server --list

# Kill process on remote server
gpukill --remote prod-gpu-01 --kill --pid 1234

# Reset GPU on remote server
gpukill --remote gpu-cluster --reset --gpu 0

Troubleshooting

Build Issues

OpenSSL not found:

# Ubuntu/Debian
sudo apt install build-essential libssl-dev pkg-config

# Fedora/RHEL/CentOS
sudo dnf install gcc gcc-c++ pkg-config openssl-devel

Other common build issues:

  • Ensure you have the latest Rust toolchain: rustup update
  • Clean and rebuild: cargo clean && cargo build --release
  • Check system dependencies are installed (see Requirements section)

Need Help?

gpukill --help                    # Show all options
gpukill --version                 # Show version

CI/CD and Testing

GPU Kill uses a CI/CD pipeline with automatic GPU testing:

  • ✅ Conditional GPU testing - Runs automatically when GPU hardware is available
  • ✅ Multi-vendor GPU testing on real hardware (NVIDIA, AMD, Intel, Apple Silicon)
  • ✅ Hot Aisle integration - Optional on-demand GPU instance provisioning for comprehensive testing
  • ✅ Cross-platform compatibility testing
  • ✅ Performance benchmarking and profiling
  • ✅ Security auditing and compliance checks
  • ✅ Stress testing for reliability validation

How GPU Testing Works

  • On GitHub hosted runners: GPU tests skip gracefully (no GPU hardware)
  • On self-hosted runners: GPU tests run automatically when GPU hardware is detected
  • On cloud instances: GPU tests run automatically when GPU hardware is available
  • On developer machines: GPU tests run automatically when GPU hardware is detected
  • Via Hot Aisle: On-demand GPU instance provisioning for comprehensive testing

Quick Setup

Option 1: Test Locally (Already Working)

cargo test --test gpu_hardware_tests  # Runs on your GPU hardware

Option 2: Set Up Cloud GPU (5 minutes)

# On any cloud GPU instance:
curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash

Option 3: Self-Hosted Runner See CI_CD.md for detailed information about our testing infrastructure and how to set up self-hosted runners with GPU hardware.

Option 4: Hot Aisle Integration (Optional)

# Build with Hot Aisle feature
cargo build --release --features hotaisle

# Integration tests run automatically (no API key required)
# For actual GPU testing:
# 1. Set up HOTAISLE_API_KEY in GitHub Secrets
# 2. Manually trigger "Hot Aisle GPU Testing" workflow
# 3. Tests run on real GPU hardware with automatic cleanup

Option 5: Cloud GPU Setup See docs/CLOUD_GPU_SETUP.md for AWS, GCP, and Azure GPU instance setup.

Documentation

License

This project is licensed under the FSL-1.1-MIT License. See the LICENSE file for details.