Skip to content

This repository contains demonstration materials presented at the PEARC25 (Practice and Experience in Advanced Research Computing) conference, held July 20-24, 2025, in Columbus, Ohio.

License

Notifications You must be signed in to change notification settings

AMD-melliott/PEARC25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PEARC25 Conference Demos

This repository contains demonstration materials presented at the PEARC25 (Practice and Experience in Advanced Research Computing) conference, held July 20-24, 2025, in Columbus, Ohio.

Overview

These demos showcase advanced GPU computing capabilities using AMD Instinct MI300X GPUs, including:

  • GPU Partitioning: Demonstrating AMD GPU partitioning modes (CPX/SPX) for optimized resource allocation
  • Kubernetes AI Inference: Deploying and scaling vLLM inference services on Kubernetes with GPU acceleration
  • Model Context Protocol (MCP): Integration examples using SGLang and MCP for enhanced AI tooling

Configuration Files

  • ansible/device-config.yaml: Kubernetes GPU Operator configuration
  • ansible/microk8s-full-install.yml: MicroK8s installation playbook
  • ansible/microk8s-uninstall.yml: MicroK8s removal playbook
  • k8s/device-config.yaml: Kubernetes device plugin configuration
  • k8s/metallb-config.yaml: MetalLB load balancer configuration
  • k8s/vllm-*.yaml: vLLM deployment, service, and storage configurations
  • sglang/docker-compose.yml: SGLang container orchestration
  • env.example: Example .env file

Prerequisites

Hardware Requirements

  • AMD Instinct GPUs
  • System with ROCm-compatible hardware

Software Requirements

1. GPU Partitioning Demo

This section demonstrates the configuration and usage of paritions with AMD Instinct MI300X GPUs. Instinct GPUs support different partitioning modes:

  • SPX (Single Partition X-celerator): Treats the entire GPU as a single device
  • CPX (Core Partitioned X-celerator): Each XCD appears as a separate logical GPU (8 GPUs per MI300X)

Usage

Validate your environment and check current partitioning:

amd-smi version
amd-smi monitor

amd-smi monitor will display 8 available GPUs.

Change to CPX partitioning mode:

sudo amd-smi set -C cpx
amd-smi monitor

amd-smi will now display 64 available GPUs.

2. Kubernetes vLLM Inference Demo

This section demonstrates deployubg and scaling vLLM (a high-performance LLM inference server) on Kubernetes with AMD GPU acceleration. The previous partitioning configuration is leveraged to support up to 64 GPU-enabled pods. This demo is based on the comprehensive three-part blog series on AI Inference Orchestration with Kubernetes on Instinct MI300X, but has been modified to use a pre-downloaded model for faster deployment.

Prerequisites

Before starting this demo, you must:

  • Download the required model: Download the amd/Llama-3.2-1B-Instruct-FP8-KV model and save it to /data/hf_home using the Hugging Face CLI:

    # Create the directory if it doesn't exist
    sudo mkdir -p /data/hf_home
    sudo chown $USER:$USER /data/hf_home
    
    # Set Hugging Face cache directory
    export HF_HOME=/data/hf_home
    
    # Download the model (this may take some time)
    huggingface-cli download amd/Llama-3.2-1B-Instruct-FP8-KV

    The model will be stored in /data/hf_home/hub/ and is referenced in the vllm-deployment.yaml configuration.

  • Configure MetalLB IP range: Review Part 2 of the blog series to determine an appropriate IP address range for your network environment. Update the metallb-config.yaml file with IP addresses that are:

    • In the same subnet as your Kubernetes nodes
    • Not used by DHCP or other network services
    • Available for MetalLB to assign to LoadBalancer services

    Example IP range configuration in metallb-config.yaml:

    # Update this range based on your network environment
    spec:
      addresses:
      - 192.168.1.200-192.168.1.210  # Adjust for your network

Deployment

Setup

First, edit ansible/microk8s-full-install.yml and change the node_name: line to the hostname of the server that will host the Microk8s cluster. Then, run the Ansible playbook to install a local, single-node MicroK8s cluster, the AMD GPU Operator, and prerequisites.

cd ansible
ansible-playbook microk8s-full-install.yml -i localhost
cd ..

Deployment Steps

Verify GPU availability:

kubectl get nodes -L feature.node.kubernetes.io/amd-gpu
kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"

Create persistent storage for vLLM:

cd k8s
kubectl apply -f vllm-pvc.yaml

Deploy a single vLLM inference service:

kubectl apply -f vllm-deployment.yaml

Monitor GPU allocation:

kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e amd.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/amd.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t

The output should indicate that a GPU has been allocated to the vLLM service.

Expose the vLLM service:

kubectl apply -f vllm-service.yaml
kubectl get svc -n default

Note the external IP of the vLLM service. This will be used to send a request to the vLLM API.

Next, install and configure MetalLB load balancer:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml
kubectl get pods -n metallb-system
kubectl apply -f metallb-config.yaml
kubectl get svc

Testing the API

Once deployed, test the inference API:

curl http://<EXTERNAL-IP>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
        "prompt": "Explain quantum entanglement in simple terms",
        "max_tokens": 1024,
        "temperature": 0.5
      }' | jq .

Scaling

Scale the deployment to multiple replicas:

kubectl scale -n default deployment llama-3-2-1b --replicas=12

Send another query to the load balancer, which will now has a pool of availble vLLM servers.

curl http://<EXTERNAL-IP>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
        "prompt": "Explain quantum entanglement in simple terms",
        "max_tokens": 1024,
        "temperature": 0.5
      }' | jq .

Clean Up

Uninstall Microk8s

cd ..
cd ansible
ansible-playbook microk8s-uninstall.yml -i localhost

Chnage back to SPX partitioning mode

cd ..
sudo amd-smi set -C spx
amd-smi monitor

3. Model Context Protocol (MCP) Demo

The MCP demo showcases integration with SGLang and the AMD SMI MCP server for enhanced AI tooling capabilities. Do not proceed with this step until you have completed the "Clean Up" step from the previous section (uninstall microk8s and set GPU to SPX partitioning.)

Prerequisites

The setup requirements depend on which type of MCP client you plan to use:

For "Bring Your Own Model" MCP clients (e.g., Roo Code, Continue, etc.): You'll need to download the moonshotai/Kimi-K2-Instruct model and run SGLang locally:

# Set Hugging Face cache directory (if not already set)
export HF_HOME=/data/hf_home

# Download the model for SGLang demo
huggingface-cli download moonshotai/Kimi-K2-Instruct

Note: You can also use a different or smaller model if preferred, but you'll need to download it yourself and update the docker-compose.yml file accordingly to reference your chosen model.

For hosted LLM services (e.g., GitHub Copilot, Claude, etc.): No model download or SGLang setup is required. You can skip directly to the MCP server testing section.

SGLang Setup (Only for "Bring Your Own Model" clients)

If you're using an MCP client that requires a local model (like Roo Code or Continue), configure the suggested environment variables. Create a new .env file based on the provided example:

cd sglang
cp env.example .env

Edit .env and add your Hugging Face API token on the HF_TOKEN line. Next, start SGLang using Docker Compose:

docker compose up -d

Run docker compose logs -f to monitor the SGlang logs. Once the logs indicate that the SGLang server has started, configure your MCP client to use the deployed model. Refer to your specific MCP client's documentation for configuration details.

MCP AMD SMI Integration

This demonstrates the mcp-amdsmi server integration.

Setup: Follow the instructions in the mcp-amdsmi repository README to download and configure the MCP server.

Important: When using hosted LLM services (GitHub Copilot, Claude, etc.), you'll typically need to start the MCP server in HTTP transport mode rather than the default stdio mode. Refer to the mcp-amdsmi documentation for specific configuration instructions.

MCP Testing Options

Option 1: Use an MCP Client

You can test the MCP integration using various MCP clients. Visit https://modelcontextprotocol.io/clients to explore available clients and choose one that suits your needs.

Option 2: Use the MCP Inspector CLI Tool

If you prefer not to use a full MCP client, you can test the functionality using the MCP Inspector CLI tool.

# Install the MCP Inspector globally
npm i @modelcontextprotocol/inspector -g

# Test various AMD SMI MCP server capabilities
# Discover available GPUs in the system
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_discovery

# Get current GPU status and utilization
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_status

# Monitor GPU performance metrics
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_performance

# Analyze GPU memory usage patterns
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name analyze_gpu_memory

# Monitor power consumption and thermal data
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name monitor_power_thermal 

# Check overall GPU health status
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name check_gpu_health

These commands demonstrate the MCP server's ability to provide comprehensive GPU monitoring and management capabilities through the standardized Model Context Protocol interface. Each command will return information about the AMD GPU's current state and performance characteristics.

Troubleshooting

Common Issues

  1. GPU not detected: Ensure ROCm drivers are properly installed and GPUs are visible via amd-smi
  2. Partitioning fails: Verify you have appropriate permissions for amd-smi set commands
  3. Kubernetes pods stuck: Check GPU operator deployment and device plugin status
  4. Service not accessible: Verify MetalLB configuration and network policies
  5. Model not found: Ensure the required models are properly downloaded to /data/hf_home/hub/:
    • amd/Llama-3.2-1B-Instruct-FP8-KV for the Kubernetes vLLM demo
    • moonshotai/Kimi-K2-Instruct for the MCP/SGLang demo (or your chosen alternative model)
  6. MetalLB IP conflicts: Verify the IP range in metallb-config.yaml doesn't conflict with existing network assignments
  7. MCP server connection issues: Ensure the mcp-amdsmi server is properly installed and configured according to the repository instructions

Useful Commands

  • Display GPU information: amd-smi monitor
  • Kubernetes events: kubectl get events --sort-by='.lastTimestamp'
  • Pod logs: kubectl logs <pod-name>

References

About

This repository contains demonstration materials presented at the PEARC25 (Practice and Experience in Advanced Research Computing) conference, held July 20-24, 2025, in Columbus, Ohio.

Resources

License

Stars

Watchers

Forks