This repository contains demonstration materials presented at the PEARC25 (Practice and Experience in Advanced Research Computing) conference, held July 20-24, 2025, in Columbus, Ohio.
These demos showcase advanced GPU computing capabilities using AMD Instinct MI300X GPUs, including:
- GPU Partitioning: Demonstrating AMD GPU partitioning modes (CPX/SPX) for optimized resource allocation
- Kubernetes AI Inference: Deploying and scaling vLLM inference services on Kubernetes with GPU acceleration
- Model Context Protocol (MCP): Integration examples using SGLang and MCP for enhanced AI tooling
ansible/device-config.yaml: Kubernetes GPU Operator configurationansible/microk8s-full-install.yml: MicroK8s installation playbookansible/microk8s-uninstall.yml: MicroK8s removal playbookk8s/device-config.yaml: Kubernetes device plugin configurationk8s/metallb-config.yaml: MetalLB load balancer configurationk8s/vllm-*.yaml: vLLM deployment, service, and storage configurationssglang/docker-compose.yml: SGLang container orchestrationenv.example: Example.envfile
- AMD Instinct GPUs
- System with ROCm-compatible hardware
- Instinct drivers and ROCm toolkit
- Docker and Docker Compose
- Ansible (for automated Kubernetes setup)
- Hugging Face CLI (
pip install huggingface_hub) npmto use the MCP Inspector tool (@modelcontextprotocol/inspector)jqfor JSON parsing (used in API testing examples)
This section demonstrates the configuration and usage of paritions with AMD Instinct MI300X GPUs. Instinct GPUs support different partitioning modes:
- SPX (Single Partition X-celerator): Treats the entire GPU as a single device
- CPX (Core Partitioned X-celerator): Each XCD appears as a separate logical GPU (8 GPUs per MI300X)
Validate your environment and check current partitioning:
amd-smi version
amd-smi monitoramd-smi monitor will display 8 available GPUs.
Change to CPX partitioning mode:
sudo amd-smi set -C cpx
amd-smi monitoramd-smi will now display 64 available GPUs.
This section demonstrates deployubg and scaling vLLM (a high-performance LLM inference server) on Kubernetes with AMD GPU acceleration. The previous partitioning configuration is leveraged to support up to 64 GPU-enabled pods. This demo is based on the comprehensive three-part blog series on AI Inference Orchestration with Kubernetes on Instinct MI300X, but has been modified to use a pre-downloaded model for faster deployment.
Prerequisites
Before starting this demo, you must:
-
Download the required model: Download the
amd/Llama-3.2-1B-Instruct-FP8-KVmodel and save it to/data/hf_homeusing the Hugging Face CLI:# Create the directory if it doesn't exist sudo mkdir -p /data/hf_home sudo chown $USER:$USER /data/hf_home # Set Hugging Face cache directory export HF_HOME=/data/hf_home # Download the model (this may take some time) huggingface-cli download amd/Llama-3.2-1B-Instruct-FP8-KV
The model will be stored in
/data/hf_home/hub/and is referenced in thevllm-deployment.yamlconfiguration. -
Configure MetalLB IP range: Review Part 2 of the blog series to determine an appropriate IP address range for your network environment. Update the
metallb-config.yamlfile with IP addresses that are:- In the same subnet as your Kubernetes nodes
- Not used by DHCP or other network services
- Available for MetalLB to assign to LoadBalancer services
Example IP range configuration in
metallb-config.yaml:# Update this range based on your network environment spec: addresses: - 192.168.1.200-192.168.1.210 # Adjust for your network
Setup
First, edit ansible/microk8s-full-install.yml and change the node_name: line to the hostname of the server that will host the Microk8s cluster. Then, run the Ansible playbook to install a local, single-node MicroK8s cluster, the AMD GPU Operator, and prerequisites.
cd ansible
ansible-playbook microk8s-full-install.yml -i localhost
cd ..Deployment Steps
Verify GPU availability:
kubectl get nodes -L feature.node.kubernetes.io/amd-gpu
kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"Create persistent storage for vLLM:
cd k8s
kubectl apply -f vllm-pvc.yamlDeploy a single vLLM inference service:
kubectl apply -f vllm-deployment.yamlMonitor GPU allocation:
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e amd.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/amd.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -tThe output should indicate that a GPU has been allocated to the vLLM service.
Expose the vLLM service:
kubectl apply -f vllm-service.yaml
kubectl get svc -n defaultNote the external IP of the vLLM service. This will be used to send a request to the vLLM API.
Next, install and configure MetalLB load balancer:
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml
kubectl get pods -n metallb-system
kubectl apply -f metallb-config.yaml
kubectl get svcTesting the API
Once deployed, test the inference API:
curl http://<EXTERNAL-IP>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
"prompt": "Explain quantum entanglement in simple terms",
"max_tokens": 1024,
"temperature": 0.5
}' | jq .Scaling
Scale the deployment to multiple replicas:
kubectl scale -n default deployment llama-3-2-1b --replicas=12Send another query to the load balancer, which will now has a pool of availble vLLM servers.
curl http://<EXTERNAL-IP>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
"prompt": "Explain quantum entanglement in simple terms",
"max_tokens": 1024,
"temperature": 0.5
}' | jq .Clean Up
Uninstall Microk8s
cd ..
cd ansible
ansible-playbook microk8s-uninstall.yml -i localhostChnage back to SPX partitioning mode
cd ..
sudo amd-smi set -C spx
amd-smi monitorThe MCP demo showcases integration with SGLang and the AMD SMI MCP server for enhanced AI tooling capabilities. Do not proceed with this step until you have completed the "Clean Up" step from the previous section (uninstall microk8s and set GPU to SPX partitioning.)
The setup requirements depend on which type of MCP client you plan to use:
For "Bring Your Own Model" MCP clients (e.g., Roo Code, Continue, etc.):
You'll need to download the moonshotai/Kimi-K2-Instruct model and run SGLang locally:
# Set Hugging Face cache directory (if not already set)
export HF_HOME=/data/hf_home
# Download the model for SGLang demo
huggingface-cli download moonshotai/Kimi-K2-InstructNote: You can also use a different or smaller model if preferred, but you'll need to download it yourself and update the docker-compose.yml file accordingly to reference your chosen model.
For hosted LLM services (e.g., GitHub Copilot, Claude, etc.): No model download or SGLang setup is required. You can skip directly to the MCP server testing section.
If you're using an MCP client that requires a local model (like Roo Code or Continue), configure the suggested environment variables. Create a new .env file based on the provided example:
cd sglang
cp env.example .envEdit .env and add your Hugging Face API token on the HF_TOKEN line. Next, start SGLang using Docker Compose:
docker compose up -dRun docker compose logs -f to monitor the SGlang logs. Once the logs indicate that the SGLang server has started, configure your MCP client to use the deployed model. Refer to your specific MCP client's documentation for configuration details.
This demonstrates the mcp-amdsmi server integration.
Setup: Follow the instructions in the mcp-amdsmi repository README to download and configure the MCP server.
Important: When using hosted LLM services (GitHub Copilot, Claude, etc.), you'll typically need to start the MCP server in HTTP transport mode rather than the default stdio mode. Refer to the mcp-amdsmi documentation for specific configuration instructions.
Option 1: Use an MCP Client
You can test the MCP integration using various MCP clients. Visit https://modelcontextprotocol.io/clients to explore available clients and choose one that suits your needs.
Option 2: Use the MCP Inspector CLI Tool
If you prefer not to use a full MCP client, you can test the functionality using the MCP Inspector CLI tool.
# Install the MCP Inspector globally
npm i @modelcontextprotocol/inspector -g
# Test various AMD SMI MCP server capabilities
# Discover available GPUs in the system
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_discovery
# Get current GPU status and utilization
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_status
# Monitor GPU performance metrics
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name get_gpu_performance
# Analyze GPU memory usage patterns
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name analyze_gpu_memory
# Monitor power consumption and thermal data
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name monitor_power_thermal
# Check overall GPU health status
npx @modelcontextprotocol/inspector --cli mcp-amdsmi --method tools/call --tool-name check_gpu_healthThese commands demonstrate the MCP server's ability to provide comprehensive GPU monitoring and management capabilities through the standardized Model Context Protocol interface. Each command will return information about the AMD GPU's current state and performance characteristics.
- GPU not detected: Ensure ROCm drivers are properly installed and GPUs are visible via
amd-smi - Partitioning fails: Verify you have appropriate permissions for
amd-smi setcommands - Kubernetes pods stuck: Check GPU operator deployment and device plugin status
- Service not accessible: Verify MetalLB configuration and network policies
- Model not found: Ensure the required models are properly downloaded to
/data/hf_home/hub/:amd/Llama-3.2-1B-Instruct-FP8-KVfor the Kubernetes vLLM demomoonshotai/Kimi-K2-Instructfor the MCP/SGLang demo (or your chosen alternative model)
- MetalLB IP conflicts: Verify the IP range in
metallb-config.yamldoesn't conflict with existing network assignments - MCP server connection issues: Ensure the
mcp-amdsmiserver is properly installed and configured according to the repository instructions
- Display GPU information:
amd-smi monitor - Kubernetes events:
kubectl get events --sort-by='.lastTimestamp' - Pod logs:
kubectl logs <pod-name>
- AI Inference Orchestration with Kubernetes on Instinct MI300X - Part 1
- AI Inference Orchestration with Kubernetes on Instinct MI300X - Part 2
- AI Inference Orchestration with Kubernetes on Instinct MI300X - Part 3
- mcp-amdsmi Repository
- Model Context Protocol Clients
- AMD Instinct MI300X GPU Partitioning Documentation
- vLLM Documentation
- SGLang Project
- Model Context Protocol
- PEARC25 Conference