-
Notifications
You must be signed in to change notification settings - Fork 261
Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… EKS New blog post covering Dynamic Resource Allocation (DRA) with NVIDIA Multi-Instance GPU (MIG) on Amazon EKS using Pulumi. Includes: - How DRA works and replaces the device plugin model - MIG partitioning concepts and profile configurations - Complete Pulumi TypeScript infrastructure code - Real workload demonstration with Fashion-MNIST - GPU Operator and DRA driver configuration - Monitoring with Grafana and DCGM Exporter - CrossGuard policy examples for GPU governance - Lessons learned: MIG strategy, AL2023 AMI quirks, Spot capacity Also updates github-card shortcode with path parameter support.
Documentation ReviewThis is a comprehensive technical blog post covering Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA MIG on Amazon EKS. The content is well-structured, technically accurate, and demonstrates real working examples. Below are my findings: Issues FoundLine 322: Shortcode usage with path parameter{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra" >}}Issue: The path parameter uses "tree/master" but the repository might be using "main" as the default branch. Verify that the path Line 1198: Duplicate shortcode with same path concern{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra/mig-policy" >}}Issue: Same concern - verify "tree/master" vs "tree/main" for the path parameter. Line 538: Command references Pulumi ESC without contextpulumi env run pulumi-idp/auth -- kubectl get configmap -n gpu-operator default-mig-parted-config -o yamlIssue: This command references
Minor: Sentence structure on line 316This trips people up. For MIG on AWS, you need p4d (A100) or p5 (H100) instances.Suggestion: While technically correct, the casual tone "This trips people up" is slightly informal for a technical blog post. Consider: "This is a common source of confusion. For MIG on AWS, you need p4d (A100) or p5 (H100) instances." Positive Findings✅ Style compliance: Headings follow the style guide (H1 in Title Case, H2+ in Sentence case) Shortcode Enhancement ReviewThe changes to
Recommendations
SummaryThis is high-quality technical content that effectively demonstrates a complex topic with working code examples. The issues identified are minor and primarily concern verifying external paths and clarifying authentication context. The writing is clear, the examples are thorough, and the practical lessons at the end add significant value. Mention me (@claude) if you'd like me to review any updates or have questions about these suggestions. |
|
Your site preview for commit 7c157cf is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-17225-7c157cf6.s3-website.us-west-2.amazonaws.com. |
| title: "Kubernetes GPU Sharing: NVIDIA MIG + DRA on Amazon EKS" | ||
| date: 2026-01-24 | ||
| draft: false | ||
| meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning" | |
| meta_desc: "Learn how to share GPUs on Amazon EKS using Kubernetes 1.34 DRA and NVIDIA MIG. Complete Pulumi TypeScript code for hardware-isolated inference and training workloads." |
|
|
||
| ### How MIG partitions the GPU | ||
|
|
||
| MIG works by slicing the GPU into two types of resources that get combined to create isolated instances: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| MIG works by slicing the GPU into two types of resources that get combined to create isolated instances: | |
| MIG works by partitioning the GPU into two types of resources that get combined to create isolated instances: |
cnunciato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great start, but it's covering way too much surface area for a single blog post. Let's hack on it together and see what we can do.
|
|
||
| When you deploy a pod that references a ResourceClaim, here's what happens: | ||
|
|
||
| 1. **Scheduling** - The scheduler reads the claim's requirements and evaluates them against all ResourceSlices in the cluster. It finds nodes where matching devices are available. This is different from device plugins, where the scheduler just checks if `nvidia.com/gpu >= 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest: At this point, my eyes are glazing over — there's just way too much info here for a single post, and I'm only like 10% into it. 😅
I'm sure everything here is valuable and factually correct, but we don't have to educate readers on what all this stuff is, how it works, etc.; there are docs for that. We can respect their time and keep it focused by stating the problem, summarizing the solution, linking off to relevant docs, and getting to the goods, pointing out whatever's relevant along the way.
The principle behind this is known as "just in time" vs. "just in case":
I know this because as a long-winded writer myself, I've had it hammered into my own head over the years. 😄
Summary
Test plan
/blog/pulumi-eks-dynamic-resource-allocation/make lint- passes