Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225

dirien · 2026-01-25T20:24:59Z

Summary

New blog post covering Dynamic Resource Allocation (DRA) with NVIDIA Multi-Instance GPU (MIG) on Amazon EKS using Pulumi
Explains how DRA replaces the device plugin model for GPU scheduling
Includes complete Pulumi TypeScript infrastructure code
Demonstrates real workloads with Fashion-MNIST (training + inference)
Covers GPU Operator, DRA driver configuration, and monitoring with Grafana
Documents lessons learned: MIG strategy mismatches, AL2023 AMI quirks, Spot capacity issues
Updates github-card shortcode with path parameter support

Test plan

Verify blog post renders correctly at /blog/pulumi-eks-dynamic-resource-allocation/
Check all images display properly
Verify code blocks have correct syntax highlighting
Test github-card shortcode renders with path parameter
Run make lint - passes

… EKS New blog post covering Dynamic Resource Allocation (DRA) with NVIDIA Multi-Instance GPU (MIG) on Amazon EKS using Pulumi. Includes: - How DRA works and replaces the device plugin model - MIG partitioning concepts and profile configurations - Complete Pulumi TypeScript infrastructure code - Real workload demonstration with Fashion-MNIST - GPU Operator and DRA driver configuration - Monitoring with Grafana and DCGM Exporter - CrossGuard policy examples for GPU governance - Lessons learned: MIG strategy, AL2023 AMI quirks, Spot capacity Also updates github-card shortcode with path parameter support.

claude · 2026-01-25T20:26:40Z

Documentation Review

This is a comprehensive technical blog post covering Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA MIG on Amazon EKS. The content is well-structured, technically accurate, and demonstrates real working examples. Below are my findings:

Issues Found

Line 322: Shortcode usage with path parameter

{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra" >}}

Issue: The path parameter uses "tree/master" but the repository might be using "main" as the default branch. Verify that the path tree/master/aws-ts-eks-gpu-dra exists in the pulumi/examples repository.

Line 1198: Duplicate shortcode with same path concern

{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra/mig-policy" >}}

Issue: Same concern - verify "tree/master" vs "tree/main" for the path parameter.

Line 538: Command references Pulumi ESC without context

pulumi env run pulumi-idp/auth -- kubectl get configmap -n gpu-operator default-mig-parted-config -o yaml

Issue: This command references pulumi env run pulumi-idp/auth which appears to be using Pulumi ESC (Environments, Secrets, and Configuration). However, this wasn't mentioned in the prerequisites or setup sections. Readers who follow along may not have this configured and won't be able to run the command. Consider either:

Adding Pulumi ESC to the prerequisites section
Showing the command with and without ESC: kubectl get configmap -n gpu-operator default-mig-parted-config -o yaml
Adding a note explaining this is optional if using ESC for authentication

Minor: Sentence structure on line 316

This trips people up. For MIG on AWS, you need p4d (A100) or p5 (H100) instances.

Suggestion: While technically correct, the casual tone "This trips people up" is slightly informal for a technical blog post. Consider: "This is a common source of confusion. For MIG on AWS, you need p4d (A100) or p5 (H100) instances."

Positive Findings

✅ Style compliance: Headings follow the style guide (H1 in Title Case, H2+ in Sentence case)
✅ Content structure: Excellent logical flow from concepts to implementation to lessons learned
✅ Code examples: Comprehensive TypeScript examples with proper context and error handling
✅ Technical accuracy: DRA concepts, MIG profiles, and AWS instance types are correctly explained
✅ Images: All images have descriptive alt text
✅ Real-world value: The "What we learned the hard way" section provides valuable practical insights
✅ Cross-references: Appropriate links to related documentation and external resources
✅ SEO: Title and meta description are clear and search-friendly
✅ Front matter: Complete and properly formatted
✅ Spelling and grammar: No errors detected
✅ Line formatting: Code blocks are properly formatted with correct syntax highlighting

Shortcode Enhancement Review

The changes to layouts/shortcodes/github-card.html are well-implemented:

Adds optional path parameter support
Maintains backward compatibility
Updates both link URL and display text appropriately
Good code quality with clear variable naming

Recommendations

Verify GitHub paths: Confirm the pulumi/examples repository paths use "master" vs "main"
Add ESC context: Either document the ESC prerequisite or provide alternative command syntax
Consider adding: A brief mention in the prerequisites or early setup about using Pulumi ESC for authentication (if that's the intended workflow)

Summary

This is high-quality technical content that effectively demonstrates a complex topic with working code examples. The issues identified are minor and primarily concern verifying external paths and clarifying authentication context. The writing is clear, the examples are thorough, and the practical lessons at the end add significant value.

Mention me (@claude) if you'd like me to review any updates or have questions about these suggestions.

pulumi-bot · 2026-01-25T20:39:16Z

Your site preview for commit 7c157cf is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-17225-7c157cf6.s3-website.us-west-2.amazonaws.com.

asafashirov · 2026-01-26T04:19:08Z

content/blog/pulumi-eks-dynamic-resource-allocation/index.md

+title: "Kubernetes GPU Sharing: NVIDIA MIG + DRA on Amazon EKS"
+date: 2026-01-24
+draft: false
+meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning"


Suggested change

meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning"

meta_desc: "Learn how to share GPUs on Amazon EKS using Kubernetes 1.34 DRA and NVIDIA MIG. Complete Pulumi TypeScript code for hardware-isolated inference and training workloads."

asafashirov · 2026-01-26T04:26:35Z

content/blog/pulumi-eks-dynamic-resource-allocation/index.md

+
+### How MIG partitions the GPU
+
+MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:


Suggested change

MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:

MIG works by partitioning the GPU into two types of resources that get combined to create isolated instances:

cnunciato

This is great start, but it's covering way too much surface area for a single blog post. Let's hack on it together and see what we can do.

cnunciato · 2026-01-31T21:00:24Z

content/blog/pulumi-eks-dynamic-resource-allocation/index.md

+
+When you deploy a pod that references a ResourceClaim, here's what happens:
+
+1. **Scheduling** - The scheduler reads the claim's requirements and evaluates them against all ResourceSlices in the cluster. It finds nodes where matching devices are available. This is different from device plugins, where the scheduler just checks if `nvidia.com/gpu >= 1`.


I'll be honest: At this point, my eyes are glazing over — there's just way too much info here for a single post, and I'm only like 10% into it. 😅

I'm sure everything here is valuable and factually correct, but we don't have to educate readers on what all this stuff is, how it works, etc.; there are docs for that. We can respect their time and keep it focused by stating the problem, summarizing the solution, linking off to relevant docs, and getting to the goods, pointing out whatever's relevant along the way.

The principle behind this is known as "just in time" vs. "just in case":

I know this because as a long-winded writer myself, I've had it hammered into my own head over the years. 😄

dirien had a problem deploying to testing January 25, 2026 20:25 — with GitHub Actions Error

Fix kubectl command to not require Pulumi ESC

7c157cf

dirien temporarily deployed to testing January 25, 2026 20:29 — with GitHub Actions Inactive

dirien requested review from adamgordonbell and asafashirov January 25, 2026 20:41

cnunciato self-requested a review January 26, 2026 01:55

asafashirov reviewed Jan 26, 2026

View reviewed changes

asafashirov approved these changes Jan 26, 2026

View reviewed changes

cnunciato requested changes Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225

Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225

dirien commented Jan 25, 2026

Uh oh!

claude bot commented Jan 25, 2026

Uh oh!

pulumi-bot commented Jan 25, 2026

Uh oh!

asafashirov Jan 26, 2026

Uh oh!

asafashirov Jan 26, 2026

Uh oh!

cnunciato left a comment

Uh oh!

cnunciato Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning"
	meta_desc: "Learn how to share GPUs on Amazon EKS using Kubernetes 1.34 DRA and NVIDIA MIG. Complete Pulumi TypeScript code for hardware-isolated inference and training workloads."


		### How MIG partitions the GPU

		MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:

	MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:
	MIG works by partitioning the GPU into two types of resources that get combined to create isolated instances:


		When you deploy a pod that references a ResourceClaim, here's what happens:

		1. Scheduling - The scheduler reads the claim's requirements and evaluates them against all ResourceSlices in the cluster. It finds nodes where matching devices are available. This is different from device plugins, where the scheduler just checks if `nvidia.com/gpu >= 1`.

Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225

Are you sure you want to change the base?

Add blog post: Kubernetes GPU Sharing with NVIDIA MIG + DRA on Amazon EKS #17225

Conversation

dirien commented Jan 25, 2026

Summary

Test plan

Uh oh!

claude bot commented Jan 25, 2026

Documentation Review

Issues Found

Line 322: Shortcode usage with path parameter

Line 1198: Duplicate shortcode with same path concern

Line 538: Command references Pulumi ESC without context

Minor: Sentence structure on line 316

Positive Findings

Shortcode Enhancement Review

Recommendations

Summary

Uh oh!

pulumi-bot commented Jan 25, 2026

Uh oh!

asafashirov Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

asafashirov Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cnunciato left a comment

Choose a reason for hiding this comment

Uh oh!

cnunciato Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants