Skip to content

Conversation

@dirien
Copy link
Contributor

@dirien dirien commented Jan 25, 2026

Summary

  • New blog post covering Dynamic Resource Allocation (DRA) with NVIDIA Multi-Instance GPU (MIG) on Amazon EKS using Pulumi
  • Explains how DRA replaces the device plugin model for GPU scheduling
  • Includes complete Pulumi TypeScript infrastructure code
  • Demonstrates real workloads with Fashion-MNIST (training + inference)
  • Covers GPU Operator, DRA driver configuration, and monitoring with Grafana
  • Documents lessons learned: MIG strategy mismatches, AL2023 AMI quirks, Spot capacity issues
  • Updates github-card shortcode with path parameter support

Test plan

  • Verify blog post renders correctly at /blog/pulumi-eks-dynamic-resource-allocation/
  • Check all images display properly
  • Verify code blocks have correct syntax highlighting
  • Test github-card shortcode renders with path parameter
  • Run make lint - passes

… EKS

New blog post covering Dynamic Resource Allocation (DRA) with NVIDIA
Multi-Instance GPU (MIG) on Amazon EKS using Pulumi. Includes:

- How DRA works and replaces the device plugin model
- MIG partitioning concepts and profile configurations
- Complete Pulumi TypeScript infrastructure code
- Real workload demonstration with Fashion-MNIST
- GPU Operator and DRA driver configuration
- Monitoring with Grafana and DCGM Exporter
- CrossGuard policy examples for GPU governance
- Lessons learned: MIG strategy, AL2023 AMI quirks, Spot capacity

Also updates github-card shortcode with path parameter support.
@claude
Copy link
Contributor

claude bot commented Jan 25, 2026

Documentation Review

This is a comprehensive technical blog post covering Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA MIG on Amazon EKS. The content is well-structured, technically accurate, and demonstrates real working examples. Below are my findings:

Issues Found

Line 322: Shortcode usage with path parameter

{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra" >}}

Issue: The path parameter uses "tree/master" but the repository might be using "main" as the default branch. Verify that the path tree/master/aws-ts-eks-gpu-dra exists in the pulumi/examples repository.

Line 1198: Duplicate shortcode with same path concern

{{< github-card repo="pulumi/examples" path="tree/master/aws-ts-eks-gpu-dra/mig-policy" >}}

Issue: Same concern - verify "tree/master" vs "tree/main" for the path parameter.

Line 538: Command references Pulumi ESC without context

pulumi env run pulumi-idp/auth -- kubectl get configmap -n gpu-operator default-mig-parted-config -o yaml

Issue: This command references pulumi env run pulumi-idp/auth which appears to be using Pulumi ESC (Environments, Secrets, and Configuration). However, this wasn't mentioned in the prerequisites or setup sections. Readers who follow along may not have this configured and won't be able to run the command. Consider either:

  1. Adding Pulumi ESC to the prerequisites section
  2. Showing the command with and without ESC: kubectl get configmap -n gpu-operator default-mig-parted-config -o yaml
  3. Adding a note explaining this is optional if using ESC for authentication

Minor: Sentence structure on line 316

This trips people up. For MIG on AWS, you need p4d (A100) or p5 (H100) instances.

Suggestion: While technically correct, the casual tone "This trips people up" is slightly informal for a technical blog post. Consider: "This is a common source of confusion. For MIG on AWS, you need p4d (A100) or p5 (H100) instances."

Positive Findings

Style compliance: Headings follow the style guide (H1 in Title Case, H2+ in Sentence case)
Content structure: Excellent logical flow from concepts to implementation to lessons learned
Code examples: Comprehensive TypeScript examples with proper context and error handling
Technical accuracy: DRA concepts, MIG profiles, and AWS instance types are correctly explained
Images: All images have descriptive alt text
Real-world value: The "What we learned the hard way" section provides valuable practical insights
Cross-references: Appropriate links to related documentation and external resources
SEO: Title and meta description are clear and search-friendly
Front matter: Complete and properly formatted
Spelling and grammar: No errors detected
Line formatting: Code blocks are properly formatted with correct syntax highlighting

Shortcode Enhancement Review

The changes to layouts/shortcodes/github-card.html are well-implemented:

  • Adds optional path parameter support
  • Maintains backward compatibility
  • Updates both link URL and display text appropriately
  • Good code quality with clear variable naming

Recommendations

  1. Verify GitHub paths: Confirm the pulumi/examples repository paths use "master" vs "main"
  2. Add ESC context: Either document the ESC prerequisite or provide alternative command syntax
  3. Consider adding: A brief mention in the prerequisites or early setup about using Pulumi ESC for authentication (if that's the intended workflow)

Summary

This is high-quality technical content that effectively demonstrates a complex topic with working code examples. The issues identified are minor and primarily concern verifying external paths and clarifying authentication context. The writing is clear, the examples are thorough, and the practical lessons at the end add significant value.

Mention me (@claude) if you'd like me to review any updates or have questions about these suggestions.

@pulumi-bot
Copy link
Collaborator

@cnunciato cnunciato self-requested a review January 26, 2026 01:55
title: "Kubernetes GPU Sharing: NVIDIA MIG + DRA on Amazon EKS"
date: 2026-01-24
draft: false
meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta_desc: "Use Kubernetes DRA with Pulumi and Amazon EKS to efficiently run inference workloads with NVIDIA MIG GPU partitioning"
meta_desc: "Learn how to share GPUs on Amazon EKS using Kubernetes 1.34 DRA and NVIDIA MIG. Complete Pulumi TypeScript code for hardware-isolated inference and training workloads."


### How MIG partitions the GPU

MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MIG works by slicing the GPU into two types of resources that get combined to create isolated instances:
MIG works by partitioning the GPU into two types of resources that get combined to create isolated instances:

Copy link
Contributor

@cnunciato cnunciato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great start, but it's covering way too much surface area for a single blog post. Let's hack on it together and see what we can do.


When you deploy a pod that references a ResourceClaim, here's what happens:

1. **Scheduling** - The scheduler reads the claim's requirements and evaluates them against all ResourceSlices in the cluster. It finds nodes where matching devices are available. This is different from device plugins, where the scheduler just checks if `nvidia.com/gpu >= 1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest: At this point, my eyes are glazing over — there's just way too much info here for a single post, and I'm only like 10% into it. 😅

I'm sure everything here is valuable and factually correct, but we don't have to educate readers on what all this stuff is, how it works, etc.; there are docs for that. We can respect their time and keep it focused by stating the problem, summarizing the solution, linking off to relevant docs, and getting to the goods, pointing out whatever's relevant along the way.

The principle behind this is known as "just in time" vs. "just in case":

Image

I know this because as a long-winded writer myself, I've had it hammered into my own head over the years. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants