Skip to content

Add optional Training Plan support for HyperPod instance groups#1004

Open
newabdosheham wants to merge 9 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan
Open

Add optional Training Plan support for HyperPod instance groups#1004
newabdosheham wants to merge 9 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan

Conversation

@newabdosheham
Copy link

Why new PR: “Old PR #930 auto-closed due to upstream history rewrite; this is rebased onto new main.”

What changed: “Adds optional Training Plan support for HyperPod instance groups + HyperPod Slurm observability improvements.”

perifaws and others added 9 commits February 27, 2026 03:06
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
* Change readme to refer to recent test cases & assets

* Remove borked ascii logo
Description of changes:

- Updating the CloudFormation template for the HyperPod Slurm observability stack. Now it automatically register Prometheus as the data-source for Grafana, and installs pre-configured dashboard to Grafana.
- Updating lifecycle script to install metric exporters and OTEL collector in each node - more scalable architecture similar to the HyperPod EKS Observability.
- Added "ObservabilityConfig" class in config.py. It has Prometheus Remote Write URL and advanced flag.

For more details - https://catalog.workshops.aws/sagemaker-hyperpod/en-US/09-observability

---------

Co-authored-by: Madhubalasri-B <madbal@amazon.com>
@newabdosheham
Copy link
Author

This PR supersedes #930 (auto-closed due to upstream history rewrite).

Addressing review feedback from #930:

  • Refactored instance_groups to a list of objects with embedded name
  • Moved Training Plan support into the instance group object via optional training_plan_arn (instead of global vars / selecting a group by name)
  • Updated the Terraform example tfvars accordingly

Happy to adjust if you’d prefer a different naming/schema.

@newabdosheham
Copy link
Author

Hi @KeitaW @bluecrayon52 — thanks again for the guidance on #930.

This PR (#1004) supersedes #930 (auto-closed due to upstream history rewrite) and is rebased onto the new main.

It addresses the review feedback from #930 by:

Standardizing instance_groups to a list of objects

Embedding name in each instance group object

Supporting Training Plans via optional per-group training_plan_arn (no global training-plan vars)

Updating terraform.tfvars.example accordingly

Would you mind taking a look when you have a moment? Thank you!

cc @perifaws @awsankur @shimomut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants