Releases · awslabs/awsome-distributed-training · GitHub

14 Feb 01:15

KeitaW

v1.2.0 Latest

Latest

What's Changed

Update eksctl cluster versions and ML CBR usage by @bryantbiggs in #573
Deleting SMP/SMDDP test-cases by @shimomut in #617
adding picotron by @KeitaW in #584
Update readme, deprecate test cases, and move Pytorch test cases under pytorch subdirectory by @KeitaW in #620
Add EKS node autorepair example cluster manifest by @iankouls-aws in #619
added AmazonEKS_CNI_Policy to SM Exec Role by @bluecrayon52 in #624
Reduce efa exporter container images by @mhuguesaws in #611
Change EFA, NCCL version in pipeline by @mhuguesaws in #626
added DOCKER_NETWORK and env_var persistence for SageMaker Code Editor use at AWS Events by @bluecrayon52 in #623
updated fsx_ubuntu.sh script with wait loop by @bluecrayon52 in #633
Change PyTorch version for FSDP case and remove conda by @mhuguesaws in #629
Change prometheus version for SMHP by @mhuguesaws in #628
Openzfs smhp by @amanshanbhag in #622
Fix cloudwatch access from Grafana by @mhuguesaws in #627
Fixing recently raised Studio Issues by @amanshanbhag in #640
Terraform Modules for HyperPod EKS by @bluecrayon52 in #586
Slurm cluster creation issues by @amanshanbhag in #641
Update 0.distributed-training.Dockerfile by @KeitaW in #645
Improvements/fsdp restructure by @mhuguesaws in #630
Add automated Grafana dashboard deployment by @mhuguesaws in #607
Fix FSDP to use venv first by @mhuguesaws in #650
nvshmem by @pbelevich in #599
Update install_enroot_pyxis.sh by @KeitaW in #661
feat: Add Hyperpod Optimum-neuron LoRA example by @Captainia in #631
Adding custom dcgm metrics for EKS by @nadknish in #666
re-adding deepspeed by @KeitaW in #659
Lcc studio jl by @amanshanbhag in #669
Update 0.distributed-training.Dockerfile by @nicolaven in #671
utility to dump details of all nodes in a cluster, into a csv file by @amitosaurus in #652
Update setup_mariadb_accounting.sh with apg installation by @amanshanbhag in #672
U 2204 patch -- update from #672 by @amanshanbhag in #673
Upgrade pinned version of Ansible by @amanshanbhag in #681
Nghtm patch 2 by @nghtm in #683
Fix minor spelling mistake in start_slurm.sh by @sammyhori in #686
Fix nvidia container toolkit to 1.17.6 by @mhuguesaws in #689
Update 2.SageMakerVPC.yaml by @nghtm in #691
Skip fsx_ubuntu.sh execution when no FSx parameters are provided in the provisioning parameters by @vaikor-amazon in #692
Change nccl-tests to have cuda version by @mhuguesaws in #694
Adding a template for HyperPod EventBridge email notifications by @shimomut in #687
Improvements/nccl cuda verison bump by @mhuguesaws in #695
ec2 get metadata replacement by @gmgtamz in #515
Replacing ********* with localhost in OZFS mount script by @amanshanbhag in #696
Adding ssh keys to additional (OZFS at /home) file system by @amanshanbhag in #700
[feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. by @divincode in #698
fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… by @mvinci12 in #701
Fixing Race Conditions reported in #674 by @amanshanbhag in #703
feat: Add LoRA fine-tuning optimum-neuron example for slurm by @Captainia in #643
Fsdp regression tests by @amanshanbhag in #714
Fix FSDP venv creation by @mhuguesaws in #720
Updating venv test case for FSDP to point to correct train.py by @amanshanbhag in #725
Bump requests from 2.32.0 to 2.32.4 in /3.test_cases/pytorch/bionemo by @dependabot[bot] in #727
new commit for fixing fsdp dataset, using allenai/c4 with HF token by @mvinci12 in #729
Adding test configs to matrix by @amanshanbhag in #731
Change FSDP steps and checkpoint steps by @mhuguesaws in #730
Incorrect indent in container reg test by @amanshanbhag in #732
Change FSDP steps to reduce time by @mhuguesaws in #734
Adding SMHP test cluster to matrix (venv) by @amanshanbhag in #740
Fixing path to match readme instructions by @amanshanbhag in #742
Feat/picotron resume from checkpoint by @KeitaW in #656
Fix FSDP venv run by @mhuguesaws in #733
slurm and eks readme edits by @mvinci12 in #735
Change FSDP PyTorch to 2.7.1 by @mhuguesaws in #739
Change FSDP to truncate dataset by @mhuguesaws in #743
fix typo in NCCL tests README by @KeitaW in #746
Enable 1click for SageMaker HyperPod by @mhuguesaws in #670
Fix FSDP requirements.txt to effectively use cuda 128 by @mhuguesaws in #748
Terraform Modules Updates by @bluecrayon52 in #744
HyperPod EKS Helper Script Fixes by @bluecrayon52 in #709
Observability change target scrapping rate to 1 minute by @mhuguesaws in #750
Fix FSDP destroy process group by @mhuguesaws in #749
docker library version on eks by @mvinci12 in #753
Add GPU Health, Slurm exporter to 1click observability by @mhuguesaws in #751
Add DCGM exporter dashboard with hostnames by @mhuguesaws in #752
adding llamav3 support on slurm and EKS by @allela-roy in #737
updating FSDP slurm documentation by @allela-roy in #745
Updating Parallelcluster deployment guide by @KeitaW in #721
Update README.md by @nghtm in https://github.com/aws-samples/awsome-distribu...

Read more

Contributors

RobertNorthard, pbelevich, and 44 other contributors

Assets 2

31 Mar 08:03

KeitaW

Release before the mass migration work

This release is pointing out the old directory structure + test cases.

This release creates a new "opt-in" openZFS filesystem as a home-directory on SageMaker HyperPod Slurm clusters, to address the Lots of Small Files (LoSF) issue encountered frequently when creating Conda Environments on default home directories where Lustre exists.

Assets 2

08 Feb 01:13

KeitaW

Release before re-organize

Full Changelog: https://github.com/aws-samples/awsome-distributed-training/commits/v1.1

Assets 2