Releases: awslabs/awsome-distributed-training
Releases · awslabs/awsome-distributed-training
v1.2.0
What's Changed
- Update
eksctlcluster versions and ML CBR usage by @bryantbiggs in #573 - Deleting SMP/SMDDP test-cases by @shimomut in #617
- adding picotron by @KeitaW in #584
- Update readme, deprecate test cases, and move Pytorch test cases under pytorch subdirectory by @KeitaW in #620
- Add EKS node autorepair example cluster manifest by @iankouls-aws in #619
- added AmazonEKS_CNI_Policy to SM Exec Role by @bluecrayon52 in #624
- Reduce efa exporter container images by @mhuguesaws in #611
- Change EFA, NCCL version in pipeline by @mhuguesaws in #626
- added DOCKER_NETWORK and env_var persistence for SageMaker Code Editor use at AWS Events by @bluecrayon52 in #623
- updated fsx_ubuntu.sh script with wait loop by @bluecrayon52 in #633
- Change PyTorch version for FSDP case and remove conda by @mhuguesaws in #629
- Change prometheus version for SMHP by @mhuguesaws in #628
- Openzfs smhp by @amanshanbhag in #622
- Fix cloudwatch access from Grafana by @mhuguesaws in #627
- Fixing recently raised Studio Issues by @amanshanbhag in #640
- Terraform Modules for HyperPod EKS by @bluecrayon52 in #586
- Slurm cluster creation issues by @amanshanbhag in #641
- Update 0.distributed-training.Dockerfile by @KeitaW in #645
- Improvements/fsdp restructure by @mhuguesaws in #630
- Add automated Grafana dashboard deployment by @mhuguesaws in #607
- Fix FSDP to use venv first by @mhuguesaws in #650
- nvshmem by @pbelevich in #599
- Update install_enroot_pyxis.sh by @KeitaW in #661
- feat: Add Hyperpod Optimum-neuron LoRA example by @Captainia in #631
- Adding custom dcgm metrics for EKS by @nadknish in #666
- re-adding deepspeed by @KeitaW in #659
- Lcc studio jl by @amanshanbhag in #669
- Update 0.distributed-training.Dockerfile by @nicolaven in #671
- utility to dump details of all nodes in a cluster, into a csv file by @amitosaurus in #652
- Update setup_mariadb_accounting.sh with apg installation by @amanshanbhag in #672
- U 2204 patch -- update from #672 by @amanshanbhag in #673
- Upgrade pinned version of Ansible by @amanshanbhag in #681
- Nghtm patch 2 by @nghtm in #683
- Fix minor spelling mistake in start_slurm.sh by @sammyhori in #686
- Fix nvidia container toolkit to 1.17.6 by @mhuguesaws in #689
- Update 2.SageMakerVPC.yaml by @nghtm in #691
- Skip fsx_ubuntu.sh execution when no FSx parameters are provided in the provisioning parameters by @vaikor-amazon in #692
- Change nccl-tests to have cuda version by @mhuguesaws in #694
- Adding a template for HyperPod EventBridge email notifications by @shimomut in #687
- Improvements/nccl cuda verison bump by @mhuguesaws in #695
- ec2 get metadata replacement by @gmgtamz in #515
- Replacing ********* with localhost in OZFS mount script by @amanshanbhag in #696
- Adding ssh keys to additional (OZFS at
/home) file system by @amanshanbhag in #700 - [feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. by @divincode in #698
- fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… by @mvinci12 in #701
- Fixing Race Conditions reported in #674 by @amanshanbhag in #703
- feat: Add LoRA fine-tuning optimum-neuron example for slurm by @Captainia in #643
- Fsdp regression tests by @amanshanbhag in #714
- Fix FSDP venv creation by @mhuguesaws in #720
- Updating venv test case for FSDP to point to correct
train.pyby @amanshanbhag in #725 - Bump requests from 2.32.0 to 2.32.4 in /3.test_cases/pytorch/bionemo by @dependabot[bot] in #727
- new commit for fixing fsdp dataset, using allenai/c4 with HF token by @mvinci12 in #729
- Adding test configs to matrix by @amanshanbhag in #731
- Change FSDP steps and checkpoint steps by @mhuguesaws in #730
- Incorrect indent in container reg test by @amanshanbhag in #732
- Change FSDP steps to reduce time by @mhuguesaws in #734
- Adding SMHP test cluster to matrix (venv) by @amanshanbhag in #740
- Fixing path to match readme instructions by @amanshanbhag in #742
- Feat/picotron resume from checkpoint by @KeitaW in #656
- Fix FSDP venv run by @mhuguesaws in #733
- slurm and eks readme edits by @mvinci12 in #735
- Change FSDP PyTorch to 2.7.1 by @mhuguesaws in #739
- Change FSDP to truncate dataset by @mhuguesaws in #743
- fix typo in NCCL tests README by @KeitaW in #746
- Enable 1click for SageMaker HyperPod by @mhuguesaws in #670
- Fix FSDP requirements.txt to effectively use cuda 128 by @mhuguesaws in #748
- Terraform Modules Updates by @bluecrayon52 in #744
- HyperPod EKS Helper Script Fixes by @bluecrayon52 in #709
- Observability change target scrapping rate to 1 minute by @mhuguesaws in #750
- Fix FSDP destroy process group by @mhuguesaws in #749
- docker library version on eks by @mvinci12 in #753
- Add GPU Health, Slurm exporter to 1click observability by @mhuguesaws in #751
- Add DCGM exporter dashboard with hostnames by @mhuguesaws in #752
- adding llamav3 support on slurm and EKS by @allela-roy in #737
- updating FSDP slurm documentation by @allela-roy in #745
- Updating Parallelcluster deployment guide by @KeitaW in #721
- Update README.md by @nghtm in https://github.com/aws-samples/awsome-distribu...
Release before the mass migration work
This release is pointing out the old directory structure + test cases.
This release creates a new "opt-in" openZFS filesystem as a home-directory on SageMaker HyperPod Slurm clusters, to address the Lots of Small Files (LoSF) issue encountered frequently when creating Conda Environments on default home directories where Lustre exists.