Open
Conversation
* fix cpu init during export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * export env fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * delete_extra_state for TE related during checkpoint loading for export Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * paths fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add override_provider option for checkpoint loading Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add unit test for override_provider option Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove debug lines Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
* chore: Add issue template for model requests Signed-off-by: oliver könig <okoenig@nvidia.com> * copying over remaining templates Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Skip if `docs-only` label is attached Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * update Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* cleanup process group at end of performance script Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * Update scripts/performance/run_script.py Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com> * destroy pg for other scripts Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
* ci(fix): pre-flight Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* initial gemma commit Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma provider Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * patch tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add gemma bridge + tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conftest Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * reenable msc Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix gemma test fallback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * try simpler tokenizer Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * upload assets Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use pre-downloaded config for model provider test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * lint Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback -s Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * rebase Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * use mcore activations Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update test Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix mock Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix conversion script reference Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * subclass Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * update tests Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * [docs] packed sequences Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * address feedback Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* docs] placeholder page for performance summary Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * add sections for releases Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * improve description Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
… compatibility (NVIDIA-NeMo#829) * save latest_checkpointed_iteration for compatibility Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * fix megatron fsdp test assertion Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* exit profiler context Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> * disable vocab size logging in flops calculation Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> --------- Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* Clear disk space before install check Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Clear disk space before install check" This reverts commit 2c085f5. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Run bare metal install on self-hosted runners Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…A-NeMo#607) * update llama and qwen models to use auto bridge and update recipes test as well Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temporary remove llama4 as it's not fully tested or verified. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temporary remove llama4 as it's not fully tested or verified." This reverts commit 5217084. * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "temp save" This reverts commit 0c57e2b. * Revert "temp save" This reverts commit 0748d52. * update qwen's recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove some old recipe files Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe files to match old recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe file Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update qwen recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update llama recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * recipe naming update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add TypedDict for args Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * unit test fix and license fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * sync eval_interval and save_interval Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * set TRANSFORMERS_OFFLINE=1 in action.yml Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix llama3 8b hf model path Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * replay lr decay iters update on updated recipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update action.yml Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * add comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add guard / mock for the places needs to download hf config in unit test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add qwen functional test Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update recipe tests Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * lint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
…ation support - Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge. - Updated `DITForwardStep` class to use `__call__` method for forward steps. - Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`. - Adjusted tensor and context parallelism settings in `llama3_8b.py`. This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity. - Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`. - Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization. - Introduced `DiffusionDataModuleConfig` for better dataset configuration management. - Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`. - Refined imports across various modules to ensure consistency and clarity. This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.