Skip to content

Huvu/mcore wan official#2

Open
huvunvidia wants to merge 41 commits intoabhinavg4:mainfrom
huvunvidia:huvu/mcore_wan_official
Open

Huvu/mcore wan official#2
huvunvidia wants to merge 41 commits intoabhinavg4:mainfrom
huvunvidia:huvu/mcore_wan_official

Conversation

@huvunvidia
Copy link

No description provided.

abhinavg4 and others added 30 commits September 30, 2025 14:23
* fix cpu init during export

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* export env fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* delete_extra_state for TE related during checkpoint loading for export

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* paths fixes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add override_provider option for checkpoint loading

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add unit test for override_provider option

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* remove debug lines

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* unit test fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
* chore: Add issue template for model requests

Signed-off-by: oliver könig <okoenig@nvidia.com>

* copying over remaining templates

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Skip if `docs-only` label is attached

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* cleanup process group at end of performance script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update scripts/performance/run_script.py

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

* destroy pg for other scripts

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
* ci(fix): pre-flight

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* initial gemma commit

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* gemma provider

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* patch tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* add gemma bridge + tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix conftest

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* reenable msc

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix gemma test fallback

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* try simpler tokenizer

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* upload assets

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use pre-downloaded config for model provider test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address feedback -s

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use mcore activations

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix mock

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix conversion script reference

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* subclass

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address feedback

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* gemma2 provider and bridge

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* gemma2 model provider + bridge

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* docs] placeholder page for performance summary

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* add sections for releases

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* improve description

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
… compatibility (NVIDIA-NeMo#829)

* save latest_checkpointed_iteration for compatibility

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* fix megatron fsdp test assertion

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* exit profiler context

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* disable vocab size logging in flops calculation

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
* Clear disk space before install check

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Clear disk space before install check"

This reverts commit 2c085f5.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Run bare metal install on self-hosted runners

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…A-NeMo#607)

* update llama and qwen models to use auto bridge and update recipes test as well

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* temporary remove llama4 as it's not fully tested or verified.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Revert "temporary remove llama4 as it's not fully tested or verified."

This reverts commit 5217084.

* temp save

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* temp save

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Revert "temp save"

This reverts commit 0c57e2b.

* Revert "temp save"

This reverts commit 0748d52.

* update qwen's recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update llama recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* remove some old recipe files

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe files to match old recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe file

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update qwen recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update llama recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* recipe naming update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add TypedDict for args

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update docstring

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* unit test fix and license fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* sync eval_interval and save_interval

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* set TRANSFORMERS_OFFLINE=1 in action.yml

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix llama3 8b hf model path

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* replay lr decay iters update on updated recipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update action.yml

Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* add comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add guard / mock for the places needs to download hf config in unit test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* add qwen functional test

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update recipe tests

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* lint

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
…ation support

- Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge.
- Updated `DITForwardStep` class to use `__call__` method for forward steps.
- Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`.
- Adjusted tensor and context parallelism settings in `llama3_8b.py`.

This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
abhinavg4 and others added 11 commits October 6, 2025 09:33
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity.
- Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`.
- Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization.
- Introduced `DiffusionDataModuleConfig` for better dataset configuration management.
- Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`.
- Refined imports across various modules to ensure consistency and clarity.

This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants