[Feature]Use Megatron-core dist_checkpointing to load checkpoint with different parallel strategies

**Is your feature request related to a problem? Please describe.**
Currently ChatLearn use a tool to convert checkpoint when different parallel strategies are detected, https://github.com/alibaba/ChatLearn/blob/main/chatlearn/utils/megatron_utils.py#L164

The online conversion has been addressed in Megatron core dist_checkpointing. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html

**Describe the solution you'd like**
Use Megatron-core dist checkpointing to save and load checkpoint.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]Use Megatron-core dist_checkpointing to load checkpoint with different parallel strategies #169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]Use Megatron-core dist_checkpointing to load checkpoint with different parallel strategies #169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions