Is your feature request related to a problem? Please describe.
Currently ChatLearn use a tool to convert checkpoint when different parallel strategies are detected, https://github.com/alibaba/ChatLearn/blob/main/chatlearn/utils/megatron_utils.py#L164
The online conversion has been addressed in Megatron core dist_checkpointing. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html
Describe the solution you'd like
Use Megatron-core dist checkpointing to save and load checkpoint.