-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Description
Thanks for the great works!
I use megatron as the backend in verl, and I need to change the saved distcp checkpoint to hf format.
I referred the code you provided in the following link (dist2hf.py in my case):
volcengine/verl#3057 (comment)
But I got a segmentation fault:
python dist2hf.py --model_path model/QwQ-32B --dist_weight_path checkpoints/SFT-GRPO-lr-2e-6-tp-8-pp-1-batch-256/global_step_150/actor --save_path checkpoints/QwQ_SFT_GRPO_step150
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 32763876352
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 625371) ====
0 0x0000000000042520 __sigaction() ???:0
=================================
Segmentation fault (core dumped)
In fact, I met the same issue when I use verl.model_merger, I can locate the problem happened when executing:
## verl/utils/megatron/dist_checkpointing.py
from megatron.core import dist_checkpointing, mpu
from megatron.core.dist_checkpointing.serialization import (
get_default_load_sharded_strategy,
)
from megatron.core.dist_checkpointing.strategies.fully_parallel import (
FullyParallelSaveStrategyWrapper,
)
def load_dist_checkpointing(sharded_state_dict, ckpt_dir):
load_strategy = get_default_load_sharded_strategy(ckpt_dir)
load_strategy = FullyParallelLoadStrategyWrapper(
load_strategy, mpu.get_data_parallel_group(with_context_parallel=True)
)
#### Segmentation Fault happend when executing load####
state_dict = dist_checkpointing.load(sharded_state_dict, ckpt_dir, sharded_strategy=load_strategy)
return state_dict
Could you please provide any suggestions to fix this bug?
I tried megatron-core version 0.12.1, 0.12.2, 0.13.1. This error always occurs.
Metadata
Metadata
Assignees
Labels
No labels