-
Notifications
You must be signed in to change notification settings - Fork 26
Description
您好,感谢您百忙之中抽空阅读!我可能遇到了模型加载失败的错误
我在训练自己的基座模型时进行了如下的设置
model
model_name_or_path: /train34/cog8/permanent/mywang71/Qwen2.5-1.5B
method
stage: ddm
shift: true
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
dataset
dataset: slimpajama
template: qwen
packing: true
cutoff_len: 2048
streaming: false
tokenized_path: output/qwen2-slimpajama-tokenized/
overwrite_cache: true
preprocessing_num_workers: 16
然后启动训练FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_full_ddm.yaml但遇到了如下报错
06/03/2025 10:13:37 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
06/03/2025 10:13:37 - INFO - llamafactory.model.adapter - Fine-tuning method: Full
Traceback (most recent call last):
File "/train34/cog8/permanent/mywang71/diffusion/DiffuLLaMA-main/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
launch()
File "/train34/cog8/permanent/mywang71/diffusion/DiffuLLaMA-main/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/train34/cog8/permanent/mywang71/diffusion/DiffuLLaMA-main/LLaMA-Factory/src/llamafactory/train/tuner.py", line 53, in run_exp
run_ddm(model_args, data_args, training_args, finetuning_args, callbacks)
File "/train34/cog8/permanent/mywang71/diffusion/DiffuLLaMA-main/LLaMA-Factory/src/llamafactory/train/ddm/workflow.py", line 40, in run_ddm
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/train34/cog8/permanent/mywang71/diffusion/DiffuLLA-MA-main/LLaMA-Factory/src/llamafactory/model/loader.py", line 279, in load_model
trainable_params, all_param, 100 * trainable_params / all_param
~~~~~~~~~~~~~~~~~~~~~~^
ZeroDivisionError: division by zero
[2025-06-03 10:13:40,092] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 25298) of binary: /home4/intern/mywang71/miniconda3/envs/DiffuLLAMA/bin/python3.1
我注意到LLaMA-Factory/src/llamafactory/train/ddm/model.py中DiscreteDiffusionModel类中缺少了qwen2的分支,目前只有gpt2和llama。请问这个问题的根源吗?(我尝试自己添加qwen2分支,不会再报同样的错误,但不知道是否已经从根源解决问题)
if getattr(self.config, "model_type", None) == "gpt2":
self.embed_tokens = self.model.transformer.wte
self.denoise_model = self.model.transformer # use inputs_embeds instead of input_ids in forward function
for gpt2block in self.model.transformer.h:
gpt2block.attn.bias.fill_(True) # remove causal mask
self.lm_head = self.model.lm_head
del self.denoise_model.wte
elif getattr(self.config, "model_type", None) == "llama":
self.embed_tokens = self.model.model.embed_tokens
self.denoise_model = self.model.model
self.lm_head = self.model.lm_head
del self.denoise_model.embed_tokens
del self.model