Thank you for the excellent work! I encountered an issue during SFT where the GPU memory ran out.
I run the codes on 8 H200 GPUs, with each of 140GB of memory.
TOKENIZERS_PARALLELISM=false torchrun --standalone --nproc_per_node=8 train.py
--deepspeed ./scripts/deepspeed_zero2.json
--output_dir checkpoints/$run_name
--overwrite_output_dir True
--run_name $run_name
--save_on_each_node True
--do_train True
--eval_strategy no
--per_device_train_batch_size 1
--gradient_accumulation_steps 64
--learning_rate $learning_rate
--warmup_ratio 0.03
--optim adamw_torch
--lr_scheduler_type cosine
--num_train_epochs 1
--logging_steps 1000
--save_steps 1000
--bf16 True
--tf32 True
--gradient_checkpointing True
--pretrained_model_name_or_path /data/checkpoints/LiveCC-7B-Instruct
--annotation_paths
datasets/live_whisperx_526k_with_seeks_filtered.jsonl
--dataloader_num_workers 16
--freeze_modules visual
--use_liger_kernel True
--report_to tensorboard
Could you please share what hardware configuration you used for SFT? And how much GPU memory is typically required for SFT?
Look forward to your reply!