About the training GPU memory requirements

Thank you for the excellent work! I encountered an issue during SFT where the GPU memory ran out.
I run the codes on 8 H200 GPUs, with each of 140GB of memory. 

TOKENIZERS_PARALLELISM=false torchrun --standalone --nproc_per_node=8 train.py \
  --deepspeed ./scripts/deepspeed_zero2.json \
  --output_dir checkpoints/$run_name \
  --overwrite_output_dir True \
  --run_name $run_name \
  --save_on_each_node True \
  --do_train True \
  --eval_strategy no \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 64 \
  --learning_rate $learning_rate \
  --warmup_ratio 0.03 \
  --optim adamw_torch \
  --lr_scheduler_type cosine \
  --num_train_epochs 1 \
  --logging_steps 1000 \
  --save_steps 1000 \
  --bf16 True \
  --tf32 True \
  --gradient_checkpointing True \
  --pretrained_model_name_or_path /data/checkpoints/LiveCC-7B-Instruct \
  --annotation_paths \
      datasets/live_whisperx_526k_with_seeks_filtered.jsonl \
  --dataloader_num_workers 16 \
  --freeze_modules visual \
  --use_liger_kernel True \
  --report_to tensorboard

Could you please share what hardware configuration you used for SFT? And how much GPU memory is typically required for SFT?
Look forward to your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the training GPU memory requirements #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About the training GPU memory requirements #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions