-
Notifications
You must be signed in to change notification settings - Fork 1
Description
My understanding of how the code handles multi-GPU training is that it relies on AutoModelForCausalLM.from_pretrained(..., device_map="auto"), which I think does naive model parallelism where computation is done sequentially on each GPU rather than getting parallelism across GPUs. For larger models and sweeps we may get significant compute savings by switching to FSDP. (Benchmark this hypothesis with full parameter fine-tuning first before refactoring all the training attacks).
We can use accelerate for FSDP. This will need refactoring.
E.g., instead of run_in_isolation for training, launch an accelerate subprocess? Or we separate training & evaluation into separate runs, where training uses accelerate (evaluation often uses vLLM so I think that means it doesn't need to be run with accelerate).