Add FSDP support, for faster training of large models.

My understanding of how the code handles multi-GPU training is that it relies on `AutoModelForCausalLM.from_pretrained(..., device_map="auto")`, which I think does naive model parallelism where computation is done sequentially on each GPU rather than getting parallelism across GPUs. For larger models and sweeps we may get significant compute savings by switching to FSDP. (Benchmark this hypothesis with full parameter fine-tuning first before refactoring all the training attacks).

We can use `accelerate` for FSDP. This will need refactoring.
E.g., instead of `run_in_isolation` for training, launch an `accelerate` subprocess? Or we separate training & evaluation into separate runs, where `training` uses `accelerate` (evaluation often uses vLLM so I think that means it doesn't need to be run with `accelerate`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FSDP support, for faster training of large models. #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add FSDP support, for faster training of large models. #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions