diff --git a/docs/finetuning.md b/docs/finetuning.md index 0ce9ab1..41af477 100644 --- a/docs/finetuning.md +++ b/docs/finetuning.md @@ -34,6 +34,18 @@ python finetuning/finetune.py ``` to run the sample fine-tuning loop. +This loop should run on an A100 with 80 GB of memory. +If you need to reduce memory usage, you could try the following: +(a) split the model and optimiser parameters across multiple GPUs with +[FSDP](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html); +(b) use a more memory-efficient optimiser, such as +[Adafactor](https://docs.pytorch.org/docs/stable/generated/torch.optim.Adafactor.html); +(c) split the model activations across multiple GPUs with model parallelism +(you will need to implement this yourself or use an existing framework); +(d) do CPU offloading of model or optimiser parameters; or +(e) run everything in pure `bfloat16` (this might lead to more unstable training). +You could also try to optimise the activation checkpointing strategy, to +see if there is something more to gain there. For example, on Azure, launch a VM with size `Standard_NC24ads_A100_v4`, image Ubuntu 24.04 LTS (x64), and 256 GB of disk space. @@ -51,6 +63,8 @@ and reboot. You should now be able to clone the repo and build and run the image using the instructions above. + + ## Computing Gradients To compute gradients, you will need an A100 with 80 GB of memory.