-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.
Benmark results
| Model | GPU | Distribution Type | Batch Size Per GPU | Gradient Accumulation Steps | GPU Memory | Speed (tokens/s) |
|---|---|---|---|---|---|---|
| tinyllama | 2*RTX4090 | DeepSpeed Zero-2 | 3 | 4 | 21G | 1.8k |
| tinyllama | 2*RTX4090 | DDP | 3 | 4 | 21G | 2.7k |
| tinyllama | 2*RTX4090 | DDP | 3 | 1 | 21G | 1.5k |
| tinyllama | 1*RTX4090 | N/A | 3 | 4 | 21G | 1.8k |
Some issues:
- the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
- I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.
Environments
deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2
Any ideas I can improve the thoughput?
Metadata
Metadata
Assignees
Labels
No labels