Skip to content

Some benchmark results and issues on 2*RTX4090 #1

@nanamiwang

Description

@nanamiwang

Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.

Benmark results

Model GPU Distribution Type Batch Size Per GPU Gradient Accumulation Steps GPU Memory Speed (tokens/s)
tinyllama 2*RTX4090 DeepSpeed Zero-2 3 4 21G 1.8k
tinyllama 2*RTX4090 DDP 3 4 21G 2.7k
tinyllama 2*RTX4090 DDP 3 1 21G 1.5k
tinyllama 1*RTX4090 N/A 3 4 21G 1.8k

Some issues:

  • the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
  • I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

Environments

deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2

Any ideas I can improve the thoughput?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions