Question about the training time

Hi! I have noticed the information in the paper:

> We train the 160M model on 1B-token datasets using a single NVIDIA A100
40GB GPU. For experiments with the 160M, 470M, and 1B models on 10B and 50B-token datasets,
we utilize 8 NVIDIA A100 40GB GPUs. All data scoring steps, including proxy data annotation,
scorer training, and data scoring, are performed on a single NVIDIA A100 80GB GPU.

In the step"Annotate proxy data", the PMP-Solver is trained, and I wonder how long will it take in this step for an H800 or A100 80GB GPU? Can I speed up this step using multi-gpu parallelism in deepspeed framework? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the training time #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the training time #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions