Skip to content

Question about the training time #14

@Hugo-cell111

Description

@Hugo-cell111

Hi! I have noticed the information in the paper:

We train the 160M model on 1B-token datasets using a single NVIDIA A100
40GB GPU. For experiments with the 160M, 470M, and 1B models on 10B and 50B-token datasets,
we utilize 8 NVIDIA A100 40GB GPUs. All data scoring steps, including proxy data annotation,
scorer training, and data scoring, are performed on a single NVIDIA A100 80GB GPU.

In the step"Annotate proxy data", the PMP-Solver is trained, and I wonder how long will it take in this step for an H800 or A100 80GB GPU? Can I speed up this step using multi-gpu parallelism in deepspeed framework? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions