Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 28 additions & 52 deletions NVIDIA/benchmarks/llama31_8b/implementations/nemo/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## Running NVIDIA Large Language Model Llama 3.1 405B PyTorch MLPerf Benchmark
## Running NVIDIA Large Language Model Llama 3.1 8B PyTorch MLPerf Benchmark

This file contains the instructions for running the NVIDIA Large Language Model Llama 3.1 405B PyTorch MLPerf Benchmark on NVIDIA hardware.
This file contains the instructions for running the NVIDIA Large Language Model Llama 3.1 8B PyTorch MLPerf Benchmark on NVIDIA hardware.

## 1. Hardware Requirements

- At least 2.5TB disk space is required.
- At least 100GB disk space is required.
- NVIDIA GPU with at least 80GB memory is strongly recommended.
- GPUs are not required for dataset preparation.

Expand All @@ -20,9 +20,9 @@ This file contains the instructions for running the NVIDIA Large Language Model
Replace `<docker/registry>` with your container registry and build:

```bash
docker build -t <docker/registry>/mlperf-nvidia:llama31_405b-pyt .
# optionally: docker push <docker/registry>/mlperf-nvidia:llama31_405b-pyt
export CONT=<docker/registry>/mlperf-nvidia:llama31_405b-pyt
docker build -t <docker/registry>/mlperf-nvidia:llama31_8b-pyt .
# optionally: docker push <docker/registry>/mlperf-nvidia:llama31_8b-pyt
export CONT=<docker/registry>/mlperf-nvidia:llama31_8b-pyt
```

make sure that container is accessible on your Slurm system.
Expand All @@ -37,28 +37,31 @@ export DATADIR=<path/to/dataset>
To download the dataset and align the directories with the layout the benchmark expects, run:

```bash
bash data_scripts/download.sh
bash data_scripts/download_8b.sh
```

The final content under `${DATADIR}/405b` should be:
At the end, the directory structure should look like:

```
$tree 405b
405b
```bash
$tree 8b/
8b/
|-- LICENSE.txt
|-- NOTICE.txt
|-- c4-train.en_6_text_document.bin
|-- c4-train.en_6_text_document.idx
|-- c4-train.en_7_text_document.bin
|-- c4-train.en_7_text_document.idx
|-- c4-validation-91205-samples.en_text_document.bin
|-- c4-validation-91205-samples.en_text_document.idx
|-- llama-3-1-8b-preprocessed-c4-dataset.md5
`-- tokenizer
|-- LICENSE
|-- README.md
|-- USE_POLICY.md
|-- llama-3-1-8b-tokenizer.md5
|-- special_tokens_map.json
|-- tokenizer.json
|-- tokenizer.model
|-- tokenizer.model.v1
`-- tokenizer_config.json

2 directories, 11 files
2 directories, 14 files
```

### 3.3 Model and checkpoint preparation
Expand All @@ -69,34 +72,11 @@ $tree 405b

#### 3.3.2 List of Layers

The model largely follows the [Llama 3.1 405B paper](https://arxiv.org/abs/2407.21783). The only difference is that we replace the paper's TikTokenizer with the Mixtral 8x22b tokenizer in this benchmark. Please refer to the [Model details section](https://github.com/mlcommons/training/tree/master/large_language_model_pretraining/nemo#model-details) from the reference for more details.
The model largely follows the paper titled [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783).

#### 3.3.3 Model checkpoint
In the benchmarking region, we resume training from Meta's official HuggingFace checkpoint. Please refer to the [instructions](https://github.com/mlcommons/training/tree/master/large_language_model_pretraining/nemo#checkpoint-download) from the reference to download the BF16 model checkpoint.

**NOTE**: Before you proceed, make sure that your current working directory is able to hold >1.5TB of data.

Assuming that you are running the download command under a given directory, with its location stored under `LOAD_CHECKPOINTS_PATH` environment variable. After the checkpoint is downloaded, you should be able to find a `405b` folder which holds a `context` and `weights` subfolder under the current directory:

```
<LOAD_CHECKPOINTS_PATH>
└── 405b
├── context
│ ├── nemo_tokenizer
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ └── tokenizer.json
│ ├── io.json
│ └── model.yaml
└── weights
├── __0_0.distcp
├── __0_1.distcp
├── .metadata
├── common.pt
└── metadata.json
```

By default, when we run the container, we will mount `LOAD_CHECKPOINTS_PATH` to `/load_checkpoints` in the container. Thus, you should set `export LOAD_CHECKPOINT="/load_checkpoints/405b"` to ensure that the `405b` folder is accessed in the container.
The LLama3.1 8B is trained from scratch and is not using a checkpoint.

## 4. Launch training

Expand All @@ -106,17 +86,13 @@ Navigate to the directory where `run.sub` is stored.

The launch command structure:
```bash
export DATADIR="<desired/dataset/path>"
export LOAD_CHECKPOINTS_PATH="</path/to/your/downloaded/checkpoint>"
export LOAD_CHECKPOINT="/load_checkpoints/405b"
export LOGDIR="</path/to/output/dir>" # set the place where the output logs will be saved
export CONT=$CONT
source config_GB200_128x4x144xtp4pp8cp2_cg.sh # select config and source it
export LOGDIR=</path/to/output/dir> # set the place where the output logs will be saved
export DATADIR=<as/set/above>
export CONT=<as/set/above>
source config_GB200_2x4x2xtp1pp1cp1_8b.sh # select config and source it
sbatch -N ${DGXNNODES} --time=${WALLTIME} run.sub # you may be required to set --account and --partition here
```

Replace `<desired/dataset/path>` and `</path/to/your/downloaded/checkpoint>` with your paths set up in Section 3.

All configuration files follow the format `config_<SYSTEM_NAME>_<NODES>x<GPUS/NODE>x<BATCH/GPU>xtpXppYcpZ.sh`, where X represents tensor parallel, Y represents pipeline parallel, and Z represents context parallel.

# 5. Quality
Expand All @@ -125,13 +101,13 @@ All configuration files follow the format `config_<SYSTEM_NAME>_<NODES>x<GPUS/NO
Log Perplexity

### Quality target
5.6
3.3

### Evaluation frequency
Evaluate after every 46,080 sequences (=377.49B tokens)
Evaluate after every 12,288 sequences (=100M tokens)

### Evaluation thoroughness
Evaluation on the validation subset that consists of 5,760 sequences (=47.19B tokens).
Evaluation on the validation subset that consists of 1024 sequences (=8.4M tokens).


# 6. Additional notes
Expand Down