From 00f8b90679bf67304b2e62d65afe4691f9ae2437 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Marcinkiewicz?= <43240942+mmarcinkiewicz@users.noreply.github.com> Date: Mon, 12 Jan 2026 23:55:47 +0100 Subject: [PATCH] Fix llama3.1 8b readme --- .../llama31_8b/implementations/nemo/README.md | 80 +++++++------------ 1 file changed, 28 insertions(+), 52 deletions(-) diff --git a/NVIDIA/benchmarks/llama31_8b/implementations/nemo/README.md b/NVIDIA/benchmarks/llama31_8b/implementations/nemo/README.md index 72e97b80..bc33f6f6 100644 --- a/NVIDIA/benchmarks/llama31_8b/implementations/nemo/README.md +++ b/NVIDIA/benchmarks/llama31_8b/implementations/nemo/README.md @@ -1,10 +1,10 @@ -## Running NVIDIA Large Language Model Llama 3.1 405B PyTorch MLPerf Benchmark +## Running NVIDIA Large Language Model Llama 3.1 8B PyTorch MLPerf Benchmark -This file contains the instructions for running the NVIDIA Large Language Model Llama 3.1 405B PyTorch MLPerf Benchmark on NVIDIA hardware. +This file contains the instructions for running the NVIDIA Large Language Model Llama 3.1 8B PyTorch MLPerf Benchmark on NVIDIA hardware. ## 1. Hardware Requirements -- At least 2.5TB disk space is required. +- At least 100GB disk space is required. - NVIDIA GPU with at least 80GB memory is strongly recommended. - GPUs are not required for dataset preparation. @@ -20,9 +20,9 @@ This file contains the instructions for running the NVIDIA Large Language Model Replace `` with your container registry and build: ```bash -docker build -t /mlperf-nvidia:llama31_405b-pyt . -# optionally: docker push /mlperf-nvidia:llama31_405b-pyt -export CONT=/mlperf-nvidia:llama31_405b-pyt +docker build -t /mlperf-nvidia:llama31_8b-pyt . +# optionally: docker push /mlperf-nvidia:llama31_8b-pyt +export CONT=/mlperf-nvidia:llama31_8b-pyt ``` make sure that container is accessible on your Slurm system. @@ -37,28 +37,31 @@ export DATADIR= To download the dataset and align the directories with the layout the benchmark expects, run: ```bash -bash data_scripts/download.sh +bash data_scripts/download_8b.sh ``` -The final content under `${DATADIR}/405b` should be: +At the end, the directory structure should look like: -``` -$tree 405b -405b +```bash +$tree 8b/ +8b/ +|-- LICENSE.txt +|-- NOTICE.txt |-- c4-train.en_6_text_document.bin |-- c4-train.en_6_text_document.idx -|-- c4-train.en_7_text_document.bin -|-- c4-train.en_7_text_document.idx |-- c4-validation-91205-samples.en_text_document.bin |-- c4-validation-91205-samples.en_text_document.idx +|-- llama-3-1-8b-preprocessed-c4-dataset.md5 `-- tokenizer + |-- LICENSE + |-- README.md + |-- USE_POLICY.md + |-- llama-3-1-8b-tokenizer.md5 |-- special_tokens_map.json |-- tokenizer.json - |-- tokenizer.model - |-- tokenizer.model.v1 `-- tokenizer_config.json -2 directories, 11 files +2 directories, 14 files ``` ### 3.3 Model and checkpoint preparation @@ -69,34 +72,11 @@ $tree 405b #### 3.3.2 List of Layers -The model largely follows the [Llama 3.1 405B paper](https://arxiv.org/abs/2407.21783). The only difference is that we replace the paper's TikTokenizer with the Mixtral 8x22b tokenizer in this benchmark. Please refer to the [Model details section](https://github.com/mlcommons/training/tree/master/large_language_model_pretraining/nemo#model-details) from the reference for more details. +The model largely follows the paper titled [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783). #### 3.3.3 Model checkpoint -In the benchmarking region, we resume training from Meta's official HuggingFace checkpoint. Please refer to the [instructions](https://github.com/mlcommons/training/tree/master/large_language_model_pretraining/nemo#checkpoint-download) from the reference to download the BF16 model checkpoint. - -**NOTE**: Before you proceed, make sure that your current working directory is able to hold >1.5TB of data. - -Assuming that you are running the download command under a given directory, with its location stored under `LOAD_CHECKPOINTS_PATH` environment variable. After the checkpoint is downloaded, you should be able to find a `405b` folder which holds a `context` and `weights` subfolder under the current directory: -``` - -└── 405b - ├── context - │ ├── nemo_tokenizer - │ │ ├── special_tokens_map.json - │ │ ├── tokenizer_config.json - │ │ └── tokenizer.json - │ ├── io.json - │ └── model.yaml - └── weights - ├── __0_0.distcp - ├── __0_1.distcp - ├── .metadata - ├── common.pt - └── metadata.json -``` - -By default, when we run the container, we will mount `LOAD_CHECKPOINTS_PATH` to `/load_checkpoints` in the container. Thus, you should set `export LOAD_CHECKPOINT="/load_checkpoints/405b"` to ensure that the `405b` folder is accessed in the container. +The LLama3.1 8B is trained from scratch and is not using a checkpoint. ## 4. Launch training @@ -106,17 +86,13 @@ Navigate to the directory where `run.sub` is stored. The launch command structure: ```bash -export DATADIR="" -export LOAD_CHECKPOINTS_PATH="" -export LOAD_CHECKPOINT="/load_checkpoints/405b" -export LOGDIR="" # set the place where the output logs will be saved -export CONT=$CONT -source config_GB200_128x4x144xtp4pp8cp2_cg.sh # select config and source it +export LOGDIR= # set the place where the output logs will be saved +export DATADIR= +export CONT= +source config_GB200_2x4x2xtp1pp1cp1_8b.sh # select config and source it sbatch -N ${DGXNNODES} --time=${WALLTIME} run.sub # you may be required to set --account and --partition here ``` -Replace `` and `` with your paths set up in Section 3. - All configuration files follow the format `config__xxxtpXppYcpZ.sh`, where X represents tensor parallel, Y represents pipeline parallel, and Z represents context parallel. # 5. Quality @@ -125,13 +101,13 @@ All configuration files follow the format `config__x