Conversation
There was a problem hiding this comment.
Pull request overview
Adds DeepSeek V3 / V2-Lite SLURM training entrypoints (with Docker image pre-loading), introduces a C4 preparation helper, and updates Megatron configs + launch scripts to support new logging/comm/runtime behaviors.
Changes:
- Add
start_training_*andrun_*scripts for SLURM-based DeepSeek V3 and V2-Lite training. - Update SLURM/local launcher scripts to derive
MASTER_ADDRdifferently and adjust container mounts/cleanup behavior. - Update MI355X DeepSeek Megatron YAML configs and tweak Megatron preparation/build steps.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| start_training_dsv3.sh | New SLURM allocation + Docker image preload + DeepSeek V3 training entrypoint. |
| start_training_dsv2_lite.sh | New SLURM allocation + Docker image preload + DeepSeek V2-Lite training entrypoint. |
| run_dsv3.sh | New DeepSeek V3 environment setup and invocation of run_slurm_pretrain.sh. |
| run_dsv2_lite.sh | New DeepSeek V2-Lite environment setup and invocation of run_slurm_pretrain.sh. |
| prepare_c4_data.sh | New script to merge/tokenize C4 shards into Megatron dataset format. |
| examples/run_slurm_pretrain.sh | Adjust master address derivation + hardcode socket interfaces for SLURM runs. |
| examples/run_pretrain.sh | Update AINIC-related env vars, library paths, and introduce LD_PRELOAD changes. |
| examples/run_local_pretrain.sh | Add optional host mount; change runtime preference order; add container stop behavior. |
| examples/megatron/prepare.py | Add --no-build-isolation to editable pip install for Emerging-Optimizers. |
| examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml | Enable W&B/TensorBoard flags, tweak batch/recompute settings, and adjust MoE grouped gemm setting. |
| examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml | Enable W&B/TensorBoard flags and toggle MoE/turbo-related settings. |
| examples/megatron/configs/MI355X/deepseek_v2-BF16-pretrain.yaml | Add a W&B disable flag (currently via an inconsistent key). |
Comments suppressed due to low confidence (3)
start_training_dsv3.sh:17
- Line 17 has a typo/extra whitespace in the user-facing message ("Allo cating..."). Please correct this to avoid confusing output/log searches.
echo "Allo cating ${NNODES} nodes and starting training..."
run_dsv3.sh:25
- There’s a typo in the comment ("herer"). Please fix to keep the script self-explanatory.
export DOCKER_MOUNT_PATH=/shared # this is the mount path for the docker container, we put the data path herer
# export DATA_PATH=/shared/c4/data
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml:62
- Typo in the inline comment ("disalbe"). Please correct to keep configs readable/searchable.
moe_use_legacy_grouped_gemm: false # need to disalbe legacy grouped gemm for dsv3 as it will hang the job
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | ||
| export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56 | ||
| export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9 | ||
| export MICRO_BATCH_SIZE=1 |
There was a problem hiding this comment.
GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv3.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.
| export MICRO_BATCH_SIZE=1 | |
| export MICRO_BATCH_SIZE=1 | |
| export NNODES=${NNODES:-1} |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
There was a problem hiding this comment.
This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.
| export HF_TOKEN="your_hf_token" # make it your own hf token | |
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "Error: HF_TOKEN environment variable is not set." >&2 | |
| exit 1 | |
| fi | |
| if [ -z "${WANDB_API_KEY:-}" ]; then | |
| echo "Error: WANDB_API_KEY environment variable is not set." >&2 | |
| exit 1 | |
| fi | |
| export HF_TOKEN | |
| export WANDB_API_KEY |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | ||
| export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56 | ||
| export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9 |
There was a problem hiding this comment.
GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv2_lite.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.
| export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9 | |
| export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9 | |
| NNODES=${NNODES:-1} |
| # Get IP address of master node from ens9np0 interface | ||
| MASTER_NODE=\${node_array[0]} | ||
| if [ \"\$SLURM_NODEID\" = \"0\" ]; then | ||
| # We are on the master node, get IP directly | ||
| MASTER_IP=\$(ip addr show ens9np0 | grep 'inet ' | awk '{print \$2}' | cut -d/ -f1) | ||
| else | ||
| # Resolve master node IP via DNS (no SSH needed) | ||
| MASTER_IP=\$(getent hosts \$MASTER_NODE | awk '{print \$1}') | ||
| fi |
There was a problem hiding this comment.
MASTER_ADDR is now derived from ip addr show ens9np0 / getent hosts, but there is no validation that ens9np0 exists or that either command returns an IP. If MASTER_IP is empty, exporting MASTER_ADDR will break rendezvous. Consider making the interface configurable (or reusing the existing get_ip_interface helper) and failing fast when MASTER_IP cannot be determined, with a fallback to the hostname if appropriate.
| export GLOO_SOCKET_IFNAME=ens9np0 | ||
| export NCCL_SOCKET_IFNAME=ens9np0 |
There was a problem hiding this comment.
This hard-codes GLOO_SOCKET_IFNAME/NCCL_SOCKET_IFNAME to ens9np0, which overrides the dynamic detection already implemented in examples/run_pretrain.sh and may break on systems that use a different interface name. Prefer only setting these if they are unset, or pass the intended interface via an env var (e.g., IP_INTERFACE) rather than forcing a specific device name.
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
There was a problem hiding this comment.
This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.
| export HF_TOKEN="your_hf_token" # make it your own hf token | |
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | |
| export HF_TOKEN="${HF_TOKEN:?HF_TOKEN environment variable must be set}" # make it your own hf token | |
| export WANDB_API_KEY="${WANDB_API_KEY:?WANDB_API_KEY environment variable must be set}" # make it your own wandb api key |
| echo "Node-${NODE_RANK}: Launching training container." | ||
| fi | ||
|
|
||
| docker stop $(docker ps -aq) || true |
There was a problem hiding this comment.
docker stop $(docker ps -aq) will stop all Docker containers on the node, including unrelated workloads, and bypasses the existing CLEAN_DOCKER_CONTAINER logic (and the docker_podman_proxy abstraction). This is highly disruptive in shared environments; please remove it or restrict cleanup to the specific training container(s) started by this script.
| docker stop $(docker ps -aq) || true |
|
|
||
| #export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH | ||
| export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/install/lib:$LD_LIBRARY_PATH | ||
| export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0 |
There was a problem hiding this comment.
This sets LD_PRELOAD without preserving any existing LD_PRELOAD from the environment. Overwriting it can break other required preloads (profilers, sanitizers, etc.). Consider appending/prepending while keeping the current value (or only setting it if unset).
| export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0 | |
| export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0${LD_PRELOAD:+:$LD_PRELOAD} |
| # By default downloads 1 shard (~350MB compressed, ~3M documents) for testing. | ||
| # Full C4-en has 1024 shards. Adjust --num_shards for more data. | ||
| ############################################################################### | ||
|
|
||
| set -e | ||
|
|
||
| # ======================== Configuration ======================== | ||
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | ||
| DATA_DIR=${DATA_DIR:-"/shared/c4"} |
There was a problem hiding this comment.
The header/usage comment says the default is to download/prepare 1 shard for testing, but the script currently defaults NUM_SHARDS to 200 and also states that download is skipped in favor of pre-downloaded shards. Please update the documentation and/or defaults so the described behavior matches what the script actually does (and so the default runtime/data volume is reasonable).
| # ======================== Configuration ======================== | ||
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | ||
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | ||
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} |
There was a problem hiding this comment.
The default PRIMUS_PATH is set to an absolute, user-specific location (/shared/john/Primus). This makes the script non-portable out of the box. Consider defaulting PRIMUS_PATH to the repo root relative to the script location (or requiring it to be provided explicitly).
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} | |
| # Determine repository root relative to this script if PRIMUS_PATH is not provided | |
| SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" | |
| REPO_ROOT="$(cd -- "${SCRIPT_DIR}/.." && pwd)" | |
| PRIMUS_PATH=${PRIMUS_PATH:-"${REPO_ROOT}"} |
No description provided.