[Draft] Dev/355 test by wenxie-amd · Pull Request #569 · AMD-AGI/Primus

wenxie-amd · 2026-02-27T09:01:33Z

No description provided.

Copilot

Pull request overview

Adds DeepSeek V3 / V2-Lite SLURM training entrypoints (with Docker image pre-loading), introduces a C4 preparation helper, and updates Megatron configs + launch scripts to support new logging/comm/runtime behaviors.

Changes:

Add start_training_* and run_* scripts for SLURM-based DeepSeek V3 and V2-Lite training.
Update SLURM/local launcher scripts to derive MASTER_ADDR differently and adjust container mounts/cleanup behavior.
Update MI355X DeepSeek Megatron YAML configs and tweak Megatron preparation/build steps.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
start_training_dsv3.sh	New SLURM allocation + Docker image preload + DeepSeek V3 training entrypoint.
start_training_dsv2_lite.sh	New SLURM allocation + Docker image preload + DeepSeek V2-Lite training entrypoint.
run_dsv3.sh	New DeepSeek V3 environment setup and invocation of `run_slurm_pretrain.sh`.
run_dsv2_lite.sh	New DeepSeek V2-Lite environment setup and invocation of `run_slurm_pretrain.sh`.
prepare_c4_data.sh	New script to merge/tokenize C4 shards into Megatron dataset format.
examples/run_slurm_pretrain.sh	Adjust master address derivation + hardcode socket interfaces for SLURM runs.
examples/run_pretrain.sh	Update AINIC-related env vars, library paths, and introduce LD_PRELOAD changes.
examples/run_local_pretrain.sh	Add optional host mount; change runtime preference order; add container stop behavior.
examples/megatron/prepare.py	Add `--no-build-isolation` to editable pip install for Emerging-Optimizers.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml	Enable W&B/TensorBoard flags, tweak batch/recompute settings, and adjust MoE grouped gemm setting.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml	Enable W&B/TensorBoard flags and toggle MoE/turbo-related settings.
examples/megatron/configs/MI355X/deepseek_v2-BF16-pretrain.yaml	Add a W&B disable flag (currently via an inconsistent key).

Comments suppressed due to low confidence (3)

start_training_dsv3.sh:17

Line 17 has a typo/extra whitespace in the user-facing message ("Allo cating..."). Please correct this to avoid confusing output/log searches.

echo "Allo  cating ${NNODES} nodes and starting training..."

run_dsv3.sh:25

There’s a typo in the comment ("herer"). Please fix to keep the script self-explanatory.

export DOCKER_MOUNT_PATH=/shared # this is the mount path for the docker container, we put the data path herer 
# export DATA_PATH=/shared/c4/data

examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml:62

Typo in the inline comment ("disalbe"). Please correct to keep configs readable/searchable.

      moe_use_legacy_grouped_gemm: false # need to disalbe  legacy grouped gemm for dsv3 as it will hang the job

Copilot · 2026-02-27T09:08:22Z

run_dsv3.sh

+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56
+export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
+export MICRO_BATCH_SIZE=1


GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv3.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.

Suggested change

export MICRO_BATCH_SIZE=1

export MICRO_BATCH_SIZE=1

export NNODES=${NNODES:-1}

Copilot · 2026-02-27T09:08:23Z

run_dsv2_lite.sh

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.

Suggested change

export HF_TOKEN="your_hf_token" # make it your own hf token

export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

if [ -z "${HF_TOKEN:-}" ]; then

echo "Error: HF_TOKEN environment variable is not set." >&2

exit 1

fi

if [ -z "${WANDB_API_KEY:-}" ]; then

echo "Error: WANDB_API_KEY environment variable is not set." >&2

exit 1

fi

export HF_TOKEN

export WANDB_API_KEY

Copilot · 2026-02-27T09:08:23Z

run_dsv2_lite.sh

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56
+export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9


GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv2_lite.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.

Suggested change

export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9

export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9

NNODES=${NNODES:-1}

Copilot · 2026-02-27T09:08:23Z

examples/run_slurm_pretrain.sh

+          # Get IP address of master node from ens9np0 interface
+          MASTER_NODE=\${node_array[0]}
+          if [ \"\$SLURM_NODEID\" = \"0\" ]; then
+              # We are on the master node, get IP directly
+              MASTER_IP=\$(ip addr show ens9np0 | grep 'inet ' | awk '{print \$2}' | cut -d/ -f1)
+          else
+              # Resolve master node IP via DNS (no SSH needed)
+              MASTER_IP=\$(getent hosts \$MASTER_NODE | awk '{print \$1}')
+          fi


MASTER_ADDR is now derived from ip addr show ens9np0 / getent hosts, but there is no validation that ens9np0 exists or that either command returns an IP. If MASTER_IP is empty, exporting MASTER_ADDR will break rendezvous. Consider making the interface configurable (or reusing the existing get_ip_interface helper) and failing fast when MASTER_IP cannot be determined, with a fallback to the hostname if appropriate.

Copilot · 2026-02-27T09:08:23Z

examples/run_slurm_pretrain.sh

+          export GLOO_SOCKET_IFNAME=ens9np0
+          export NCCL_SOCKET_IFNAME=ens9np0


This hard-codes GLOO_SOCKET_IFNAME/NCCL_SOCKET_IFNAME to ens9np0, which overrides the dynamic detection already implemented in examples/run_pretrain.sh and may break on systems that use a different interface name. Prefer only setting these if they are unset, or pass the intended interface via an env var (e.g., IP_INTERFACE) rather than forcing a specific device name.

Copilot · 2026-02-27T09:08:24Z

run_dsv3.sh

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.

Suggested change

export HF_TOKEN="your_hf_token" # make it your own hf token

export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

export HF_TOKEN="${HF_TOKEN:?HF_TOKEN environment variable must be set}" # make it your own hf token

export WANDB_API_KEY="${WANDB_API_KEY:?WANDB_API_KEY environment variable must be set}" # make it your own wandb api key

Copilot · 2026-02-27T09:08:24Z

examples/run_local_pretrain.sh

    echo "Node-${NODE_RANK}: Launching training container."
 fi
-
+docker stop $(docker ps -aq) || true


docker stop $(docker ps -aq) will stop all Docker containers on the node, including unrelated workloads, and bypasses the existing CLEAN_DOCKER_CONTAINER logic (and the docker_podman_proxy abstraction). This is highly disruptive in shared environments; please remove it or restrict cleanup to the specific training container(s) started by this script.

Suggested change

docker stop $(docker ps -aq) || true

Copilot · 2026-02-27T09:08:25Z

examples/run_pretrain.sh

-
+    #export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH
+    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/install/lib:$LD_LIBRARY_PATH
+    export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0


This sets LD_PRELOAD without preserving any existing LD_PRELOAD from the environment. Overwriting it can break other required preloads (profilers, sanitizers, etc.). Consider appending/prepending while keeping the current value (or only setting it if unset).

Suggested change

export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0

export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0${LD_PRELOAD:+:$LD_PRELOAD}

Copilot · 2026-02-27T09:08:25Z

prepare_c4_data.sh

+# By default downloads 1 shard (~350MB compressed, ~3M documents) for testing.
+# Full C4-en has 1024 shards. Adjust --num_shards for more data.
+###############################################################################
+
+set -e
+
+# ======================== Configuration ========================
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}


The header/usage comment says the default is to download/prepare 1 shard for testing, but the script currently defaults NUM_SHARDS to 200 and also states that download is skipped in favor of pre-downloaded shards. Please update the documentation and/or defaults so the described behavior matches what the script actually does (and so the default runtime/data volume is reasonable).

Copilot · 2026-02-27T09:08:25Z

prepare_c4_data.sh

+# ======================== Configuration ========================
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}
+PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}


The default PRIMUS_PATH is set to an absolute, user-specific location (/shared/john/Primus). This makes the script non-portable out of the box. Consider defaulting PRIMUS_PATH to the repo root relative to the script location (or requiring it to be provided explicitly).

Suggested change

PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}

# Determine repository root relative to this script if PRIMUS_PATH is not provided

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"

REPO_ROOT="$(cd -- "${SCRIPT_DIR}/.." && pwd)"

PRIMUS_PATH=${PRIMUS_PATH:-"${REPO_ROOT}"}

JohnQinAMD added 2 commits February 7, 2026 20:14

update config for deepseek models

1ec487a

fix ssh issue for master address

9f70c5a

Copilot AI review requested due to automatic review settings February 27, 2026 09:01

Copilot started reviewing on behalf of wenxie-amd February 27, 2026 09:03 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Dev/355 test#569

[Draft] Dev/355 test#569
wenxie-amd wants to merge 2 commits intomainfrom
dev/355_test

wenxie-amd commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	export MICRO_BATCH_SIZE=1
	export MICRO_BATCH_SIZE=1
	export NNODES=${NNODES:-1}

		export HF_TOKEN="your_hf_token" # make it your own hf token
		export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

	export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
	export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
	NNODES=${NNODES:-1}

		export GLOO_SOCKET_IFNAME=ens9np0
		export NCCL_SOCKET_IFNAME=ens9np0

	export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0
	export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0${LD_PRELOAD:+:$LD_PRELOAD}

-PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
+# Determine repository root relative to this script if PRIMUS_PATH is not provided
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd -- "${SCRIPT_DIR}/.." && pwd)"
+PRIMUS_PATH=${PRIMUS_PATH:-"${REPO_ROOT}"}

Conversation

wenxie-amd commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants