Skip to content

[Draft] Dev/355 test#569

Draft
wenxie-amd wants to merge 2 commits intomainfrom
dev/355_test
Draft

[Draft] Dev/355 test#569
wenxie-amd wants to merge 2 commits intomainfrom
dev/355_test

Conversation

@wenxie-amd
Copy link
Contributor

No description provided.

Copilot AI review requested due to automatic review settings February 27, 2026 09:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DeepSeek V3 / V2-Lite SLURM training entrypoints (with Docker image pre-loading), introduces a C4 preparation helper, and updates Megatron configs + launch scripts to support new logging/comm/runtime behaviors.

Changes:

  • Add start_training_* and run_* scripts for SLURM-based DeepSeek V3 and V2-Lite training.
  • Update SLURM/local launcher scripts to derive MASTER_ADDR differently and adjust container mounts/cleanup behavior.
  • Update MI355X DeepSeek Megatron YAML configs and tweak Megatron preparation/build steps.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
start_training_dsv3.sh New SLURM allocation + Docker image preload + DeepSeek V3 training entrypoint.
start_training_dsv2_lite.sh New SLURM allocation + Docker image preload + DeepSeek V2-Lite training entrypoint.
run_dsv3.sh New DeepSeek V3 environment setup and invocation of run_slurm_pretrain.sh.
run_dsv2_lite.sh New DeepSeek V2-Lite environment setup and invocation of run_slurm_pretrain.sh.
prepare_c4_data.sh New script to merge/tokenize C4 shards into Megatron dataset format.
examples/run_slurm_pretrain.sh Adjust master address derivation + hardcode socket interfaces for SLURM runs.
examples/run_pretrain.sh Update AINIC-related env vars, library paths, and introduce LD_PRELOAD changes.
examples/run_local_pretrain.sh Add optional host mount; change runtime preference order; add container stop behavior.
examples/megatron/prepare.py Add --no-build-isolation to editable pip install for Emerging-Optimizers.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml Enable W&B/TensorBoard flags, tweak batch/recompute settings, and adjust MoE grouped gemm setting.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml Enable W&B/TensorBoard flags and toggle MoE/turbo-related settings.
examples/megatron/configs/MI355X/deepseek_v2-BF16-pretrain.yaml Add a W&B disable flag (currently via an inconsistent key).
Comments suppressed due to low confidence (3)

start_training_dsv3.sh:17

  • Line 17 has a typo/extra whitespace in the user-facing message ("Allo cating..."). Please correct this to avoid confusing output/log searches.
echo "Allo  cating ${NNODES} nodes and starting training..."

run_dsv3.sh:25

  • There’s a typo in the comment ("herer"). Please fix to keep the script self-explanatory.
export DOCKER_MOUNT_PATH=/shared # this is the mount path for the docker container, we put the data path herer 
# export DATA_PATH=/shared/c4/data 

examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml:62

  • Typo in the inline comment ("disalbe"). Please correct to keep configs readable/searchable.
      moe_use_legacy_grouped_gemm: false # need to disalbe  legacy grouped gemm for dsv3 as it will hang the job

export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56
export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
export MICRO_BATCH_SIZE=1
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv3.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.

Suggested change
export MICRO_BATCH_SIZE=1
export MICRO_BATCH_SIZE=1
export NNODES=${NNODES:-1}

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.

Suggested change
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
if [ -z "${HF_TOKEN:-}" ]; then
echo "Error: HF_TOKEN environment variable is not set." >&2
exit 1
fi
if [ -z "${WANDB_API_KEY:-}" ]; then
echo "Error: WANDB_API_KEY environment variable is not set." >&2
exit 1
fi
export HF_TOKEN
export WANDB_API_KEY

Copilot uses AI. Check for mistakes.
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
export DOCKER_IMAGE=john132/tas:primus-25.9-ainic-56
export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GLOBAL_BATCH_SIZE is computed from NNODES, but this script does not set a default/validation for NNODES. If run_dsv2_lite.sh is executed directly, the arithmetic expansion will fail or produce an unexpected value. Consider setting NNODES=${NNODES:-1} (or erroring if unset) before using it.

Suggested change
export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
export NCCL_IB_HCA=ionic_0,ionic_2,ionic_3,ionic_4,ionic_5,ionic_7,ionic_8,ionic_9
NNODES=${NNODES:-1}

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +56
# Get IP address of master node from ens9np0 interface
MASTER_NODE=\${node_array[0]}
if [ \"\$SLURM_NODEID\" = \"0\" ]; then
# We are on the master node, get IP directly
MASTER_IP=\$(ip addr show ens9np0 | grep 'inet ' | awk '{print \$2}' | cut -d/ -f1)
else
# Resolve master node IP via DNS (no SSH needed)
MASTER_IP=\$(getent hosts \$MASTER_NODE | awk '{print \$1}')
fi
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MASTER_ADDR is now derived from ip addr show ens9np0 / getent hosts, but there is no validation that ens9np0 exists or that either command returns an IP. If MASTER_IP is empty, exporting MASTER_ADDR will break rendezvous. Consider making the interface configurable (or reusing the existing get_ip_interface helper) and failing fast when MASTER_IP cannot be determined, with a fallback to the hostname if appropriate.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +69
export GLOO_SOCKET_IFNAME=ens9np0
export NCCL_SOCKET_IFNAME=ens9np0
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes GLOO_SOCKET_IFNAME/NCCL_SOCKET_IFNAME to ens9np0, which overrides the dynamic detection already implemented in examples/run_pretrain.sh and may break on systems that use a different interface name. Prefer only setting these if they are unset, or pass the intended interface via an env var (e.g., IP_INTERFACE) rather than forcing a specific device name.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports HF_TOKEN/WANDB_API_KEY as literal placeholder strings, which will overwrite any real credentials already set in the environment and can lead to accidental runs with invalid auth. Prefer the existing repo pattern of defaulting from the current env (e.g., HF_TOKEN=${HF_TOKEN:-...}) or validating they are set and failing fast.

Suggested change
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
export HF_TOKEN="${HF_TOKEN:?HF_TOKEN environment variable must be set}" # make it your own hf token
export WANDB_API_KEY="${WANDB_API_KEY:?WANDB_API_KEY environment variable must be set}" # make it your own wandb api key

Copilot uses AI. Check for mistakes.
echo "Node-${NODE_RANK}: Launching training container."
fi

docker stop $(docker ps -aq) || true
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker stop $(docker ps -aq) will stop all Docker containers on the node, including unrelated workloads, and bypasses the existing CLEAN_DOCKER_CONTAINER logic (and the docker_podman_proxy abstraction). This is highly disruptive in shared environments; please remove it or restrict cleanup to the specific training container(s) started by this script.

Suggested change
docker stop $(docker ps -aq) || true

Copilot uses AI. Check for mistakes.

#export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/install/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets LD_PRELOAD without preserving any existing LD_PRELOAD from the environment. Overwriting it can break other required preloads (profilers, sanitizers, etc.). Consider appending/prepending while keeping the current value (or only setting it if unset).

Suggested change
export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0
export LD_PRELOAD=${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0${LD_PRELOAD:+:$LD_PRELOAD}

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +24
# By default downloads 1 shard (~350MB compressed, ~3M documents) for testing.
# Full C4-en has 1024 shards. Adjust --num_shards for more data.
###############################################################################

set -e

# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header/usage comment says the default is to download/prepare 1 shard for testing, but the script currently defaults NUM_SHARDS to 200 and also states that download is skipped in favor of pre-downloaded shards. Please update the documentation and/or defaults so the described behavior matches what the script actually does (and so the default runtime/data volume is reasonable).

Copilot uses AI. Check for mistakes.
# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default PRIMUS_PATH is set to an absolute, user-specific location (/shared/john/Primus). This makes the script non-portable out of the box. Consider defaulting PRIMUS_PATH to the repo root relative to the script location (or requiring it to be provided explicitly).

Suggested change
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
# Determine repository root relative to this script if PRIMUS_PATH is not provided
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/.." && pwd)"
PRIMUS_PATH=${PRIMUS_PATH:-"${REPO_ROOT}"}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants