Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
376 changes: 188 additions & 188 deletions .github/workflows/ci.yaml

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ modules:

# parallel
tensor_model_parallel_size: ${PRIMUS_TP:1}
pipeline_model_parallel_size: ${PRIMUS_PP:1}
pipeline_model_parallel_size: ${PRIMUS_PP:8}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config changes the default pipeline_model_parallel_size from the common default of 1 to 8, which is inconsistent with the MI355X DeepSeek-V3 FP8 config and most other configs. To avoid surprising users (and breaking single-node runs), consider keeping the default at 1 and relying on PRIMUS_PP to opt into PP>1.

Suggested change
pipeline_model_parallel_size: ${PRIMUS_PP:8}
pipeline_model_parallel_size: ${PRIMUS_PP:1}

Copilot uses AI. Check for mistakes.
expert_model_parallel_size: ${PRIMUS_EP:8}
overlap_grad_reduce: true
overlap_param_gather: true
Expand Down Expand Up @@ -71,6 +71,24 @@ modules:
ckpt_format: torch
eval_iters: 0

# Turbo
enable_primus_turbo: true
use_turbo_attention: false
use_turbo_grouped_mlp: false

# deepep
use_turbo_deepep: true
moe_shared_expert_overlap: false
moe_router_dtype: fp32

# 64 or 80 for ep8, 32 for ep16-64 is best practice
turbo_deepep_num_cu: 64
turbo_deepep_use_comm_stream: false

# sync-free moe support stage 1-2, 0 means not use sync-free moe
# stage 2 is recommended for better performance
turbo_sync_free_moe_stage: 1

# Cross entropy flags
# cross_entropy_fusion_impl: "te"
# cross_entropy_loss_fusion: true
4 changes: 3 additions & 1 deletion examples/megatron/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,9 @@ def build_megatron_helper(primus_path: Path, patch_args: Path, backend_path: str

emerging_optimizers_path = primus_path / "third_party/Emerging-Optimizers"
log_info(f"Building Emerging Optimizers in {emerging_optimizers_path}")
ret = subprocess.run(["pip", "install", "-e", str(emerging_optimizers_path)], check=True)
ret = subprocess.run(
["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True
)
if ret.returncode != 0:
log_error_and_exit("Building Emerging Optimizers failed.")
Comment on lines +267 to 271
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., check=True) will raise an exception on failure, so the subsequent if ret.returncode != 0: branch is redundant/unreachable. Either remove the returncode check, or set check=False and keep the explicit returncode handling (and ideally invoke pip via sys.executable -m pip to ensure the active Python environment is used).

Copilot uses AI. Check for mistakes.

Expand Down
15 changes: 11 additions & 4 deletions examples/run_local_pretrain.sh
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ echo "ENV_ARGS: ${ENV_ARGS[*]}"
HOSTNAME=$(hostname)
ARGS=("$@")

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes an extra bind mount to /shared_aig/c4, which will break (or unintentionally create an empty host directory) on systems that don’t have that path. Consider making this conditional/configurable via an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_EXTRA_MOUNTS) rather than always mounting a cluster-specific path.

Suggested change
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
# Optional extra volume mounts: set PRIMUS_EXTRA_MOUNTS to a string like:
# '-v /shared_aig/c4:/shared_aig/c4 -v /other/path:/other/path:ro'
if [[ -n "${PRIMUS_EXTRA_MOUNTS:-}" ]]; then
# Intentional word splitting to allow multiple -v arguments.
VOLUME_ARGS+=(${PRIMUS_EXTRA_MOUNTS})
elif [[ -d "/shared_aig/c4" ]]; then
# Backwards-compatible default: only mount /shared_aig/c4 if it exists.
VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")
fi

Copilot uses AI. Check for mistakes.
if [[ -f "$PATH_TO_BNXT_TAR_PACKAGE" ]]; then
VOLUME_ARGS+=(-v "$PATH_TO_BNXT_TAR_PACKAGE":"$PATH_TO_BNXT_TAR_PACKAGE")
fi
Expand All @@ -134,10 +134,10 @@ export CLEAN_DOCKER_CONTAINER=${CLEAN_DOCKER_CONTAINER:-0}

# ------------------ Optional Container Cleanup ------------------
docker_podman_proxy() {
if command -v podman &>/dev/null; then
podman "$@"
elif command -v docker &>/dev/null; then
if command -v docker &>/dev/null; then
docker "$@"
elif command -v podman &>/dev/null; then
podman "$@"
else
echo "Neither Docker nor Podman found!" >&2
return 1
Expand All @@ -164,6 +164,13 @@ else
echo "Node-${NODE_RANK}: Launching training container."
fi

if ! docker_podman_proxy image inspect "$DOCKER_IMAGE" &>/dev/null; then
echo "Node-${NODE_RANK}: Image not found locally, pulling $DOCKER_IMAGE..."
docker_podman_proxy pull "$DOCKER_IMAGE"
else
echo "Node-${NODE_RANK}: Image $DOCKER_IMAGE already exists, skipping pull."
fi

# ------------------ Launch Training Container ------------------
docker_podman_proxy run --rm \
--env MASTER_ADDR \
Expand Down
6 changes: 4 additions & 2 deletions examples/run_pretrain.sh
Original file line number Diff line number Diff line change
Expand Up @@ -197,8 +197,10 @@ if [ "$USING_AINIC" == "1" ]; then
export NCCL_IB_GID_INDEX=1
# export NCCL_IB_ROCE_VERSION_NUM=2
export NCCL_MAX_P2P_CHANNELS=56
export NCCL_IB_TC=104
export NCCL_IB_FIFO_TC=192
# export NCCL_IB_TC=104
# export NCCL_IB_FIFO_TC=192
export NCCL_IB_TC=41
export NCCL_IB_FIFO_TC=185
export NET_OPTIONAL_RECV_COMPLETION=1
export NCCL_IB_USE_INLINE=1
export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
Expand Down
4 changes: 4 additions & 0 deletions examples/run_slurm_pretrain.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,14 @@ export LOG_DIR=${LOG_DIR:-"./output"}
LOG_FILE="${LOG_DIR}/log_slurm_pretrain.txt"
mkdir -p "$LOG_DIR"

# --nodelist="uswslocpm2m-106-[273,297,310,319,687,732,836,892]" \
srun -N "${NNODES}" \
--exclusive \
--export ALL \
--ntasks-per-node=1 \
--time="${SLURM_TIME:-07:00:00}" \
--nodelist="${SLURM_NODELIST:-}" \
--partition="${SLURM_PARTITION:-amd-aig}" \
Comment on lines +46 to +48
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing --nodelist with an empty value (when SLURM_NODELIST is unset) causes srun to fail with an invalid nodelist. Consider only adding --nodelist when SLURM_NODELIST is non-empty (or default to omitting the flag entirely).

Copilot uses AI. Check for mistakes.
--cpus-per-task="${CPUS_PER_TASK:-128}" \
bash -c "
readarray -t node_array < <(scontrol show hostnames \"\$SLURM_JOB_NODELIST\")
Expand Down
144 changes: 144 additions & 0 deletions prepare_c4_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#!/bin/bash
###############################################################################
# Prepare C4 English dataset for Megatron training with DeepSeek V3
#
# This script:
# 1. Downloads C4-en data from HuggingFace (configurable amount)
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
# cd c4
# git lfs pull --include "en/*"
# 2. Converts to JSONL format
# 3. Tokenizes into Megatron .bin/.idx format using DeepSeekV3Tokenizer
#
# Usage:
# bash prepare_c4_data.sh [--num_shards N] [--data_dir /path/to/data]
#
# By default downloads 1 shard (~350MB compressed, ~3M documents) for testing.
# Full C4-en has 1024 shards. Adjust --num_shards for more data.
###############################################################################

set -e

# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says the default downloads 1 shard for testing, but NUM_SHARDS is set to default to 200. Please update either the comment or the default so they match (and avoid surprising users by downloading/processing 200 shards by default).

Suggested change
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
NUM_SHARDS=${NUM_SHARDS:-1} # Number of C4 shards to download (1-1024)

Copilot uses AI. Check for mistakes.
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
Comment on lines +23 to +25
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMUS_PATH defaults to a user-specific absolute path (/shared/john/Primus), which makes this script non-portable and likely to fail for other users/environments. Consider requiring PRIMUS_PATH to be provided (and exit with a clear message if unset) or deriving it relative to the repo root.

Suggested change
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"${SCRIPT_DIR}/../Primus"}
if [[ ! -d "$PRIMUS_PATH" ]]; then
echo "Error: PRIMUS_PATH is not set to a valid directory: '$PRIMUS_PATH'" >&2
echo "Please set PRIMUS_PATH explicitly, for example:" >&2
echo " export PRIMUS_PATH=/path/to/Primus" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
TOKENIZER_TYPE="DeepSeekV3Tokenizer"
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
WORKERS=${WORKERS:-$(nproc)} # Number of preprocessing workers
HF_TOKEN=${HF_TOKEN:-"your_hf_token"} # Set your HuggingFace token
Comment on lines +16 to +29
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says the script "downloads" C4 shards and that the default is 1 shard, but the implementation explicitly skips downloading and defaults NUM_SHARDS to 200. Please align the documentation and defaults with the actual behavior (either implement download, or update comments and set NUM_SHARDS default accordingly).

Copilot uses AI. Check for mistakes.

# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
--num_shards) NUM_SHARDS="$2"; shift 2;;
--data_dir) DATA_DIR="$2"; shift 2;;
--workers) WORKERS="$2"; shift 2;;
*) echo "Unknown option: $1"; exit 1;;
esac
done

# ======================== Paths ========================
export RAW_DIR="${DATA_DIR}/en" # Pre-downloaded shards live here
export JSONL_DIR="${DATA_DIR}/jsonl"
export TOKENIZED_DIR="${DATA_DIR}/tokenized"
export TRAIN_OUTPUT_PREFIX="${TOKENIZED_DIR}/c4_en_train"
export NUM_SHARDS

mkdir -p "$RAW_DIR" "$JSONL_DIR" "$TOKENIZED_DIR"

echo "============================================"
echo "C4 English Data Preparation"
echo "============================================"
echo "NUM_SHARDS: ${NUM_SHARDS} (out of 1024 total)"
echo "DATA_DIR: ${DATA_DIR}"
echo "PRIMUS_PATH: ${PRIMUS_PATH}"
echo "TOKENIZER: ${TOKENIZER_TYPE} / ${TOKENIZER_MODEL}"
echo "WORKERS: ${WORKERS}"
echo "============================================"

# ======================== Step 1: Merge shards into JSONL ========================
echo ""
echo ">>> Step 1: Merging C4 English shards into JSONL (${NUM_SHARDS} shards)..."
echo " (Download skipped — using pre-downloaded shards in ${RAW_DIR})"

JSONL_FILE="${JSONL_DIR}/c4_en_train.jsonl"

if [ -f "${JSONL_FILE}" ]; then
echo "JSONL file already exists: ${JSONL_FILE}"
echo "Skipping merge. Delete it to re-merge."
else
# Verify shards exist
MISSING=0
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
if [ ! -f "${RAW_DIR}/${SHARD_NAME}" ]; then
echo " WARNING: Missing shard ${SHARD_NAME}"
MISSING=$((MISSING + 1))
fi
done
if [ "$MISSING" -gt 0 ]; then
echo "ERROR: ${MISSING} shard(s) missing in ${RAW_DIR}. Cannot proceed."
exit 1
fi

echo "Decompressing and merging shards into JSONL ..."
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
done

Comment on lines +86 to +92
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge step appends directly into the final JSONL_FILE with '>>'. If the script is interrupted or a shard is corrupt, you'll end up with a partial JSONL that then causes future runs to skip merging because the file exists. Write to a temporary file and atomically move it into place on success (and/or validate the output) to avoid leaving a bad cached artifact.

Suggested change
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
done
# Write to a temporary file first to avoid leaving a corrupted final JSONL
TMP_JSONL_FILE="$(mktemp "${JSONL_DIR}/c4_en_train.jsonl.tmp.XXXXXX")"
# Ensure the temporary file is cleaned up on failure or interruption
cleanup_tmp() {
if [ -n "${TMP_JSONL_FILE:-}" ] && [ -f "${TMP_JSONL_FILE}" ]; then
rm -f "${TMP_JSONL_FILE}"
fi
}
trap cleanup_tmp EXIT INT TERM
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${TMP_JSONL_FILE}"
done
# Basic validation: ensure the merged file is non-empty before finalizing
if [ ! -s "${TMP_JSONL_FILE}" ]; then
echo "ERROR: Merged JSONL is empty; aborting."
cleanup_tmp
exit 1
fi
# Move the completed temp file into place atomically
mv "${TMP_JSONL_FILE}" "${JSONL_FILE}"
# Prevent trap from deleting the now-final JSONL file
TMP_JSONL_FILE=""
trap - EXIT INT TERM

Copilot uses AI. Check for mistakes.
DOC_COUNT=$(wc -l < "${JSONL_FILE}")
echo "Done! Total documents: ${DOC_COUNT}"
echo "Saved to: ${JSONL_FILE}"
fi

echo ">>> Step 1 complete."

# ======================== Step 2: Tokenize ========================
echo ""
echo ">>> Step 2: Tokenizing with ${TOKENIZER_TYPE}..."

JSONL_FILE="${JSONL_DIR}/c4_en_train.jsonl"

if [ -f "${TRAIN_OUTPUT_PREFIX}_text_document.bin" ] && [ -f "${TRAIN_OUTPUT_PREFIX}_text_document.idx" ]; then
echo "Tokenized files already exist:"
echo " ${TRAIN_OUTPUT_PREFIX}_text_document.bin"
echo " ${TRAIN_OUTPUT_PREFIX}_text_document.idx"
echo "Skipping tokenization. Delete them to re-tokenize."
else
# Need to set up Python path for Megatron imports
export PYTHONPATH="${PRIMUS_PATH}/third_party/Megatron-LM:${PRIMUS_PATH}:${PYTHONPATH:-}"

python3 "${PRIMUS_PATH}/examples/megatron/preprocess_data.py" \
--input "${JSONL_FILE}" \
--tokenizer-type "${TOKENIZER_TYPE}" \
--tokenizer-model "${TOKENIZER_MODEL}" \
--output-prefix "${TRAIN_OUTPUT_PREFIX}" \
--workers "${WORKERS}" \
--append-eod \
--partitions 1

echo ">>> Step 2 complete."
fi

# ======================== Summary ========================
echo ""
echo "============================================"
echo "Data preparation complete!"
echo "============================================"
echo ""
echo "Tokenized data files:"
ls -lh "${TOKENIZED_DIR}/"
echo ""
echo "To use this data for training, set in run_dsv3.sh:"
echo ""
echo " 1. Change: --mock_data True → --mock_data False"
echo " 2. Add env: export PRIMUS_TOKENIZED_DATA_PATH=${TRAIN_OUTPUT_PREFIX}_text_document"
echo ""
echo "Or pass directly via environment variable before running:"
echo " export PRIMUS_TOKENIZED_DATA_PATH=${TRAIN_OUTPUT_PREFIX}_text_document"
echo ""
echo "============================================"
Original file line number Diff line number Diff line change
Expand Up @@ -193,12 +193,26 @@ def inject(
local_rank = torch.cuda.current_device()
r_total, r_used, r_free = get_rocm_smi_mem_info(local_rank)
r_ratio = r_used / r_total

# get the max rocm_mem_usage
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage_tensor is created as float32 even though r_used is a byte count on the order of 1e11 for large GPUs. float32 will lose integer precision at that scale, which can make max-rank selection and reported GB values slightly inaccurate. Prefer using an integer dtype (e.g., int64) for the gathered byte counts, then format as GB on CPU.

Suggested change
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)

Copilot uses AI. Check for mistakes.
world_size = torch.distributed.get_world_size()
gathered_usage = [torch.zeros_like(usage_tensor) for _ in range(world_size)]
torch.distributed.all_gather(gathered_usage, usage_tensor)

rocm_mem_usages = [t.item() for t in gathered_usage]
max_usage = max(rocm_mem_usages)
max_rank = rocm_mem_usages.index(max_usage)
Comment on lines +197 to +205
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ROCm SMI logging now depends on torch.tensor and torch.distributed.* inside the same try block as get_rocm_smi_mem_info. In unit tests this module monkeypatches 'torch' with a minimal fake that lacks tensor/distributed, so this will swallow the exception and drop the "rocm mem usage/free/total" segment entirely, breaking existing assertions. Consider constructing the local ROCm SMI string first, then optionally (in a separate guarded block) doing distributed collectives only when torch.distributed is available+initialized, so local stats remain logged even without distributed.

Copilot uses AI. Check for mistakes.

rocm_mem_str = (
f" | rocm mem usage/free/total/usage_ratio: "
f"{r_used / 1024 ** 3:.2f}GB/"
f"{r_free / 1024 ** 3:.2f}GB/"
f"{r_total / 1024 ** 3:.2f}GB/"
f"{r_ratio * 100:.2f}%"
f" | rank-{max_rank} rocm max mem usage/usage_ratio: "
f"{max_usage / 1024 ** 3:.2f}GB/"
f"{max_usage / r_total * 100:.2f}%"
Comment on lines +197 to +215
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_usage / r_total uses the local rank’s r_total when computing the max-usage ratio, which can be incorrect if ranks have different total HBM sizes (or if totals differ for any reason). To make this accurate, gather each rank’s r_total (or gather precomputed r_used/r_total ratios) and compute the max ratio corresponding to max_rank.

Copilot uses AI. Check for mistakes.
)
# Cache for reuse on non-sampled iterations
self._last_rocm_mem_str = rocm_mem_str
Expand Down
51 changes: 51 additions & 0 deletions start_training_dsv2_lite.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash

export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment on lines +3 to +4
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports HF_TOKEN and WANDB_API_KEY as literal strings, which both encourages storing secrets in a committed file and also overwrites any values already present in the caller’s environment. Prefer reading these from the environment (and erroring if missing) rather than exporting placeholder values here.

Suggested change
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
: "${HF_TOKEN:?Environment variable HF_TOKEN must be set}"
: "${WANDB_API_KEY:?Environment variable WANDB_API_KEY must be set}"
export HF_TOKEN
export WANDB_API_KEY

Copilot uses AI. Check for mistakes.
export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
#export SLURM_TREE_WIDTH=128

export NNODES=128
export SLURM_TIME=07:00:00
export SLURM_PARTITION=amd-aig

# export NCCL_DEBUG=INFO
export USING_AINIC=1
export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1"
export GLOO_SOCKET_IFNAME=ens9np0
export NCCL_SOCKET_IFNAME=ens9np0
export CLEAN_DOCKER_CONTAINER=1
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.

Suggested change
export CLEAN_DOCKER_CONTAINER=1
# Set to 1 to allow run_slurm_pretrain.sh to clean up all Docker/Podman containers on the host.
# Use 1 only on dedicated/non-shared nodes where this is safe.
export CLEAN_DOCKER_CONTAINER=0

Copilot uses AI. Check for mistakes.

export MBS=12
export GBS=$((96 * NNODES))
export PROFILE=False
export TURBO_GROUPED_MLP=False
export TURBO_DEEPEEP=True
export LEGACY_GG=True
export PRIMUS_DETERMINISTIC=0

# export EXP=examples/megatron/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
export EXP=examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml
export PRIMUS_TEAM=amd
export PRIMUS_USER=tas
export PRIMUS_EXP_NAME=dsv2_lite-pretrain-mbs_$MBS-gbs_$GBS-turbogg_$TURBO_GROUPED_MLP-turbodeepep_$TURBO_DEEPEEP-legacygg_$LEGACY_GG-profile_$PROFILE

mkdir -p output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME
bash ./examples/run_slurm_pretrain.sh \
--train_iters 10 \
--disable_wandb True \
--disable_tensorboard True \
--micro_batch_size $MBS \
--global_batch_size $GBS \
--seq_length 4096 \
--max_position_embeddings 4096 \
--use_turbo_grouped_mlp $TURBO_GROUPED_MLP \
--use_turbo_deepep $TURBO_DEEPEEP \
--moe_use_legacy_grouped_gemm $LEGACY_GG \
--cross_entropy_fusion_impl "te" \
--cross_entropy_loss_fusion True \
--profile $PROFILE \
--use_pytorch_profiler $PROFILE \
--profile_step_end 7 \
--profile_step_start 6 \
2>&1 | tee output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME/log.txt
Loading