anyscale · kunling-anyscale · Jan 12, 2026 · Jan 12, 2026 · Jan 12, 2026 · Jan 13, 2026
diff --git a/ray-train-megatron/Dockerfile b/ray-train-megatron/Dockerfile
@@ -0,0 +1,30 @@
+# Containerfile for Megatron-Bridge with transformer_engine
+FROM anyscale/ray:2.53.0-py312-cu128
+
+# Install core dependencies
+RUN pip install --no-cache-dir \
+    transformers>=4.57.1 \
+    datasets \
+    accelerate \
+    omegaconf>=2.3.0 \
+    tensorboard>=2.19.0 \
+    typing-extensions \
+    rich \
+    wandb>=0.19.10 \
+    pyyaml>=6.0.2 \
+    tqdm>=4.67.1 \
+    "hydra-core>1.3,<=1.3.2" \
+    timm \
+    megatron-energon
+
+# Install NVIDIA packages - transformer_engine is the key dependency
+RUN pip install --no-cache-dir nvidia-modelopt
+RUN pip install --no-cache-dir nvidia-resiliency-ext
+RUN pip install --no-cache-dir --no-build-isolation transformer_engine[pytorch]
+
+WORKDIR /app
+
+# Clone Megatron-Bridge and submodules
+RUN git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git && \
+    cd Megatron-Bridge && \
+    git submodule update --init 3rdparty/Megatron-LM
diff --git a/ray-train-megatron/README.md b/ray-train-megatron/README.md
@@ -0,0 +1,87 @@
+# Fine-Tuning LLM with Megatron-Bridge and Ray Train
+
+This example demonstrates how to run **Megatron-Bridge** training using **Ray Train** for multi-GPU distributed training on Anyscale. It performs Supervised Fine-Tuning (SFT) on a Qwen/Qwen2.5-0.5B model.
+
+
+## Option 1: Run as an Anyscale Job
+
+This is the simplest way to execute the training. The job will automatically build the environment, provision resources, and run the script.
+
+### 1. Install Anyscale CLI
+If you haven't already:
+```bash
+pip install -U anyscale
+anyscale login
+```
+
+### 2. Submit the Job
+Clone the repository and submit the job using the provided YAML configuration:
+
+```bash
+# Clone the repository
+git clone https://github.com/anyscale/examples.git
+cd examples/ray-train-megatron
+
+# Submit the job
+anyscale job submit -f job.yaml
+```
+
+**What this job does:**
+1. **Builds** a Docker image with Megatron-Bridge and dependencies (using `Dockerfile`).
+2. **Provisions** 8 GPUs (default: 2 nodes with 4xL4 GPUs each).
+3. **Runs** the distributed training script `llm_sft_ray_train_megatron.py`.
+
+---
+
+## Option 2: Run in an Anyscale Workspace (Interactive)
+
+Use a Workspace for interactive development, debugging, or modifying the code.
+
+### 1. Build the Container Image
+
+To ensure all dependencies are installed, you need to build a custom image.
+
+Follow the [Build Farm guide](https://docs.anyscale.com/container-image/build-image#build-farm) and create a new container image named `megatron-bridge-ray-train` on Anyscale using the following configuration:
+
+### 2. Create a Workspace
+
+1. Start a new Workspace.
+2. Select the `megatron-bridge-ray-train` image you just built.
+3. Configure the **Compute**:
+   - **Head Node:** 1x CPU node (e.g., `m5.xlarge`).
+   - **Worker Nodes:** Select the `Auto-select nodes` option. It will automatically use 4xL4 GPUs in your cloud. Make sure you have the available GPUs.
+
+### 3. Run the Training
+
+Once your Workspace is running, open a terminal (VS Code or Jupyter) and execute the following:
+
+```bash
+# 1. Clone the repository
+git clone https://github.com/anyscale/examples.git
+cd examples/ray-train-megatron
+
+# 2. Set environment variables
+export RAY_TRAIN_V2_ENABLED=1
+export MEGATRON_BRIDGE_ROOT=/app/Megatron-Bridge
+export PYTHONPATH=$PYTHONPATH:/app/Megatron-Bridge/src:/app/Megatron-Bridge/3rdparty/Megatron-LM
+export HF_HOME=/mnt/cluster_storage/huggingface
+export PYTHONUNBUFFERED=1
+
+# 3. Run the training script
+python llm_sft_ray_train_megatron.py \
+    --hf_model_path Qwen/Qwen2.5-0.5B \
+    --num_workers 8 \
+    --tensor_parallel_size 2 \
+    --pipeline_parallel_size 2 \
+    --train_iters 100 \
+    --global_batch_size 8 \
+    --micro_batch_size 1 \
+    --seq_length 512 \
+    --storage_path /mnt/cluster_storage/megatron_experiment
+```
+
+> **Note:** The configuration must satisfy `TP * PP * DP = Total GPUs`. For example, when using 8 GPUs (`--num_workers 8`), setting `TP=2` (`--tensor_parallel_size 2`) and `PP=2` (`--pipeline_parallel_size 2`) implies `DP = 8 / (2 * 2) = 2`. If you are using fewer than 8 GPUs, you must adjust these parameters accordingly.
+
+### 4. Locate the checkpoints
+After the training, you can locate the checkpoints in `/mnt/cluster_storage/megatron_experiment/megatron_outputs/checkpoints`.
+
diff --git a/ray-train-megatron/job.yaml b/ray-train-megatron/job.yaml
@@ -0,0 +1,45 @@
+# Anyscale Job configuration for Megatron-Bridge training
+# 8 GPUs: 2 worker nodes with 4 GPUs each (g6e.12xlarge)
+
+name: ray-train-megatron-bridge-8gpu-job
+
+# Build a custom image using the local Dockerfile
+containerfile: ./Dockerfile  
+
+# Alternatively, use a pre-built image (ask an Anyscale engineer for access)
+#image_uri: anyscale/image/megatron-bridge-ray-train:1 
+
+# When empty, Anyscale will auto-select the instance types. You can also specify
+# minimum and maximum resources.
+compute_config:
+# compute_config:
+#   head_node:
+#     instance_type: m5.xlarge
+#   worker_nodes:
+#     - instance_type: g6.12xlarge  # 4x L4 GPUs per node
+#       min_nodes: 2
+#       max_nodes: 2
+
+working_dir: .
+
+# Override workspace dependencies - ensure requirements.txt exists in the working directory
+requirements: requirements.txt
+
+env_vars:
+  RAY_TRAIN_V2_ENABLED: "1"
+  MEGATRON_BRIDGE_ROOT: "/app/Megatron-Bridge"
+  PYTHONPATH: "/app/Megatron-Bridge/src:/app/Megatron-Bridge/3rdparty/Megatron-LM"
+  NCCL_DEBUG: "WARN"
+  PYTHONUNBUFFERED: "1"
+
+entrypoint: >-
+  python llm_sft_ray_train_megatron.py
+  --hf_model_path Qwen/Qwen2.5-0.5B
+  --num_workers 8
+  --tensor_parallel_size 2
+  --pipeline_parallel_size 2
+  --train_iters 100
+  --global_batch_size 8
+  --micro_batch_size 1
+  --seq_length 512
+  --storage_path /mnt/cluster_storage/megatron_experiment