Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions ray-train-megatron/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Containerfile for Megatron-Bridge with transformer_engine
FROM anyscale/ray:2.53.0-py312-cu128

# Install core dependencies
RUN pip install --no-cache-dir \
transformers>=4.57.1 \
datasets \
accelerate \
omegaconf>=2.3.0 \
tensorboard>=2.19.0 \
typing-extensions \
rich \
wandb>=0.19.10 \
pyyaml>=6.0.2 \
tqdm>=4.67.1 \
"hydra-core>1.3,<=1.3.2" \
timm \
megatron-energon

# Install NVIDIA packages - transformer_engine is the key dependency
RUN pip install --no-cache-dir nvidia-modelopt
RUN pip install --no-cache-dir nvidia-resiliency-ext
RUN pip install --no-cache-dir --no-build-isolation transformer_engine[pytorch]

WORKDIR /app

# Clone Megatron-Bridge and submodules
RUN git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git && \
cd Megatron-Bridge && \
git submodule update --init 3rdparty/Megatron-LM
87 changes: 87 additions & 0 deletions ray-train-megatron/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Fine-Tuning LLM with Megatron-Bridge and Ray Train

This example demonstrates how to run **Megatron-Bridge** training using **Ray Train** for multi-GPU distributed training on Anyscale. It performs Supervised Fine-Tuning (SFT) on a Qwen/Qwen2.5-0.5B model.


## Option 1: Run as an Anyscale Job

This is the simplest way to execute the training. The job will automatically build the environment, provision resources, and run the script.

### 1. Install Anyscale CLI
If you haven't already:
```bash
pip install -U anyscale
anyscale login
```

### 2. Submit the Job
Clone the repository and submit the job using the provided YAML configuration:

```bash
# Clone the repository
git clone https://github.com/anyscale/examples.git
cd examples/ray-train-megatron

# Submit the job
anyscale job submit -f job.yaml
```

**What this job does:**
1. **Builds** a Docker image with Megatron-Bridge and dependencies (using `Dockerfile`).
2. **Provisions** 8 GPUs (default: 2 nodes with 4xL4 GPUs each).
3. **Runs** the distributed training script `llm_sft_ray_train_megatron.py`.

---

## Option 2: Run in an Anyscale Workspace (Interactive)

Use a Workspace for interactive development, debugging, or modifying the code.

### 1. Build the Container Image

To ensure all dependencies are installed, you need to build a custom image.

Follow the [Build Farm guide](https://docs.anyscale.com/container-image/build-image#build-farm) and create a new container image named `megatron-bridge-ray-train` on Anyscale using the following configuration:

### 2. Create a Workspace

1. Start a new Workspace.
2. Select the `megatron-bridge-ray-train` image you just built.
3. Configure the **Compute**:
- **Head Node:** 1x CPU node (e.g., `m5.xlarge`).
- **Worker Nodes:** Select the `Auto-select nodes` option. It will automatically use 4xL4 GPUs in your cloud. Make sure you have the available GPUs.

### 3. Run the Training

Once your Workspace is running, open a terminal (VS Code or Jupyter) and execute the following:

```bash
# 1. Clone the repository
git clone https://github.com/anyscale/examples.git
cd examples/ray-train-megatron

# 2. Set environment variables
export RAY_TRAIN_V2_ENABLED=1
export MEGATRON_BRIDGE_ROOT=/app/Megatron-Bridge
export PYTHONPATH=$PYTHONPATH:/app/Megatron-Bridge/src:/app/Megatron-Bridge/3rdparty/Megatron-LM
export HF_HOME=/mnt/cluster_storage/huggingface
export PYTHONUNBUFFERED=1

# 3. Run the training script
python llm_sft_ray_train_megatron.py \
--hf_model_path Qwen/Qwen2.5-0.5B \
--num_workers 8 \
--tensor_parallel_size 2 \
--pipeline_parallel_size 2 \
--train_iters 100 \
--global_batch_size 8 \
--micro_batch_size 1 \
--seq_length 512 \
--storage_path /mnt/cluster_storage/megatron_experiment
```

> **Note:** The configuration must satisfy `TP * PP * DP = Total GPUs`. For example, when using 8 GPUs (`--num_workers 8`), setting `TP=2` (`--tensor_parallel_size 2`) and `PP=2` (`--pipeline_parallel_size 2`) implies `DP = 8 / (2 * 2) = 2`. If you are using fewer than 8 GPUs, you must adjust these parameters accordingly.

### 4. Locate the checkpoints
After the training, you can locate the checkpoints in `/mnt/cluster_storage/megatron_experiment/megatron_outputs/checkpoints`.

45 changes: 45 additions & 0 deletions ray-train-megatron/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Anyscale Job configuration for Megatron-Bridge training
# 8 GPUs: 2 worker nodes with 4 GPUs each (g6e.12xlarge)

name: ray-train-megatron-bridge-8gpu-job

# Build a custom image using the local Dockerfile
containerfile: ./Dockerfile

# Alternatively, use a pre-built image (ask an Anyscale engineer for access)
#image_uri: anyscale/image/megatron-bridge-ray-train:1

# When empty, Anyscale will auto-select the instance types. You can also specify
# minimum and maximum resources.
compute_config:
# compute_config:
# head_node:
# instance_type: m5.xlarge
# worker_nodes:
# - instance_type: g6.12xlarge # 4x L4 GPUs per node
# min_nodes: 2
# max_nodes: 2

working_dir: .

# Override workspace dependencies - ensure requirements.txt exists in the working directory
requirements: requirements.txt

env_vars:
RAY_TRAIN_V2_ENABLED: "1"
MEGATRON_BRIDGE_ROOT: "/app/Megatron-Bridge"
PYTHONPATH: "/app/Megatron-Bridge/src:/app/Megatron-Bridge/3rdparty/Megatron-LM"
NCCL_DEBUG: "WARN"
PYTHONUNBUFFERED: "1"

entrypoint: >-
python llm_sft_ray_train_megatron.py
--hf_model_path Qwen/Qwen2.5-0.5B
--num_workers 8
--tensor_parallel_size 2
--pipeline_parallel_size 2
--train_iters 100
--global_batch_size 8
--micro_batch_size 1
--seq_length 512
--storage_path /mnt/cluster_storage/megatron_experiment
Loading