LLM360 · Jianshu-She · Jun 20, 2025 · Jun 20, 2025 · Jun 20, 2025 · Jun 20, 2025
diff --git a/.gemini/config.yaml b/.gemini/config.yaml
@@ -0,0 +1,10 @@
+have_fun: false
+code_review:
+  disable: false
+  comment_severity_threshold: HIGH
+  max_review_comments: -1
+  pull_request_opened:
+    help: false
+    summary: false
+    code_review: true
+ignore_patterns: []
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1,18 @@
+/docs @eric-haibin-lin @zhaochenyang20 @hongpeng-guo
+/docs/amd_tutorial @yushengsu-thu
+/docs/slang_multiturn @zhaochenyang20 @SwordFaith
+
+/recipe/dapo @tongyx361 @PeterSH6
+/recipe/spin @zhaochenyang20
+/recipe/sppo @zhaochenyang20
+
+/third_party/sglang @zhaochenyang20 @SwordFaith
+/third_party/vllm @PeterSH6 @wuxibin89
+/verl/single_controller @zw0610 @wuxibin89 @hongpeng-guo
+/verl/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
+/verl/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
+/verl/workers/rollout/sglang_rollout @zhaochenyang20 @SwordFaith @chenhaiq
+
+/tests/single_controller @zw0610 @wuxibin89
+/tests/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6
+/tests/workers/rollout/vllm_rollout @wuxibin89 @PeterSH6 @chenhaiq
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,46 +1,40 @@
-### Checklist Before Starting
-
-- [ ] Search for similar PR(s).
-
 ### What does this PR do?
 
-> Add one-line overview of what this PR aims to achieve or accomplish. 
+> Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.
 
-### High-Level Design
-
-> Demonstrate the high-level design if this PR is complex.
-
-### Specific Changes
+### Checklist Before Starting
 
-> List the specific changes.
+- [ ] Search for similar PRs. Paste at least one query link here: ...
+- [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI)
+  - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`
+  - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]`
+  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
+  - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title.
+  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
 
-### API
+### Test
 
-> Demonstrate how the API changes if any.
+> For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
 
-### Usage Example
+### API and Usage Example
 
-> Provide usage example(s) for easier usage.
+> Demonstrate how the API changes if any, and provide usage example(s) if possible.
 
 ```python
-# Add code snippet or script demonstrating how to use this 
+# Add code snippet or script demonstrating how to use this
 ```
 
-### Test
-
-> For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.
+### Design & Code Changes
 
-### Additional Info.
-
-- **Issue Number**: Fixes issue # or discussion # if any.
-- **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
-- **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none]
+> Demonstrate the high-level design if this PR is complex, and list the specific changes.
 
 ### Checklist Before Submitting
 
-- [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
-- [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
-- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
-- [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs).
-- [ ] New CI unit test(s) are added to cover the code path.
-- [ ] Rely on existing unit tests on CI that covers the code path.
+> [!IMPORTANT]
+> Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
+
+- [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
+- [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always`
+- [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs).
+- [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ...
+- [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -0,0 +1,69 @@
+### Adding a New Workflow
+
+When adding a new workflow for continuous integration (CI), you have two runner options: a fixed runner or a machine from the vemlp.
+
+- **Fixed Runner**: To use a fixed runner, specify it in your workflow using the `runs-on` keyword, like `runs-on: [L20x8]`. 
+- **Vemlp Runner**: Opting for a Vemlp machine allows you to launch tasks elastically. 
+
+Here is a template to assist you. This template is designed for using Vemlp machines. Currently, for each workflow, you need to create a `setup` and a `cleanup` job. When using this template, the main parts you need to modify are the `IMAGE` environment variable and the specific `job steps`.
+
+```yaml
+name: Your Default Workflow
+
+on:
+  push:
+    branches:
+      - main
+      - v0.*
+  pull_request:
+    branches:
+      - main
+      - v0.*
+    paths:
+      - "**/*.py"
+      - ".github/workflows/template.yml"
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
+
+permissions:
+  contents: read
+
+env:
+  IMAGE: "your vemlp image" # e.g. "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
+  DYNAMIC_RUNNER_URL: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner" # public veFaas api
+
+jobs:
+  setup:
+    if: github.repository_owner == 'volcengine'
+    runs-on: ubuntu-latest
+    outputs:
+      runner-label: ${{ steps.create-runner.outputs.runner-label }}
+      task-id: ${{ steps.create-runner.outputs.task-id }}
+    steps:
+      - uses: actions/checkout@v4
+      - id: create-runner
+        uses: volcengine/vemlp-github-runner@v1 
+        with:
+          mode: "create"
+          faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
+          image: "${{ env.DEFAULT_IMAGE }}"
+
+  your_job:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'default-runner' }}"]
+    steps:
+      xxxx # your jobs
+
+  cleanup:
+    runs-on: ubuntu-latest
+    needs: [setup, your_job]
+    if: always()
+    steps:
+      - id: destroy-runner
+        uses: volcengine/vemlp-github-runner@v1
+        with:
+          mode: "destroy"
+          faas-url: "${{ env.DYNAMIC_RUNNER_URL }}"
+          task-id: "${{ needs.setup.outputs.task-id }}"
diff --git a/.github/workflows/check-pr-title.yml b/.github/workflows/check-pr-title.yml
@@ -0,0 +1,58 @@
+# # Tests layout
+
+# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
+# - `tests/trainer` for testing functionality related to `verl/trainer`
+# - `tests/models` for testing functionality related to `verl/models`
+# - ...
+
+# There are a few folders with `special_` prefix, created for special purposes:
+# - `special_distributed`: unit tests that must run with multiple GPUs
+# - `special_e2e`: end-to-end tests with training/generation scripts
+# - `special_npu`: tests for NPUs
+# - `special_sanity`: a suite of quick sanity tests
+# - `special_standalone`: a set of test that are designed to run in dedicated environments
+
+# Accelerators for tests 
+# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
+# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
+
+# # Workflow layout
+
+# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
+# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
+# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
+# 3. End-to-end tests: `e2e_*.yml`
+# 4. Unit tests
+#   - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
+#   - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
+#   - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
+#     - new workflow yaml is added to `.github/workflows`
+#     - new tests are added to workflow mentioned in 2.
+
+
+on:
+  pull_request:
+    types: [opened, edited, synchronize]
+
+jobs:
+  check-title:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Run PR title checker
+        run: python3 tests/special_sanity/check_pr_title.py
+        env:
+          PR_TITLE: ${{ github.event.pull_request.title }}
+
+      - name: Run PR description checker
+        run: python3 tests/special_sanity/check_pr_description.py
+        env:
+          PR_TITLE: ${{ github.event.pull_request.title }}
+          GITHUB_EVENT_PATH: ${{ github.event_path }}
diff --git a/.github/workflows/checkpoint_converter.yml b/.github/workflows/checkpoint_converter.yml
@@ -1,3 +1,36 @@
+# # Tests layout
+
+# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
+# - `tests/trainer` for testing functionality related to `verl/trainer`
+# - `tests/models` for testing functionality related to `verl/models`
+# - ...
+
+# There are a few folders with `special_` prefix, created for special purposes:
+# - `special_distributed`: unit tests that must run with multiple GPUs
+# - `special_e2e`: end-to-end tests with training/generation scripts
+# - `special_npu`: tests for NPUs
+# - `special_sanity`: a suite of quick sanity tests
+# - `special_standalone`: a set of test that are designed to run in dedicated environments
+
+# Accelerators for tests 
+# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
+# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
+
+# # Workflow layout
+
+# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
+# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
+# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
+# 3. End-to-end tests: `e2e_*.yml`
+# 4. Unit tests
+#   - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
+#   - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
+#   - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
+#     - new workflow yaml is added to `.github/workflows`
+#     - new tests are added to workflow mentioned in 2.
+
+
+
 name: checkpoint_converter
 # latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
 
@@ -27,7 +60,7 @@ on:
       - ".github/workflows/checkpoint_converter.yml"
       - ".github/workflows/e2e_ppo_trainer_megatron.yml"
       - "examples/data_preprocess/gsm8k.py"
-      - "tests/e2e/run_ppo_trainer_megatron.sh"
+      - "tests/special_e2e/run_ppo_trainer_megatron.sh"
       - "verl/trainer/main_ppo.py"
       - "verl/trainer/config/ppo_megatron_trainer.yaml"
 
@@ -51,7 +84,7 @@ jobs:
       NO_PROXY: "localhost,127.0.0.1"
       HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
     container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
       options: --gpus all --shm-size=10g
     steps:
       - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -63,11 +96,11 @@ jobs:
       - name: Running Huggingface to Megatron dist_ckpt converter (Qwen/Qwen2.5-0.5B)
         run: |
           ray stop --force
-          python scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/Qwen/Qwen2.5-0.5B --output_path checkpoints/Qwen/Qwen2.5-0.5B
+          python scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/Qwen/Qwen2.5-0.5B --output_path checkpoints/Qwen/Qwen2.5-0.5B --test
       - name: Running Huggingface to Megatron dist_ckpt converter (deepseek-ai/deepseek-coder-1.3b-instruct)
         run: |
           ray stop --force
-          python scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/deepseek-ai/deepseek-coder-1.3b-instruct --output_path checkpoints/deepseek-ai/deepseek-coder-1.3b-instruct
+          python scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/deepseek-ai/deepseek-coder-1.3b-instruct --output_path checkpoints/deepseek-ai/deepseek-coder-1.3b-instruct --test
       - name: Clean up
         run: |
           rm -rf checkpoints
@@ -81,7 +114,7 @@ jobs:
       HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
       HF_ENDPOINT: "https://hf-mirror.com"
     container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
       options: --gpus all --shm-size=10g
     steps:
       - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -98,6 +131,10 @@ jobs:
         run: |
           ray stop --force
           python scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/Qwen/Qwen1.5-MoE-A2.7B-Chat --output_path checkpoints/Qwen/Qwen1.5-MoE-A2.7B-Chat --use_cpu_initialization
+      - name: Running distributed Huggingface to Megatron dist_ckpt CPU converter (Qwen/Qwen1.5-MoE-A2.7B-Chat)
+        run: |
+          ray stop --force
+          torchrun --nproc_per_node 8 --nnodes 1 scripts/converter_hf_to_mcore.py --hf_model_path=${HOME}/models/Qwen/Qwen1.5-MoE-A2.7B-Chat --output_path checkpoints/Qwen/Qwen1.5-MoE-A2.7B-Chat_dist --use_cpu_initialization
       - name: clean up
         run: |
           rm -rf checkpoints