Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 19 additions & 18 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,22 @@ These examples provide concrete examples to leverage Miles in your own RL workfl

## Directory Structure

- **[DrGRPO](./DrGRPO)**: Custom reducer for Dr.GRPO algorithm.
- **[eval](./eval)**: Documentation and setup for evaluation environments using NeMo-Skills.
- **[eval_multi_task](./eval_multi_task)**: Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench.
- **[formal_math](./formal_math)**: Examples related to formal math reasoning tasks, including a single round demo.
- **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency.
- **[geo3k_vlm](./geo3k_vlm)**: Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset.
- **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training (FSDP backend) on Geo3k dataset.
- **[low_precision](./low_precision)**: Examples of FP8 training and inference for improved throughput and stability.
- **[multi_agent](./multi_agent)**: Example of running multi-agent RL with `miles`.
- **[on_policy_distillation](./on_policy_distillation)**: Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training.
- **[reproducibility](./reproducibility)**: Guides on achieving bitwise experiment reproduction using deterministic modes.
- **[retool](./retool)**: Demonstrates the retool functionality for tool-enabled language model generation.
- **[search-r1](./search-r1)**: A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling.
- **[strands-agents](./strands-agents)**: Integration example with the Strands-Agents scaffolding framework.
- **[tau-bench](./tau-bench)**: Training in an agentic multi-turn tool use environment (Tau-bench).
- **[train_infer_mismatch_helper](./train_infer_mismatch_helper)**: Algorithmic methods for rollout correction (e.g., TIS, MIS).
- **[true_on_policy](./true_on_policy)**: Ensures strictly equal log probabilities between inference (SGLang) and training engines.
- **[true_on_policy_vlm](./true_on_policy_vlm)**: "True On-Policy" training demonstration for VLM (Qwen3-VL).
| Example | Description | W&B |
| :--- | :--- | :--- |
| **[DrGRPO](./DrGRPO)** | Custom reducer for Dr.GRPO algorithm. | |
| **[eval](./eval)** | Documentation and setup for evaluation environments using NeMo-Skills. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval) |
| **[eval_multi_task](./eval_multi_task)** | Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval-multi-task) |
| **[formal_math](./formal_math)** | Examples related to formal math reasoning tasks, including a single round demo. | [link](https://wandb.ai/zijie_xia-n-a/miles-formal-math-run-minimal) |
| **[fully_async](./fully_async)** | Demonstrates fully asynchronous rollout generation for higher efficiency. | [link](https://wandb.ai/zijie_xia-n-a/miles-fully-async) |
| **[geo3k_vlm](./geo3k_vlm)** | Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm) |
| **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)** | VLM multi-turn training (FSDP backend) on Geo3k dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm-multi-turn) |
| **[low_precision](./low_precision)** | Examples of FP8 training and inference for improved throughput and stability. | [link](https://wandb.ai/zijie_xia-n-a/miles-low-precision) |
| **[multi_agent](./multi_agent)** | Example of running multi-agent RL with miles. | [link](https://wandb.ai/zijie_xia-n-a/miles-multi-agent) |
| **[on_policy_distillation](./on_policy_distillation)** | Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. | [link](https://wandb.ai/zijie_xia-n-a/miles-on-policy-distillation) |
| **[reproducibility](./reproducibility)** | Guides on achieving bitwise experiment reproduction using deterministic modes. | [link](https://wandb.ai/zijie_xia-n-a/miles-reproducibility) |
| **[retool](./retool)** | Demonstrates the retool functionality for tool-enabled language model generation. | [link](https://wandb.ai/zijie_xia-n-a/miles-retool) |
| **[search-r1](./search-r1)** | A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. | [link](https://wandb.ai/zijie_xia-n-a/miles-search-r1) |
| **[strands-agents](./strands-agents)** | Integration example with the Strands-Agents scaffolding framework. | [link](https://wandb.ai/zijie_xia-n-a/miles-strands-agents) |
| **[train_infer_mismatch_helper](./train_infer_mismatch_helper)** | Algorithmic methods for rollout correction (e.g., TIS, MIS). | [link](https://wandb.ai/zijie_xia-n-a/miles-train-infer-mismatch) |
| **[true_on_policy](./true_on_policy)** | Ensures strictly equal log probabilities between inference (SGLang) and training engines. | [link](https://wandb.ai/zijie_xia-n-a/miles-true-on-policy) |
| **[true_on_policy_vlm](./true_on_policy_vlm)** | "True On-Policy" training demonstration for VLM (Qwen3-VL). | |
4 changes: 2 additions & 2 deletions examples/eval/scripts/multi_tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ eval:
top_k: -1
max_response_len: 24576
datasets: # these eval tasks go through miles dataset config and default rollout function (miles.rollout.sglang_rollout.generate_rollout)
- name: gpqa # huggingface-cli download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa
- name: gpqa # hf download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa
path: /root/gpqa/gpqa_eval.jsonl
rm_type: gpqa
n_samples_per_eval_prompt: 2
- name: ifbench # huggingface-cli download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench
- name: ifbench # hf download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench
path: /root/ifbench/IFBench_eval.jsonl
rm_type: ifbench
n_samples_per_eval_prompt: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/eval_multi_task/multi_task.sh
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project eval
--wandb-project miles-eval-multi-task
--wandb-group multi_task
--wandb-key ${WANDB_KEY}
)
Expand Down
10 changes: 9 additions & 1 deletion examples/fully_async/run-qwen3-4b-fully_async.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ CKPT_ARGS=(
--save-interval 20
)

PROMPT_SET=/path/to/dapo-math-17k.jsonl
PROMPT_SET=/root/dapo-math-17k/dapo-math-17k.jsonl

ROLLOUT_ARGS=(
--rollout-function-path fully_async_rollout.generate_rollout_fully_async
Expand Down Expand Up @@ -96,6 +96,13 @@ OPTIMIZER_ARGS=(
--adam-beta2 0.98
)

WANDB_ARGS=(
--use-wandb
--wandb-project miles-fully-async
--wandb-group qwen3-4b-fully_async
--wandb-key ${WANDB_KEY}
)

SGLANG_ARGS=(
--rollout-num-gpus-per-engine 1
)
Expand Down Expand Up @@ -134,6 +141,7 @@ ray job submit --address="http://127.0.0.1:8265" \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
2 changes: 1 addition & 1 deletion examples/geo3k_vlm/run_geo3k_vlm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ OPTIMIZER_ARGS=(
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 1
--sglang-mem-fraction-static 0.6
--sglang-cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256
--sglang-disable-cuda-graph
)

# Wandb args (only if WANDB_API_KEY is set)
Expand Down
8 changes: 2 additions & 6 deletions examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def execute():
wandb_args = (
(
"--use-wandb "
"--wandb-project miles-dev "
"--wandb-project miles-geo3k-vlm-multi-turn "
"--wandb-group geo3k_vlm_multi_turn "
f"--wandb-key '{wandb_api_key}' "
)
Expand Down Expand Up @@ -98,11 +98,7 @@ def execute():
"--adam-beta2 0.98 "
)

sglang_args = (
"--rollout-num-gpus-per-engine 1 "
"--sglang-mem-fraction-static 0.6 "
f"--sglang-cuda-graph-bs {' '.join(map(str, [1, 2, 4, 8] + list(range(16, 257, 8))))} "
)
sglang_args = "--rollout-num-gpus-per-engine 1 " "--sglang-mem-fraction-static 0.6 " "--sglang-disable-cuda-graph "

fsdp_args = (
"--train-backend fsdp "
Expand Down
10 changes: 5 additions & 5 deletions examples/low_precision/run-qwen3-4b-fp8.sh
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,10 @@ OPTIMIZER_ARGS=(
)

WANDB_ARGS=(
# --use-wandb
# --wandb-project miles-dev
# --wandb-group qwen3-4B-test
# --wandb-key ${WANDB_KEY}
--use-wandb
--wandb-project miles-low-precision
--wandb-group qwen3-4B-test
--wandb-key ${WANDB_KEY}
)

SGLANG_ARGS=(
Expand All @@ -125,7 +125,7 @@ PRECISE_ARGS=(
--bf16
--fp8-format e4m3
--fp8-recipe blockwise
--fp8-param-gather
# --fp8-param-gather
)


Expand Down
8 changes: 4 additions & 4 deletions examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh
Original file line number Diff line number Diff line change
Expand Up @@ -104,10 +104,10 @@ OPTIMIZER_ARGS=(
)

WANDB_ARGS=(
#--use-wandb
# --wandb-project miles-dev
# --wandb-group qwen3-30B-A3B-test
# --wandb-key ${WANDB_KEY}
--use-wandb
--wandb-project miles-multi-agent
--wandb-group qwen3-30B-A3B-test
--wandb-key ${WANDB_KEY}
)

SGLANG_ARGS=(
Expand Down
25 changes: 17 additions & 8 deletions examples/on_policy_distillation/run-qwen3-8B-opd.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

# usage: bash examples/on_policy_distillation/run-qwen3-8B-opd.sh

pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python

set -ex


Expand Down Expand Up @@ -48,10 +57,10 @@ source "/root/miles/scripts/models/qwen3-8B.sh"


CKPT_ARGS=(
--hf-checkpoint /root/Qwen3-8B
--ref-load /root/Qwen3-8B_torch_dist
--load /root/Qwen3-8B_miles/
--save /root/Qwen3-8B_miles/
--hf-checkpoint /root/models/Qwen3-8B
--ref-load /root/models/Qwen3-8B_torch_dist
--load /root/models/Qwen3-8B_miles/
--save /root/models/Qwen3-8B_miles/
--save-interval 20
)

Expand Down Expand Up @@ -119,10 +128,10 @@ OPTIMIZER_ARGS=(
)

WANDB_ARGS=(
#--use-wandb
# --wandb-project miles-dev
# --wandb-group qwen3-8B-test
# --wandb-key ${WANDB_KEY}
--use-wandb
--wandb-project miles-on-policy-distillation
--wandb-group qwen3-8B-test
--wandb-key ${WANDB_KEY}
)

SGLANG_ARGS=(
Expand Down
8 changes: 3 additions & 5 deletions examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-host https://wandb.ai/
--wandb-team glm-zero
--wandb-project miles-dev
--wandb-project miles-reproducibility
--wandb-group qwen2.5-0.5B-gsm8k-deterministic
)

Expand All @@ -109,7 +107,7 @@ MISC_ARGS=(
)

# launch the master node of ray in container
ray start --head --node-ip-address 127.0.0.1 --num-gpus 8 --disable-usage-stats
ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats

ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{
Expand All @@ -123,7 +121,7 @@ ray job submit --address="http://127.0.0.1:8265" \
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--actor-num-gpus-per-node 2 \
--colocate \
--calculate-per-token-loss \
--use-miles-router \
Expand Down
2 changes: 1 addition & 1 deletion examples/retool/retool_qwen3_4b_rl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dapo
--wandb-project miles-retool
--wandb-group qwen3-4B-test-multi-turn
--wandb-key ${WANDB_KEY}
)
Expand Down
2 changes: 1 addition & 1 deletion examples/retool/retool_qwen3_4b_sft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dev
--wandb-project miles-retool
--wandb-group qwen3-4B-base-sft
--wandb-key ${WANDB_KEY}
)
Expand Down
8 changes: 4 additions & 4 deletions examples/search-r1/run_qwen2.5_3B.sh
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,10 @@ OPTIMIZER_ARGS=(
)

WANDB_ARGS=(
# --use-wandb
# --wandb-project miles-dev
# --wandb-group search-r1_qwen2.5-3B-test
# --wandb-key ${WANDB_KEY}
--use-wandb
--wandb-project miles-search-r1
--wandb-group search-r1_qwen2.5-3B-test
--wandb-key ${WANDB_KEY}
)

SGLANG_ARGS=(
Expand Down
2 changes: 1 addition & 1 deletion examples/strands-agents/strands_qwen3_4b.sh
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project strands-miles
--wandb-project miles-strands-agents
--wandb-group Qwen3-4B-Instruct-2507-strands-dapo
--wandb-key ${WANDB_KEY}
)
Expand Down
12 changes: 6 additions & 6 deletions examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ ROLLOUT_ARGS=(
--num-rollout 100
--rollout-batch-size 8
--n-samples-per-prompt 8
--rollout-max-response-len 4096
--rollout-temperature 0.8
--rollout-max-response-len 8192
--rollout-temperature 1.0
--global-batch-size 64
)

Expand All @@ -78,7 +78,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dev-mcore-fsdp
--wandb-project miles-train-infer-mismatch
--wandb-group qwen3-4B-fsdp-1130-ref
--wandb-key ${WANDB_API_KEY}
)
Expand Down Expand Up @@ -106,7 +106,7 @@ PERF_ARGS=(

MISC_ARGS=(
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--actor-num-gpus-per-node 4
--colocate
--use-fault-tolerance
--dump-details /root/shared_data/qwen3-4B-fsdp-1116-noref/dump_details
Expand All @@ -118,9 +118,9 @@ CUSTOM_ARGS=(
--custom-tis-function-path examples.train_infer_mismatch_helper.mis.compute_mis_weights_fsdp
)

# launch the master node of ray in container - 8 GPUs for training
# launch the master node of ray in container - 4 GPUs for training
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 4 --disable-usage-stats


RUNTIME_ENV_JSON="{
Expand Down
2 changes: 1 addition & 1 deletion examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ OPTIMIZER_ARGS=(

WANDB_ARGS=(
--use-wandb
--wandb-project miles-mis
--wandb-project miles-train-infer-mismatch
--wandb-group qwen3-4B-mis
--wandb-key ${WANDB_KEY}
)
Expand Down