radixark · zijiexia · Jan 10, 2026 · Jan 10, 2026 · Jan 13, 2026
diff --git a/examples/README.md b/examples/README.md
@@ -4,21 +4,22 @@ These examples provide concrete examples to leverage Miles in your own RL workfl
 
 ## Directory Structure
 
-- **[DrGRPO](./DrGRPO)**: Custom reducer for Dr.GRPO algorithm.
-- **[eval](./eval)**: Documentation and setup for evaluation environments using NeMo-Skills.
-- **[eval_multi_task](./eval_multi_task)**: Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench.
-- **[formal_math](./formal_math)**: Examples related to formal math reasoning tasks, including a single round demo.
-- **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency.
-- **[geo3k_vlm](./geo3k_vlm)**: Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset.
-- **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training (FSDP backend) on Geo3k dataset.
-- **[low_precision](./low_precision)**: Examples of FP8 training and inference for improved throughput and stability.
-- **[multi_agent](./multi_agent)**: Example of running multi-agent RL with `miles`.
-- **[on_policy_distillation](./on_policy_distillation)**: Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training.
-- **[reproducibility](./reproducibility)**: Guides on achieving bitwise experiment reproduction using deterministic modes.
-- **[retool](./retool)**: Demonstrates the retool functionality for tool-enabled language model generation.
-- **[search-r1](./search-r1)**: A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling.
-- **[strands-agents](./strands-agents)**: Integration example with the Strands-Agents scaffolding framework.
-- **[tau-bench](./tau-bench)**: Training in an agentic multi-turn tool use environment (Tau-bench).
-- **[train_infer_mismatch_helper](./train_infer_mismatch_helper)**: Algorithmic methods for rollout correction (e.g., TIS, MIS).
-- **[true_on_policy](./true_on_policy)**: Ensures strictly equal log probabilities between inference (SGLang) and training engines.
-- **[true_on_policy_vlm](./true_on_policy_vlm)**: "True On-Policy" training demonstration for VLM (Qwen3-VL).
+| Example | Description | W&B |
+| :--- | :--- | :--- |
+| **[DrGRPO](./DrGRPO)** | Custom reducer for Dr.GRPO algorithm. | |
+| **[eval](./eval)** | Documentation and setup for evaluation environments using NeMo-Skills. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval) |
+| **[eval_multi_task](./eval_multi_task)** | Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval-multi-task) |
+| **[formal_math](./formal_math)** | Examples related to formal math reasoning tasks, including a single round demo. | [link](https://wandb.ai/zijie_xia-n-a/miles-formal-math-run-minimal) |
+| **[fully_async](./fully_async)** | Demonstrates fully asynchronous rollout generation for higher efficiency. | [link](https://wandb.ai/zijie_xia-n-a/miles-fully-async) |
+| **[geo3k_vlm](./geo3k_vlm)** | Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm) |
+| **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)** | VLM multi-turn training (FSDP backend) on Geo3k dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm-multi-turn) |
+| **[low_precision](./low_precision)** | Examples of FP8 training and inference for improved throughput and stability. | [link](https://wandb.ai/zijie_xia-n-a/miles-low-precision) |
+| **[multi_agent](./multi_agent)** | Example of running multi-agent RL with miles. | [link](https://wandb.ai/zijie_xia-n-a/miles-multi-agent) |
+| **[on_policy_distillation](./on_policy_distillation)** | Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. | [link](https://wandb.ai/zijie_xia-n-a/miles-on-policy-distillation) |
+| **[reproducibility](./reproducibility)** | Guides on achieving bitwise experiment reproduction using deterministic modes. | [link](https://wandb.ai/zijie_xia-n-a/miles-reproducibility) |
+| **[retool](./retool)** | Demonstrates the retool functionality for tool-enabled language model generation. | [link](https://wandb.ai/zijie_xia-n-a/miles-retool) |
+| **[search-r1](./search-r1)** | A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. | [link](https://wandb.ai/zijie_xia-n-a/miles-search-r1) |
+| **[strands-agents](./strands-agents)** | Integration example with the Strands-Agents scaffolding framework. | [link](https://wandb.ai/zijie_xia-n-a/miles-strands-agents) |
+| **[train_infer_mismatch_helper](./train_infer_mismatch_helper)** | Algorithmic methods for rollout correction (e.g., TIS, MIS). | [link](https://wandb.ai/zijie_xia-n-a/miles-train-infer-mismatch) |
+| **[true_on_policy](./true_on_policy)** | Ensures strictly equal log probabilities between inference (SGLang) and training engines. | [link](https://wandb.ai/zijie_xia-n-a/miles-true-on-policy) |
+| **[true_on_policy_vlm](./true_on_policy_vlm)** | "True On-Policy" training demonstration for VLM (Qwen3-VL). | |
diff --git a/examples/eval/scripts/multi_tasks.yaml b/examples/eval/scripts/multi_tasks.yaml
@@ -6,11 +6,11 @@ eval:
     top_k: -1
     max_response_len: 24576
   datasets: # these eval tasks go through miles dataset config and default rollout function (miles.rollout.sglang_rollout.generate_rollout)
-    - name: gpqa    # huggingface-cli download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa
+    - name: gpqa    # hf download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa
       path: /root/gpqa/gpqa_eval.jsonl
       rm_type: gpqa
       n_samples_per_eval_prompt: 2
-    - name: ifbench   # huggingface-cli download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench
+    - name: ifbench   # hf download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench
       path: /root/ifbench/IFBench_eval.jsonl
       rm_type: ifbench
       n_samples_per_eval_prompt: 1

diff --git a/examples/eval_multi_task/multi_task.sh b/examples/eval_multi_task/multi_task.sh
@@ -97,7 +97,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project eval
+   --wandb-project miles-eval-multi-task
    --wandb-group multi_task
    --wandb-key ${WANDB_KEY}
 )

diff --git a/examples/fully_async/run-qwen3-4b-fully_async.sh b/examples/fully_async/run-qwen3-4b-fully_async.sh
@@ -35,7 +35,7 @@ CKPT_ARGS=(
    --save-interval 20
 )
 
-PROMPT_SET=/path/to/dapo-math-17k.jsonl
+PROMPT_SET=/root/dapo-math-17k/dapo-math-17k.jsonl
 
 ROLLOUT_ARGS=(
    --rollout-function-path fully_async_rollout.generate_rollout_fully_async
@@ -96,6 +96,13 @@ OPTIMIZER_ARGS=(
    --adam-beta2 0.98
 )
 
+WANDB_ARGS=(
+   --use-wandb
+   --wandb-project miles-fully-async
+   --wandb-group qwen3-4b-fully_async
+   --wandb-key ${WANDB_KEY}
+)
+
 SGLANG_ARGS=(
    --rollout-num-gpus-per-engine 1
 )
@@ -134,6 +141,7 @@ ray job submit --address="http://127.0.0.1:8265" \
    ${ROLLOUT_ARGS[@]} \
    ${OPTIMIZER_ARGS[@]} \
    ${GRPO_ARGS[@]} \
+   ${WANDB_ARGS[@]} \
    ${PERF_ARGS[@]} \
    ${SGLANG_ARGS[@]} \
    ${MISC_ARGS[@]}
diff --git a/examples/geo3k_vlm/run_geo3k_vlm.sh b/examples/geo3k_vlm/run_geo3k_vlm.sh
@@ -129,7 +129,7 @@ OPTIMIZER_ARGS=(
 SGLANG_ARGS=(
    --rollout-num-gpus-per-engine 1
    --sglang-mem-fraction-static 0.6
-   --sglang-cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256
+   --sglang-disable-cuda-graph
 )
 
 # Wandb args (only if WANDB_API_KEY is set)

diff --git a/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py b/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py
@@ -45,7 +45,7 @@ def execute():
     wandb_args = (
         (
             "--use-wandb "
-            "--wandb-project miles-dev "
+            "--wandb-project miles-geo3k-vlm-multi-turn "
             "--wandb-group geo3k_vlm_multi_turn "
             f"--wandb-key '{wandb_api_key}' "
         )
@@ -98,11 +98,7 @@ def execute():
         "--adam-beta2 0.98 "
     )
 
-    sglang_args = (
-        "--rollout-num-gpus-per-engine 1 "
-        "--sglang-mem-fraction-static 0.6 "
-        f"--sglang-cuda-graph-bs {' '.join(map(str, [1, 2, 4, 8] + list(range(16, 257, 8))))} "
-    )
+    sglang_args = "--rollout-num-gpus-per-engine 1 " "--sglang-mem-fraction-static 0.6 " "--sglang-disable-cuda-graph "
 
     fsdp_args = (
         "--train-backend fsdp "

diff --git a/examples/low_precision/run-qwen3-4b-fp8.sh b/examples/low_precision/run-qwen3-4b-fp8.sh
@@ -98,10 +98,10 @@ OPTIMIZER_ARGS=(
 )
 
 WANDB_ARGS=(
-   # --use-wandb
-   # --wandb-project miles-dev
-   # --wandb-group qwen3-4B-test
-   # --wandb-key ${WANDB_KEY}
+   --use-wandb
+   --wandb-project miles-low-precision
+   --wandb-group qwen3-4B-test
+   --wandb-key ${WANDB_KEY}
 )
 
 SGLANG_ARGS=(
@@ -125,7 +125,7 @@ PRECISE_ARGS=(
    --bf16
    --fp8-format e4m3
    --fp8-recipe blockwise
-   --fp8-param-gather
+   # --fp8-param-gather
 )
 
 

diff --git a/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh b/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh
@@ -104,10 +104,10 @@ OPTIMIZER_ARGS=(
 )
 
 WANDB_ARGS=(
-   #--use-wandb
-   # --wandb-project miles-dev
-   # --wandb-group qwen3-30B-A3B-test
-   # --wandb-key ${WANDB_KEY}
+   --use-wandb
+   --wandb-project miles-multi-agent
+   --wandb-group qwen3-30B-A3B-test
+   --wandb-key ${WANDB_KEY}
 )
 
 SGLANG_ARGS=(

diff --git a/examples/on_policy_distillation/run-qwen3-8B-opd.sh b/examples/on_policy_distillation/run-qwen3-8B-opd.sh
@@ -2,6 +2,15 @@
 
 # usage: bash examples/on_policy_distillation/run-qwen3-8B-opd.sh
 
+pkill -9 sglang
+sleep 3
+ray stop --force
+pkill -9 ray
+pkill -9 python
+sleep 3
+pkill -9 ray
+pkill -9 python
+
 set -ex
 
 
@@ -48,10 +57,10 @@ source "/root/miles/scripts/models/qwen3-8B.sh"
 
 
 CKPT_ARGS=(
-   --hf-checkpoint /root/Qwen3-8B
-   --ref-load /root/Qwen3-8B_torch_dist
-   --load /root/Qwen3-8B_miles/
-   --save /root/Qwen3-8B_miles/
+   --hf-checkpoint /root/models/Qwen3-8B
+   --ref-load /root/models/Qwen3-8B_torch_dist
+   --load /root/models/Qwen3-8B_miles/
+   --save /root/models/Qwen3-8B_miles/
    --save-interval 20
 )
 
@@ -119,10 +128,10 @@ OPTIMIZER_ARGS=(
 )
 
 WANDB_ARGS=(
-   #--use-wandb
-   # --wandb-project miles-dev
-   # --wandb-group qwen3-8B-test
-   # --wandb-key ${WANDB_KEY}
+   --use-wandb
+   --wandb-project miles-on-policy-distillation
+   --wandb-group qwen3-8B-test
+   --wandb-key ${WANDB_KEY}
 )
 
 SGLANG_ARGS=(

diff --git a/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh b/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh
@@ -81,9 +81,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-host https://wandb.ai/
-   --wandb-team glm-zero
-   --wandb-project miles-dev
+   --wandb-project miles-reproducibility
    --wandb-group qwen2.5-0.5B-gsm8k-deterministic
 )
 
@@ -109,7 +107,7 @@ MISC_ARGS=(
 )
 
 # launch the master node of ray in container
-ray start --head --node-ip-address 127.0.0.1 --num-gpus 8 --disable-usage-stats
+ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats
 
 ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{
@@ -123,7 +121,7 @@ ray job submit --address="http://127.0.0.1:8265" \
    }' \
    -- python3 train.py \
    --actor-num-nodes 1 \
-   --actor-num-gpus-per-node 8 \
+   --actor-num-gpus-per-node 2 \
    --colocate \
    --calculate-per-token-loss \
    --use-miles-router \

diff --git a/examples/retool/retool_qwen3_4b_rl.sh b/examples/retool/retool_qwen3_4b_rl.sh
@@ -98,7 +98,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project miles-dapo
+   --wandb-project miles-retool
    --wandb-group qwen3-4B-test-multi-turn
    --wandb-key ${WANDB_KEY}
 )

diff --git a/examples/retool/retool_qwen3_4b_sft.sh b/examples/retool/retool_qwen3_4b_sft.sh
@@ -80,7 +80,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project miles-dev
+   --wandb-project miles-retool
    --wandb-group qwen3-4B-base-sft
    --wandb-key ${WANDB_KEY}
 )

diff --git a/examples/search-r1/run_qwen2.5_3B.sh b/examples/search-r1/run_qwen2.5_3B.sh
@@ -90,10 +90,10 @@ OPTIMIZER_ARGS=(
 )
 
 WANDB_ARGS=(
-   # --use-wandb
-   # --wandb-project miles-dev
-   # --wandb-group search-r1_qwen2.5-3B-test
-   # --wandb-key ${WANDB_KEY}
+   --use-wandb
+   --wandb-project miles-search-r1
+   --wandb-group search-r1_qwen2.5-3B-test
+   --wandb-key ${WANDB_KEY}
 )
 
 SGLANG_ARGS=(

diff --git a/examples/strands-agents/strands_qwen3_4b.sh b/examples/strands-agents/strands_qwen3_4b.sh
@@ -99,7 +99,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project strands-miles
+   --wandb-project miles-strands-agents
    --wandb-group Qwen3-4B-Instruct-2507-strands-dapo
    --wandb-key ${WANDB_KEY}
 )

diff --git a/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh b/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh
@@ -50,8 +50,8 @@ ROLLOUT_ARGS=(
    --num-rollout 100
    --rollout-batch-size 8
    --n-samples-per-prompt 8
-   --rollout-max-response-len 4096
-   --rollout-temperature 0.8
+   --rollout-max-response-len 8192
+   --rollout-temperature 1.0
    --global-batch-size 64
 )
 
@@ -78,7 +78,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project miles-dev-mcore-fsdp
+   --wandb-project miles-train-infer-mismatch
    --wandb-group qwen3-4B-fsdp-1130-ref
    --wandb-key ${WANDB_API_KEY}
 )
@@ -106,7 +106,7 @@ PERF_ARGS=(
 
 MISC_ARGS=(
    --actor-num-nodes 1
-   --actor-num-gpus-per-node 8
+   --actor-num-gpus-per-node 4
    --colocate
    --use-fault-tolerance
    --dump-details /root/shared_data/qwen3-4B-fsdp-1116-noref/dump_details
@@ -118,9 +118,9 @@ CUSTOM_ARGS=(
    --custom-tis-function-path examples.train_infer_mismatch_helper.mis.compute_mis_weights_fsdp
 )
 
-# launch the master node of ray in container - 8 GPUs for training
+# launch the master node of ray in container - 4 GPUs for training
 export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
-ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
+ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 4 --disable-usage-stats
 
 
 RUNTIME_ENV_JSON="{

diff --git a/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh b/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh
@@ -99,7 +99,7 @@ OPTIMIZER_ARGS=(
 
 WANDB_ARGS=(
    --use-wandb
-   --wandb-project miles-mis
+   --wandb-project miles-train-infer-mismatch
    --wandb-group qwen3-4B-mis
    --wandb-key ${WANDB_KEY}
 )