From 1db2b7ba6b453b8afb575228370e0686109a4e60 Mon Sep 17 00:00:00 2001 From: Zijie Xia Date: Fri, 9 Jan 2026 20:45:42 -0800 Subject: [PATCH 1/3] Update example scripts and README for improved clarity and organization --- examples/README.md | 39 ++++++++++--------- examples/eval/scripts/multi_tasks.yaml | 4 +- examples/eval_multi_task/multi_task.sh | 2 +- .../fully_async/run-qwen3-4b-fully_async.sh | 10 ++++- examples/geo3k_vlm/run_geo3k_vlm.sh | 2 +- .../run_geo3k_vlm_multi_turn.py | 4 +- examples/low_precision/run-qwen3-4b-fp8.sh | 10 ++--- .../run-qwen3-30B-A3B-multi-agent.sh | 8 ++-- .../run-qwen3-8B-opd.sh | 25 ++++++++---- .../reproducibility/run-qwen2.5-0.5B-gsm8k.sh | 8 ++-- examples/retool/retool_qwen3_4b_rl.sh | 2 +- examples/retool/retool_qwen3_4b_sft.sh | 2 +- examples/search-r1/run_qwen2.5_3B.sh | 8 ++-- examples/strands-agents/strands_qwen3_4b.sh | 2 +- .../run-qwen3-4b-fsdp-mis.sh | 12 +++--- .../run-qwen3-4b-mis.sh | 2 +- 16 files changed, 79 insertions(+), 61 deletions(-) diff --git a/examples/README.md b/examples/README.md index c38642fbd..346447248 100644 --- a/examples/README.md +++ b/examples/README.md @@ -4,21 +4,24 @@ These examples provide concrete examples to leverage Miles in your own RL workfl ## Directory Structure -- **[DrGRPO](./DrGRPO)**: Custom reducer for Dr.GRPO algorithm. -- **[eval](./eval)**: Documentation and setup for evaluation environments using NeMo-Skills. -- **[eval_multi_task](./eval_multi_task)**: Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench. -- **[formal_math](./formal_math)**: Examples related to formal math reasoning tasks, including a single round demo. -- **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency. -- **[geo3k_vlm](./geo3k_vlm)**: Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset. -- **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training (FSDP backend) on Geo3k dataset. -- **[low_precision](./low_precision)**: Examples of FP8 training and inference for improved throughput and stability. -- **[multi_agent](./multi_agent)**: Example of running multi-agent RL with `miles`. -- **[on_policy_distillation](./on_policy_distillation)**: Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. -- **[reproducibility](./reproducibility)**: Guides on achieving bitwise experiment reproduction using deterministic modes. -- **[retool](./retool)**: Demonstrates the retool functionality for tool-enabled language model generation. -- **[search-r1](./search-r1)**: A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. -- **[strands-agents](./strands-agents)**: Integration example with the Strands-Agents scaffolding framework. -- **[tau-bench](./tau-bench)**: Training in an agentic multi-turn tool use environment (Tau-bench). -- **[train_infer_mismatch_helper](./train_infer_mismatch_helper)**: Algorithmic methods for rollout correction (e.g., TIS, MIS). -- **[true_on_policy](./true_on_policy)**: Ensures strictly equal log probabilities between inference (SGLang) and training engines. -- **[true_on_policy_vlm](./true_on_policy_vlm)**: "True On-Policy" training demonstration for VLM (Qwen3-VL). +| Example | Description | W&B | +| :--- | :--- | :--- | +| **[DrGRPO](./DrGRPO)** | Custom reducer for Dr.GRPO algorithm. | | +| **[eval](./eval)** | Documentation and setup for evaluation environments using NeMo-Skills. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval) | +| **[eval_multi_task](./eval_multi_task)** | Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench. | [link](https://wandb.ai/zijie_xia-n-a/miles-eval-multi-task) | +| **[formal_math](./formal_math)** | Examples related to formal math reasoning tasks, including a single round demo. | [link](https://wandb.ai/zijie_xia-n-a/miles-formal-math-run-minimal) | +| **[fully_async](./fully_async)** | Demonstrates fully asynchronous rollout generation for higher efficiency. | [link](https://wandb.ai/zijie_xia-n-a/miles-fully-async) | +| **[geo3k_vlm](./geo3k_vlm)** | Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm) | +| **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)** | VLM multi-turn training (FSDP backend) on Geo3k dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm-multi-turn) | +| **[low_precision](./low_precision)** | Examples of FP8 training and inference for improved throughput and stability. | [link](https://wandb.ai/zijie_xia-n-a/miles-low-precision) | +| **[multi_agent](./multi_agent)** | Example of running multi-agent RL with `miles`. | [link](https://wandb.ai/zijie_xia-n-a/miles-multi-agent) | +| **[on_policy_distillation](./on_policy_distillation)** | Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. | [link](https://wandb.ai/zijie_xia-n-a/miles-on-policy-distillation) | +| **[reproducibility](./reproducibility)** | Guides on achieving bitwise experiment reproduction using deterministic modes. | [link](https://wandb.ai/zijie_xia-n-a/miles-reproducibility) | +| **[retool](./retool)** | Demonstrates the retool functionality for tool-enabled language model generation. | [link](https://wandb.ai/zijie_xia-n-a/miles-retool) | +| **[search-r1](./search-r1)** | A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. | [link](https://wandb.ai/zijie_xia-n-a/miles-search-r1) | +| **[strands-agents](./strands-agents)** | Integration example with the Strands-Agents scaffolding framework. | [link](https://wandb.ai/zijie_xia-n-a/miles-strands-agents) | +| **[swe-agent](./swe-agent)** | Example of SWE-agent training using Nvidia's Nemo-Gym and SWE-Gym. | [link](https://wandb.ai/zijie_xia-n-a/miles-swe-agent) | +| **[tau-bench](./tau-bench)** | Training in an agentic multi-turn tool use environment (Tau-bench). | | +| **[train_infer_mismatch_helper](./train_infer_mismatch_helper)** | Algorithmic methods for rollout correction (e.g., TIS, MIS). | [link](https://wandb.ai/zijie_xia-n-a/miles-train-infer-mismatch-helper) | +| **[true_on_policy](./true_on_policy)** | Ensures strictly equal log probabilities between inference (SGLang) and training engines. | [link](https://wandb.ai/zijie_xia-n-a/miles-true-on-policy) | +| **[true_on_policy_vlm](./true_on_policy_vlm)** | "True On-Policy" training demonstration for VLM (Qwen3-VL). | | diff --git a/examples/eval/scripts/multi_tasks.yaml b/examples/eval/scripts/multi_tasks.yaml index 1a22b8c9a..0718c794f 100644 --- a/examples/eval/scripts/multi_tasks.yaml +++ b/examples/eval/scripts/multi_tasks.yaml @@ -6,11 +6,11 @@ eval: top_k: -1 max_response_len: 24576 datasets: # these eval tasks go through miles dataset config and default rollout function (miles.rollout.sglang_rollout.generate_rollout) - - name: gpqa # huggingface-cli download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa + - name: gpqa # hf download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa path: /root/gpqa/gpqa_eval.jsonl rm_type: gpqa n_samples_per_eval_prompt: 2 - - name: ifbench # huggingface-cli download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench + - name: ifbench # hf download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench path: /root/ifbench/IFBench_eval.jsonl rm_type: ifbench n_samples_per_eval_prompt: 1 diff --git a/examples/eval_multi_task/multi_task.sh b/examples/eval_multi_task/multi_task.sh index 8236e6d2f..c781aebd1 100644 --- a/examples/eval_multi_task/multi_task.sh +++ b/examples/eval_multi_task/multi_task.sh @@ -97,7 +97,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project eval + --wandb-project miles-eval-multi-task --wandb-group multi_task --wandb-key ${WANDB_KEY} ) diff --git a/examples/fully_async/run-qwen3-4b-fully_async.sh b/examples/fully_async/run-qwen3-4b-fully_async.sh index 026e48608..96774c884 100644 --- a/examples/fully_async/run-qwen3-4b-fully_async.sh +++ b/examples/fully_async/run-qwen3-4b-fully_async.sh @@ -35,7 +35,7 @@ CKPT_ARGS=( --save-interval 20 ) -PROMPT_SET=/path/to/dapo-math-17k.jsonl +PROMPT_SET=/root/dapo-math-17k/dapo-math-17k.jsonl ROLLOUT_ARGS=( --rollout-function-path fully_async_rollout.generate_rollout_fully_async @@ -96,6 +96,13 @@ OPTIMIZER_ARGS=( --adam-beta2 0.98 ) +WANDB_ARGS=( + --use-wandb + --wandb-project miles-fully-async + --wandb-group qwen3-4b-fully_async + --wandb-key ${WANDB_KEY} +) + SGLANG_ARGS=( --rollout-num-gpus-per-engine 1 ) @@ -134,6 +141,7 @@ ray job submit --address="http://127.0.0.1:8265" \ ${ROLLOUT_ARGS[@]} \ ${OPTIMIZER_ARGS[@]} \ ${GRPO_ARGS[@]} \ + ${WANDB_ARGS[@]} \ ${PERF_ARGS[@]} \ ${SGLANG_ARGS[@]} \ ${MISC_ARGS[@]} diff --git a/examples/geo3k_vlm/run_geo3k_vlm.sh b/examples/geo3k_vlm/run_geo3k_vlm.sh index 051efc285..2bd2503d3 100644 --- a/examples/geo3k_vlm/run_geo3k_vlm.sh +++ b/examples/geo3k_vlm/run_geo3k_vlm.sh @@ -129,7 +129,7 @@ OPTIMIZER_ARGS=( SGLANG_ARGS=( --rollout-num-gpus-per-engine 1 --sglang-mem-fraction-static 0.6 - --sglang-cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256 + --sglang-disable-cuda-graph ) # Wandb args (only if WANDB_API_KEY is set) diff --git a/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py b/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py index 5c32d33d3..64a89758b 100644 --- a/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py +++ b/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py @@ -45,7 +45,7 @@ def execute(): wandb_args = ( ( "--use-wandb " - "--wandb-project miles-dev " + "--wandb-project miles-geo3k-vlm-multi-turn " "--wandb-group geo3k_vlm_multi_turn " f"--wandb-key '{wandb_api_key}' " ) @@ -101,7 +101,7 @@ def execute(): sglang_args = ( "--rollout-num-gpus-per-engine 1 " "--sglang-mem-fraction-static 0.6 " - f"--sglang-cuda-graph-bs {' '.join(map(str, [1, 2, 4, 8] + list(range(16, 257, 8))))} " + f"--sglang-disable-cuda-graph " ) fsdp_args = ( diff --git a/examples/low_precision/run-qwen3-4b-fp8.sh b/examples/low_precision/run-qwen3-4b-fp8.sh index b196ba606..1a5a05074 100644 --- a/examples/low_precision/run-qwen3-4b-fp8.sh +++ b/examples/low_precision/run-qwen3-4b-fp8.sh @@ -98,10 +98,10 @@ OPTIMIZER_ARGS=( ) WANDB_ARGS=( - # --use-wandb - # --wandb-project miles-dev - # --wandb-group qwen3-4B-test - # --wandb-key ${WANDB_KEY} + --use-wandb + --wandb-project miles-low-precision + --wandb-group qwen3-4B-test + --wandb-key ${WANDB_KEY} ) SGLANG_ARGS=( @@ -125,7 +125,7 @@ PRECISE_ARGS=( --bf16 --fp8-format e4m3 --fp8-recipe blockwise - --fp8-param-gather + # --fp8-param-gather ) diff --git a/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh b/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh index f3e5f1466..c2be7e2ed 100644 --- a/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh +++ b/examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh @@ -104,10 +104,10 @@ OPTIMIZER_ARGS=( ) WANDB_ARGS=( - #--use-wandb - # --wandb-project miles-dev - # --wandb-group qwen3-30B-A3B-test - # --wandb-key ${WANDB_KEY} + --use-wandb + --wandb-project miles-multi-agent + --wandb-group qwen3-30B-A3B-test + --wandb-key ${WANDB_KEY} ) SGLANG_ARGS=( diff --git a/examples/on_policy_distillation/run-qwen3-8B-opd.sh b/examples/on_policy_distillation/run-qwen3-8B-opd.sh index f45c2634b..9850d6cde 100644 --- a/examples/on_policy_distillation/run-qwen3-8B-opd.sh +++ b/examples/on_policy_distillation/run-qwen3-8B-opd.sh @@ -2,6 +2,15 @@ # usage: bash examples/on_policy_distillation/run-qwen3-8B-opd.sh +pkill -9 sglang +sleep 3 +ray stop --force +pkill -9 ray +pkill -9 python +sleep 3 +pkill -9 ray +pkill -9 python + set -ex @@ -48,10 +57,10 @@ source "/root/miles/scripts/models/qwen3-8B.sh" CKPT_ARGS=( - --hf-checkpoint /root/Qwen3-8B - --ref-load /root/Qwen3-8B_torch_dist - --load /root/Qwen3-8B_miles/ - --save /root/Qwen3-8B_miles/ + --hf-checkpoint /root/models/Qwen3-8B + --ref-load /root/models/Qwen3-8B_torch_dist + --load /root/models/Qwen3-8B_miles/ + --save /root/models/Qwen3-8B_miles/ --save-interval 20 ) @@ -119,10 +128,10 @@ OPTIMIZER_ARGS=( ) WANDB_ARGS=( - #--use-wandb - # --wandb-project miles-dev - # --wandb-group qwen3-8B-test - # --wandb-key ${WANDB_KEY} + --use-wandb + --wandb-project miles-on-policy-distillation + --wandb-group qwen3-8B-test + --wandb-key ${WANDB_KEY} ) SGLANG_ARGS=( diff --git a/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh b/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh index 9fab2cd2f..48f166a0d 100644 --- a/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh +++ b/examples/reproducibility/run-qwen2.5-0.5B-gsm8k.sh @@ -81,9 +81,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-host https://wandb.ai/ - --wandb-team glm-zero - --wandb-project miles-dev + --wandb-project miles-reproducibility --wandb-group qwen2.5-0.5B-gsm8k-deterministic ) @@ -109,7 +107,7 @@ MISC_ARGS=( ) # launch the master node of ray in container -ray start --head --node-ip-address 127.0.0.1 --num-gpus 8 --disable-usage-stats +ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{ @@ -123,7 +121,7 @@ ray job submit --address="http://127.0.0.1:8265" \ }' \ -- python3 train.py \ --actor-num-nodes 1 \ - --actor-num-gpus-per-node 8 \ + --actor-num-gpus-per-node 2 \ --colocate \ --calculate-per-token-loss \ --use-miles-router \ diff --git a/examples/retool/retool_qwen3_4b_rl.sh b/examples/retool/retool_qwen3_4b_rl.sh index 838ce0e2c..1c774d2e5 100644 --- a/examples/retool/retool_qwen3_4b_rl.sh +++ b/examples/retool/retool_qwen3_4b_rl.sh @@ -98,7 +98,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project miles-dapo + --wandb-project miles-retool --wandb-group qwen3-4B-test-multi-turn --wandb-key ${WANDB_KEY} ) diff --git a/examples/retool/retool_qwen3_4b_sft.sh b/examples/retool/retool_qwen3_4b_sft.sh index 871f4bccf..73d5f58cd 100644 --- a/examples/retool/retool_qwen3_4b_sft.sh +++ b/examples/retool/retool_qwen3_4b_sft.sh @@ -80,7 +80,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project miles-dev + --wandb-project miles-retool --wandb-group qwen3-4B-base-sft --wandb-key ${WANDB_KEY} ) diff --git a/examples/search-r1/run_qwen2.5_3B.sh b/examples/search-r1/run_qwen2.5_3B.sh index d118fc42b..654539efa 100644 --- a/examples/search-r1/run_qwen2.5_3B.sh +++ b/examples/search-r1/run_qwen2.5_3B.sh @@ -90,10 +90,10 @@ OPTIMIZER_ARGS=( ) WANDB_ARGS=( - # --use-wandb - # --wandb-project miles-dev - # --wandb-group search-r1_qwen2.5-3B-test - # --wandb-key ${WANDB_KEY} + --use-wandb + --wandb-project miles-search-r1 + --wandb-group search-r1_qwen2.5-3B-test + --wandb-key ${WANDB_KEY} ) SGLANG_ARGS=( diff --git a/examples/strands-agents/strands_qwen3_4b.sh b/examples/strands-agents/strands_qwen3_4b.sh index 647c8e2f5..2406fcc58 100644 --- a/examples/strands-agents/strands_qwen3_4b.sh +++ b/examples/strands-agents/strands_qwen3_4b.sh @@ -99,7 +99,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project strands-miles + --wandb-project miles-strands-agents --wandb-group Qwen3-4B-Instruct-2507-strands-dapo --wandb-key ${WANDB_KEY} ) diff --git a/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh b/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh index df3848038..c527edc39 100644 --- a/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh +++ b/examples/train_infer_mismatch_helper/run-qwen3-4b-fsdp-mis.sh @@ -50,8 +50,8 @@ ROLLOUT_ARGS=( --num-rollout 100 --rollout-batch-size 8 --n-samples-per-prompt 8 - --rollout-max-response-len 4096 - --rollout-temperature 0.8 + --rollout-max-response-len 8192 + --rollout-temperature 1.0 --global-batch-size 64 ) @@ -78,7 +78,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project miles-dev-mcore-fsdp + --wandb-project miles-train-infer-mismatch --wandb-group qwen3-4B-fsdp-1130-ref --wandb-key ${WANDB_API_KEY} ) @@ -106,7 +106,7 @@ PERF_ARGS=( MISC_ARGS=( --actor-num-nodes 1 - --actor-num-gpus-per-node 8 + --actor-num-gpus-per-node 4 --colocate --use-fault-tolerance --dump-details /root/shared_data/qwen3-4B-fsdp-1116-noref/dump_details @@ -118,9 +118,9 @@ CUSTOM_ARGS=( --custom-tis-function-path examples.train_infer_mismatch_helper.mis.compute_mis_weights_fsdp ) -# launch the master node of ray in container - 8 GPUs for training +# launch the master node of ray in container - 4 GPUs for training export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} -ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats +ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 4 --disable-usage-stats RUNTIME_ENV_JSON="{ diff --git a/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh b/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh index 300e8ac75..e9c7b5215 100644 --- a/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh +++ b/examples/train_infer_mismatch_helper/run-qwen3-4b-mis.sh @@ -99,7 +99,7 @@ OPTIMIZER_ARGS=( WANDB_ARGS=( --use-wandb - --wandb-project miles-mis + --wandb-project miles-train-infer-mismatch --wandb-group qwen3-4B-mis --wandb-key ${WANDB_KEY} ) From 1181bcaca1e830c7654660c19aaca35116d75042 Mon Sep 17 00:00:00 2001 From: Zijie Xia Date: Fri, 9 Jan 2026 20:52:27 -0800 Subject: [PATCH 2/3] fix --- examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py b/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py index 64a89758b..3f24f09ea 100644 --- a/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py +++ b/examples/geo3k_vlm_multi_turn/run_geo3k_vlm_multi_turn.py @@ -98,11 +98,7 @@ def execute(): "--adam-beta2 0.98 " ) - sglang_args = ( - "--rollout-num-gpus-per-engine 1 " - "--sglang-mem-fraction-static 0.6 " - f"--sglang-disable-cuda-graph " - ) + sglang_args = "--rollout-num-gpus-per-engine 1 " "--sglang-mem-fraction-static 0.6 " "--sglang-disable-cuda-graph " fsdp_args = ( "--train-backend fsdp " From e746448c6fb55b4ec747f0c5e6ec304338753f18 Mon Sep 17 00:00:00 2001 From: Zijie Xia Date: Mon, 12 Jan 2026 16:34:26 -0800 Subject: [PATCH 3/3] fix --- examples/README.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/examples/README.md b/examples/README.md index 346447248..e12bfa7df 100644 --- a/examples/README.md +++ b/examples/README.md @@ -14,14 +14,12 @@ These examples provide concrete examples to leverage Miles in your own RL workfl | **[geo3k_vlm](./geo3k_vlm)** | Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm) | | **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)** | VLM multi-turn training (FSDP backend) on Geo3k dataset. | [link](https://wandb.ai/zijie_xia-n-a/miles-geo3k-vlm-multi-turn) | | **[low_precision](./low_precision)** | Examples of FP8 training and inference for improved throughput and stability. | [link](https://wandb.ai/zijie_xia-n-a/miles-low-precision) | -| **[multi_agent](./multi_agent)** | Example of running multi-agent RL with `miles`. | [link](https://wandb.ai/zijie_xia-n-a/miles-multi-agent) | +| **[multi_agent](./multi_agent)** | Example of running multi-agent RL with miles. | [link](https://wandb.ai/zijie_xia-n-a/miles-multi-agent) | | **[on_policy_distillation](./on_policy_distillation)** | Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. | [link](https://wandb.ai/zijie_xia-n-a/miles-on-policy-distillation) | | **[reproducibility](./reproducibility)** | Guides on achieving bitwise experiment reproduction using deterministic modes. | [link](https://wandb.ai/zijie_xia-n-a/miles-reproducibility) | | **[retool](./retool)** | Demonstrates the retool functionality for tool-enabled language model generation. | [link](https://wandb.ai/zijie_xia-n-a/miles-retool) | | **[search-r1](./search-r1)** | A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. | [link](https://wandb.ai/zijie_xia-n-a/miles-search-r1) | | **[strands-agents](./strands-agents)** | Integration example with the Strands-Agents scaffolding framework. | [link](https://wandb.ai/zijie_xia-n-a/miles-strands-agents) | -| **[swe-agent](./swe-agent)** | Example of SWE-agent training using Nvidia's Nemo-Gym and SWE-Gym. | [link](https://wandb.ai/zijie_xia-n-a/miles-swe-agent) | -| **[tau-bench](./tau-bench)** | Training in an agentic multi-turn tool use environment (Tau-bench). | | -| **[train_infer_mismatch_helper](./train_infer_mismatch_helper)** | Algorithmic methods for rollout correction (e.g., TIS, MIS). | [link](https://wandb.ai/zijie_xia-n-a/miles-train-infer-mismatch-helper) | +| **[train_infer_mismatch_helper](./train_infer_mismatch_helper)** | Algorithmic methods for rollout correction (e.g., TIS, MIS). | [link](https://wandb.ai/zijie_xia-n-a/miles-train-infer-mismatch) | | **[true_on_policy](./true_on_policy)** | Ensures strictly equal log probabilities between inference (SGLang) and training engines. | [link](https://wandb.ai/zijie_xia-n-a/miles-true-on-policy) | | **[true_on_policy_vlm](./true_on_policy_vlm)** | "True On-Policy" training demonstration for VLM (Qwen3-VL). | |