Pr upstream verl merge diffaware #137

Jianshu-She · 2025-09-29T22:12:25Z

What does this PR do?

Update IFbench reward model

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.

### Checklist Before Starting - [x] Searched for similar PR(s). - [x] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Move entropy reward to the entropy recipe, and kl_cov anf clip_cov to README > Add one-line overview of what this PR aims to achieve or accomplish. Reference related github issues and PRs if that help review. ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path. --------- Co-authored-by: Jiacheng Chen <jackchan9345@gmail.com> Co-authored-by: H <linhaibin.eric@gmail.com>

### Checklist Before Starting - [x] Searched for similar PR(s). - [x] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? For any function or class included in `__all__`, there must be docstring associated. ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…on func defs (#2113) ### Checklist Before Starting - [x] Searched for similar PR(s). - [x] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Per volcengine/verl#2112, type annotation should be encouraged to increase readability. In previous PRs, the type check script does not really take effect (either too strict or too loose). In this PR, the check is limited to only function definitions, with a default threshold. By default on CI it only inspect the files changed in the current PR. For reference, below is a glimpse of failure cases if we force it to inspect all files under `verl`. Upon failure, it prints: ``` f"Please add type annotations for inputs and outputs to meet threshold {args.threshold}. Cases exempt from checking:" "1. Private methods." "2. Args with name in ('self', 'cls'), or *args / **kwargs" "3. Files under tests/" ``` ``` verl/trainer/main_generation.py:44: def main(config): verl/trainer/main_generation.py:48: def run_generation(config) -> None: verl/trainer/main_generation.py:60: def main_task(config): verl/trainer/main_eval.py:33: def process_item(reward_fn, data_source, response_lst, reward_data): verl/trainer/main_eval.py:40: def main(config): verl/trainer/main_ppo.py:26: def main(config): verl/trainer/main_ppo.py:31: def run_ppo(config) -> None: verl/trainer/main_ppo.py:182: def create_rl_dataset(data_paths, data_config, tokenizer, processor): verl/trainer/main_ppo.py:224: def create_rl_sampler(data_config, dataset): verl/trainer/main_ppo.py:57: def run(self, config): verl/trainer/fsdp_sft_trainer.py:71: def extract_step(path): verl/trainer/fsdp_sft_trainer.py:549: def run_sft(config): verl/trainer/fsdp_sft_trainer.py:572: def main(config): verl/trainer/fsdp_sft_trainer.py:576: def create_sft_dataset(data_paths, data_config, tokenizer): verl/trainer/fsdp_sft_trainer.py:384: def training_step(self, batch: TensorDict): verl/trainer/fsdp_sft_trainer.py:433: def validation_step(self, batch: TensorDict): verl/trainer/fsdp_sft_trainer.py:444: def save_checkpoint(self, step): verl/trainer/fsdp_sft_trainer.py:486: def fit(self): verl/trainer/ppo/reward.py:25: def get_custom_reward_fn(config): verl/trainer/ppo/reward.py:60: def load_reward_manager(config, tokenizer, num_examine, **reward_kwargs): verl/trainer/ppo/reward.py:111: def compute_reward(data: DataProto, reward_fn): verl/trainer/ppo/reward.py:133: def compute_reward_async(data: DataProto, config, tokenizer): verl/trainer/ppo/reward.py:54: def wrapped_fn(*args, **kwargs): verl/trainer/ppo/ray_trainer.py:132: def apply_kl_penalty(data: DataProto, kl_ctrl: core_algos.AdaptiveKLController, kl_penalty="kl", multi_turn=Fals verl/trainer/ppo/ray_trainer.py:181: def compute_response_mask(data: DataProto): verl/trainer/ppo/ray_trainer.py:199: def compute_advantage(data: DataProto, adv_estimator, gamma=1.0, lam=1.0, num_repeat=1, multi_turn=False, norm_a verl/trainer/ppo/ray_trainer.py:89: def create_resource_pool(self): verl/trainer/ppo/ray_trainer.py:710: def init_workers(self): verl/trainer/ppo/ray_trainer.py:892: def fit(self): verl/trainer/ppo/ray_trainer.py:381: def check_mutually_exclusive(mbs, mbs_per_gpu, name: str): verl/trainer/ppo/core_algos.py:34: def register_adv_est(name_or_enum): verl/trainer/ppo/core_algos.py:53: def get_adv_estimator_fn(name_or_enum): verl/trainer/ppo/core_algos.py:116: def get_kl_controller(kl_ctrl): verl/trainer/ppo/core_algos.py:127: def compute_gae_advantage_return( verl/trainer/ppo/core_algos.py:174: def compute_grpo_outcome_advantage( verl/trainer/ppo/core_algos.py:231: def compute_grpo_passk_outcome_advantage( verl/trainer/ppo/core_algos.py:291: def compute_reinforce_plus_plus_baseline_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, verl/trainer/ppo/core_algos.py:336: def compute_rloo_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, verl/trainer/ppo/core_algos.py:379: def compute_opo_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, verl/trainer/ppo/core_algos.py:426: def compute_reinforce_plus_plus_outcome_advantage(token_level_rewards: torch.Tensor, response_mask: torch.Tensor, verl/trainer/ppo/core_algos.py:463: def compute_remax_outcome_advantage(token_level_rewards: torch.Tensor, reward_baselines: torch.Tensor, response_mask: verl/trainer/ppo/core_algos.py:492: def compute_rewards(token_level_scores, old_log_prob, ref_log_prob, kl_ratio): verl/trainer/ppo/core_algos.py:497: def agg_loss(loss_mat: torch.Tensor, loss_mask: torch.Tensor, loss_agg_mode: str): verl/trainer/ppo/core_algos.py:533: def compute_policy_loss( verl/trainer/ppo/core_algos.py:599: def compute_entropy_loss(logits, response_mask, loss_agg_mode: str = "token-mean"): verl/trainer/ppo/core_algos.py:616: def compute_value_loss(vpreds: torch.Tensor, returns: torch.Tensor, values: torch.Tensor, response_mask: torch.Tensor, verl/trainer/ppo/core_algos.py:651: def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> torch.FloatTensor: verl/trainer/ppo/core_algos.py:689: def compute_pf_ppo_reweight_data( verl/trainer/ppo/core_algos.py:43: def decorator(fn): verl/trainer/ppo/core_algos.py:99: def update(self, current_kl, n_steps): verl/trainer/ppo/core_algos.py:112: def update(self, current_kl, n_steps): verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:329: def init_cache_engine(self): verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:334: def free_cache_engine(self): verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:355: def from_engine_args( ``` ### Usage Example For current git diffs compared to `main`: ``` python3 tests/special_sanity/type_coverage_check.py ``` For inspecting all files under `verl/` ``` find verl -type f -name "*.py" | xargs -n 1 python3 tests/special_sanity/type_coverage_check.py --all-lines --debug --target-file ``` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…rofiler to DistProfiler, add unit test based on ProfilerConfig (#2117) ### Checklist Before Starting - [x] Searched for similar PR(s). - [x] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Previously, most of individual components in verl takes omega conf dict as one of the input, making it tedious to setup unit tests. Now verl is gradually introducing dataclass for each sub module for configuration, with `verl.utils.omega_conf_to_dataclass` to make the conversion easier. This PR also provide example unit tests on how standalone classes with config as the input should be tested before using them end-to-end. Finally, this PR also renames WorkerProfiler to DistProfiler for clarity. ### Test Test cases for configuration utilities on CPU. 1. Test basic OmegaConf to dataclass conversion for simple nested structures 2. Test nested OmegaConf to dataclass conversion for complex hierarchical configurations 3. Verify all configuration values are correctly converted and accessible Test suite for NsightSystemsProfiler functionality 1. Initialization: Verify profiler state after creation 2. Basic Profiling: Test start/stop functionality 3. Discrete Mode: Test discrete profiling behavior 4. Annotation: Test the annotate decorator in both normal and discrete modes 5. Config Validation: Verify proper config initialization from OmegaConf ### Usage Example > Provide usage example(s) for easier usage. ```python def omega_conf_to_dataclass(config: Union[DictConfig, dict], dataclass_type: Type[Any]) -> Any: """ Convert an OmegaConf DictConfig to a dataclass. Args: config: The OmegaConf DictConfig or dict to convert. dataclass_type: The dataclass type to convert to. Returns: The dataclass instance. """ ``` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…(#2049) ### Checklist Before Starting - [x] Searched for similar PR(s). - [x] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Add support for dynamic batch size (data packing) of multimodal dataset. Add an example script `examples/grpo_trainer/run_qwen2_5_vl-7b_seq_balance.sh`. ### Test The console log from training Qwen2.5-VL-7B with PPO on the Geo3K dataset (`examples/grpo_trainer/run_qwen2_5_vl-7b_seq_balance.sh`). The experiment was conducted on a single node with 8 NVIDIA A800 GPUs. ``` [2025-06-17 02:42:10] (WorkerDict pid=13539) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch [repeated 7x across cluster] [2025-06-17 02:42:10] (WorkerDict pid=13361) Model config after override: Qwen2_5_VLConfig { [2025-06-17 02:42:10] (WorkerDict pid=13361) "architectures": [ [2025-06-17 02:42:10] (WorkerDict pid=13361) "Qwen2_5_VLForConditionalGeneration" [2025-06-17 02:42:10] (WorkerDict pid=13361) ], [2025-06-17 02:42:10] (WorkerDict pid=13361) "attention_dropout": 0.0, [2025-06-17 02:42:10] (WorkerDict pid=13361) "eos_token_id": 151645, [2025-06-17 02:42:10] (WorkerDict pid=13361) "hidden_act": "silu", [2025-06-17 02:42:10] (WorkerDict pid=13361) "hidden_size": 3584, [2025-06-17 02:42:10] (WorkerDict pid=13361) "image_token_id": 151655, [2025-06-17 02:42:10] (WorkerDict pid=13361) "initializer_range": 0.02, [2025-06-17 02:42:10] (WorkerDict pid=13361) "intermediate_size": 18944, [2025-06-17 02:42:10] (WorkerDict pid=13361) "max_position_embeddings": 128000, [2025-06-17 02:42:10] (WorkerDict pid=13361) "max_window_layers": 28, [2025-06-17 02:42:10] (WorkerDict pid=13361) "model_type": "qwen2_5_vl", [2025-06-17 02:42:10] (WorkerDict pid=13361) "num_attention_heads": 28, [2025-06-17 02:42:10] (WorkerDict pid=13361) "num_hidden_layers": 28, [2025-06-17 02:42:10] (WorkerDict pid=13361) "num_key_value_heads": 4, [2025-06-17 02:42:10] (WorkerDict pid=13361) "pad_token_id": 151643, [2025-06-17 02:42:10] (WorkerDict pid=13361) "rms_norm_eps": 1e-06, [2025-06-17 02:42:10] (WorkerDict pid=13361) "rope_scaling": { [2025-06-17 02:42:10] (WorkerDict pid=13361) "mrope_section": [ [2025-06-17 02:42:10] (WorkerDict pid=13361) 16, [2025-06-17 02:42:10] (WorkerDict pid=13361) 24, [2025-06-17 02:42:10] (WorkerDict pid=13361) 24 [2025-06-17 02:42:10] (WorkerDict pid=13361) ], [2025-06-17 02:42:10] (WorkerDict pid=13361) "rope_type": "default", [2025-06-17 02:42:10] (WorkerDict pid=13361) "type": "default" [2025-06-17 02:42:10] (WorkerDict pid=13361) }, [2025-06-17 02:42:10] (WorkerDict pid=13361) "rope_theta": 1000000.0, [2025-06-17 02:42:10] (WorkerDict pid=13361) "sliding_window": 32768, [2025-06-17 02:42:10] (WorkerDict pid=13361) "tie_word_embeddings": false, [2025-06-17 02:42:10] (WorkerDict pid=13361) "torch_dtype": "bfloat16", [2025-06-17 02:42:10] (WorkerDict pid=13361) "transformers_version": "4.51.0", [2025-06-17 02:42:10] (WorkerDict pid=13361) "use_cache": true, [2025-06-17 02:42:10] (WorkerDict pid=13361) "use_sliding_window": false, [2025-06-17 02:42:10] (WorkerDict pid=13361) "video_token_id": 151656, [2025-06-17 02:42:10] (WorkerDict pid=13361) "vision_config": { [2025-06-17 02:42:10] (WorkerDict pid=13361) "depth": 32, [2025-06-17 02:42:10] (WorkerDict pid=13361) "fullatt_block_indexes": [ [2025-06-17 02:42:10] (WorkerDict pid=13361) 7, [2025-06-17 02:42:10] (WorkerDict pid=13361) 15, [2025-06-17 02:42:10] (WorkerDict pid=13361) 23, [2025-06-17 02:42:10] (WorkerDict pid=13361) 31 [2025-06-17 02:42:10] (WorkerDict pid=13361) ], [2025-06-17 02:42:10] (WorkerDict pid=13361) "hidden_act": "silu", [2025-06-17 02:42:10] (WorkerDict pid=13361) "hidden_size": 1280, [2025-06-17 02:42:10] (WorkerDict pid=13361) "in_channels": 3, [2025-06-17 02:42:10] (WorkerDict pid=13361) "in_chans": 3, [2025-06-17 02:42:10] (WorkerDict pid=13361) "intermediate_size": 3420, [2025-06-17 02:42:10] (WorkerDict pid=13361) "model_type": "qwen2_5_vl", [2025-06-17 02:42:10] (WorkerDict pid=13361) "num_heads": 16, [2025-06-17 02:42:10] (WorkerDict pid=13361) "out_hidden_size": 3584, [2025-06-17 02:42:10] (WorkerDict pid=13361) "patch_size": 14, [2025-06-17 02:42:10] (WorkerDict pid=13361) "spatial_merge_size": 2, [2025-06-17 02:42:10] (WorkerDict pid=13361) "spatial_patch_size": 14, [2025-06-17 02:42:10] (WorkerDict pid=13361) "temporal_patch_size": 2, [2025-06-17 02:42:10] (WorkerDict pid=13361) "tokens_per_second": 2, [2025-06-17 02:42:10] (WorkerDict pid=13361) "torch_dtype": "float32", [2025-06-17 02:42:10] (WorkerDict pid=13361) "window_size": 112 [2025-06-17 02:42:10] (WorkerDict pid=13361) }, [2025-06-17 02:42:10] (WorkerDict pid=13361) "vision_end_token_id": 151653, [2025-06-17 02:42:10] (WorkerDict pid=13361) "vision_start_token_id": 151652, [2025-06-17 02:42:10] (WorkerDict pid=13361) "vision_token_id": 151654, [2025-06-17 02:42:10] (WorkerDict pid=13361) "vocab_size": 152064 [2025-06-17 02:42:10] (WorkerDict pid=13361) } [2025-06-17 02:42:10] (WorkerDict pid=13361) [2025-06-17 02:42:10] (WorkerDict pid=13361) Monkey patch FlashAttention2.forward in Qwen2.5VL [2025-06-17 02:42:10] (WorkerDict pid=13361) Monkey patch _flash_attention_forward in transformers.models.qwen2_5_vl.modeling_qwen2_5_vl [2025-06-17 02:42:10] (WorkerDict pid=13361) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch [2025-06-17 02:42:10] (WorkerDict pid=13541) Monkey patch FlashAttention2.forward in Qwen2.5VL [2025-06-17 02:42:10] (WorkerDict pid=13541) Monkey patch _flash_attention_forward in transformers.models.qwen2_5_vl.modeling_qwen2_5_vl [2025-06-17 02:42:10] (WorkerDict pid=13541) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch [2025-06-17 02:42:10] (WorkerDict pid=13361) Qwen2_5_VLForConditionalGeneration contains 8.29B parameters [2025-06-17 02:42:10] (WorkerDict pid=13361) wrap_policy: functools.partial(<function _or_policy at 0x7f8504485b40>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x7f8504485a20>, transformer_layer_cls={<class 'transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLDecoderLayer'>, <class 'transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLVisionBlock'>})]) [2025-06-17 02:42:10] (WorkerDict pid=13361) Total steps: 60, num_warmup_steps: 0 [2025-06-17 02:42:10] (WorkerDict pid=13361) Actor use_remove_padding=True [2025-06-17 02:42:10] (WorkerDict pid=13361) Actor use_fused_kernels=False [2025-06-17 02:42:10] (WorkerDict pid=13543) Monkey patch FlashAttention2.forward in Qwen2.5VL [repeated 6x across cluster] [2025-06-17 02:42:10] (WorkerDict pid=13543) Monkey patch _flash_attention_forward in transformers.models.qwen2_5_vl.modeling_qwen2_5_vl [repeated 6x across cluster] [2025-06-17 02:42:10] (WorkerDict pid=13543) Skipping monkey patch for Qwen2_5_VLForConditionalGeneration as use_fused_kernels is False or fused_kernels_backend is torch [repeated 6x across cluster] [2025-06-17 02:42:10] (WorkerDict pid=13361) WARNING 06-16 18:40:12 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f830d065330> [2025-06-17 02:42:10] (WorkerDict pid=13540) NCCL version 2.21.5+cuda12.4 Training Progress: 0%| | 0/60 [00:00<?, ?it/s] [2025-06-17 02:42:18] (WorkerDict pid=13539) /**********/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 5x across cluster] [2025-06-17 02:42:18] (WorkerDict pid=13539) warnings.warn( [repeated 5x across cluster] Training Progress: 2%|▏ | 1/60 [04:09<4:05:26, 249.60s/it] [2025-06-17 02:46:27] (WorkerDict pid=13537) /**********/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 2x across cluster] [2025-06-17 02:46:27] (WorkerDict pid=13537) warnings.warn( [repeated 2x across cluster] Training Progress: 3%|▎ | 2/60 [08:04<3:52:47, 240.81s/it] (TaskRunner pid=9331) Training Progress: 5%|▌ | 3/60 [11:53<3:43:33, 235.33s/it] (WorkerDict pid=13540) kwargs: {'n': 5, 'logprobs': 0, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} (WorkerDict pid=13539) WARNING 06-16 18:40:12 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f97d7cc92d0> [repeated 7x across cluster] (WorkerDict pid=13542) NCCL version 2.21.5+cuda12.4 [repeated 2x across cluster] (TaskRunner pid=9331) Using LocalLogger is deprecated. The constructor API will change (WorkerDict pid=13539) kwargs: {'n': 5, 'logprobs': 0, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 5x across cluster] (TaskRunner pid=9331) step:1 - global_seqlen/min:194004.000 - global_seqlen/max:215990.000 - global_seqlen/minmax_diff:21986.000 - global_seqlen/balanced_min:203335.000 - global_seqlen/balanced_max:203336.000 - global_seqlen/mean:203335.125 - actor/entropy:0.467 - training/rollout_probs_diff_max:0.378 - training/rollout_probs_diff_mean:0.005 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.001 - actor/kl_coef:0.010 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.001 - actor/ppo_kl:-0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.230 - perf/mfu/actor:0.323 - perf/max_memory_allocated_gb:62.271 - perf/max_memory_reserved_gb:81.812 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.394 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.394 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.008 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.008 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:380.995 - response_length/max:2048.000 - response_length/min:25.000 - response_length/clip_ratio:0.007 - prompt_length/mean:254.428 - prompt_length/max:996.000 - prompt_length/min:102.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:66.493 - timing_s/reshard:1.879 - timing_s/gen:70.929 - timing_s/reward:3.603 - timing_s/old_log_prob:34.632 - timing_s/ref:33.643 - timing_s/adv:0.095 - timing_s/update_actor:95.425 - timing_s/step:238.697 - timing_per_token_ms/gen:0.073 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.021 - timing_per_token_ms/update_actor:0.059 - perf/total_num_tokens:1626681.000 - perf/time_per_step:238.697 - perf/throughput:851.856 (WorkerDict pid=13537) kwargs: {'n': 5, 'logprobs': 0, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 2x across cluster] (TaskRunner pid=9331) step:2 - global_seqlen/min:190581.000 - global_seqlen/max:220843.000 - global_seqlen/minmax_diff:30262.000 - global_seqlen/balanced_min:209057.000 - global_seqlen/balanced_max:209058.000 - global_seqlen/mean:209057.500 - actor/entropy:0.458 - training/rollout_probs_diff_max:0.415 - training/rollout_probs_diff_mean:0.005 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.001 - actor/kl_coef:0.010 - actor/pg_loss:0.017 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.252 - perf/mfu/actor:0.327 - perf/max_memory_allocated_gb:62.280 - perf/max_memory_reserved_gb:85.205 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - training/global_step:2.000 - training/epoch:0.000 - critic/score/mean:0.403 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.403 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.016 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.016 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:390.521 - response_length/max:2048.000 - response_length/min:18.000 - response_length/clip_ratio:0.009 - prompt_length/mean:262.783 - prompt_length/max:996.000 - prompt_length/min:103.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:63.223 - timing_s/reshard:2.093 - timing_s/gen:70.164 - timing_s/reward:3.706 - timing_s/old_log_prob:30.945 - timing_s/ref:30.190 - timing_s/adv:0.088 - timing_s/update_actor:96.829 - timing_s/step:232.303 - timing_per_token_ms/gen:0.070 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.018 - timing_per_token_ms/update_actor:0.058 - perf/total_num_tokens:1672460.000 - perf/time_per_step:232.303 - perf/throughput:899.936 (TaskRunner pid=9331) step:3 - global_seqlen/min:197140.000 - global_seqlen/max:212951.000 - global_seqlen/minmax_diff:15811.000 - global_seqlen/balanced_min:205956.000 - global_seqlen/balanced_max:205957.000 - global_seqlen/mean:205956.250 - actor/entropy:0.418 - training/rollout_probs_diff_max:0.319 - training/rollout_probs_diff_mean:0.005 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.005 - actor/kl_coef:0.010 - actor/pg_loss:0.065 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.199 - perf/mfu/actor:0.332 - perf/max_memory_allocated_gb:62.414 - perf/max_memory_reserved_gb:85.205 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - training/global_step:3.000 - training/epoch:0.000 - critic/score/mean:0.392 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.392 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.004 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.004 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:379.654 - response_length/max:2048.000 - response_length/min:20.000 - response_length/clip_ratio:0.003 - prompt_length/mean:263.959 - prompt_length/max:776.000 - prompt_length/min:103.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:60.097 - timing_s/reshard:2.019 - timing_s/gen:69.763 - timing_s/reward:3.414 - timing_s/old_log_prob:30.005 - timing_s/ref:30.284 - timing_s/adv:0.090 - timing_s/update_actor:93.705 - timing_s/step:227.641 - timing_per_token_ms/gen:0.072 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.018 - timing_per_token_ms/update_actor:0.057 - perf/total_num_tokens:1647650.000 - perf/time_per_step:227.641 - perf/throughput:904.741 (TaskRunner pid=9331) Training Progress: 7%|▋ | 4/60 [15:41<3:37:00, 232.51s/it] (TaskRunner pid=9331) step:4 - global_seqlen/min:190149.000 - global_seqlen/max:224987.000 - global_seqlen/minmax_diff:34838.000 - global_seqlen/balanced_min:207060.000 - global_seqlen/balanced_max:207061.000 - global_seqlen/mean:207060.250 - actor/entropy:0.429 - training/rollout_probs_diff_max:0.299 - training/rollout_probs_diff_mean:0.004 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.002 - actor/kl_coef:0.010 - actor/pg_loss:0.036 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.210 - perf/mfu/actor:0.330 - perf/max_memory_allocated_gb:62.977 - perf/max_memory_reserved_gb:87.430 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - training/global_step:4.000 - training/epoch:0.000 - critic/score/mean:0.406 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.406 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.019 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.019 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:392.973 - response_length/max:2048.000 - response_length/min:25.000 - response_length/clip_ratio:0.010 - prompt_length/mean:254.090 - prompt_length/max:996.000 - prompt_length/min:103.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:64.229 - timing_s/reshard:2.136 - timing_s/gen:71.688 - timing_s/reward:3.684 - timing_s/old_log_prob:28.621 - timing_s/ref:28.663 - timing_s/adv:0.088 - timing_s/update_actor:94.804 - timing_s/step:227.898 - timing_per_token_ms/gen:0.071 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.017 - timing_per_token_ms/update_actor:0.057 - perf/total_num_tokens:1656482.000 - perf/time_per_step:227.898 - perf/throughput:908.567 (TaskRunner pid=9331) test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=9331) validation generation end (TaskRunner pid=9331) [prompt] system (TaskRunner pid=9331) You are a helpful assistant. (TaskRunner pid=9331) user (TaskRunner pid=9331) Chords $\overline{A C}$ and $\overline{D F}$ are equidistant from the center. If the radius of $\odot G$ is 26 find $A C$ You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}. (TaskRunner pid=9331) assistant (TaskRunner pid=9331) (TaskRunner pid=9331) [response] <think>Let's start by analyzing the given information and the diagram. We know that chords $\overline{AC}$ and $\overline{DF}$ are equidistant from the center $G$. This means that the perpendicular distances from $G$ to $\overline{AC}$ and $\overline{DF}$ are the same. (TaskRunner pid=9331) (TaskRunner pid=9331) First, let's consider the radius of the circle, which is 26. The perpendicular distance from the center $G$ to a chord is the length of the segment from $G$ to the midpoint of the chord. This distance is also the length of the segment from $G$ to the chord, which is the same for both chords $\overline{AC}$ and $\overline{DF}$. (TaskRunner pid=9331) (TaskRunner pid=9331) Since the chords are equidistant from the center, the perpendicular distance from $G$ to $\overline{AC}$ is the same as the perpendicular distance from $G$ to $\overline{DF}$. This distance is 10, as given in the problem. (TaskRunner pid=9331) (TaskRunner pid=9331) Now, let's use the Pythagorean theorem in the right triangle formed by the radius, the perpendicular distance, and half the length of the chord. The radius is 26, the perpendicular distance is 10, and half the length of the chord is $ \frac{AC}{2} $. (TaskRunner pid=9331) (TaskRunner pid=9331) The Pythagorean theorem states: (TaskRunner pid=9331) \[ 26^2 = 10^2 + \left( \frac{AC}{2} \right)^2 \] (TaskRunner pid=9331) \[ 676 = 100 + \left( \frac{AC}{2} \right)^2 \] (TaskRunner pid=9331) \[ 576 = \left( \frac{AC}{2} \right)^2 \] (TaskRunner pid=9331) \[ \frac{AC}{2} = \sqrt{576} \] (TaskRunner pid=9331) \[ \frac{AC}{2} = 24 \] (TaskRunner pid=9331) \[ AC = 48 \] (TaskRunner pid=9331) (TaskRunner pid=9331) So, the length of $AC$ is 48.</think> (TaskRunner pid=9331) \boxed{48} (TaskRunner pid=9331) [ground_truth] 48 (TaskRunner pid=9331) [score] 1.0 (TaskRunner pid=9331) Training Progress: 8%|▊ | 5/60 [20:34<3:53:09, 254.36s/it] (TaskRunner pid=9331) Training Progress: 10%|█ | 6/60 [24:24<3:41:25, 246.02s/it] (TaskRunner pid=9331) step:5 - global_seqlen/min:196253.000 - global_seqlen/max:210637.000 - global_seqlen/minmax_diff:14384.000 - global_seqlen/balanced_min:205432.000 - global_seqlen/balanced_max:205432.000 - global_seqlen/mean:205432.000 - actor/entropy:0.383 - training/rollout_probs_diff_max:0.349 - training/rollout_probs_diff_mean:0.004 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.003 - actor/kl_coef:0.010 - actor/pg_loss:-0.022 - actor/pg_clipfrac:0.001 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.218 - perf/mfu/actor:0.327 - perf/max_memory_allocated_gb:62.977 - perf/max_memory_reserved_gb:87.430 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - val-aux/hiyouga/geometry3k/reward/mean@1:0.450 - val-aux/hiyouga/geometry3k/reward/mean@24:0.550 - val-aux/hiyouga/geometry3k/reward/std@24:0.450 - val-aux/hiyouga/geometry3k/reward/best@2/mean:0.637 - val-aux/hiyouga/geometry3k/reward/best@2/std:0.238 - val-aux/hiyouga/geometry3k/reward/worst@2/mean:0.367 - val-aux/hiyouga/geometry3k/reward/worst@2/std:0.229 - val-aux/hiyouga/geometry3k/reward/best@4/mean:0.789 - val-aux/hiyouga/geometry3k/reward/best@4/std:0.255 - val-aux/hiyouga/geometry3k/reward/worst@4/mean:0.137 - val-aux/hiyouga/geometry3k/reward/worst@4/std:0.153 - val-aux/hiyouga/geometry3k/reward/best@8/mean:0.964 - val-aux/hiyouga/geometry3k/reward/best@8/std:0.118 - val-aux/hiyouga/geometry3k/reward/worst@8/mean:0.097 - val-aux/hiyouga/geometry3k/reward/worst@8/std:0.056 - val-aux/hiyouga/geometry3k/reward/best@16/mean:1.000 - val-aux/hiyouga/geometry3k/reward/best@16/std:0.000 - val-aux/hiyouga/geometry3k/reward/worst@16/mean:0.064 - val-aux/hiyouga/geometry3k/reward/worst@16/std:0.022 - val-aux/hiyouga/geometry3k/reward/best@24/mean:1.000 - val-aux/hiyouga/geometry3k/reward/best@24/std:0.000 - val-aux/hiyouga/geometry3k/reward/worst@24/mean:0.100 - val-aux/hiyouga/geometry3k/reward/worst@24/std:0.000 - val-aux/hiyouga/geometry3k/reward/mean@14:0.550 - val-aux/hiyouga/geometry3k/reward/std@14:0.450 - val-aux/hiyouga/geometry3k/reward/best@14/mean:1.000 - val-aux/hiyouga/geometry3k/reward/best@14/std:0.000 - val-aux/hiyouga/geometry3k/reward/worst@14/mean:0.100 - val-aux/hiyouga/geometry3k/reward/worst@14/std:0.000 - val-aux/hiyouga/geometry3k/reward/mean@2:0.548 - val-aux/hiyouga/geometry3k/reward/std@2:0.210 - val-aux/hiyouga/geometry3k/reward/mean@3:0.455 - val-aux/hiyouga/geometry3k/reward/std@3:0.309 - val-aux/hiyouga/geometry3k/reward/best@3/mean:0.664 - val-aux/hiyouga/geometry3k/reward/best@3/std:0.192 - val-aux/hiyouga/geometry3k/reward/worst@3/mean:0.231 - val-aux/hiyouga/geometry3k/reward/worst@3/std:0.235 - val-aux/hiyouga/geometry3k/reward/mean@6:0.475 - val-aux/hiyouga/geometry3k/reward/std@6:0.437 - val-aux/hiyouga/geometry3k/reward/best@6/mean:0.958 - val-aux/hiyouga/geometry3k/reward/best@6/std:0.174 - val-aux/hiyouga/geometry3k/reward/worst@6/mean:0.105 - val-aux/hiyouga/geometry3k/reward/worst@6/std:0.061 - val-core/hiyouga/geometry3k/reward/mean@26:0.612 - val-aux/hiyouga/geometry3k/reward/std@26:0.454 - val-core/hiyouga/geometry3k/reward/best@26/mean:1.000 - val-core/hiyouga/geometry3k/reward/best@26/std:0.000 - val-aux/hiyouga/geometry3k/reward/worst@26/mean:0.012 - val-aux/hiyouga/geometry3k/reward/worst@26/std:0.032 - val-aux/hiyouga/geometry3k/reward/mean@8:0.438 - val-aux/hiyouga/geometry3k/reward/std@8:0.420 - val-aux/hiyouga/geometry3k/reward/mean@5:0.460 - val-aux/hiyouga/geometry3k/reward/std@5:0.400 - val-aux/hiyouga/geometry3k/reward/best@5/mean:0.856 - val-aux/hiyouga/geometry3k/reward/best@5/std:0.255 - val-aux/hiyouga/geometry3k/reward/worst@5/mean:0.135 - val-aux/hiyouga/geometry3k/reward/worst@5/std:0.134 - val-aux/hiyouga/geometry3k/reward/mean@9:0.300 - val-aux/hiyouga/geometry3k/reward/std@9:0.374 - val-aux/hiyouga/geometry3k/reward/best@9/mean:0.908 - val-aux/hiyouga/geometry3k/reward/best@9/std:0.272 - val-aux/hiyouga/geometry3k/reward/worst@9/mean:0.100 - val-aux/hiyouga/geometry3k/reward/worst@9/std:0.000 - val-aux/hiyouga/geometry3k/reward/mean@4:0.100 - val-aux/hiyouga/geometry3k/reward/std@4:0.000 - training/global_step:5.000 - training/epoch:1.000 - critic/score/mean:0.388 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.388 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.010 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.010 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:384.739 - response_length/max:2048.000 - response_length/min:18.000 - response_length/clip_ratio:0.007 - prompt_length/mean:257.236 - prompt_length/max:996.000 - prompt_length/min:103.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:63.336 - timing_s/reshard:2.105 - timing_s/gen:69.227 - timing_s/reward:3.572 - timing_s/old_log_prob:29.942 - timing_s/ref:29.623 - timing_s/adv:0.087 - timing_s/update_actor:94.945 - timing_s/testing:51.987 - timing_s/step:279.773 - timing_per_token_ms/gen:0.070 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.018 - timing_per_token_ms/update_actor:0.058 - perf/total_num_tokens:1643456.000 - perf/time_per_step:279.773 - perf/throughput:734.280 (TaskRunner pid=9331) step:6 - global_seqlen/min:200473.000 - global_seqlen/max:216599.000 - global_seqlen/minmax_diff:16126.000 - global_seqlen/balanced_min:207366.000 - global_seqlen/balanced_max:207367.000 - global_seqlen/mean:207366.250 - actor/entropy:0.346 - training/rollout_probs_diff_max:0.239 - training/rollout_probs_diff_mean:0.004 - training/rollout_probs_diff_std:0.011 - actor/kl_loss:0.004 - actor/kl_coef:0.010 - actor/pg_loss:0.013 - actor/pg_clipfrac:0.001 - actor/ppo_kl:-0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.257 - perf/mfu/actor:0.328 - perf/max_memory_allocated_gb:62.977 - perf/max_memory_reserved_gb:87.430 - perf/cpu_memory_used_gb:0.000 - actor/lr:0.000 - training/global_step:6.000 - training/epoch:1.000 - critic/score/mean:0.443 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.443 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.010 - critic/advantages/max:1.789 - critic/advantages/min:-1.789 - critic/returns/mean:-0.010 - critic/returns/max:1.789 - critic/returns/min:-1.789 - response_length/mean:381.082 - response_length/max:2048.000 - response_length/min:22.000 - response_length/clip_ratio:0.005 - prompt_length/mean:266.938 - prompt_length/max:996.000 - prompt_length/min:102.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:60.989 - timing_s/reshard:1.788 - timing_s/gen:67.473 - timing_s/reward:3.320 - timing_s/old_log_prob:30.357 - timing_s/ref:31.241 - timing_s/adv:0.090 - timing_s/update_actor:95.860 - timing_s/step:228.698 - timing_per_token_ms/gen:0.069 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.019 - timing_per_token_ms/update_actor:0.058 - perf/total_num_tokens:1658930.000 - perf/time_per_step:228.698 - perf/throughput:906.726 (TaskRunner pid=9331) Training Progress: 12%|█▏ | 7/60 [28:12<3:32:05, 240.11s/it] (TaskRunner pid=9331) Training Progress: 13%|█▎ | 8/60 [31:55<3:23:25, 234.72s/it] (TaskRunner pid=9331) Training Progress: 15%|█▌ | 9/60 [35:50<3:19:30, 234.71s/it] ... ... ``` ### Usage Example ```bash bash examples/grpo_trainer/run_qwen2_5_vl-7b_seq_balance.sh ``` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Rely on existing unit tests on CI that covers the code path. - [ ] New CI unit test(s) are added to cover the code path.

…#2130) ### What does this PR do? Set actor's strategy as the default strategy for critic, ref and reward model. In principle, all actors should use the same strategy. With this change, we can set `STRATEGY=fsdp2` in `run_function_reward.sh` and all models can use fsdp2 as strategy, instead of setting it for each role individually. ### Checklist Before Describing the Details - [x] Searched for similar PR(s). - [x] PR title is in the format of: `[modules] type: Title` - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg` - type is in `feat, fix, refactor, chore, test` - multiple modules are seperated by `,` or space, such as `[megatron, fsdp, doc] feat: xxx` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit run --show-diff-on-failure --color=always --all-files` - [x] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path.

…126)

### What does this PR do? fix the rendering ### Checklist Before Describing the Details - [x] Searched for similar PR(s). - [x] PR title is in the format of: `[modules] type: Title` - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg` - type is in `feat, fix, refactor, chore, test` - multiple modules are seperated by `,` or space, such as `[megatron, fsdp, doc] feat: xxx` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit run --show-diff-on-failure --color=always --all-files` - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…n inside wor… (#2131) …ker group ### What does this PR do? As title ### Checklist Before Describing the Details - [ ] Searched for similar PR(s). - [ ] PR title is in the format of: `[modules] type: Title` - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg` - type is in `feat, fix, refactor, chore, test` - multiple modules are seperated by `,` or space, such as `[megatron, fsdp, doc] feat: xxx` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit run --show-diff-on-failure --color=always --all-files` - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…ack in sglang (#1630)

### Checklist Before Starting - [X] Searched for similar PR(s). ### What does this PR do? This PR adds image input to sglang async rollout. Previously sglang async rollout only support text. There is also a placeholder for video data, will be added as an input when SGLang engine supports it. ### High-Level Design Since sglang engine already handle the image input, just need to properly handling the tokenization. ### Specific Changes Change `self.tokenizer.apply_chat_template()` to `self.processing_class.apply_chat_template()`. `processing_class` could be `tokenizer` or `processor`. ### Usage Example It will automatically using processor to process image when the model's processor supports that. It will use tokenizer if there is no processor available ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path. --------- Co-authored-by: xieck13 <xieck13@gmail.com>

### What does this PR do? Support FSDP2 save HF model. Previously only supported FSDP1, and FSDP2 will lead to error in volcengine/verl#1703. Fix volcengine/verl#1703. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

### What does this PR do? > fix : batch size and the size of raw_prompt unmatching when setting `data.return_raw_chat=True` fix bug when using `data.return_raw_chat=True` in GRPO algorithm with reward model: ` File "/ossfs/workspace/repository/verl/verl/single_controller/ray/base.py", line 625, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) File "/ossfs/workspace/repository/verl/verl/single_controller/base/decorator.py", line 534, in inner return func(*args, **kwargs) File "/ossfs/workspace/repository/verl/verl/workers/fsdp_workers.py", line 634, in generate_sequences output = self.rollout.generate_sequences(prompts=prompts) File "/ossfs/workspace/repository/verl/verl/utils/debug/performance.py", line 78, in f return self.log(decorated_function, *args, **kwargs) File "/ossfs/workspace/repository/verl/verl/utils/debug/performance.py", line 88, in log output = func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ossfs/workspace/repository/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 346, in generate_sequences return DataProto(batch=batch, non_tensor_batch=non_tensor_batch) File "<string>", line 6, in __init__ File "/ossfs/workspace/repository/verl/verl/protocol.py", line 214, in __post_init__ self.check_consistency() File "/ossfs/workspace/repository/verl/verl/protocol.py", line 325, in check_consistency assert val.shape[0] == batch_size, f"key {key} length {len(val)} is not equal to batch size {batch_size}" AssertionError: key raw_prompt length 128 is not equal to batch size 640` ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

Co-authored with: MrAta (immrata@gmail.com) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? ### Motivation In RL Ecosystem which use colocate design like [verl](https://github.com/volcengine/verl/tree/main), we need to offload training model and load serving model & KV Cache frequently. #### Background - Currently SGLang is using [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to pause and resume. - [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is a open source repo that provided easy to use api to hack **cudaMalloc** and **cudaFree** to make sure the virtual address could be consistent after pause and resume, which is critical to ensure CUDA Graph work. - CUDA Graph is critical to make sure SGLang runs faster in decoding phases. #### Here is the current behavior of VERL + SGLang ![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8) 1. During Training, we have training model and optimizer state in the GPU Memory, and once training is done, we will offload optimizer state to cpu and keep the model weights in GPU, which is needed in Update Weight. 2. During Update Weight, we awake the SGLang engine, so those paused memory of Model Weights and KV Cache will come back. Then we update model from training model to serving model on the fly using the api: `update_weights_in_tensor` 3. After Model being updated, we delete the training model from GPU Memory. Above design works pretty well so far, however, this would waste a big chunk of GPU Memory during rollout, which could cause a few issues we've seen so far: - **Small KV Cache**: We need to use relative lower number of mem fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV Cache has less tokens, we will hit `RuntimeError: Prefill out of memory. Try to lower your batch size.` when we try prefill large number of requests. - **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B model on 8 H100, it will OOM during update weight #### Challenge - `torch_memory_saver` currently only supports Singleton, hence SGLang will pause and resume KV Cache + Weights together, they are treated as the same group of memory controlled by the singleton `torch_memory_saver` instance #### Proposal ![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f) 1. During Training, we do the same 2. During Update Weight Stage 1, we awake the model weights from SGLang and then update weights 3. During Update Weight Stage 2, we delete the training model weights from GPU Memory 4. Awake the SGLang's KV Cache ![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25) ### Benefit With above feature, we can train larger model with same GPU, we can also make training/rollout more efficient given we can allocate larger KV Cache ### Solution: Keep using Singleton and provide tag based pause/resume - [x] Support tag based resume/pause: fzyzcjy/torch_memory_saver#20 - [x] Support Multiple Stage Awake in SGLang: sgl-project/sglang#7099 - [ ] Support Multiple Stage Awake in verl: volcengine/verl#1911 ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![Screenshot 2025-06-19 at 12 16 19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc) ![Screenshot 2025-06-19 at 12 13 14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Co-authored-by: Chayenne <zhaochen20@outlook.com>

### What does this PR do? Allow torch.distributed.init_process_group to fetch "DIST_INIT_METHOD" from os.environ to accelerate single node initialization. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test Before using shared file-system initialization ![企业微信截图_17506726709312](https://github.com/user-attachments/assets/cce62dab-dbea-496e-bb60-5cd4e88f8809) After export DIST_INIT_METHOD='file:///tmp/torch_dist' ![企业微信截图_17506729178154](https://github.com/user-attachments/assets/6ed23d76-dda8-44fc-8cb8-5596da0c606d) ### API and Usage Example Simply add ```export DIST_INIT_METHOD='file:///tmp/some_file'``` to your script, and remember to ```rm -rf /tmp/some_file``` before your next run. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: very simple to reproduce - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Add MiniCPM-o 2.6 multimodal model support to VERL framework for vision-language RL training. ### Specific Changes - **verl/third_party/vllm/vllm_v_0_5_4/dtensor_weight_loaders.py**: Add MiniCPM-o weight loading - **verl/third_party/vllm/vllm_v_0_6_3/dtensor_weight_loaders.py**: Add MiniCPM-o weight loading - **verl/utils/dataset/vision_utils.py**: Enhanced vision data processing - **verl/utils/dataset/rl_dataset.py**: Multimodal dataset support - **verl/utils/flops_counter.py**: Vision model FLOPS calculation - **verl/workers/actor/dp_actor.py**: Multimodal model compatibility - **examples/grpo_trainer/run_minicpmo2_6.sh**: Complete training example ### Usage Example ```bash # Train MiniCPM-o 2.6 with GRPO bash examples/grpo_trainer/run_minicpmo2_6.sh ``` ### Test - [x] Local testing with MiniCPM-o 2.6 on geo3k dataset - [x] Verified weight loading for both vLLM versions - [x] Training script validation ### Checklist Before Submitting - [x] Read the Contribute Guide - [x] Apply pre-commit checks (will fix in follow-up if needed) - [ ] No breaking API changes - [ ] Documentation updates (if needed) - [x] Rely on existing unit tests --------- Co-authored-by: RanchiZhao <ranchizhao@example.com>

Reverts volcengine/verl#1833

### What does this PR do? fix timer importance error in split_placement, should use `from verl.trainer.ppo.ray_trainer import marked_timer`, but got `from verl.trainer.ppo.ray_trainer import _timer` now. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test Not related. ### API and Usage Example Not related. ### High-Level Design Not related. ### Specific Changes fix timer importance error in split_placement, should use `from verl.trainer.ppo.ray_trainer import marked_timer`, but got `import _timer` now. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

…_aware

…nto diff_aware

### What does this PR do? Fix bugs introduced by volcengine/verl#2113 Do not skip the e2e tests when pushing changes to the main branch > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

…ecker code to dapo_ray_trainer.py, previous location was wrong; have a specialized launch script, the only modification is having an explicit default_local_dir for saving the checkpoints and the tran_dataset

### Checklist Before Starting - [x] Searched for similar PR(s). - [ ] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? fix megatron vllm async rollout, releated to volcengine/verl#2008 and volcengine/verl#2001 > We are from the Large Model Post-Training Team of **📕 Xiaohongshu's AI Platform Technology Department** , dedicated to developing high-performance, easily-scalable distributed post-training engines. ### Test ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

### What does this PR do? Fix a typo in the profiler's document, `use_profiler` should be `use_profile`. https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/debug/profile.py#L49 > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

…t; Simplify the code for length budget assignment

…Reasoning360 into pr-upstream-verl-merge-diffaware

(1) Fix the generation length for validation to 32k; for training, the max response length still query value from max_response_length. (2) Log generated response lengths to a CSV file during validation. This can be removed for production release.

…Reasoning360 into pr-upstream-verl-merge-diffaware

Raf-Chen and others added 30 commits June 20, 2025 17:27

[BREAKING][ci] feat: add CI request channel & improve PR template (#2…

9b7bb69

…126)

[sglang] feat: Support async multi-turn rollout with simulation feedb…

c7aa5e8

…ack in sglang (#1630)

[tool] feat: Add memory limit configuration for sandbox fusion (#2105)

e67ee86

debug script

bcda81d

[model] fix: Revert "[model] feat: Add MiniCPM-o 2.6 support" (#2176)

24707f6

Reverts volcengine/verl#1833

feature:appending pass rate after each gradient updates

29857aa

7b-inst 4node script

0b11244

Merge branch 'diff_aware' of github.com:LLM360/Reasoning360 into diff…

1c16fec

…_aware

bugfix: add modifications for the pass rate appending code

d707c21

Merge branch 'diff_aware' of https://github.com/LLM360/Reasoning360 i…

d406a13

…nto diff_aware

bug fix and launch script update: move the train_dataset.dataframe ch…

556a7c5

…ecker code to dapo_ray_trainer.py, previous location was wrong; have a specialized launch script, the only modification is having an explicit default_local_dir for saving the checkpoints and the tran_dataset

[Feature] Finalize the code and scripts forpriority sampling

a922689

LiqunMa and others added 25 commits August 12, 2025 18:29

del

fe2e413

add all synlogic verifier

9bcf1d4

add train scripts

de34d41

Upload 70b long-cot fsdp example [tested]

c85d51e

del the code for debug

b0b6347

nemotron stem

0450b70

init dapo

c9e7b3e

add quantile statistics and address the data type warning on logger

1e5e5cd

[Bug fix and improve] Avoid the seoncd time repeating for vllm rollou…

0494be5

…t; Simplify the code for length budget assignment

correct dataset name

ef4a8ce

fix names of test data

d0cfea9

merge syn

924d94f

merge nemotron_stem

7f998e0

fix the data dir

ab00656

fix bugs

e13c7ab

small

52f9c09

Merge branch 'pr-upstream-verl-merge-diffaware' of github.com:LLM360/…

4ee85d4

…Reasoning360 into pr-upstream-verl-merge-diffaware

dalu recipe

ae4a18a

update data_process

adf641b

revise testset

249d242

m1 script & better dalu data shuffle

c84c707

update ifbench test

a4ceab1

debug length

7c02165

Merge branch 'pr-upstream-verl-merge-diffaware' of github.com:LLM360/…

863cb25

…Reasoning360 into pr-upstream-verl-merge-diffaware

Jianshu-She requested a review from ZYHowell September 30, 2025 11:12

haonan-li added 4 commits September 30, 2025 18:57

support wandb resume by adding +trainer.run_id={RUNIT} in recipe

1f98532

update 32b training script

a630dab

update data_process script

c25c64d

update scripts

1afe759

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pr upstream verl merge diffaware #137

Pr upstream verl merge diffaware #137

Uh oh!

Jianshu-She commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Pr upstream verl merge diffaware #137

Are you sure you want to change the base?

Pr upstream verl merge diffaware #137

Uh oh!

Conversation

Jianshu-She commented Sep 29, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants