feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228

opooladz · 2025-10-04T02:20:32Z

Summary

Expand GRPOTrainer to expose ProRL/DAPO-style controls (asymmetric clipping, per-token weighting, length shaping, KL/entropy guards, selectable advantage estimators)
Enhance DPOTrainer metrics with KL/entropy/logit margin for consistent dashboards
Add shared utilities for group-normalised advantages and DAPO length shaping plus accompanying unit tests
Document the overall RL pipeline (RL_ALGORITHMS.md), inference engines (INFERENCE_ENGINES.md), release notes (RL_UPDATES.md), and provide a VinePPO roadmap (VINEPPO_IMPLEMENTATION_PLAN.md)

Changes

easydel/trainers/group_relative_policy_optimization/grpo_config.py: New config knobs (advantage estimator, clip bounds, length shaping, KL resets, sampling safeguards)
easydel/trainers/group_relative_policy_optimization/grpo_trainer.py: Preprocess rewards with length shaping, compute advanced metrics, manage reference resets (ProRL-style)
easydel/trainers/group_relative_policy_optimization/_fn.py: PPO-style ratio loss with asymmetric clipping and token weighting
easydel/trainers/training_utils.py: Shared helpers for group advantages, length shaping, EMA tracking
easydel/trainers/direct_preference_optimization_trainer/_fn.py: Richer logging metrics
tests/trainers/grpo_utils_test.py: Unit coverage for new helpers
Documentation updates: docs/trainers/grpo.md, RL_ALGORITHMS.md, INFERENCE_ENGINES.md, RL_UPDATES.md, RL.md, VINEPPO_IMPLEMENTATION_PLAN.md

Testing

python3 -m pytest tests/trainers/grpo_utils_test.py

Follow-ups

Decide on VinePPO-Lite vs. full VinePPO implementation (see VINEPPO_IMPLEMENTATION_PLAN.md)
Consider renaming documentation references from "VinePPO advantages" to "GAE advantages" after feature flag decisions

Notes

No behaviour changes for existing configs; defaults preserve legacy GRPO behaviour
Inference docs include mention of external ejkernel optimisations

…stack Enhance GRPOTrainer with advanced RL features: - Add asymmetric clipping, per-token weighting, length shaping - Support multiple advantage estimators (mean, median, GAE) - Implement KL/entropy guards and reference model resets - Add group-normalized advantages and DAPO length shaping Improve DPO metrics logging: - Add KL divergence, entropy, and logit margin tracking - Ensure consistent dashboard metrics across trainers Documentation updates: - Add RL_ALGORITHMS.md documenting all RL algorithms - Add INFERENCE_ENGINES.md documenting vInference/vSurge/vWhisper - Add RL_UPDATES.md with release notes - Add VINEPPO_IMPLEMENTATION_PLAN.md for future work - Update docs/trainers/grpo.md with new configuration options Testing: - Add tests/trainers/grpo_utils_test.py for utility functions

opooladz · 2025-10-04T02:26:48Z

lets see how codex did

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228

feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228

Are you sure you want to change the base?

feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228

Uh oh!

Conversation

opooladz commented Oct 4, 2025

Summary

Changes

Testing

Follow-ups

Notes

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant