feat: extend GRPO with ProRL/DAPO controls and document RL/inference stack #228
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
GRPOTrainerto expose ProRL/DAPO-style controls (asymmetric clipping, per-token weighting, length shaping, KL/entropy guards, selectable advantage estimators)DPOTrainermetrics with KL/entropy/logit margin for consistent dashboardsRL_ALGORITHMS.md), inference engines (INFERENCE_ENGINES.md), release notes (RL_UPDATES.md), and provide a VinePPO roadmap (VINEPPO_IMPLEMENTATION_PLAN.md)Changes
easydel/trainers/group_relative_policy_optimization/grpo_config.py: New config knobs (advantage estimator, clip bounds, length shaping, KL resets, sampling safeguards)easydel/trainers/group_relative_policy_optimization/grpo_trainer.py: Preprocess rewards with length shaping, compute advanced metrics, manage reference resets (ProRL-style)easydel/trainers/group_relative_policy_optimization/_fn.py: PPO-style ratio loss with asymmetric clipping and token weightingeasydel/trainers/training_utils.py: Shared helpers for group advantages, length shaping, EMA trackingeasydel/trainers/direct_preference_optimization_trainer/_fn.py: Richer logging metricstests/trainers/grpo_utils_test.py: Unit coverage for new helpersdocs/trainers/grpo.md,RL_ALGORITHMS.md,INFERENCE_ENGINES.md,RL_UPDATES.md,RL.md,VINEPPO_IMPLEMENTATION_PLAN.mdTesting
Follow-ups
VINEPPO_IMPLEMENTATION_PLAN.md)Notes
ejkerneloptimisations