Skip to content

Conversation

@opooladz
Copy link

@opooladz opooladz commented Oct 4, 2025

Summary

  • Expand GRPOTrainer to expose ProRL/DAPO-style controls (asymmetric clipping, per-token weighting, length shaping, KL/entropy guards, selectable advantage estimators)
  • Enhance DPOTrainer metrics with KL/entropy/logit margin for consistent dashboards
  • Add shared utilities for group-normalised advantages and DAPO length shaping plus accompanying unit tests
  • Document the overall RL pipeline (RL_ALGORITHMS.md), inference engines (INFERENCE_ENGINES.md), release notes (RL_UPDATES.md), and provide a VinePPO roadmap (VINEPPO_IMPLEMENTATION_PLAN.md)

Changes

  • easydel/trainers/group_relative_policy_optimization/grpo_config.py: New config knobs (advantage estimator, clip bounds, length shaping, KL resets, sampling safeguards)
  • easydel/trainers/group_relative_policy_optimization/grpo_trainer.py: Preprocess rewards with length shaping, compute advanced metrics, manage reference resets (ProRL-style)
  • easydel/trainers/group_relative_policy_optimization/_fn.py: PPO-style ratio loss with asymmetric clipping and token weighting
  • easydel/trainers/training_utils.py: Shared helpers for group advantages, length shaping, EMA tracking
  • easydel/trainers/direct_preference_optimization_trainer/_fn.py: Richer logging metrics
  • tests/trainers/grpo_utils_test.py: Unit coverage for new helpers
  • Documentation updates: docs/trainers/grpo.md, RL_ALGORITHMS.md, INFERENCE_ENGINES.md, RL_UPDATES.md, RL.md, VINEPPO_IMPLEMENTATION_PLAN.md

Testing

python3 -m pytest tests/trainers/grpo_utils_test.py

Follow-ups

  • Decide on VinePPO-Lite vs. full VinePPO implementation (see VINEPPO_IMPLEMENTATION_PLAN.md)
  • Consider renaming documentation references from "VinePPO advantages" to "GAE advantages" after feature flag decisions

Notes

  • No behaviour changes for existing configs; defaults preserve legacy GRPO behaviour
  • Inference docs include mention of external ejkernel optimisations

…stack

Enhance GRPOTrainer with advanced RL features:
- Add asymmetric clipping, per-token weighting, length shaping
- Support multiple advantage estimators (mean, median, GAE)
- Implement KL/entropy guards and reference model resets
- Add group-normalized advantages and DAPO length shaping

Improve DPO metrics logging:
- Add KL divergence, entropy, and logit margin tracking
- Ensure consistent dashboard metrics across trainers

Documentation updates:
- Add RL_ALGORITHMS.md documenting all RL algorithms
- Add INFERENCE_ENGINES.md documenting vInference/vSurge/vWhisper
- Add RL_UPDATES.md with release notes
- Add VINEPPO_IMPLEMENTATION_PLAN.md for future work
- Update docs/trainers/grpo.md with new configuration options

Testing:
- Add tests/trainers/grpo_utils_test.py for utility functions
@opooladz
Copy link
Author

opooladz commented Oct 4, 2025

lets see how codex did

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant