Skip to content

Conversation

@lintangsutawika
Copy link
Collaborator

  1. More rewards (include cosine rewards)
  2. Handle both step-wise and non-step-wise training
  3. Add non-thinking
  4. maintain max length but train on shorter (adjustable) train length. This is so that rollouts length and training can be decoupled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants