-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Thanks for your fantastic work!
It seems that Wandb only provides training information for SFT and does not provide relevant training information for PPO. Is the provided Wandb link the training process of PPO? Or is the PPO training process after SFT?
It is difficult to reproduce PPO results using the default parameters in train_ppo.sh and the same environment, and even training cannot converge.
Can you give me some advice?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels