Skip to content
/ GCPO Public

Group Contrastive Policy Optimazation. Read the paper on arXiv: πŸ‘‰ https://arxiv.org/abs/2510.07790

License

Notifications You must be signed in to change notification settings

AchoWu/GCPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GCPO

⚑ Group Contrastive Policy Optimization (GCPO)

arXiv Hugging Face

Official repository of the paper: GCPO: When Contrast Fails, Go Gold


About

GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the model’s own rollouts, GCPO introduces Golden Answers (GAs) β€” external reference answers β€” to guide the model’s updates when all sampled responses are incorrect.

This approach ensures:

βœ… Full sample utilization β€” no training data is wasted
🧠 Knowledge transfer β€” small models learn reasoning strategies from larger models
πŸš€ Faster convergence and better generalization


🎯 Key Features

  • βœ… Golden Answer Injection β€” handles failure rollouts by injecting correct reference solutions
  • βš–οΈ Sequence-Level Importance Sampling β€” stabilizes training under sparse reward settings
  • πŸ”₯ Contrastive Optimization β€” enhances separation between good and bad reasoning traces
  • ✨ No KL Penalty Needed β€” encourages diverse yet effective reasoning behaviors
  • πŸ“š Generalizable β€” works on math, code, and logical QA tasks

πŸš€ Coming Soon

Item Status
Paper βœ… Released
Model Checkpoints βœ… Released
GCPO Dataset ⏳ Coming soon
Code (Training + Evaluation) ⏳ Coming soon

πŸ› οΈ Model Use

We provide the model weights of GCPO-R1-1.5B, which is trained based on DeepSeek-R1-Distill-Qwen-1.5B using the GCPO algorithm. You can find the model at https://huggingface.co/Ach0/GCPO-R1-1.5B.

βš–οΈ Evaluation

To evaluate the model on AIME 2024, run:

python3 vllm_eval.py --model_path Ach0/GCPO-R1-1.5B --test_file dataset/AIME24/aime_2024.jsonl --output_path aime2024_result.jsonl  --tensor_parallel_size 4 --mode all

πŸ“Š GCPO Improves Reasoning Performance

GCPO consistently outperforms DAPO.

Performance Comparison


πŸ”§ GCPO Training Pipeline

GCPO Pipeline


✍️ Citation

If you find this work useful, please cite:

@article{wu2025gcpo,
  title={GCPO: When Contrast Fails, Go Gold},
  author={Hao Wu and Wei Liu},
  journal={arXiv preprint arXiv:2510.07790},
  year={2025}
}

About

Group Contrastive Policy Optimazation. Read the paper on arXiv: πŸ‘‰ https://arxiv.org/abs/2510.07790

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages