β‘ Group Contrastive Policy Optimization (GCPO)
Official repository of the paper: GCPO: When Contrast Fails, Go Gold
GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the modelβs own rollouts, GCPO introduces Golden Answers (GAs) β external reference answers β to guide the modelβs updates when all sampled responses are incorrect.
This approach ensures:
β
Full sample utilization β no training data is wasted
π§ Knowledge transfer β small models learn reasoning strategies from larger models
π Faster convergence and better generalization
- β Golden Answer Injection β handles failure rollouts by injecting correct reference solutions
- βοΈ Sequence-Level Importance Sampling β stabilizes training under sparse reward settings
- π₯ Contrastive Optimization β enhances separation between good and bad reasoning traces
- β¨ No KL Penalty Needed β encourages diverse yet effective reasoning behaviors
- π Generalizable β works on math, code, and logical QA tasks
| Item | Status |
|---|---|
| Paper | β Released |
| Model Checkpoints | β Released |
| GCPO Dataset | β³ Coming soon |
| Code (Training + Evaluation) | β³ Coming soon |
We provide the model weights of GCPO-R1-1.5B, which is trained based on DeepSeek-R1-Distill-Qwen-1.5B using the GCPO algorithm. You can find the model at https://huggingface.co/Ach0/GCPO-R1-1.5B.
To evaluate the model on AIME 2024, run:
python3 vllm_eval.py --model_path Ach0/GCPO-R1-1.5B --test_file dataset/AIME24/aime_2024.jsonl --output_path aime2024_result.jsonl --tensor_parallel_size 4 --mode all
GCPO consistently outperforms DAPO.
If you find this work useful, please cite:
@article{wu2025gcpo,
title={GCPO: When Contrast Fails, Go Gold},
author={Hao Wu and Wei Liu},
journal={arXiv preprint arXiv:2510.07790},
year={2025}
}
