🚀 Reverse Thinking Policy Optimization (RTPO)

This Work introduces Reverse Thinking Policy Optimization (RTPO) — a new RL training method for LLMs built on top of `GRPOTrainer`.

🔍 Motivation

Current GRPO-based RL methods require the model to autonomously generate a full chain-of-thought before producing the final answer. However, many training datasets already contain complete, high-quality reasoning traces that the model could benefit from.

RTPO is designed to:

Utilize existing reasoning traces as auxiliary CoT to support early-stage rollouts.
Force the model to gradually reconstruct its own reasoning by shortening the auxiliary CoT step by step.
Enable a reverse learning schedule: the model first learns to output correct answers, then progressively learns how to reason.

🧠 Method Overview

RTPO modifies the standard GRPO rollout process:

Full Auxiliary CoT Injection

At rollout step 0, the full reasoning chain from the dataset is concatenated into the input prompt.

Model behavior:

Only needs to generate the final answer.
Benefits from a high-quality reasoning scaffold.

Reverse Annealing of Auxiliary CoT

As training steps increase, RTPO gradually removes tokens from the end of the auxiliary CoT based on a configurable schedule:

full_reasoning → partial_reasoning → short_hint → empty

Expected Model behavior:

"Fill in" the removed reasoning process.
Learns to produce longer reasoning as annealing progresses.

Interesting Finding: Emergent Shorter Reasoning

Unexpectedly, RTPO also teaches the model to shorten its reasoning:

When the model does not regenerate the removed tokens, and instead directly outputs the correct final answer,
Over training, the model consistently generates shorter, more efficient reasoning chains.

More experiments are ongoing and will be included later.

🧪 Example Usage

install developing version trl from https://github.com/huggingface/trl Check train_rtpo.py

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
utils		utils
.env		.env
.gitignore		.gitignore
README.md		README.md
data_syn.py		data_syn.py
promptTemplates.json		promptTemplates.json
sep.py		sep.py
test.py		test.py
test.sh		test.sh
tokensum.py		tokensum.py
train.sh		train.sh
train_grpo.py		train_grpo.py
train_heva.py		train_heva.py
train_papo.py		train_papo.py
train_rtpo.py		train_rtpo.py
train_sft.py		train_sft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Reverse Thinking Policy Optimization (RTPO)

This Work introduces Reverse Thinking Policy Optimization (RTPO) — a new RL training method for LLMs built on top of `GRPOTrainer`.

🔍 Motivation

🧠 Method Overview

Full Auxiliary CoT Injection

Reverse Annealing of Auxiliary CoT

Interesting Finding: Emergent Shorter Reasoning

🧪 Example Usage

About

Uh oh!

Releases

Packages

Languages

SolarWindRider/avr

Folders and files

Latest commit

History

Repository files navigation

🚀 Reverse Thinking Policy Optimization (RTPO)

This Work introduces Reverse Thinking Policy Optimization (RTPO) — a new RL training method for LLMs built on top of GRPOTrainer.

🔍 Motivation

🧠 Method Overview

Full Auxiliary CoT Injection

Reverse Annealing of Auxiliary CoT

Interesting Finding: Emergent Shorter Reasoning

🧪 Example Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

This Work introduces Reverse Thinking Policy Optimization (RTPO) — a new RL training method for LLMs built on top of `GRPOTrainer`.

Packages