Answer Convergence as a Signal for Early Stopping in Reasoning

This repository contains the official implementation for the paper "Answer Convergence as a Signal for Early Stopping in Reasoning" (EMNLP 2025).

Overview

Chain-of-thought (CoT) prompting in large language models often leads to verbose and redundant outputs, significantly increasing inference cost. This work investigates when models converge to their final answers during reasoning and proposes three inference-time strategies for efficient early stopping:

Early Stopping via Answer Consistency: Stops reasoning when answers converge across partial reasoning trajectories
Boost Think Sampling: Enhances end-of-reasoning signals to encourage earlier convergence
Learn to Stop (LSTM-based): A supervised method that predicts optimal stopping points using internal model activations

Our experiments show significant token reduction (up to 40%) with little to no accuracy drop, and even improved accuracy on NaturalQuestions.

Paper

Title: Answer Convergence as a Signal for Early Stopping in Reasoning
Authors: Xin Liu, Lu Wang
Affiliation: Computer Science and Engineering, University of Michigan, Ann Arbor
Paper: arXiv:2506.02536

Installation

Requirements

Python 3.8+
CUDA-capable GPU(s)
PyTorch
vLLM
Other dependencies (see below)

Setup

Clone this repository:

git clone <repository-url>
cd <repository-name>

Install dependencies:

pip install torch transformers vllm tqdm numpy scikit-learn nltk openai datasets

Download NLTK data (required for sentence tokenization):

import nltk
nltk.download('punkt')

Quick Start

Setting Up vLLM Server

For answer evaluation (correctness label generation), you need to run a vLLM server with an OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --port 8000

Update the API endpoint in src/get_accuracy_label.py if your server runs on a different host/port:

openai_api_base = "http://localhost:8000/v1"  # Update to your server endpoint

Running Experiments

This repository contains three main modules, each implementing a different early stopping strategy:

Early Stop via Answer Consistency: Stops when answers converge across partial reasoning steps
Boost Think Sampling: Enhances end-of-reasoning signals
Learn to Stop (LSTM): Supervised learning approach using hidden states

See the respective README files for detailed usage instructions for each module.

Project Structure

.
├── script/                          # Execution scripts for all modules
│   ├── basic_LLMs_sampling_only_answer.sh
│   ├── boost_think_sampling.sh
│   ├── collect_training_trajectories.sh
│   ├── collect_eval_trajectories.sh
│   ├── eval.sh
│   ├── evaluate_lstm.sh
│   └── ...
├── src/                             # Core implementation code
│   ├── sample_generation_only_answer.py
│   ├── sample_generation_boost_think.py
│   ├── collect_partial_reasoning_res.py
│   ├── get_partial_reasoning_hidden_states.py
│   ├── get_accuracy_label.py
│   ├── early_stopping_via_consistency.py
│   ├── train_lstm.py
│   ├── evaluate_lstm.py
│   ├── calculate_accuracy.py
│   ├── math_util.py
│   └── math_equivalence.py
├── chat_template/                   # Chat templates for model inference
│   └── r1_qwen.jinja
├── data/
│   ├── processed_datasets/          # Input datasets (train/eval splits)
│   └── output/                       # Output directory for all results
├── README.md                         # This file
├── README_boost_think.md            # Boost Think module documentation
├── README_early_stop_consistency.md # Answer Consistency module documentation
└── README_learn_to_stop.md         # LSTM Learn-to-Stop module documentation

Datasets

The code supports the following datasets:

GSM8K: Math word problems
MATH: Competition mathematics problems
NaturalQuestions (NQ): Open-domain QA
GPQA: Graduate-level science questions
AIME: American Invitational Mathematics Examination problems

Place your dataset files in data/processed_datasets/ with the naming convention:

{dataset_name}_train.jsonl for training data
{dataset_name}_eval.jsonl for evaluation data

Each line should be a JSON object with question and answer fields.

Usage Examples

Example 1: Early Stopping via Answer Consistency

cd script
# Step 1: Collect full trajectories
bash basic_LLMs_sampling_only_answer.sh

# Step 2: Extract partial reasoning and hidden states
bash get_partial_res_hidden_states_1gpu_eval.sh

# Step 3: Generate correctness labels (requires vLLM server)
bash eval.sh

# Step 4: Sample via consistency
bash sample_via_answer_consistency.sh

Example 2: Boost Think Sampling

cd script
bash boost_think_sampling.sh

Example 3: Learn to Stop (LSTM)

cd script
# Training phase
bash collect_training_trajectories.sh
bash collect_partial_reasoning_training.sh
bash collect_training_hidden_states.sh
bash train_lstm.sh

# Evaluation phase
bash collect_eval_trajectories.sh
bash collect_eval_hidden_states.sh
bash eval.sh  # Requires vLLM server
bash evaluate_lstm.sh

Configuration

Most configuration parameters can be modified in the shell scripts:

Models: Edit the MODELS array in each script
Datasets: Edit the DATASET_PATHS array
GPU Configuration: Modify TOTAL_GPUS and GPU assignment logic
Output Paths: Update OUTPUT_PATH variables

For LSTM training, hyperparameters can be adjusted in src/train_lstm.py.

Key Features

Three Early Stopping Strategies: Answer consistency, boost think, and LSTM-based learning
Efficient Inference: Significant token reduction (up to 40%) with minimal accuracy loss
Flexible Configuration: Easy to adapt for different models and datasets
Comprehensive Evaluation: Includes correctness label generation and accuracy calculation tools

Results

Our methods achieve:

Token Reduction: Up to 40% reduction in generated tokens
Accuracy: Maintained or improved accuracy across datasets
Efficiency: Real-time early stopping without model retraining

See the paper for detailed experimental results and analysis.

Citation

If you use this code in your research, please cite:

@article{liu2025answer,
  title={Answer Convergence as a Signal for Early Stopping in Reasoning},
  author={Liu, Xin and Wang, Lu},
  journal={arXiv preprint arXiv:2506.02536},
  year={2025}
}

Contact

For questions or issues, please contact:

Xin Liu: liuxincs@umich.edu

Acknowledgments

We thank the vLLM team for their excellent inference framework and the open-source community for their valuable tools and datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Answer Convergence as a Signal for Early Stopping in Reasoning

Overview

Paper

Installation

Requirements

Setup

Quick Start

Setting Up vLLM Server

Running Experiments

Project Structure

Datasets

Usage Examples

Example 1: Early Stopping via Answer Consistency

Example 2: Boost Think Sampling

Example 3: Learn to Stop (LSTM)

Configuration

Key Features

Results

Citation

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
chat_template		chat_template
data/processed_datasets		data/processed_datasets
script		script
src		src
README.md		README.md
README_boost_think.md		README_boost_think.md
README_early_stop_consistency.md		README_early_stop_consistency.md
README_learn_to_stop.md		README_learn_to_stop.md

launchnlp/reasoning_earlystop

Folders and files

Latest commit

History

Repository files navigation

Answer Convergence as a Signal for Early Stopping in Reasoning

Overview

Paper

Installation

Requirements

Setup

Quick Start

Setting Up vLLM Server

Running Experiments

Project Structure

Datasets

Usage Examples

Example 1: Early Stopping via Answer Consistency

Example 2: Boost Think Sampling

Example 3: Learn to Stop (LSTM)

Configuration

Key Features

Results

Citation

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages