This repository contains the official implementation for the paper "Answer Convergence as a Signal for Early Stopping in Reasoning" (EMNLP 2025).
Chain-of-thought (CoT) prompting in large language models often leads to verbose and redundant outputs, significantly increasing inference cost. This work investigates when models converge to their final answers during reasoning and proposes three inference-time strategies for efficient early stopping:
- Early Stopping via Answer Consistency: Stops reasoning when answers converge across partial reasoning trajectories
- Boost Think Sampling: Enhances end-of-reasoning signals to encourage earlier convergence
- Learn to Stop (LSTM-based): A supervised method that predicts optimal stopping points using internal model activations
Our experiments show significant token reduction (up to 40%) with little to no accuracy drop, and even improved accuracy on NaturalQuestions.
Title: Answer Convergence as a Signal for Early Stopping in Reasoning
Authors: Xin Liu, Lu Wang
Affiliation: Computer Science and Engineering, University of Michigan, Ann Arbor
Paper: arXiv:2506.02536
- Python 3.8+
- CUDA-capable GPU(s)
- PyTorch
- vLLM
- Other dependencies (see below)
- Clone this repository:
git clone <repository-url>
cd <repository-name>- Install dependencies:
pip install torch transformers vllm tqdm numpy scikit-learn nltk openai datasets- Download NLTK data (required for sentence tokenization):
import nltk
nltk.download('punkt')For answer evaluation (correctness label generation), you need to run a vLLM server with an OpenAI-compatible API:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct \
--port 8000Update the API endpoint in src/get_accuracy_label.py if your server runs on a different host/port:
openai_api_base = "http://localhost:8000/v1" # Update to your server endpointThis repository contains three main modules, each implementing a different early stopping strategy:
- Early Stop via Answer Consistency: Stops when answers converge across partial reasoning steps
- Boost Think Sampling: Enhances end-of-reasoning signals
- Learn to Stop (LSTM): Supervised learning approach using hidden states
See the respective README files for detailed usage instructions for each module.
.
├── script/ # Execution scripts for all modules
│ ├── basic_LLMs_sampling_only_answer.sh
│ ├── boost_think_sampling.sh
│ ├── collect_training_trajectories.sh
│ ├── collect_eval_trajectories.sh
│ ├── eval.sh
│ ├── evaluate_lstm.sh
│ └── ...
├── src/ # Core implementation code
│ ├── sample_generation_only_answer.py
│ ├── sample_generation_boost_think.py
│ ├── collect_partial_reasoning_res.py
│ ├── get_partial_reasoning_hidden_states.py
│ ├── get_accuracy_label.py
│ ├── early_stopping_via_consistency.py
│ ├── train_lstm.py
│ ├── evaluate_lstm.py
│ ├── calculate_accuracy.py
│ ├── math_util.py
│ └── math_equivalence.py
├── chat_template/ # Chat templates for model inference
│ └── r1_qwen.jinja
├── data/
│ ├── processed_datasets/ # Input datasets (train/eval splits)
│ └── output/ # Output directory for all results
├── README.md # This file
├── README_boost_think.md # Boost Think module documentation
├── README_early_stop_consistency.md # Answer Consistency module documentation
└── README_learn_to_stop.md # LSTM Learn-to-Stop module documentation
The code supports the following datasets:
- GSM8K: Math word problems
- MATH: Competition mathematics problems
- NaturalQuestions (NQ): Open-domain QA
- GPQA: Graduate-level science questions
- AIME: American Invitational Mathematics Examination problems
Place your dataset files in data/processed_datasets/ with the naming convention:
{dataset_name}_train.jsonlfor training data{dataset_name}_eval.jsonlfor evaluation data
Each line should be a JSON object with question and answer fields.
cd script
# Step 1: Collect full trajectories
bash basic_LLMs_sampling_only_answer.sh
# Step 2: Extract partial reasoning and hidden states
bash get_partial_res_hidden_states_1gpu_eval.sh
# Step 3: Generate correctness labels (requires vLLM server)
bash eval.sh
# Step 4: Sample via consistency
bash sample_via_answer_consistency.shcd script
bash boost_think_sampling.shcd script
# Training phase
bash collect_training_trajectories.sh
bash collect_partial_reasoning_training.sh
bash collect_training_hidden_states.sh
bash train_lstm.sh
# Evaluation phase
bash collect_eval_trajectories.sh
bash collect_eval_hidden_states.sh
bash eval.sh # Requires vLLM server
bash evaluate_lstm.shMost configuration parameters can be modified in the shell scripts:
- Models: Edit the
MODELSarray in each script - Datasets: Edit the
DATASET_PATHSarray - GPU Configuration: Modify
TOTAL_GPUSand GPU assignment logic - Output Paths: Update
OUTPUT_PATHvariables
For LSTM training, hyperparameters can be adjusted in src/train_lstm.py.
- Three Early Stopping Strategies: Answer consistency, boost think, and LSTM-based learning
- Efficient Inference: Significant token reduction (up to 40%) with minimal accuracy loss
- Flexible Configuration: Easy to adapt for different models and datasets
- Comprehensive Evaluation: Includes correctness label generation and accuracy calculation tools
Our methods achieve:
- Token Reduction: Up to 40% reduction in generated tokens
- Accuracy: Maintained or improved accuracy across datasets
- Efficiency: Real-time early stopping without model retraining
See the paper for detailed experimental results and analysis.
If you use this code in your research, please cite:
@article{liu2025answer,
title={Answer Convergence as a Signal for Early Stopping in Reasoning},
author={Liu, Xin and Wang, Lu},
journal={arXiv preprint arXiv:2506.02536},
year={2025}
}For questions or issues, please contact:
- Xin Liu: liuxincs@umich.edu
We thank the vLLM team for their excellent inference framework and the open-source community for their valuable tools and datasets.