Skip to content

Code for EMNLP 2025 paper Answer Convergence as a Signal for Early Stopping in Reasoning

Notifications You must be signed in to change notification settings

launchnlp/reasoning_earlystop

Repository files navigation

Answer Convergence as a Signal for Early Stopping in Reasoning

This repository contains the official implementation for the paper "Answer Convergence as a Signal for Early Stopping in Reasoning" (EMNLP 2025).

Overview

Chain-of-thought (CoT) prompting in large language models often leads to verbose and redundant outputs, significantly increasing inference cost. This work investigates when models converge to their final answers during reasoning and proposes three inference-time strategies for efficient early stopping:

  1. Early Stopping via Answer Consistency: Stops reasoning when answers converge across partial reasoning trajectories
  2. Boost Think Sampling: Enhances end-of-reasoning signals to encourage earlier convergence
  3. Learn to Stop (LSTM-based): A supervised method that predicts optimal stopping points using internal model activations

Our experiments show significant token reduction (up to 40%) with little to no accuracy drop, and even improved accuracy on NaturalQuestions.

Paper

Title: Answer Convergence as a Signal for Early Stopping in Reasoning
Authors: Xin Liu, Lu Wang
Affiliation: Computer Science and Engineering, University of Michigan, Ann Arbor
Paper: arXiv:2506.02536

Installation

Requirements

  • Python 3.8+
  • CUDA-capable GPU(s)
  • PyTorch
  • vLLM
  • Other dependencies (see below)

Setup

  1. Clone this repository:
git clone <repository-url>
cd <repository-name>
  1. Install dependencies:
pip install torch transformers vllm tqdm numpy scikit-learn nltk openai datasets
  1. Download NLTK data (required for sentence tokenization):
import nltk
nltk.download('punkt')

Quick Start

Setting Up vLLM Server

For answer evaluation (correctness label generation), you need to run a vLLM server with an OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --port 8000

Update the API endpoint in src/get_accuracy_label.py if your server runs on a different host/port:

openai_api_base = "http://localhost:8000/v1"  # Update to your server endpoint

Running Experiments

This repository contains three main modules, each implementing a different early stopping strategy:

  1. Early Stop via Answer Consistency: Stops when answers converge across partial reasoning steps
  2. Boost Think Sampling: Enhances end-of-reasoning signals
  3. Learn to Stop (LSTM): Supervised learning approach using hidden states

See the respective README files for detailed usage instructions for each module.

Project Structure

.
├── script/                          # Execution scripts for all modules
│   ├── basic_LLMs_sampling_only_answer.sh
│   ├── boost_think_sampling.sh
│   ├── collect_training_trajectories.sh
│   ├── collect_eval_trajectories.sh
│   ├── eval.sh
│   ├── evaluate_lstm.sh
│   └── ...
├── src/                             # Core implementation code
│   ├── sample_generation_only_answer.py
│   ├── sample_generation_boost_think.py
│   ├── collect_partial_reasoning_res.py
│   ├── get_partial_reasoning_hidden_states.py
│   ├── get_accuracy_label.py
│   ├── early_stopping_via_consistency.py
│   ├── train_lstm.py
│   ├── evaluate_lstm.py
│   ├── calculate_accuracy.py
│   ├── math_util.py
│   └── math_equivalence.py
├── chat_template/                   # Chat templates for model inference
│   └── r1_qwen.jinja
├── data/
│   ├── processed_datasets/          # Input datasets (train/eval splits)
│   └── output/                       # Output directory for all results
├── README.md                         # This file
├── README_boost_think.md            # Boost Think module documentation
├── README_early_stop_consistency.md # Answer Consistency module documentation
└── README_learn_to_stop.md         # LSTM Learn-to-Stop module documentation

Datasets

The code supports the following datasets:

  • GSM8K: Math word problems
  • MATH: Competition mathematics problems
  • NaturalQuestions (NQ): Open-domain QA
  • GPQA: Graduate-level science questions
  • AIME: American Invitational Mathematics Examination problems

Place your dataset files in data/processed_datasets/ with the naming convention:

  • {dataset_name}_train.jsonl for training data
  • {dataset_name}_eval.jsonl for evaluation data

Each line should be a JSON object with question and answer fields.

Usage Examples

Example 1: Early Stopping via Answer Consistency

cd script
# Step 1: Collect full trajectories
bash basic_LLMs_sampling_only_answer.sh

# Step 2: Extract partial reasoning and hidden states
bash get_partial_res_hidden_states_1gpu_eval.sh

# Step 3: Generate correctness labels (requires vLLM server)
bash eval.sh

# Step 4: Sample via consistency
bash sample_via_answer_consistency.sh

Example 2: Boost Think Sampling

cd script
bash boost_think_sampling.sh

Example 3: Learn to Stop (LSTM)

cd script
# Training phase
bash collect_training_trajectories.sh
bash collect_partial_reasoning_training.sh
bash collect_training_hidden_states.sh
bash train_lstm.sh

# Evaluation phase
bash collect_eval_trajectories.sh
bash collect_eval_hidden_states.sh
bash eval.sh  # Requires vLLM server
bash evaluate_lstm.sh

Configuration

Most configuration parameters can be modified in the shell scripts:

  • Models: Edit the MODELS array in each script
  • Datasets: Edit the DATASET_PATHS array
  • GPU Configuration: Modify TOTAL_GPUS and GPU assignment logic
  • Output Paths: Update OUTPUT_PATH variables

For LSTM training, hyperparameters can be adjusted in src/train_lstm.py.

Key Features

  • Three Early Stopping Strategies: Answer consistency, boost think, and LSTM-based learning
  • Efficient Inference: Significant token reduction (up to 40%) with minimal accuracy loss
  • Flexible Configuration: Easy to adapt for different models and datasets
  • Comprehensive Evaluation: Includes correctness label generation and accuracy calculation tools

Results

Our methods achieve:

  • Token Reduction: Up to 40% reduction in generated tokens
  • Accuracy: Maintained or improved accuracy across datasets
  • Efficiency: Real-time early stopping without model retraining

See the paper for detailed experimental results and analysis.

Citation

If you use this code in your research, please cite:

@article{liu2025answer,
  title={Answer Convergence as a Signal for Early Stopping in Reasoning},
  author={Liu, Xin and Wang, Lu},
  journal={arXiv preprint arXiv:2506.02536},
  year={2025}
}

Contact

For questions or issues, please contact:

Acknowledgments

We thank the vLLM team for their excellent inference framework and the open-source community for their valuable tools and datasets.

About

Code for EMNLP 2025 paper Answer Convergence as a Signal for Early Stopping in Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published