Long Chain-of-Thought Reasoning Across Languages

Authors

UC Berkeley

News 🎉

July 2025: Paper accepted at SCALR Workshop @ COLM 2025!

Project Overview

This repository provides a framework for evaluating large language models on multilingual reasoning tasks with language compliance validation. The system supports cross-lingual evaluation where models can be prompted in one language and asked to reason in another, enabling research on language-specific reasoning capabilities.

Key Features:

Multilingual Evaluation: Support for 10 languages (English, Chinese, French, Japanese, Latvian, Swahili, Telugu, Thai, Afrikaans, Marathi)
Local & Cloud Generation: Local inference via vLLM and OpenAI API integration for answer verification
Language Compliance: Robust language validation using Lingua with script-specific heuristics for non-Latin scripts
Multiple Benchmarks: MT-MATH-500, MT-AIME (2024/2025), and MMLU-ProX-Lite datasets
Two-Phase Evaluation: Separate raw generation and labeling phases for efficient resumption and verification
Cross-Lingual Evaluation: Prompt in one language while enforcing reasoning in another

Repository Structure:

Multilingual-Long-CoT/
├── eval/
│   ├── eval.py                      # Main evaluation script (vLLM + OpenAI verification)
│   ├── score.py                     # Scoring script for computing accuracy
│   └── benchmarks/                  # Local dataset files
│       ├── mt-math-500/            # 500 math problems × 10 languages
│       ├── mt-aime-2025/           # AIME 2025 problems × 10 languages
│       └── mmlu-prox-lite/         # MMLU subset × 9 languages
│       # Note: mt-aime-2024 is downloaded from HuggingFace on demand
├── utils/
│   ├── generate.py                  # Unified generation (vLLM + OpenAI)
│   ├── language_validation.py      # Language compliance validation
│   └── prompts.py                   # Multilingual prompt templates
└── environment.yml                  # Conda environment specification

Installation

Prerequisites

Python 3.12
CUDA-capable GPU(s) for local model inference
Conda or Mamba package manager

Setup

Clone the repository:

git clone https://github.com/Berkeley-NLP/Multilingual-Long-CoT.git
cd Multilingual-Long-CoT

Create the conda environment:

conda env create -f environment.yml
conda activate multilingual

Set up OpenAI API key (for answer verification):

# Create a .env file in the repository root
echo "OPENAI_API_KEY=your_api_key_here" > .env

The evaluation script uses gpt-5-mini for answer verification. If you don't have an API key, you can still run evaluations, but fuzzy answer verification will fail (strict verification still works).

Datasets

All datasets are available on HuggingFace: Multilingual-S1K Collection

Available Datasets

Dataset	eval.py Name	Problems	Languages	Description
MT-MATH-500	`math500`	500	10	Math reasoning problems across difficulty levels
MT-AIME-2024	`aime2024`	30	10	American Invitational Mathematics Examination 2024
MT-AIME-2025	`aime2025`	30	10	American Invitational Mathematics Examination 2025
MMLU-ProX-Lite	`mmlu`	Varied	9	MMLU subset (excludes Latvian)
AIME Combined	`aime-combined`	60	10	Combined AIME 2024 + 2025 datasets

Supported Languages: English (en), Chinese (zh), French (fr), Japanese (ja), Latvian (lv), Swahili (sw), Telugu (te), Thai (th), Afrikaans (af), Marathi (mr)

Dataset Availability:

MT-MATH-500, MT-AIME-2025, MMLU-ProX-Lite: Included locally in eval/benchmarks/
MT-AIME-2024: Downloaded from HuggingFace automatically when first used (amphora/MCLM dataset)

Quick Start

Basic Evaluation

Evaluate a model on English math problems with English reasoning:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results

Score the Results

After evaluation completes, compute accuracy scores:

python -m eval.score \
  --root_path eval/results \
  --output_dir eval/score_result

This generates CSV files with accuracy metrics for each dataset and model.

Usage

Evaluation Script (`eval.py`)

The main evaluation script performs two-phase evaluation:

Phase 1 (Generation): Generates model responses using vLLM and saves raw outputs
Phase 2 (Labeling): Verifies answers and validates language compliance

Required Arguments

--model MODEL_PATH              # Model name or path (e.g., deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
--dataset DATASET               # Dataset: math500, mmlu, aime2024, aime2025, aime-combined
--query_language LANGUAGE       # Language of the input problem
--reasoning_language LANGUAGE   # Language the model should reason in
--output_path PATH              # Output directory for results

Optional Arguments

--sample                        # Use only 5 examples for testing
--max_model_len N               # Maximum sequence length (default: 32768)
--n_samples N                   # Number of samples per question (default: 1)
--language_forcing              # Force model to reason in specified language
--tp_size N                     # Tensor parallel size (default: auto-detect GPUs)
--confidence_threshold FLOAT    # Language detection confidence (default: 0.9)
--compliance_threshold FLOAT    # Language compliance threshold (default: 0.9)
--echo_prompt                   # Print sample prompt for debugging

Examples

1. Cross-Lingual Evaluation (French Query → English Reasoning)

Prompt the model in French but have it reason in English:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language French \
  --reasoning_language English \
  --output_path eval/results

2. Non-English Reasoning with Language Forcing

Force the model to reason in Japanese:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language Japanese \
  --reasoning_language Japanese \
  --language_forcing \
  --output_path eval/results

Note: --language_forcing is required when:

reasoning_language differs from query_language, OR
reasoning_language is not English

3. Test with Sample Data

Run a quick test with only 5 examples:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results \
  --sample

4. Multiple Sampling

Generate 5 responses per question:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results \
  --n_samples 5

5. AIME Combined (2024 + 2025)

Evaluate on both AIME years together:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset aime-combined \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results

Scoring Script (`score.py`)

Compute accuracy metrics from evaluation results.

Basic Usage

python -m eval.score \
  --root_path eval/results \
  --output_dir eval/score_result

Filter Specific Models or Datasets

python -m eval.score \
  --root_path eval/results \
  --models deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --datasets math500 aime2025 \
  --output_dir eval/score_result

Score Specific Language Pairs

python -m eval.score \
  --root_path eval/results \
  --languages English-English French-French English-Japanese \
  --output_dir eval/score_result

Output Format

JSONL Output Format

Each line in the .jsonl files contains:

{
  "row_index": 0,
  "sample_idx": 0,
  "question": "Problem statement...",
  "response": "Model's full response...",
  "answer": "Ground truth answer",
  "language": "French",
  "reasoning_language": "English",
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
  "dataset": "math500",
  "n_samples": 1,
  "answer_alone": 1,
  "language_compliance": 1,
  "language_compliance_score": 0.95,
  "language_filtered_chunks": 42,
  "language_invalid_chunks": [],
  "correct": 1
}

Key Fields:

answer_alone: 1 if the extracted answer matches ground truth (regardless of language)
language_compliance: 1 if the response is in the target language (≥ compliance threshold)
language_compliance_score: Fraction of text chunks in the target language
correct: 1 if both answer_alone and language_compliance are 1

CSV Output Format

Each CSV contains a row per model with columns for each language pair:

model,English-English,French-French,Japanese-Japanese,AVG
meta-llama_Llama-3.3-70B-Instruct,0.856,0.782,0.734,0.791

Configuration

Environment Variables

Create a .env file in the repository root:

# Required for answer verification
OPENAI_API_KEY=your_api_key_here

# Optional: vLLM tuning
VLLM_TP_SIZE=8                  # Tensor parallel size (default: auto-detect)
VLLM_GPU_MEM_UTIL=0.90          # GPU memory utilization (default: 0.90)
VLLM_MAX_MODEL_LEN=32768        # Max model length (default: 32768)

Language Validation

The framework uses two thresholds for language compliance:

confidence_threshold (default: 0.9): Minimum Lingua confidence for a text chunk to be considered
compliance_threshold (default: 0.9): Minimum fraction of chunks that must be in the target language

For non-Latin scripts (Chinese, Japanese, Thai, Telugu, Marathi), the system also uses script-based heuristics:

Japanese: Requires ≥5% Kana characters to avoid misclassifying pure Kanji as Chinese
Chinese: Rejects text with significant Kana presence
Others: Script ratio must be ≥75% for the target script

Two-Phase Evaluation

The evaluation automatically saves progress:

Raw generation (*.raw.jsonl): Saved immediately after generation
Labeled results (*.jsonl): Saved after answer verification and language validation

If evaluation is interrupted:

If only raw generation completed: Resume from labeling phase (no model loading)
If labeling completed: Skip entirely
Otherwise: Restart from generation

Advanced Usage

Resume Interrupted Evaluation

The script automatically detects and resumes incomplete evaluations. Simply re-run the same command.

Debug Prompts

Print a sample formatted prompt to verify prompt construction:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results \
  --sample \
  --echo_prompt

Custom Benchmarks Directory

Use datasets from a custom location:

python -m eval.eval \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dataset math500 \
  --query_language English \
  --reasoning_language English \
  --output_path eval/results \
  --benchmarks_dir /path/to/custom/benchmarks

Citation

Please consider citing our paper if you find our work or dataset helpful for your research:

@inproceedings{
barua2025long,
title={Long Chain-of-Thought Reasoning Across Languages}, 
author={Josh Barua and Seun Eisape and Kayo Yin and Alane Suhr},
year={2025},
eprint={2508.14828},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14828}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
eval		eval
utils		utils
.env		.env
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

Berkeley-NLP/Multilingual-Long-CoT

Folders and files

Latest commit

History

Repository files navigation