UC Berkeley
- July 2025: Paper accepted at SCALR Workshop @ COLM 2025!
This repository provides a framework for evaluating large language models on multilingual reasoning tasks with language compliance validation. The system supports cross-lingual evaluation where models can be prompted in one language and asked to reason in another, enabling research on language-specific reasoning capabilities.
Key Features:
- Multilingual Evaluation: Support for 10 languages (English, Chinese, French, Japanese, Latvian, Swahili, Telugu, Thai, Afrikaans, Marathi)
- Local & Cloud Generation: Local inference via vLLM and OpenAI API integration for answer verification
- Language Compliance: Robust language validation using Lingua with script-specific heuristics for non-Latin scripts
- Multiple Benchmarks: MT-MATH-500, MT-AIME (2024/2025), and MMLU-ProX-Lite datasets
- Two-Phase Evaluation: Separate raw generation and labeling phases for efficient resumption and verification
- Cross-Lingual Evaluation: Prompt in one language while enforcing reasoning in another
Repository Structure:
Multilingual-Long-CoT/
├── eval/
│ ├── eval.py # Main evaluation script (vLLM + OpenAI verification)
│ ├── score.py # Scoring script for computing accuracy
│ └── benchmarks/ # Local dataset files
│ ├── mt-math-500/ # 500 math problems × 10 languages
│ ├── mt-aime-2025/ # AIME 2025 problems × 10 languages
│ └── mmlu-prox-lite/ # MMLU subset × 9 languages
│ # Note: mt-aime-2024 is downloaded from HuggingFace on demand
├── utils/
│ ├── generate.py # Unified generation (vLLM + OpenAI)
│ ├── language_validation.py # Language compliance validation
│ └── prompts.py # Multilingual prompt templates
└── environment.yml # Conda environment specification
- Python 3.12
- CUDA-capable GPU(s) for local model inference
- Conda or Mamba package manager
- Clone the repository:
git clone https://github.com/Berkeley-NLP/Multilingual-Long-CoT.git
cd Multilingual-Long-CoT- Create the conda environment:
conda env create -f environment.yml
conda activate multilingual- Set up OpenAI API key (for answer verification):
# Create a .env file in the repository root
echo "OPENAI_API_KEY=your_api_key_here" > .envThe evaluation script uses gpt-5-mini for answer verification. If you don't have an API key, you can still run evaluations, but fuzzy answer verification will fail (strict verification still works).
All datasets are available on HuggingFace: Multilingual-S1K Collection
| Dataset | eval.py Name | Problems | Languages | Description |
|---|---|---|---|---|
| MT-MATH-500 | math500 |
500 | 10 | Math reasoning problems across difficulty levels |
| MT-AIME-2024 | aime2024 |
30 | 10 | American Invitational Mathematics Examination 2024 |
| MT-AIME-2025 | aime2025 |
30 | 10 | American Invitational Mathematics Examination 2025 |
| MMLU-ProX-Lite | mmlu |
Varied | 9 | MMLU subset (excludes Latvian) |
| AIME Combined | aime-combined |
60 | 10 | Combined AIME 2024 + 2025 datasets |
Supported Languages: English (en), Chinese (zh), French (fr), Japanese (ja), Latvian (lv), Swahili (sw), Telugu (te), Thai (th), Afrikaans (af), Marathi (mr)
Dataset Availability:
- MT-MATH-500, MT-AIME-2025, MMLU-ProX-Lite: Included locally in
eval/benchmarks/ - MT-AIME-2024: Downloaded from HuggingFace automatically when first used (
amphora/MCLMdataset)
Evaluate a model on English math problems with English reasoning:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language English \
--reasoning_language English \
--output_path eval/resultsAfter evaluation completes, compute accuracy scores:
python -m eval.score \
--root_path eval/results \
--output_dir eval/score_resultThis generates CSV files with accuracy metrics for each dataset and model.
The main evaluation script performs two-phase evaluation:
- Phase 1 (Generation): Generates model responses using vLLM and saves raw outputs
- Phase 2 (Labeling): Verifies answers and validates language compliance
--model MODEL_PATH # Model name or path (e.g., deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
--dataset DATASET # Dataset: math500, mmlu, aime2024, aime2025, aime-combined
--query_language LANGUAGE # Language of the input problem
--reasoning_language LANGUAGE # Language the model should reason in
--output_path PATH # Output directory for results--sample # Use only 5 examples for testing
--max_model_len N # Maximum sequence length (default: 32768)
--n_samples N # Number of samples per question (default: 1)
--language_forcing # Force model to reason in specified language
--tp_size N # Tensor parallel size (default: auto-detect GPUs)
--confidence_threshold FLOAT # Language detection confidence (default: 0.9)
--compliance_threshold FLOAT # Language compliance threshold (default: 0.9)
--echo_prompt # Print sample prompt for debuggingPrompt the model in French but have it reason in English:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language French \
--reasoning_language English \
--output_path eval/resultsForce the model to reason in Japanese:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language Japanese \
--reasoning_language Japanese \
--language_forcing \
--output_path eval/resultsNote: --language_forcing is required when:
reasoning_languagediffers fromquery_language, ORreasoning_languageis not English
Run a quick test with only 5 examples:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language English \
--reasoning_language English \
--output_path eval/results \
--sampleGenerate 5 responses per question:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language English \
--reasoning_language English \
--output_path eval/results \
--n_samples 5Evaluate on both AIME years together:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset aime-combined \
--query_language English \
--reasoning_language English \
--output_path eval/resultsCompute accuracy metrics from evaluation results.
python -m eval.score \
--root_path eval/results \
--output_dir eval/score_resultpython -m eval.score \
--root_path eval/results \
--models deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--datasets math500 aime2025 \
--output_dir eval/score_resultpython -m eval.score \
--root_path eval/results \
--languages English-English French-French English-Japanese \
--output_dir eval/score_resultEach line in the .jsonl files contains:
{
"row_index": 0,
"sample_idx": 0,
"question": "Problem statement...",
"response": "Model's full response...",
"answer": "Ground truth answer",
"language": "French",
"reasoning_language": "English",
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"dataset": "math500",
"n_samples": 1,
"answer_alone": 1,
"language_compliance": 1,
"language_compliance_score": 0.95,
"language_filtered_chunks": 42,
"language_invalid_chunks": [],
"correct": 1
}Key Fields:
answer_alone: 1 if the extracted answer matches ground truth (regardless of language)language_compliance: 1 if the response is in the target language (≥ compliance threshold)language_compliance_score: Fraction of text chunks in the target languagecorrect: 1 if bothanswer_aloneandlanguage_complianceare 1
Each CSV contains a row per model with columns for each language pair:
model,English-English,French-French,Japanese-Japanese,AVG
meta-llama_Llama-3.3-70B-Instruct,0.856,0.782,0.734,0.791
Create a .env file in the repository root:
# Required for answer verification
OPENAI_API_KEY=your_api_key_here
# Optional: vLLM tuning
VLLM_TP_SIZE=8 # Tensor parallel size (default: auto-detect)
VLLM_GPU_MEM_UTIL=0.90 # GPU memory utilization (default: 0.90)
VLLM_MAX_MODEL_LEN=32768 # Max model length (default: 32768)The framework uses two thresholds for language compliance:
confidence_threshold(default: 0.9): Minimum Lingua confidence for a text chunk to be consideredcompliance_threshold(default: 0.9): Minimum fraction of chunks that must be in the target language
For non-Latin scripts (Chinese, Japanese, Thai, Telugu, Marathi), the system also uses script-based heuristics:
- Japanese: Requires ≥5% Kana characters to avoid misclassifying pure Kanji as Chinese
- Chinese: Rejects text with significant Kana presence
- Others: Script ratio must be ≥75% for the target script
The evaluation automatically saves progress:
- Raw generation (
*.raw.jsonl): Saved immediately after generation - Labeled results (
*.jsonl): Saved after answer verification and language validation
If evaluation is interrupted:
- If only raw generation completed: Resume from labeling phase (no model loading)
- If labeling completed: Skip entirely
- Otherwise: Restart from generation
The script automatically detects and resumes incomplete evaluations. Simply re-run the same command.
Print a sample formatted prompt to verify prompt construction:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language English \
--reasoning_language English \
--output_path eval/results \
--sample \
--echo_promptUse datasets from a custom location:
python -m eval.eval \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dataset math500 \
--query_language English \
--reasoning_language English \
--output_path eval/results \
--benchmarks_dir /path/to/custom/benchmarksPlease consider citing our paper if you find our work or dataset helpful for your research:
@inproceedings{
barua2025long,
title={Long Chain-of-Thought Reasoning Across Languages},
author={Josh Barua and Seun Eisape and Kayo Yin and Alane Suhr},
year={2025},
eprint={2508.14828},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14828}
}