Dayeon Ki1, Kevin Duh2, Marine Carpuat1
1University of Maryland, 2Johns Hopkins University
This repository contains the code and dataset for our paper
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features.
Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English‑derived reasoning features genuinely help in other languages?
2026-04-03We release our code and dataset!
We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use (1) logistic regression to quantify how each feature associates with final answer accuracy. We further train (2) sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as (3) test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning.
- MGSM-Rev2 dataset:
data/mgsm_revised- List of IDs with updated questions could be found in
replaced_questions.jsonfile
- List of IDs with updated questions could be found in
- AIME 2024-25 dataset:
data/aime
You will first need to prompt Large Reasoning Models (LRMs) with queries in each target language to produce reasoning traces and final answers, which serve as inputs to all subsequent analyses.
For Distill-Deepseek series models,
python -u code/run_distill_vllm_aime/mgsmv2.py \
--model_name $MODEL_NAME \
--num_gpu $NUM_GPU \
--seed $SEED \
--iteration $ITERATION \
--save_name $SAVE_NAMEArguments for the code are:
model_name: The HuggingFace identifier name of the LRMnum_gpu: Number of GPUs needed to run the model (We used 2 A5000 GPU for running 7B sized model)seed: Seed for vLLM setupiteration: Name of the iteration (e.g., 1, 2, 3 ...)save_name: Name of the saved file
For Qwen3 series models,
python -u code/run_qwen_vllm_aime/mgsmv2.py \
--model_name $MODEL_NAME \
--num_gpu $NUM_GPU \
--seed $SEED \
--iteration $ITERATION \
--save_name $SAVE_NAMEThe arguments are identical to the code above. You can set your save directory in output_path in the code.
Then, you can evaluate the generated reasoning traces on the basis of (1) final answer accuracy, (2) language, and (3) length using python -u code/evaluate.py.
We then segment the generated reasoning trace using \n\n as the separator and then classify each reasoning step with GPT-4o to each cognitive-behavioral function tags.
- Segmentation:
python -u code/feature/1_segment.py - Classification: You will need to set you your
OPENAI_API_KEY.
python -u code/feature/2_classify.py \
--input_file $PATH_TO_INPUT_FILE \
--en_input_file $PATH_TO_EN_INPUT_FILE \
--language $LANG \
--output_file $PATH_TO_OUTPUT_FILEArguments for running the classification code are as follows:
--input_file: Path to the input jsonl file (saved from segmentation code)--en_input_file: Path to the English input file--language: ISO code of the target language--output_file: Path to the output jsonl file
You can check the distribution of the annotated cognitive-behavioral tag using python -u code/feature/get_distribution.py and check the frequency of each tag using python -u code/feature/tag_frequency.py. This will give values for Reasoning Flow features.
For generating Multilingual Alignment features:
- COMET-QE:
python -u code/mt/mt_comet_qe.py - Structural similarity:
python -u code/feature/3_struct_alignment.py
For generating Reasoning Step features:
- Num. Steps: You will get a summary when running
python -u code/feature/1_segment.py - Validity:
python -u code/feature/4_validity_utility.py - Direct/Indirect Utility:
python -u code/feature/4_validity_utility.py - V-Information:
python -u code/feature/5_vi.py \
--model_name $MODEL_NAME \
--num_gpu $NUM_GPU \
--seed $SEED \
--input_path $PATH_TO_INPUT_FILE \
--output_path $PATH_TO_OUTPUT_FILE \
--max_logprobs $MAX_LOGPROBSArguments are mostly similar to Step 1, additionally with:
--input_path: Path to the input jsonl file--output_path: Path to the output jsonl file--max_logprobs: Flat logprobs of a request into multiple primitive type lists (vLLM feature)
Each code with automatically compute (1) Change in answer accuracy association, (2) Pooled interaction for comparing reasoning traces from English vs. non-English queries, and (3) Visualization used in Figure 1 of the paper.
- Running univariate logistic regression model:
python -u code/analysis/univariate.py - Running multivariate logistic regression model:
python -u code/analysis/multivariate.py
- Running univariate logistic regression model:
python -u code/analysis/univariate_perlang.py - Running multivariate logistic regression model:
python -u code/analysis/multivariate_perlang.py
Please refer to HypotheSAEs Github repo for detailed instruction.
We can generate 8 traces for each temperature (0.3, 0.6, 0.8, 1.0) and then later combine and select using each feature value as the selection policy at test-time.
python -u code/steer/1_generate.py \
--model_name $MODEL_NAME \
--num_gpu $NUM_GPU \
--seed $SEED \
--data_type $DATASET_NAME \
--save_name $SAVE_NAME \
--n $NUM_TRACES \
--temperature $TEMPERATURE \
--languages $LANGUAGESArguments are mostly similar to Measurable Reasoning Features section, additionally with:
--data_type: Name of the dataset (eitheraimeormgsm_revised)--n: Number of candidate reasoning traces to generate (In the paper, we generated 8 traces for each temperature)--temperature: Temperature value (0.3, 0.6, 0.8, 1.0)--languages: List of languages to run the code for
We re-rank the generated candidates using each feature value.
python -u code/steer/2_selection.py \
--model_name $MODEL_NAME \
--data_type $DATASET_NAME \
--temperature $TEMPERATUREWe combine the re-ranked traces from each temperature and then do the final selection.
python -u code/steer/3_combine.py \
--model_name $MODEL_NAME \
--data_type $DATASET_NAME \
--temperatures $TEMPERATURE_LIST \
--seed $SEED \
--output_path $PATH_TO_OUTPUT_FILE \
--stats $RUN_STATISTICAL_SIGNIFICANCE_TESTThe newly added arguments are:
--temperatures: List of temperatures to consider in the final selection--stats: Whether to run the statistical significance testing (if passed as argument, then True)
If you find our work useful in your research, please consider citing our work:
TBD
For questions, issues, or collaborations, please reach out to dayeonki@umd.edu.

