Skip to content

HKU-MMLab/Math-VR-CodePlot-CoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Home Paper Leaderboard visitors

Chengqi Duan1*, Kaiyue Sun1*, Rongyao Fang3*, Manyuan Zhang2†, Yan Feng2, Ying Luo2, Yufang Liu2, Ke Wang3, Peng Pei2, Xunliang Cai2, Hongsheng Li3, Yi Ma1, Xihui Liu1 βœ‰οΈ

1HKU, 2Meituan, 3CUHK

*Equal contribution, †Project Lead , βœ‰οΈCorresponding author

Β 

Β  Paper β€’ Β  Introduction β€’ Β  Math-VR β€’ Model β€’ Usage β€’ Evaluation β€’ Benchmark results β€’ Β  License β€’ Β  Citation

Introduction

Recent advances in Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems that require visual assistance, such as drawing auxiliary lines or plotting functions. Most VLMs are constrained to text-only reasoning, while unified models that generate interleaved text and images often lack the precision required for mathematical tasks.

We present CodePlot-CoT, a code-driven Chain-of-Thought (CoT) paradigm that enables models to "think with images" in mathematics. Our approach leverages a VLM to generate both textual reasoning and executable plotting code. This code is then rendered into an image, serving as a "visual thought" that is reinput into the model to aid in problem solving. To facilitate this, we introduce Math-VR, the first large-scale, bilingual dataset and benchmark for mathematical problems requiring visual reasoning, comprising 178K samples. We also developed MatplotCode, a specialized image-to-code converter to generate high-quality training data. We benchmark SOTA models on our Math-VR . Our experiments show that CodePlot-CoT achieves up to a 21% performance increase over its base model, demonstrating the effectiveness of our code-driven reasoning paradigm.

The main contributions of our work can be summarized as follows:

  • We propose a novel and efficient paradigm that enables VLMs to engage in visual reasoning through code generation.
  • We construct Math-VR, the first large-scale, bilingual dataset and benchmark (178K samples) for Mathematical problems with Visual Reasoning.
  • We develop MatplotCode, a state-of-the-art image-to-code converter for mathematical figures, and train CodePlot-CoT model, a specialized model that achieves up to a 21% performance increase over strong baselines.

Released Data: Math-VR-train and Math-VR-bench

DatasetΒ  Β  Β  Link
Math-VR-train πŸ€— HuggingFace
Math-VR-bench πŸ€— HuggingFace

Released Model: MatPlotCode and CodePlot-CoT

ModelΒ  Β  Β  Link
MatPlotCode πŸ€— HuggingFace
CodePlot-CoT πŸ€— HuggingFace

Math-VR

Math-VR is the first large-scale, bilingual (English and Chinese) dataset and benchmark specifically designed to evaluate and advance the visual reasoning capabilities of AI models in mathematics. While traditional benchmarks have focused on text-centric problem-solving, Math-VR targets the critical domain of problems that require "reasoning with images," such as drawing auxiliary lines or plotting functions to find a solution.

Math-VR dataset contains 178,000 samples, each consisting of a question, a detailed reasoning process, and a final answer. A key feature of this dataset is that the reasoning process for each problem includes at least one image, providing a rich resource for training models to integrate visual information into their problem-solving steps. The dataset spans multiple mathematical domains, including Geometry, Algebra, and Calculus.

Math-VR benchmark comsists of 5,000 bilingual (English and Chinese) mathematical questions. To ensure a deterministic and reliable evaluation, questions were carefully selected; for instance, proof-based questions were excluded to avoid the difficulty of assessing logical validity, and most multiple-choice questions were removed to prevent correct answers from random guessing. The benchmark is divided into two subsets: a Text subset with 2,000 text-only questions, and a Multimodal subset with 3,000 questions presented with both text and images. Both question types require models to reason or use imagination in the visual domain.We designed a comprehensive evaluation pipeline that uses two core metrics to measure a model's performance:

  • Answer Correctness (AC): This metric provides a reliable binary judgment by strictly checking whether the model's final answer perfectly matches the ground-truth answer. Any error or omission results in a score of 0.
  • Process Score (PS): Recognizing that the reasoning process can be valuable even if the final answer is incorrect, this metric awards partial credit. It assesses whether the model hits critical "scoring points"β€”such as applying theorems or performing necessary calculationsβ€”within its reasoning steps. This fine-grained assessment more accurately reflects a model's true problem-solving abilities.

Model Overview

CodePlot-CoT: Mathematical Visual Reasoning with Code-Driven Images

We introduce CodePlot-CoT, an innovative code-driven Chain-of-Thought (CoT) paradigm designed to enable Vision Language Models to "think with images" when solving mathematical problems. Rather than generating pixel-based images directly, the model outputs executable plotting code to represent its "visual thoughts". This code is executed to render a precise figure, which is then reinput to the model as a visual input for subsequent reasoning steps.

MatplotCode: A High-Fidelity Converter for Mathematical Figures

To train the CodePlot-CoT model, we require high-quality data pairing images with corresponding plotting code. Since such resources are rare and existing general models are unreliable for this specialized task, we develope MatplotCode, a state-of-the-art image-to-code converter designed specifically for mathematical figures. It is specialized in converting complex mathematical figures into high-fidelity Python plotting code. In our evaluation, MatplotCode achieve a 100% code execution success rate. Its image reconstruction fidelity is also significantly higher than SOTA models including GPT-03 and Gemini-2.5-Pro. MatplotCode is the key to enabling the large-scale curation of our code-driven training data, laying the foundation for the successful training of the CodePlot-CoT model.

Usage

Installation

Clone the repo and install dependent packages.

conda create -n codeplot python==3.10
conda activate codeplot
git clone git@github.com:HKU-MMLab/Math-VR-CodePlot-CoT.git
cd CodePlot-CoT
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

For benchmark evaluation only.

pip install openai==4.1.1
pip install datasets==2.0.0

Model Weights

Expected directory structure might be:

CodePlot-CoT
β”œβ”€β”€ ckpts
β”‚   β”œβ”€β”€ CodePlot-CoT 
β”‚   β”œβ”€β”€ MatPlotCode 
β”œβ”€β”€ ...

Inference

# Convert image to python code with MatPlotCode
python image_to_code.py
# Solve math problems with CodePlot-CoT
python math_infer.py

Math-VR Benchmark Evaluation

To evaluate model on the Math-VR benchmark, please follow these steps:

  1. Download the Math-VR benchmark dataset from Hugging Face. This dataset contains our 2500 English test questions.
  2. Store the downloaded file in a data/ directory.
  3. Create a JSON file containing model's solutions.
    • Please refer to Math-VR-Infer.py for guidance on the generation process. Specifically, if you only wish to evaluate on the text subset or multimodal subset, you can pass the --type text or --type multimodal to the infer script.
    • The answer file must be a JSON object mapping each <question_id> to model's corresponding solution string. An example is provided in examples/answer.json.

The required format for the answer file is as follows:

{
  "<question_id>": "<Model's solution>",
  ...
}
  1. Run the evaluation script with the following command:
python Math-VR-Eval.py --answer_dir "<The Json file that contains model's solutions>" --result_dir "<The path to save the evaluation result in Json format>" --data_path "<The path to the benchmark dataset>" --api_key "<Your openai api key>"

The script leverages GPT4.1 to evaluate model's response and generates a result.json containing its judgement. 5. Summarize all scores by running:

python summarize_score.py --result_dir "<The path to the saved result>" --data_path "<The path to the benchmark dataset>"

Benchmark

The leaderboard is available here. We benchmark a suite of SOTA VLMs (Vision Language Models), UMs (Unified Models) and LLMs (Large Language Models) on Math-VR benchmark english subset with 2500 unique questions.

Math-VR benchmark (English) on VLMs and UMs
# Model Link Version #Params Type Thinking Overall (AC) Overall (PS) Text (AC) Text (PS) Multimodal (AC) Multimodal (PS)
1 Qwen3-VL-235B-A22B-Thinking πŸ₯‡ Link – 235B VLM βœ“ 66.8 81.0 58.9 77.4 72.1 83.4
2 Qwen3-VL-235B-A22B-Instruct πŸ₯ˆ Link – 235B VLM X 65.0 80.1 59.4 77.8 68.8 81.6
3 Gemini-2.5-Pro πŸ₯‰ Link – – VLM βœ“ 64.7 80.8 58.7 77.9 68.7 82.8
4 Gemini-2.5-Flash Link 2025-06-17 – VLM βœ“ 60.5 78.4 57.0 77.5 62.9 79.0
5 GPT-o3 Link 2025-04-16 – VLM βœ“ 59.3 76.4 52.9 72.9 63.7 78.6
6 Seed-1.6-Thinking Link 2025-06-15 – VLM βœ“ 58.4 75.2 53.0 73.0 62.0 76.6
7 GPT-5-Thinking Link – – VLM βœ“ 58.1 70.6 53.2 68.0 61.4 72.3
8 Claude Opus4.1 Link – – VLM βœ“ 54.3 70.6 53.1 70.5 55.1 70.6
9 Nano Banana Link 2025-08-26 – UM X 53.4 73.8 49.1 72.3 56.3 74.7
10 Gemini-2.5-Flash-No-Thinking Link 2025-06-17 – VLM X 52.3 73.7 44.6 70.9 57.5 75.5
11 GLM-4.5V Link – 108B VLM βœ“ 49.6 69.7 48.0 70.5 50.6 69.1
12 Mimo-VL-7B-RL Link 2508 7B VLM βœ“ 48.3 68.8 43.5 68.4 51.3 69.0
13 InternVL-3.5-8B Link – 8B VLM βœ“ 40.8 62.8 38.5 64.0 42.2 62.0
14 GPT-4.1-mini Link – – VLM X 33.3 60.0 33.3 62.0 33.3 58.6
15 GLM-4.1V-9B Link – 9B VLM βœ“ 29.0 53.4 27.8 54.4 29.9 52.7
16 Claude-Sonnet-4 Link 2025-05-23 – VLM X 28.1 56.4 31.5 60.9 25.8 53.4
17 GPT-4.1 Link – – VLM X 26.0 53.9 26.6 56.5 25.6 52.2
18 CodePlot-CoT Link – 32B VLM X 22.1 47.0 31.6 53.8 15.8 42.4
19 Gemini-2.0-Flash Link – – VLM X 20.6 50.7 24.1 56.1 18.3 47.0
20 Keye-VL-1.5 Link – 8B VLM X 17.3 38.2 20.2 44.4 15.4 34.0
21 Gemma3 Link – 27B VLM X 16.1 44.8 19.2 50.8 14.1 40.8
22 Qwen-2.5-VL-72B Link – 72B VLM X 13.7 40.8 15.3 44.6 12.7 38.2
23 Bagel-Zebra-CoT Link – 7B UM X 10.1 34.1 13.9 41.5 7.6 29.1
24 Qwen-2.5-VL-32B Link – 32B VLM X 10.0 33.7 10.6 36.9 9.6 31.5
25 GPT-4.1-nano Link – – VLM X 9.1 38.5 13.1 45.9 6.4 33.6
26 InternVL-3.5-8B-No-Thinking Link – 8B VLM X 7.9 31.4 9.2 35.6 7.0 28.6
27 Bagel Link – 7B UM X 7.6 27.6 8.5 32.9 7.0 24.0
28 Qwen-2.5-VL-3B Link – 3B VLM X 5.3 27.5 7.9 33.4 3.6 23.6
29 GPT-4o Link 2024-11-20 – VLM X 4.3 30.4 5.7 34.6 3.4 27.6
Math-VR benchmark (English) on LLMs
# Model Link #Params Type Thinking Text (PS) Text (AC)
1 Deepseek-R1 Link 671B LLM βœ“ 69.9 49.5

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing our paper:

@article{duan2025codeplot,
  title={CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images},
  author={Duan, Chengqi and Sun, Kaiyue and Fang, Rongyao and Zhang, Manyuan and Feng, Yan and Luo, Ying and Liu, Yufang and Wang, Ke and Pei, Peng and Cai, Xunliang and others},
  journal={arXiv preprint arXiv:2510.11718},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at duancq24@connect.hku.hk.

About

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages