🤗 Hugging Face | 📄 arXiv
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks:
- Fault Coverage: Measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes.
- Fault Exposure: Evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation.
We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.
The dataset is available on Hugging Face with the following components:
- problem: 500 competitive programming problems from CodeForces
- submission_log: Complete history of 13.1M user-submitted incorrect solutions for 500 problems, including error types, the test case index where the error occurred, and programming language
- submission_all: Complete dataset with 118,611 human-written incorrect solutions (averaging 237 per problem) used in Task 1 (Fault Coverage Task)
- submission_lite: Proportionally sampled subset with 10,000 human-written incorrect solutions (averaging 20 per problem) used in Task 2 (Fault Exposure Task)
- TestCase-Eval-Task1: Prompts for the Fault Coverage task with chain-of-thought prompt.
- TestCase-Eval-Task2: Prompts for the Fault Exposure task with chain-of-thought prompt.
- TestCase-Eval-Task1-DO: Prompts for the Fault Coverage task with direct-output prompt.
- TestCase-Eval-Task2-DO: Prompts for the Fault Exposure task with direct-output prompt.
git clone https://github.com/FlowRays/TestCase-Eval.git
pip install -r requirements.txtThe evaluation process consists of four main steps. Let's use Qwen/Qwen3-8B as an example model.
This step uses an LLM to generate raw responses for the test case generation tasks.
# Generate for both tasks
python generate.py --model "Qwen/Qwen3-8B" --all
# Or generate for a single task
python generate.py --model "Qwen/Qwen3-8B" --task 1This will create results/Qwen/Qwen3-8B/task1_gen.jsonl and/or task2_gen.jsonl.
This script processes the raw LLM responses and extracts clean, executable test cases. It uses a regex-based approach with an LLM fallback (gpt-4.1-mini) for robustness.
python extract.py --model "Qwen/Qwen3-8B"This reads the _gen.jsonl files and creates new files, task1_ext.jsonl and task2_ext.jsonl, in the same directory, containing the extracted tests.
This is the core evaluation step. The script runs the extracted test cases against the correct and incorrect code submissions in a secure sandboxed environment to record the execution outcomes.
The provided sandbox.py is a self-contained executor for convenience. For the most accurate and secure evaluation, we recommend using dedicated sandboxing frameworks such as ExecEval or SandboxFusion.
Note: This is a computationally intensive process that will take a significant amount of time.
# Run evaluation for both tasks
python sandbox.py --model "Qwen/Qwen3-8B" --all
# Or run for a single task
python sandbox.py --model "Qwen/Qwen3-8B" --task 1Execution results will be saved to results/Qwen/Qwen3-8B/task1_exec.jsonl and task2_exec.jsonl.
Finally, run the statistics scripts to calculate the Fault Coverage (Task 1) and Fault Exposure (Task 2) scores from the execution logs.
# Calculate and print Task 1 (Coverage) results
python stat_task1.py --model "Qwen/Qwen3-8B"
# Calculate and print Task 2 (Exposure) results
python stat_task2.py --model "Qwen/Qwen3-8B"The final aggregated results for the specified model will be written to stat/task1_cov_results.json and stat/task2_exp_results.json.
- All datasets are collected from CodeForces
If you find TestCase-Eval useful in your research, please cite our paper:
@inproceedings{yang-etal-2025-llms,
title = "Can {LLM}s Generate High-Quality Test Cases for Algorithm Problems? {T}est{C}ase-Eval: A Systematic Evaluation of Fault Coverage and Exposure",
author = "Yang, Zheyuan and
Kuang, Zexi and
Xia, Xue and
Zhao, Yilun",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-short.82/",
pages = "1050--1063",
ISBN = "979-8-89176-252-7",
abstract = "We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems."
}