Skip to content

[ACL 2025] Code for paper "Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure"

Notifications You must be signed in to change notification settings

FlowRays/TestCase-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TestCase-Eval

🤗 Hugging Face   |   📄 arXiv

📝 Abstract

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks:

  1. Fault Coverage: Measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes.
  2. Fault Exposure: Evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation.

We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

Overview of the TestCase-Eval.

📊 Dataset

The dataset is available on Hugging Face with the following components:

  • problem: 500 competitive programming problems from CodeForces
  • submission_log: Complete history of 13.1M user-submitted incorrect solutions for 500 problems, including error types, the test case index where the error occurred, and programming language
  • submission_all: Complete dataset with 118,611 human-written incorrect solutions (averaging 237 per problem) used in Task 1 (Fault Coverage Task)
  • submission_lite: Proportionally sampled subset with 10,000 human-written incorrect solutions (averaging 20 per problem) used in Task 2 (Fault Exposure Task)
  • TestCase-Eval-Task1: Prompts for the Fault Coverage task with chain-of-thought prompt.
  • TestCase-Eval-Task2: Prompts for the Fault Exposure task with chain-of-thought prompt.
  • TestCase-Eval-Task1-DO: Prompts for the Fault Coverage task with direct-output prompt.
  • TestCase-Eval-Task2-DO: Prompts for the Fault Exposure task with direct-output prompt.

🚀 Getting Started

Installation

git clone https://github.com/FlowRays/TestCase-Eval.git
pip install -r requirements.txt

Evaluation Process

The evaluation process consists of four main steps. Let's use Qwen/Qwen3-8B as an example model.

Step 1: Generate Test Cases

This step uses an LLM to generate raw responses for the test case generation tasks.

# Generate for both tasks
python generate.py --model "Qwen/Qwen3-8B" --all

# Or generate for a single task
python generate.py --model "Qwen/Qwen3-8B" --task 1

This will create results/Qwen/Qwen3-8B/task1_gen.jsonl and/or task2_gen.jsonl.

Step 2: Extract Test Cases

This script processes the raw LLM responses and extracts clean, executable test cases. It uses a regex-based approach with an LLM fallback (gpt-4.1-mini) for robustness.

python extract.py --model "Qwen/Qwen3-8B"

This reads the _gen.jsonl files and creates new files, task1_ext.jsonl and task2_ext.jsonl, in the same directory, containing the extracted tests.

Step 3: Run Evaluation Sandbox

This is the core evaluation step. The script runs the extracted test cases against the correct and incorrect code submissions in a secure sandboxed environment to record the execution outcomes.

The provided sandbox.py is a self-contained executor for convenience. For the most accurate and secure evaluation, we recommend using dedicated sandboxing frameworks such as ExecEval or SandboxFusion.

Note: This is a computationally intensive process that will take a significant amount of time.

# Run evaluation for both tasks
python sandbox.py --model "Qwen/Qwen3-8B" --all

# Or run for a single task
python sandbox.py --model "Qwen/Qwen3-8B" --task 1

Execution results will be saved to results/Qwen/Qwen3-8B/task1_exec.jsonl and task2_exec.jsonl.

Step 4: Calculate Final Metrics

Finally, run the statistics scripts to calculate the Fault Coverage (Task 1) and Fault Exposure (Task 2) scores from the execution logs.

# Calculate and print Task 1 (Coverage) results
python stat_task1.py --model "Qwen/Qwen3-8B"

# Calculate and print Task 2 (Exposure) results
python stat_task2.py --model "Qwen/Qwen3-8B"

The final aggregated results for the specified model will be written to stat/task1_cov_results.json and stat/task2_exp_results.json.

🙏 Acknowledgements

📚 Citation

If you find TestCase-Eval useful in your research, please cite our paper:

@inproceedings{yang-etal-2025-llms,
    title = "Can {LLM}s Generate High-Quality Test Cases for Algorithm Problems? {T}est{C}ase-Eval: A Systematic Evaluation of Fault Coverage and Exposure",
    author = "Yang, Zheyuan  and
      Kuang, Zexi  and
      Xia, Xue  and
      Zhao, Yilun",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-short.82/",
    pages = "1050--1063",
    ISBN = "979-8-89176-252-7",
    abstract = "We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems."
}

About

[ACL 2025] Code for paper "Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages