TuRTLe is a framework to assess LLMs across key RTL generation tasks systematically. It integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion.
| Benchmarks | EDA Tools and Metrics |
|---|---|
| VerilogEval v2.0 - Spec-to-RTL & Module Completion | Icarus Verilog and Verilator - STX & FNC |
| RTLLM v1.1/v2.0 - Spec-to-RTL | Yosys - SYN |
| VGen - Module Completion | OpenROAD - PPA |
| RTL-Repo - Single Line Completion | OpenLane - PPA |
For more details about our work, refer to our ArXiv paper.
- [2025-11-09] We release TuRTLe v2 with API inference support and local Docker-based evaluation for easy reproducibility
- [2025-07-03] TuRTLe now supports Verilator as a simulator to check for Syntax and Functionality
- [2025-06-12] We add support for multi-node inference with Ray and the configurations for bigger models
- [2025-05-19] The project's source code is now publicly released. We'd love to hear your feedback, so give it a try!
- [2025-03-31] Our paper "TuRTLe: A Unified Evaluation of LLMs for RTL Generation" is now available on ArXiv!
- [2025-03-20] The leaderboard is now live! Check it out on our Huggingface Space
Check the TuRTLe Leaderboard to know the best open-source models for each task.
Make sure you have installed TuRTLe and its dependencies. See Installation Guide for detailed setup instructions.
TuRTLe supports API-based inference which works out of the box with any OpenAI-compatible API (OpenRouter, OpenAI, Azure, etc.) with a Docker-based evaluation to run EDA tools locally.
export TURTLE_BASE_URL=https://openrouter.ai/api/v1
export TURTLE_API_KEY=sk-or-...
uv run turtle/src/turtle.py --use-api \
--model google/gemini-2.5-flash \
--task rtllm \
--max_tokens 18432 \
--temperature 0.2 \
--top_p 0.95 \
--n_samples 5 \
--reasoning_effort medium \
--save_generations \
--save_generations_path './results/gemini-2.5-flash/rtllm.json' \
--generation_onlyAvailable tasks: rtllm, verilog_eval_rtl, verilog_eval_cc, verigen, rtlrepo
Evaluate the generated RTL designs using our bundled EDA tools (OpenLane, Verilator, Icarus Verilog):
docker run --rm -v $(pwd):/work -w /work ggcr0/turtle-eval:2.3.4 \
python3 turtle/src/turtle.py --use_api \
--task rtllm \
--model gemini-2.5-flash \
--n_samples 5 \
--load_generations_path ./results/gemini-2.5-flash/rtllm.jsonThis will automatically pull the Docker image with all the EDA tooling and evaluate your designs for syntax, functionality, synthesis, and PPA metrics.
If you have access to a GPU cluster and want to run local inference with vLLM or perform multi-node inference, see LOCAL_INFERENCE.md for detailed instructions on using SLURM and Singularity.
The process to implement a benchmark is very similar to the one described by bigcode-evaluation-harness guide. Follow these steps:
- Copy the
turtle/tasks/template/new_task.pyintoturtle/tasks/and rename it to the name of your benchmark<benchmark_name>.py. - Complete all the TODO comments in the template file.
- Update the
_load_new_modules()and_create_extended_registry()methods withinturtle/src/utils/task_updater.py.
@inproceedings{garciagasulla2025turtleunifiedevaluationllms,
title={TuRTLe: A Unified Evaluation of LLMs for RTL Generation},
author={Dario Garcia-Gasulla and Gokcen Kestor and Emanuele Parisi and Miquel Albert\'i-Binimelis and Cristian Gutierrez and Razine Moundir Ghorab and Orlando Montenegro and Bernat Homs and Miquel Moreto},
booktitle = {Proceedings of the 2025 ACM/IEEE International Symposium on Machine Learning for CAD},
series = {MLCAD '25}
year={2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
location = {Santa Cruz, CA, USA},
url={https://arxiv.org/abs/2504.01986},
}
If you have any inquiries or wish to collaborate: hpai@bsc.es
This work was born as a fork of bigcode-evaluation-harness and vllm-code-harness, and has grown to its own framework for RTL code generation evaluation. We remain grateful to these projects.
We acknowledge the open-source EDA tools: Icarus Verilog, Verilator, Yosys, OpenROAD and LibreLane.
We also thank the authors of the benchmarks integrated in TuRTLe: VerilogEval, RTLLM, VGen, and RTL-Repo.
