DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
- 07/01/2025 - Our paper is now available on arXiv!
- 06/28/2025 - Dataset released on HuggingFace!
- 06/25/2025 - Initial public release of DICE-BENCH including data generation, scoring utilities, and vLLM inference scripts.
- 05/16/2025 - Our Paper DICE-BENCH is accepted to ACL 2025. See you in Vienna, Austria!
DICE-BENCH is a benchmark that tests how well large language models can call external functions in realistic group-chat scenarios.
Key points at a glance:
- DICE-BENCH synthesizes real group chats with a condition of four rounds and two to four speakers.
- The released dataset contains 1,607 dialogues, and 124 distinct tools.
- DICE-SCORE quantifies how difficult the given inputs are by quantifying dispersion of tool-clues throughout the input. Higher scores means the input is difficult.
- Even GPT-4o averages only about 64 percent exact match, with performance falling as rounds or participants increase.
- As the first benchmark to combine multi-round multi-party dialogue and inter-tool dependencies, DICE-BENCH provides fully open code, data, and pipeline.
| Path | Description |
|---|---|
src/ |
Core Python package (agents, prompts, utils, graphs, inference) |
data/ |
Pre-generated sample datasets (round_*.json) |
scripts/ |
Bash helpers to generate data & run inference |
outputs/ |
Generated outputs (all_rounds, selected_round, inf_vllm). Note: the output files committed here are demo-sized samples only. Please visit the Hugging Face repository for the full dataset. |
| Script | Purpose | Key CLI flags / variables |
|---|---|---|
scripts/gen_all_round.sh |
Quickly generate asmall dataset across rounds 1–4, multiple agent numbers & domains. | AGENT_NUM_LIST, DOMAIN_LIST, ROUND_LIST, DATASET_NUM, outputs to outputs/all_rounds/round_<n>.json |
scripts/gen_selected_round.sh |
Generatemany samples for one specific round (SELECTED_ROUND). |
DATASET_NUM, SELECTED_ROUND, outputs to outputs/selected_round/round_<n>.json |
scripts/inf_vllm.sh |
RunvLLM inference over generated dialogues. | MODEL_NAME, FUNCTION_DOCS, MAX_TOKENS, results in outputs/inf_vllm/<model>/ |
All scripts rely on uv to launch python modules reproducibly (uv run ...). Feel free to edit variables at the top of each file.
The repository ships with a sample dataset under data/sample/ so you can explore the JSON structure without running generation.
data/
├── round_1.json # full dataset (available at Huggingface)
├── round_2.json
├── ...
└── sample/
├── round_1.json # tiny subset (≈2 dialogues) for quick inspection
└── ...
round_<n>.json– gold dialogues used for evaluation (can be regenerated).sample/round_<n>.json– miniature versions bundled with git to keep the repo lightweight.
The tool graph and function docs used during generation live in src/graph/tool_graph.json and src/graph/tool_docs.json respectively.
uv is a super-fast Rust-based package manager & virtual-env tool that fully understands pyproject.toml.
If you do not have it yet:
curl -Ls https://astral.sh/uv/install.sh | bash # installs to ~/.cargo/bin/uvCreate the environment and install all dependencies with a single command:
# From repository root
uv init dicebench # creates .venv and installs deps from pyproject.tomlNeed an extra library? Just do:
uv add <package-name>Fallback: you can still use plain pip, but all examples below assume uv.
cd scripts
./gen_all_round.sh # all rounds, small size (≈ a few minutes)
./gen_selected_round.sh # generate many samples for a single roundOutputs are written under outputs/all_rounds/ and outputs/selected_round/ respectively.
cd scripts
./inf_vllm.sh # requires CUDA + vLLM installationResults will appear in outputs/inf_vllm/<model_name>/.
- Prepare Data - use the generation scripts above or supply your own tool-graph JSON.
- Fine-tune / Inference - leverage
src/inference/inference_vllm.pyfor fast decoding. - Evaluate - employ
src/get_dice_score.pyto calculate the DICE metric.
Detailed configs (model path, dataset size, TP degree, etc.) can be edited directly in each bash script or via CLI flags.
@misc{jang2025dicebenchevaluatingtoolusecapabilities,
title={DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues},
author={Kyochul Jang and Donghyeon Lee and Kyusik Kim and Dongseok Heo and Taewhoo Lee and Woojeong Kim and Bongwon Suh},
year={2025},
eprint={2506.22853},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.22853},
}Questions / ideas? Open an issue or email kyochul@snu.ac.kr. Pull-requests are welcome!
Please visit to kyochul[dot]com for more information about the first author!


