DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

07/01/2025 - Our paper is now available on arXiv!
06/28/2025 - Dataset released on HuggingFace!
06/25/2025 - Initial public release of DICE-BENCH including data generation, scoring utilities, and vLLM inference scripts.
05/16/2025 - Our Paper DICE-BENCH is accepted to ACL 2025. See you in Vienna, Austria!

📖 Overview

DICE-BENCH is a benchmark that tests how well large language models can call external functions in realistic group-chat scenarios.

Key points at a glance:

DICE-BENCH synthesizes real group chats with a condition of four rounds and two to four speakers.
The released dataset contains 1,607 dialogues, and 124 distinct tools.
DICE-SCORE quantifies how difficult the given inputs are by quantifying dispersion of tool-clues throughout the input. Higher scores means the input is difficult.
Even GPT-4o averages only about 64 percent exact match, with performance falling as rounds or participants increase.
As the first benchmark to combine multi-round multi-party dialogue and inter-tool dependencies, DICE-BENCH provides fully open code, data, and pipeline.

📂 Directory Layout

Path	Description
`src/`	Core Python package (agents, prompts, utils, graphs, inference)
`data/`	Pre-generated sample datasets (`round_*.json`)
`scripts/`	Bash helpers to generate data & run inference
`outputs/`	Generated outputs (`all_rounds`, `selected_round`, `inf_vllm`). Note: the output files committed here are demo-sized samples only. Please visit the Hugging Face repository for the full dataset.

🛠️ Core Scripts

Script	Purpose	Key CLI flags / variables
`scripts/gen_all_round.sh`	Quickly generate asmall dataset across rounds 1–4, multiple agent numbers & domains.	`AGENT_NUM_LIST`, `DOMAIN_LIST`, `ROUND_LIST`, `DATASET_NUM`, outputs to `outputs/all_rounds/round_<n>.json`
`scripts/gen_selected_round.sh`	Generatemany samples for one specific round (`SELECTED_ROUND`).	`DATASET_NUM`, `SELECTED_ROUND`, outputs to `outputs/selected_round/round_<n>.json`
`scripts/inf_vllm.sh`	RunvLLM inference over generated dialogues.	`MODEL_NAME`, `FUNCTION_DOCS`, `MAX_TOKENS`, results in `outputs/inf_vllm/<model>/`

All scripts rely on uv to launch python modules reproducibly (uv run ...). Feel free to edit variables at the top of each file.

📁 Data Directory Explained

The repository ships with a sample dataset under data/sample/ so you can explore the JSON structure without running generation.

 data/
   ├── round_1.json          # full dataset (available at Huggingface)
   ├── round_2.json
   ├── ...
   └── sample/
        ├── round_1.json     # tiny subset (≈2 dialogues) for quick inspection
        └── ...

round_<n>.json – gold dialogues used for evaluation (can be regenerated).
sample/round_<n>.json – miniature versions bundled with git to keep the repo lightweight.

The tool graph and function docs used during generation live in src/graph/tool_graph.json and src/graph/tool_docs.json respectively.

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

uv is a super-fast Rust-based package manager & virtual-env tool that fully understands pyproject.toml. If you do not have it yet:

curl -Ls https://astral.sh/uv/install.sh | bash   # installs to ~/.cargo/bin/uv

Create the environment and install all dependencies with a single command:

# From repository root
uv init dicebench   # creates .venv and installs deps from pyproject.toml

Need an extra library? Just do:

uv add <package-name>

Fallback: you can still use plain pip, but all examples below assume uv.

2. Generate Synthetic Dialogues

cd scripts
./gen_all_round.sh       # all rounds, small size (≈ a few minutes)
./gen_selected_round.sh  # generate many samples for a single round

Outputs are written under outputs/all_rounds/ and outputs/selected_round/ respectively.

3. Run vLLM Inference

cd scripts
./inf_vllm.sh            # requires CUDA + vLLM installation

Results will appear in outputs/inf_vllm/<model_name>/.

🔄 Experiment Steps

Prepare Data - use the generation scripts above or supply your own tool-graph JSON.
Fine-tune / Inference - leverage src/inference/inference_vllm.py for fast decoding.
Evaluate - employ src/get_dice_score.py to calculate the DICE metric.

Detailed configs (model path, dataset size, TP degree, etc.) can be edited directly in each bash script or via CLI flags.

📜 Citation

@misc{jang2025dicebenchevaluatingtoolusecapabilities,
  title={DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues},
  author={Kyochul Jang and Donghyeon Lee and Kyusik Kim and Dongseok Heo and Taewhoo Lee and Woojeong Kim and Bongwon Suh},
  year={2025},
  eprint={2506.22853},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.22853},
}

🤝 Contact & Contributing

Questions / ideas? Open an issue or email kyochul@snu.ac.kr. Pull-requests are welcome!

Please visit to kyochul[dot]com for more information about the first author!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/sample		data/sample
media		media
outputs		outputs
scripts		scripts
src		src
static		static
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

📖 Overview

📂 Directory Layout

🛠️ Core Scripts

📁 Data Directory Explained

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

2. Generate Synthetic Dialogues

3. Run vLLM Inference

🔄 Experiment Steps

📜 Citation

🤝 Contact & Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

snuhcc/DICE-Bench

Folders and files

Latest commit

History

Repository files navigation

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

📖 Overview

📂 Directory Layout

🛠️ Core Scripts

📁 Data Directory Explained

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

2. Generate Synthetic Dialogues

3. Run vLLM Inference

🔄 Experiment Steps

📜 Citation

🤝 Contact & Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages