TRUEBench (v0.6.1)

TRUEBench is a novel benchmark designed to evaluate LLM-based productivity assistants in real-world scenarios. It addresses limitations in existing benchmarks by incorporating 12 languages, nuanced implicit constraints, and dynamic multi-turn dialogues with context switches. TRUEBench evaluates LLMs across 10 categories.

*We will continue to update TRUEBench version with the ongoing advancements in NEW Language Models.

Prerequisites

Install Python 3.10.12 or later
Install uv
```
pip install uv
```

Quick Start

Set up

Clone & enter the project:

git clone https://github.com/SamsungLabs/TRUEBench.git
cd TRUEBench

Create and activate a virtual environment:
```
uv venv .venv
source .venv/bin/activate
```
Install dependencies:
```
uv pip sync requirements.lock
```

Prepare configuration files

Refer to Inference Adaptor Configuration Guide

Inference

Inference dataset with:

python inference.py --config {config_filename} --inference_adaptor {vllm/openai/vertexai/anthropic_vertexai} --dataset_path {dataset_path}

inference_adaptor: Select an inference adaptor. Supported options are vllm, openai, vertexai, or anthropic_vertexai.

vllm: Supports local inference for open models (example config: "configs/vllm-Qwen3-8B.json", "configs/vllm-Qwen3-8B-think.json").
openai: Supports OpenAI API (example config: "configs/azure_openai-gpt-o3.json", "configs/azure_openai-gpt-5.json", "openai-Qwen3-32B.json", "openai-Qwen3-32B-think.json").
vertexai: Supports VertexAI API (example config: "configs/vertexai-gemini-2.5-flash.json", "configs/vertexai-gemini-3-pro-preview.json").
anthropic_vertexai: Supports Anthropic VertexAI API (example config: "configs/anthropic_vertexai-claude-haiku-4.5.json", "configs/anthropic_vertexai-claude-haiku-4.5-think.json").

config: Path to a model configuration file in the "configs/" folder.
dataset_path: Path to the evaluation dataset.
sample_cnt: Number of sample to inference (default option make to inference all TC).
output_path: Running this command generates results at "{output_path}/{config_name}_{dataset_name}.jsonl" (default output path is "results")

Judge

Judge inference results with:

python judge.py --config {config_filename} --eval_file {eval_filename} --output_path {output_path}

config: Path to a judge model configuration file from "configs/". Judge model should be set with openai adaptor.
eval_file: Model output file to evaluate.
output_path: Folder to save evaluation results (default output path is "eval_results").

Judge Model is recommended to use the gpt-5 2025-08-07 model with default sampling params.

Get Scores

Get scores from eval_results with:

python get_score.py --target_dir eval_results

target_dir: Directory containing evaluation results (default: eval_results). Outputs stats.csv and stats_lang.csv in the target directory.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
dataset		dataset
docs		docs
inference_adaptor		inference_adaptor
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_scores.py		get_scores.py
get_scores_example.sh		get_scores_example.sh
inference.py		inference.py
inference_example.sh		inference_example.sh
judge.py		judge.py
judge_example.sh		judge_example.sh
requirements.lock		requirements.lock
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TRUEBench (v0.6.1)

Prerequisites

Quick Start

Set up

Prepare configuration files

Inference

Judge

Get Scores

About

Uh oh!

Languages

License

SamsungLabs/TRUEBench

Folders and files

Latest commit

History

Repository files navigation

TRUEBench (v0.6.1)

Prerequisites

Quick Start

Set up

Prepare configuration files

Inference

Judge

Get Scores

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages