To facilitate generalized evaluation of various Omni benchmarks, we have constructed a lightweight Omni evaluation framework and released a high-performance scoring model to support it. You can freely and easily add new datasets or evaluation models based on this framework. Below, we will use UNO-Bench and Qwen-2.5-Omni-7B as examples to demonstrate how to run the framework.
Before running, please ensure the following Python core dependencies are installed. Note: Since vLLM installation involves PyTorch, CUDA, and other complex dependencies, it is recommended to set up the environment in a fresh virtual environment to avoid potential conflicts.
pip install -r requirements.txtDownload the necessary models and datasets using the following commands:
huggingface-cli download meituan-longcat/UNO-Bench --repo-type dataset --local-dir /path/to/UNO-Bench
huggingface-cli download AGI-Eval/UNO-Scorer-Qwen3-14B --local-dir /path/to/UNO-Scorer
huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-OmniBy executing the following code, you can reproduce the experimental results of Qwen-2.5-Omni-7B presented in the paper. Remember to replace MODEL_PATH, DATASET_LOCAL_DIR, and SCORER_MODEL_PATH with your local path.
bash examples/run_unobench_qwen_omni_hf.shWe recommend you to execute the vLLM version of the inference service for better performance.
bash examples/run_unobench_qwen_omni_vllm.sh- The program employs sequential logic for evaluation, executing in the following order:
Start Inference Service -> Generate Results -> Release Resources -> Start Scoring Service -> Calculate Scores -> Release Resources. - It supports resuming from breakpoints (checkpointing); both inference progress and scoring progress are saved locally at regular intervals.
You can refer to the following code for the fitting curve of the Compositional Law.
python3 compositional_law.pyWe recommend using vLLM for higher efficiency. You can refer to:
bash examples/test_scorer_vllm.shOr use transformers-based approach, but with lower efficiency:
python3 examples/test_scorer_hf.pyBefore running, you must modify the configuration section at the top of run_unobench_qwen_omni_*.sh to adapt to your environment.
| Variable Name | Description | Example |
|---|---|---|
MODEL_NAME |
Model registration name (corresponds to the name defined in models code) |
"Qwen-2.5-Omni-7B" "VLLMClient" |
MODEL_PATH |
Local absolute path to the model weights | /path/to/Qwen2.5-Omni |
INFERENCE_BACKEND |
Inference backend selection: "vllm" or "hf" |
"vllm" |
TARGET_GPU_IDS |
GPU IDs used for the inference stage | "0,1" |
TARGET_TP_SIZE |
Tensor Parallelism size for the inference model | 2 |
TARGET_PORT |
vLLM service port | 8000 |
| Variable Name | Description | Example |
|---|---|---|
SCORER_MODEL_PATH |
Path to the scoring model (e.g., UNO-Scorer) | /path/to/UNO-Scorer |
SCORER_GPU_IDS |
GPU IDs used for the scoring stage | "0,1" |
SCORER_PORT |
vLLM service port for the scorer | 8001 |
| Variable Name | Description |
|---|---|
DATASET_NAME |
Evaluation dataset name (e.g., "UNO-Bench") |
HF_CACHE_DIR |
HuggingFace cache or multimedia data directory; automatically downloaded datasets will be saved here |
DATASET_LOCAL_DIR |
Local path for the dataset. The program prioritizes reading from DATASET_LOCAL_DIR; otherwise, it automatically downloads to HF_CACHE_DIR |
EXP_MARKING |
Experiment marking suffix (e.g., _20251024), used to distinguish experimental settings and output filenames |
After configuration, grant execution permissions to the script and run it:
bash run_eval.sh- Stage 1: Inference
- If
vllmmode is selected, the script starts the target model's API Server in the background. - Runs
eval.py --mode inferenceto perform data inference. - Key Step: After inference is complete, the script automatically kills the target model's vLLM process to fully release GPU memory.
- If
- Stage 2: Scorer Setup
- Starts the Scoring Model's (Scorer) vLLM service in the background.
- Stage 3: Evaluation (Scoring)
- Runs
eval.py --mode scoringto send the generated results to the scoring model for evaluation.
- Runs
- Cleanup
- Upon task completion, automatically shuts down the scoring model service.
Evaluation results will be generated as JSON files, saved by default in the ./eval_results/ directory.
- Filename Format:
{MODEL_NAME}{EXP_MARKING}:{DATASET_NAME}.json
.
├── run_eval.sh # [Main Program] Manages config parameters, service lifecycle, and flow control
├── eval.py # [Execution Script] Handles data loading, API interaction, and result storage
├── utils/ # [Dependencies] General utility functions
├── models/ # [Dependencies] Model registration and loading
└── benchmarks/ # [Dependencies] Dataset registration and loading
The project is mainly divided into benchmarks (evaluation sets) and evaluation models. You can register new datasets in benchmarks/ and new models in models/.
- Create a new dataset
.pyfile inbenchmarks/, such asunobench.py. Inherit from theBaseDatasetclass and implement the abstract methods:load_and_prepare: Download and load the dataset, organizing each item into theutils.EvaluationRecordformat.build_message: Construct the message sent to the model side (OpenAI Chat Message format).build_score_message: Construct the message sent to the scoring model (OpenAI Chat Message format).compute_score: Calculate the score for a single data item.compute_metrics: Calculate metrics for the entire dataset.
- Register the dataset in
__init__.py.
- Create a new model
.pyfile inmodels/, such asqwen_2d5_omni_7b.py. Inherit from theBaseModelclass and implement the abstract methods:load_model: Load the model.generate: Call the model interface once to generate text.generate_batch: Batch call the model interface to generate text.
- Register the model in
__init__.py.
- Path Check: Please ensure that the paths in the script have been modified to match the actual paths on your server.