-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Problem
When running benchmarks with vLLM backend, evaluation crashes with the following error if max_gen_toks exceeds the model's maximum sequence length:
ValueError: please provide at least one prompt
ERROR: Engine core proc EngineCore_0 died unexpectedly, shutting down client.
This does not occur with the HuggingFace (hf) backend.
Root Cause Analysis
The vLLM integration in lm-evaluation-harness calculates available prompt space as:
available_prompt_space = max_model_len - max_gen_toksWhen max_gen_toks >= max_model_len, this results in available_prompt_space <= 0, causing the prompt to be truncated to empty. vLLM then raises a ValueError because there's no prompt to generate from.
Environment
- evalchemy
- vLLM: 0.10.1.1
Reproduction Steps
# Using ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3 (4096 context window)
# with MATH500 benchmark (32768 default max_tokens)
# This crashes with vLLM:
python -m eval.eval --model vllm \
--tasks MATH500 \
--model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"
# This works with HuggingFace:
python -m eval.eval --model hf \
--tasks MATH500 \
--model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"Affected Benchmarks
The issue occurs when a benchmark's default max_tokens exceeds the model's context window. Some benchmarks I tested and confirmed fail with their default settings:
- AIME24
- AIME25
- AMC23
- MATH500
- LiveCodeBench
- GPQADiamond
- JEEBench
This is not an exhaustive list. Any benchmark can be affected if max_tokens (default or via --max_tokens argument) exceeds the model's context window.
Expected Behavior
The evaluation should gracefully cap max_gen_toks to fit within the available context window instead of crashing.
Proposed Solution
Dynamically cap max_gen_toks per-prompt based on actual prompt length in _normalize_model_args:
max_allowed = max_model_len - prompt_length - 16 # 16 token safety buffer
capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))