Skip to content

vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window #152

@Ali-Elganzory

Description

@Ali-Elganzory

Problem

When running benchmarks with vLLM backend, evaluation crashes with the following error if max_gen_toks exceeds the model's maximum sequence length:

ValueError: please provide at least one prompt
ERROR: Engine core proc EngineCore_0 died unexpectedly, shutting down client.

This does not occur with the HuggingFace (hf) backend.

Root Cause Analysis

The vLLM integration in lm-evaluation-harness calculates available prompt space as:

available_prompt_space = max_model_len - max_gen_toks

When max_gen_toks >= max_model_len, this results in available_prompt_space <= 0, causing the prompt to be truncated to empty. vLLM then raises a ValueError because there's no prompt to generate from.

Environment

  • evalchemy
  • vLLM: 0.10.1.1

Reproduction Steps

# Using ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3 (4096 context window)
# with MATH500 benchmark (32768 default max_tokens)

# This crashes with vLLM:
python -m eval.eval --model vllm \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

# This works with HuggingFace:
python -m eval.eval --model hf \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

Affected Benchmarks

The issue occurs when a benchmark's default max_tokens exceeds the model's context window. Some benchmarks I tested and confirmed fail with their default settings:

  • AIME24
  • AIME25
  • AMC23
  • MATH500
  • LiveCodeBench
  • GPQADiamond
  • JEEBench

This is not an exhaustive list. Any benchmark can be affected if max_tokens (default or via --max_tokens argument) exceeds the model's context window.

Expected Behavior

The evaluation should gracefully cap max_gen_toks to fit within the available context window instead of crashing.

Proposed Solution

Dynamically cap max_gen_toks per-prompt based on actual prompt length in _normalize_model_args:

max_allowed = max_model_len - prompt_length - 16  # 16 token safety buffer
capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions