Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
33173b0
decouple base engine client logic from engine client
sagearc Feb 24, 2026
ef863e0
split openapi serving class
sagearc Feb 24, 2026
cc71837
render client
sagearc Feb 25, 2026
a762964
remove inheritance between engine client and render client
sagearc Feb 25, 2026
e331104
remove OpenAIServingInference
sagearc Feb 25, 2026
559ba40
Split EngineClient into RendererClient + EngineClient sibling ABCs
sagearc Feb 25, 2026
f38e0bc
AsyncRenderer
sagearc Feb 25, 2026
6d46f31
Merge branch 'main' into base-engine-client
sagearc Feb 25, 2026
4afc924
fix after merge
sagearc Feb 25, 2026
cb3d2a8
revert disable_frontend_multiprocessing
sagearc Feb 25, 2026
7d267c2
revert disable_frontend_multiprocessing flag in benchamrks
sagearc Feb 25, 2026
9fb01a7
fix get_world_size
sagearc Mar 2, 2026
0e8de93
merge main
sagearc Mar 2, 2026
f516e9b
asyncllm backward compatibility
sagearc Mar 2, 2026
b976c3c
render client optional arg in init_app_state
sagearc Mar 2, 2026
2adc96c
fix io_processor init
sagearc Mar 2, 2026
8dc756f
Add co-author
sagearc Mar 2, 2026
1952ce9
build_async_clients_from_engine_args
sagearc Mar 3, 2026
b8401cd
add regression test (#35834)
hallerite Mar 3, 2026
fd02984
decouple AsyncRenderer to a separate file
sagearc Mar 3, 2026
8d3f4ee
import fix
sagearc Mar 3, 2026
4beebfd
[CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 (#…
simonreginis Mar 3, 2026
ad9d09e
[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not…
tdoublep Mar 3, 2026
fd4a90f
[CI] And PPL test for Qwen3.5. (#35853)
noooop Mar 3, 2026
440f0e7
[Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict (#35754)
bigPYJ1151 Mar 3, 2026
ea46397
[Frontend][1/n] Improve pooling entrypoints | classify. (#35604)
noooop Mar 3, 2026
fb7fdc4
[ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Op…
tjtanaa Mar 3, 2026
28ef9ba
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse M…
LucasWilkinson Mar 3, 2026
e05cb3b
TRTLLM gen-full attn Test Coverage (#34986)
ojhaanshika Mar 3, 2026
ae88468
fix: Ensure invalid audio files return 400 error (#34715)
jasonozuzu-cohere Mar 3, 2026
8e1fd5b
[CI] Bump `num_speculative_tokens` to 3 in nightly DeepSeek tests (#3…
MatthewBonanni Mar 3, 2026
881a6b0
[CI] Temporarily Disable Llama4 MoE Refactor Test (#35870)
robertgshaw2-redhat Mar 3, 2026
97995f6
[MoE Refactor] Create MK for TRTLLM Kernels (#32564)
robertgshaw2-redhat Mar 3, 2026
3a8eef5
[ROCm][Bugfix]: Disable AITER Triton ROPE by default (#35601)
Rohan138 Mar 3, 2026
e721300
[ROCm][CI] Fix TP size issue for `test_gpt_oss` (#35887)
micah-wil Mar 3, 2026
a9b8b13
[Bugfix] Fix misnamed parameter in compressed_tensors_moe.py (#35813)
bnellnm Mar 3, 2026
467886a
[Model Runner V2] Fix inputs_embeds=None bug for MM models (#35917)
WoosukKwon Mar 3, 2026
12b38c0
[CI/Build] Allow mounting AWS credentials for sccache S3 auth (#35912)
amrmahdi Mar 3, 2026
97286a2
[Model Runner V2] support dp & ep for spec decoding (#35294)
izhuhaoran Mar 3, 2026
d15c3b9
[Core] Move save_tensorized_model logic to Worker (#35825)
njhill Mar 3, 2026
f22ff29
[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline D…
jaewonlee-fb Mar 4, 2026
f7da9cd
[ROCm][CI] Support async weight transfer example with platform-aware …
AndreasKaratzas Mar 4, 2026
9a9d442
Enable bnb for multiple indices weight (#35838)
flutist Mar 4, 2026
70c73df
[Bugfix] Fix EVS implementation for Qwen3 VL (#33607)
2ez4bz Mar 4, 2026
77e6dcb
[PluggableLayer][MM] Add PluggableLayer for RelPosAttention (#33753)
shen-shanshan Mar 4, 2026
c1d9634
[model] support FireRedASR2 (#35727)
AllenDou Mar 4, 2026
6e9f21e
[Chore] Remove debug code in model implementation (#35883)
Isotr0py Mar 4, 2026
e379396
[Refactor] Clean up processor kwargs extraction (#35872)
DarkLight1337 Mar 4, 2026
edba150
[Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions (…
AndreasKaratzas Mar 4, 2026
3c85cd9
[Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) (#35913)
charlifu Mar 4, 2026
7cdba98
[BugFix] Support tool_choice=none in the Anthropic API (#35835)
ZhongsJie Mar 4, 2026
097eb54
[Bugfix] Improve engine ready timeout error message (#35616)
lailoo Mar 4, 2026
9e0f44b
[cohere][fix][spec-decode]: fix crash when allowed_token_ids is set w…
kkt-cohere Mar 4, 2026
5d199ac
Support Audio Extraction from MP4 Video for Nemotron Nano VL (#35539)
askliar Mar 4, 2026
4bb967b
Merge branch 'main' into base-engine-client
sagearc Mar 4, 2026
8f6e033
merge 34551 and resolve conflicts
sagearc Mar 4, 2026
0acf167
make engine client opetional for OpenAIServing
sagearc Mar 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-large-rocm.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
Qwen3-235B-A22B-Instruct-2507-FP8.yaml
51 changes: 51 additions & 0 deletions .buildkite/performance-benchmarks/tests/latency-tests-hpu.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,5 +51,56 @@
"max-model-len": 256,
"async-scheduling": ""
}
},
{
"test_name": "latency_deepseek_r1",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "deepseek-ai/DeepSeek-R1",
"tensor_parallel_size": 8,
"load_format": "dummy",
"max-model-len": 2048,
"dtype": "bfloat16"
}
},
{
"test_name": "latency_llama4_maverick_17b128e_instruct_fp8",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"tensor_parallel_size": 8,
"max-model-len": 512,
"max-num-seqs": 128,
"async-scheduling": "",
"gpu-memory-utilization": 0.95,
"enable_expert_parallel": ""
}
},
{
"test_name": "latency_qwen3_8b",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "Qwen/Qwen3-8B",
"tensor_parallel_size": 1,
"max-model-len": 2048,
"max-num-seqs": 128,
"dtype": "bfloat16",
"async-scheduling": ""
}
}
]
79 changes: 79 additions & 0 deletions .buildkite/performance-benchmarks/tests/serving-tests-hpu.json
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,84 @@
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_deepseek_r1",
"qps_list": [1, 4, 16, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "deepseek-ai/DeepSeek-R1",
"tensor_parallel_size": 8,
"swap_space": 16,
"disable_log_stats": "",
"load_format": "dummy",
"max-model-len": 2048,
"max-num-seqs": 200,
"async-scheduling": "",
"dtype": "bfloat16"
},
"client_parameters": {
"model": "deepseek-ai/DeepSeek-R1",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama4_maverick_17b128e_instruct_fp8",
"qps_list": [1, 4, 16, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"tensor_parallel_size": 8,
"disable_log_stats": "",
"max-model-len": 2048,
"max-num-seqs": 128,
"async-scheduling": "",
"enable_expert_parallel": "",
"max-num-batched-tokens": 4096
},
"client_parameters": {
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_qwen3_8b",
"qps_list": [1, 4, 10, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "Qwen/Qwen-3-8B",
"tensor_parallel_size": 1,
"dtype": "bfloat16",
"disable_log_stats": "",
"async-scheduling": ""
},
"client_parameters": {
"model": "Qwen/Qwen-3-8B",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
}
]
62 changes: 62 additions & 0 deletions .buildkite/performance-benchmarks/tests/throughput-tests-hpu.json
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,67 @@
"max-num-seqs": 512,
"async-scheduling": ""
}
},
{
"test_name": "throughput_deepseek_r1",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "deepseek-ai/DeepSeek-R1",
"tensor_parallel_size": 8,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"dataset_name": "sharegpt",
"num_prompts": 1000,
"backend": "vllm",
"max-model-len": 2048,
"max-num-seqs": 384,
"async-scheduling": ""
}
},
{
"test_name": "throughput_llama4_maverick_17b128e_instruct_fp8",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"tensor_parallel_size": 8,
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"dataset_name": "sharegpt",
"num_prompts": 1000,
"backend": "vllm",
"max-model-len": 2048,
"max-num-seqs": 512,
"async-scheduling": "",
"enable_expert_parallel": ""
}
},
{
"test_name": "throughput_qwen3_8b",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "Qwen/Qwen-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"dataset_name": "sharegpt",
"num_prompts": 1000,
"max-num-seqs": 512,
"backend": "vllm",
"async-scheduling": ""
}
}
]
25 changes: 21 additions & 4 deletions .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
Original file line number Diff line number Diff line change
@@ -1,26 +1,43 @@
#!/bin/bash
set -euox pipefail
export VLLM_CPU_CI_ENV=0

echo "--- PP+TP"
vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
server_pid=$!
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
timeout 600 bash -c "until curl localhost:8000/v1/models > /dev/null 2>&1; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--result-dir ./test_results \
--result-filename tp_pp.json \
--save-result \
--endpoint /v1/completions
kill -s SIGTERM $server_pid &
kill -s SIGTERM $server_pid; wait $server_pid || true
failed_req=$(jq '.failed' ./test_results/tp_pp.json)
if [ "$failed_req" -ne 0 ]; then
echo "Some requests were failed!"
exit 1
fi

echo "--- DP+TP"
vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -dp=2 &
server_pid=$!
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
timeout 600 bash -c "until curl localhost:8000/v1/models > /dev/null 2>&1; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--result-dir ./test_results \
--result-filename dp_pp.json \
--save-result \
--endpoint /v1/completions
kill -s SIGTERM $server_pid &
kill -s SIGTERM $server_pid; wait $server_pid || true
failed_req=$(jq '.failed' ./test_results/dp_pp.json)
if [ "$failed_req" -ne 0 ]; then
echo "Some requests were failed!"
exit 1
fi
Loading