Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Qwen/Qwen3-VL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins
uv venv
source .venv/bin/activate

# Install vLLM >=0.11.0
# Install vLLM >=0.11.1
uv pip install -U vllm

# Install Qwen-VL utility library (recommended for offline inference)
Expand Down Expand Up @@ -113,7 +113,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
- If your workload involves mostly **unique** multimodal inputs only, it is recommended to pass `--mm-processor-cache-gb 0` to avoid caching overhead. Otherwise, specifying `--mm-processor-cache-type shm` enables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting.
- vLLM supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out [Expert Parallelism Deployment](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html) for more details.
- You can use [benchmark_moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py) to perform MoE Triton kernel tuning for your hardware.
- You can further extend the model's context window with `YaRN` by passing `--rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}' --max-model-len 1000000`
- You can further extend the model's context window with `YaRN` by passing `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}}' --max-model-len 1000000`


### Benchmark on VisionArena-Chat Dataset
Expand Down