vllm-project · darsh12 · Feb 5, 2026
diff --git a/Qwen/Qwen3-VL.md b/Qwen/Qwen3-VL.md
@@ -12,7 +12,7 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins
 uv venv
 source .venv/bin/activate
 
-# Install vLLM >=0.11.0
+# Install vLLM >=0.11.1
 uv pip install -U vllm
 
 # Install Qwen-VL utility library (recommended for offline inference)
@@ -113,7 +113,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
 - If your workload involves mostly **unique** multimodal inputs only, it is recommended to pass `--mm-processor-cache-gb 0` to avoid caching overhead. Otherwise, specifying `--mm-processor-cache-type shm` enables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting.
 - vLLM supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out [Expert Parallelism Deployment](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html) for more details.
 - You can use [benchmark_moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py) to perform MoE Triton kernel tuning for your hardware.
-- You can further extend the model's context window with `YaRN` by passing `--rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}' --max-model-len 1000000`
+- You can further extend the model's context window with `YaRN` by passing `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}}' --max-model-len 1000000`
 
 
 ### Benchmark on VisionArena-Chat Dataset