diff --git a/Qwen/Qwen3-VL.md b/Qwen/Qwen3-VL.md index fcf703dd..a08918d3 100644 --- a/Qwen/Qwen3-VL.md +++ b/Qwen/Qwen3-VL.md @@ -12,7 +12,7 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins uv venv source .venv/bin/activate -# Install vLLM >=0.11.0 +# Install vLLM >=0.11.1 uv pip install -U vllm # Install Qwen-VL utility library (recommended for offline inference) @@ -113,7 +113,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ - If your workload involves mostly **unique** multimodal inputs only, it is recommended to pass `--mm-processor-cache-gb 0` to avoid caching overhead. Otherwise, specifying `--mm-processor-cache-type shm` enables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting. - vLLM supports Expert Parallelism (EP) via `--enable-expert-parallel`, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out [Expert Parallelism Deployment](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html) for more details. - You can use [benchmark_moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py) to perform MoE Triton kernel tuning for your hardware. -- You can further extend the model's context window with `YaRN` by passing `--rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}' --max-model-len 1000000` +- You can further extend the model's context window with `YaRN` by passing `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}}' --max-model-len 1000000` ### Benchmark on VisionArena-Chat Dataset