Lower gpu-memory-utilization to 0.80 for 6 GiB GPUs

alchemystack · alchemystack · commit 22be26370379 · 2026-02-26T14:30:12.000+01:00
vLLM defaults to 0.9 when the flag is omitted, which still exceeds
available free memory (4.94/6.0 GiB). Explicitly set to 0.80 alongside
--max-model-len 8096 so the utilization check passes on startup.
diff --git a/README.md b/README.md
@@ -108,15 +108,15 @@ See [deployments/README.md](deployments/README.md) for the full guide.
 pip install qr-sampler[grpc]
 
 # Start vLLM — qr-sampler registers automatically via entry points
-vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
+vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
 ```
 
 Configure the entropy source via environment variables:
 
 ```bash
 export QR_ENTROPY_SOURCE_TYPE=quantum_grpc
 export QR_GRPC_SERVER_ADDRESS=localhost:50051
-vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
+vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
 ```
 
 ### System entropy fallback
@@ -125,7 +125,7 @@ Without an external entropy source, qr-sampler falls back to `os.urandom()`. Thi
 
 ```bash
 pip install qr-sampler
-vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
+vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
 ```
 
 ### Per-request parameter overrides
@@ -430,7 +430,7 @@ Or configure directly via environment variables (bare-metal):
 ```bash
 export QR_ENTROPY_SOURCE_TYPE=quantum_grpc
 export QR_GRPC_SERVER_ADDRESS=localhost:50051
-vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
+vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
 ```
 
 The template handles all gRPC boilerplate (unary + bidirectional streaming, health checks, graceful shutdown). You only write the hardware-specific code.
diff --git a/examples/docker/Dockerfile.vllm b/examples/docker/Dockerfile.vllm
@@ -50,4 +50,4 @@ ENTRYPOINT []
 
 # Start vLLM. The qr-sampler plugin is auto-discovered via entry points.
 # Shell form so environment variables are resolved at runtime.
-CMD vllm serve ${HF_MODEL} --host 0.0.0.0 --port 8000 --dtype half --max-model-len 8096
+CMD vllm serve ${HF_MODEL} --host 0.0.0.0 --port 8000 --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80