Skip to content

Commit 22be263

Browse files
committed
Lower gpu-memory-utilization to 0.80 for 6 GiB GPUs
vLLM defaults to 0.9 when the flag is omitted, which still exceeds available free memory (4.94/6.0 GiB). Explicitly set to 0.80 alongside --max-model-len 8096 so the utilization check passes on startup.
1 parent a17bbdf commit 22be263

2 files changed

Lines changed: 5 additions & 5 deletions

File tree

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -108,15 +108,15 @@ See [deployments/README.md](deployments/README.md) for the full guide.
108108
pip install qr-sampler[grpc]
109109

110110
# Start vLLM — qr-sampler registers automatically via entry points
111-
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
111+
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
112112
```
113113

114114
Configure the entropy source via environment variables:
115115

116116
```bash
117117
export QR_ENTROPY_SOURCE_TYPE=quantum_grpc
118118
export QR_GRPC_SERVER_ADDRESS=localhost:50051
119-
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
119+
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
120120
```
121121

122122
### System entropy fallback
@@ -125,7 +125,7 @@ Without an external entropy source, qr-sampler falls back to `os.urandom()`. Thi
125125

126126
```bash
127127
pip install qr-sampler
128-
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
128+
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
129129
```
130130

131131
### Per-request parameter overrides
@@ -430,7 +430,7 @@ Or configure directly via environment variables (bare-metal):
430430
```bash
431431
export QR_ENTROPY_SOURCE_TYPE=quantum_grpc
432432
export QR_GRPC_SERVER_ADDRESS=localhost:50051
433-
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096
433+
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
434434
```
435435

436436
The template handles all gRPC boilerplate (unary + bidirectional streaming, health checks, graceful shutdown). You only write the hardware-specific code.

examples/docker/Dockerfile.vllm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,4 +50,4 @@ ENTRYPOINT []
5050

5151
# Start vLLM. The qr-sampler plugin is auto-discovered via entry points.
5252
# Shell form so environment variables are resolved at runtime.
53-
CMD vllm serve ${HF_MODEL} --host 0.0.0.0 --port 8000 --dtype half --max-model-len 8096
53+
CMD vllm serve ${HF_MODEL} --host 0.0.0.0 --port 8000 --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80

0 commit comments

Comments
 (0)