You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lower gpu-memory-utilization to 0.80 for 6 GiB GPUs
vLLM defaults to 0.9 when the flag is omitted, which still exceeds
available free memory (4.94/6.0 GiB). Explicitly set to 0.80 alongside
--max-model-len 8096 so the utilization check passes on startup.
The template handles all gRPC boilerplate (unary + bidirectional streaming, health checks, graceful shutdown). You only write the hardware-specific code.
0 commit comments