Skip to content

Commit beb1bbc

Browse files
authored
Merge pull request #2 from ereid7/fix/apple-silicon-support
Add Apple Silicon support via vllm-metal
2 parents bcbcd46 + c4053fb commit beb1bbc

3 files changed

Lines changed: 81 additions & 8 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,4 @@ node_modules/
5252

5353
# Setuptools SCM
5454
src/qr_sampler/_version.py
55+
.webui_secret_key

README.md

Lines changed: 78 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,68 @@ export QR_GRPC_SERVER_ADDRESS=localhost:50051
119119
vllm serve Qwen/Qwen2.5-1.5B-Instruct --dtype half --max-model-len 8096 --gpu-memory-utilization 0.80
120120
```
121121

122+
### Apple Silicon (macOS)
123+
124+
qr-sampler works on Apple Silicon via [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained vLLM plugin under the official `vllm-project` GitHub org. It uses MLX under the hood but exposes the same vLLM API and plugin system — same entry points, same endpoints, same `curl` commands.
125+
126+
vllm-metal works with MLX-format models from the [mlx-community](https://huggingface.co/mlx-community) collection on Hugging Face. These are pre-converted and quantized for Apple Silicon — pick one that fits your available memory.
127+
128+
> **Prerequisite:** vllm-metal currently does not load custom logits processors registered via entry points — it creates an empty `LogitsProcessors()` instead of calling `build_logitsprocs()`. [PR #124](https://github.com/vllm-project/vllm-metal/pull/124) fixes this with a 9-line patch that mirrors `GPUModelRunner`'s pattern. Until it is merged, you will need to apply the patch manually or install from the PR branch. Without it, qr-sampler's plugin will be silently skipped.
129+
130+
#### 1. Install vllm-metal
131+
132+
```bash
133+
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
134+
```
135+
136+
This creates a virtual environment at `~/.venv-vllm-metal` with vLLM and all dependencies. Requires Python 3.12+.
137+
138+
#### 2. Install qr-sampler
139+
140+
```bash
141+
source ~/.venv-vllm-metal/bin/activate
142+
pip install qr-sampler
143+
```
144+
145+
#### 3. Start the server
146+
147+
```bash
148+
source ~/.venv-vllm-metal/bin/activate
149+
vllm serve mlx-community/Qwen3-0.6B-4bit
150+
```
151+
152+
qr-sampler registers automatically via the same `vllm.logits_processors` entry point — no additional configuration needed. Look for this line in the server logs to confirm the plugin is active:
153+
154+
```
155+
QRSamplerLogitsProcessor initialized: vocab_size=..., entropy_source=system+system, amplifier=zscore_mean, temperature=fixed
156+
```
157+
158+
#### 4. Send a request
159+
160+
```bash
161+
# Completions
162+
curl http://localhost:8000/v1/completions \
163+
-H "Content-Type: application/json" \
164+
-d '{
165+
"model": "mlx-community/Qwen3-0.6B-4bit",
166+
"prompt": "The nature of consciousness is",
167+
"max_tokens": 100
168+
}'
169+
170+
# Chat completions
171+
curl http://localhost:8000/v1/chat/completions \
172+
-H "Content-Type: application/json" \
173+
-d '{
174+
"model": "mlx-community/Qwen3-0.6B-4bit",
175+
"messages": [{"role": "user", "content": "Tell me about quantum randomness"}],
176+
"max_tokens": 100
177+
}'
178+
```
179+
180+
All configuration (entropy sources, temperature strategies, per-request overrides) works identically to the NVIDIA setup. The only difference is how vLLM itself is installed.
181+
182+
> **Note:** The Docker deployment profiles are not compatible with Apple Silicon. Docker on macOS runs a Linux VM with no Metal GPU passthrough, so vllm-metal must run natively. To use Open WebUI on Apple Silicon, see the [Web UI](#web-ui) section.
183+
122184
### System entropy fallback
123185

124186
Without an external entropy source, qr-sampler falls back to `os.urandom()`. This is useful for development and testing but does not provide the quantum randomness needed for consciousness-research experiments. To use system entropy, set `QR_ENTROPY_SOURCE_TYPE=system` (this is the default).
@@ -149,16 +211,29 @@ curl http://localhost:8000/v1/completions \
149211

150212
qr-sampler works with [Open WebUI](https://github.com/open-webui/open-webui), a
151213
self-hosted ChatGPT-style interface that connects to vLLM's OpenAI-compatible
152-
API. Every deployment profile includes it as an optional service — add
214+
API.
215+
216+
**NVIDIA / Linux:** Every deployment profile includes Open WebUI as an optional service — add
153217
`--profile ui` to start it alongside vLLM:
154218

155219
```bash
156220
cd deployments/urandom
157221
docker compose --profile ui up --build
158222
```
159223

160-
Then open http://localhost:3000 to start chatting. Without `--profile ui`, Open
161-
WebUI does not start and nothing changes.
224+
**Apple Silicon:** The deployment profiles use NVIDIA GPU images, but Open WebUI itself is just a web app. Run it standalone in Docker and point it at your vllm-metal server:
225+
226+
```bash
227+
docker run -d -p 3000:8080 \
228+
--add-host=host.docker.internal:host-gateway \
229+
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
230+
-e OPENAI_API_KEY=not-needed \
231+
-e WEBUI_AUTH=false \
232+
--name open-webui \
233+
ghcr.io/open-webui/open-webui:main
234+
```
235+
236+
Then open http://localhost:3000 to start chatting.
162237

163238
### Controlling qr-sampler from the UI
164239

src/qr_sampler/processor.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -497,12 +497,9 @@ def _to_numpy(tensor: Any) -> np.ndarray:
497497
"""
498498
if isinstance(tensor, np.ndarray):
499499
return tensor
500-
# torch.Tensor — use .numpy() for zero-copy on CPU.
500+
# .cpu() moves GPU tensors (CUDA/MPS) to host memory; no-op on CPU.
501501
try:
502-
if tensor.is_cuda:
503-
result: np.ndarray = tensor.detach().cpu().numpy()
504-
else:
505-
result = tensor.detach().numpy()
502+
result: np.ndarray = tensor.detach().cpu().numpy()
506503
return result
507504
except AttributeError:
508505
return np.asarray(tensor)

0 commit comments

Comments
 (0)