Stop Paying the Token Tax.
tokensmith is a Rust CLI that detects your hardware, recommends a local model/runtime configuration, manages model downloads, starts local serving, and exposes OpenAI-compatible APIs. It focuses on safe defaults, OOM avoidance, and operator controls for monitoring and safe stop.
Running local LLMs is usually manual and brittle: model choice is unclear, memory limits are easy to exceed, and runtime operations are fragmented. tokensmith orchestrates this workflow:
- Hardware profiling (
doctor) - Explainable model/config recommendation (
recommend) - Model artifact management (
pull) - Runtime/server lifecycle (
up,stop,ps,logs) - OpenAI-compatible serving (
/v1/chat/completions,/v1/completions, SSE streaming) - Resource monitoring with warning thresholds (
status,monitor)
tokensmith doctor
tokensmith recommend --task code --mode balanced
tokensmith pull <id>
tokensmith up --task code --mode balanced --detach
tokensmith status
tokensmith monitor --watch
tokensmith stoptokensmith doctortokensmith recommend --task code|chat [--mode fast|balanced|quality]tokensmith pull <model_id>tokensmith up --task code|chat [--mode fast|balanced|quality] [--ctx 4096] [--port 8000] [--host 127.0.0.1] [--detach]tokensmith statustokensmith monitor [--interval 1s] [--watch] [--json] [--warn-mem 80%] [--warn-cpu 300%]tokensmith throttle --mode fast|balanced|qualitytokensmith stop [--force-after 5s]tokensmith killtokensmith pstokensmith logs [--follow]
Use tokensmith monitor to inspect:
- RSS memory (MB)
- CPU %
- thread count
- uptime
- system total/free memory when available
Threshold warnings:
tokensmith monitor --watch --warn-mem 80% --warn-cpu 300%When warnings trigger, use:
tokensmith throttle --mode fast
tokensmith stopSafe stop behavior:
- Send SIGTERM
- Wait
--force-after(default5s) - Escalate to SIGKILL if still alive
export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=localThen use your normal OpenAI SDK pointing at OPENAI_BASE_URL.
Use a dedicated VS Code profile/workspace so this does not affect your main cloud OpenAI setup.
- Start tokensmith:
tokensmith up --task code --mode fast --ctx 4096 --port 8000 --host 127.0.0.1 --detach- Point Continue to tokensmith (OpenAI-compatible endpoint):
apiBase:http://127.0.0.1:8000/v1apiKey:localmodel: your loaded local model id (for exampleqwen2.5-3b-instruct)
Example Continue config snippet:
models:
- title: Local Qwen (tokensmith)
provider: openai
model: qwen2.5-3b-instruct
apiBase: http://127.0.0.1:8000/v1
apiKey: local- Validate quickly:
tokensmith logs --calls --followYou should see model_call proxied / model_call proxied_stream lines during Continue requests.
502 ... exceed_context_size_error: Request context (chat history + attached files + instructions) is larger than runtimen_ctx. Fix by lowering context in Continue, starting a new chat, or increasing startup context with--ctx.502 ... 503 Loading model: Runtime is still loading. Wait and retry.502 ... error sending request for url (http://127.0.0.1:8001...):llama-serverbackend is not reachable. Check startup logs and runtime process.llama-server exited early ... model is corrupted or incomplete: Re-pull model artifact.
- macOS Metal path (MVP) is designed for
llama.cpp(llama-server) first. - Binary search order:
~/.tokensmith/bin/llama-serverPATH
- If
llama-serveris missing,doctorprovides guidance and the adapter can still respond with placeholder behavior for local testing.
- CUDA backend support
- Better Windows process/metrics parity
- LAN sharing mode
- Larger curated model registry
- Multi-process management (
tokensmith psfor more than one active server)
cargo fmt
cargo test
cargo run -- recommend --task code --mode balanced