curl -sSf https://raw.githubusercontent.com/brontoguana/krasis/main/install.sh | bash -s -- prereleaseThis installs the latest pre-release build. Normal install.sh (without the prerelease flag) always installs the latest stable release.
Krasis has two entry points:
krasis— the installed command (use for production, release testing)./dev— the development entry point (handles conda, auto-rebuild, GPU cleanup)
Never run Python scripts directly. Always use one of these commands.
BF16 validation policy:
- BF16-heavy configs are validation-only. Use them to prove correctness or isolate quantization from logic bugs.
- Production runs must use the normal Rust serving path with quantized configs.
gpu_expert_bits = 16is not a production mode.
| Command | Description |
|---|---|
./dev build |
Rebuild Rust extension (maturin develop --release) |
./dev run <config> [flags] |
Launch server from a test config |
./dev benchmark <config> |
Run standard benchmark and exit |
./dev release-test <model> |
Run full release test (4 configs, produces markdown report) |
./dev test <config> |
Short model test (benchmark + network tests) |
./dev test <config> --thorough |
Thorough test (+ stress + large prompts) |
./dev network <port> [--large] [--quick] |
Run network tests against a running server |
./dev perplexity <config> |
Run perplexity eval (WikiText-2) and exit |
./dev awq-calibrate <config> |
Run AWQ attention calibration (produces template) |
./dev kill |
Kill all krasis/GPU processes and reset |
Add --timing to run or benchmark for a per-layer decode timing breakdown. This adds ~30-50% overhead so do not use it for speed benchmarks — only for profiling.
The preferred way to run Krasis is with a config file:
krasis --config path/to/config.conf
./dev run qcn # resolves to testconfigs/qcn.conf
./dev benchmark qcn # sameConfig files use KEY=VALUE format. CLI flags override config file values.
| Flag | Default | Description |
|---|---|---|
--config PATH |
— | Config file (KEY=VALUE format), CLI flags override |
--model-path PATH |
— | HuggingFace model directory (safetensors + config.json) |
--num-gpus N |
all | Number of GPUs to use |
--selected-gpus IDX |
all | Comma-separated GPU indices (e.g. 0,2) |
--pp-partition STR |
auto | Layer partition across GPUs (e.g. 24,24) |
--host ADDR |
0.0.0.0 | Server bind address |
--port PORT |
8012 | Server port |
| Flag | Default | Description |
|---|---|---|
--gpu-expert-bits |
4 | GPU Marlin expert bits: 4 or 8 |
--cpu-expert-bits |
4 | CPU decode expert bits: 4 or 8 |
--attention-quant |
bf16 | Attention weight precision: bf16 or awq |
--shared-expert-quant |
int8 | Shared expert quant: int8 or bf16 |
--dense-mlp-quant |
int8 | Dense MLP quant: int8 or bf16 |
--lm-head-quant |
int8 | LM head quant: int8 or bf16 |
--kv-dtype |
fp8_e4m3 | KV cache dtype: fp8_e4m3 or bf16 |
Legacy int4/int8 values for --attention-quant are auto-migrated to awq.
When BF16 is selected for experts or major components, treat that run as validation-only rather than production.
| Flag | Default | Description |
|---|---|---|
--kv-cache-mb N |
1000 | KV cache size in MB |
--hcs / --no-hcs |
on | Hot Cache Strategy for expert pinning |
--multi-gpu-hcs |
off | Pin HCS experts across all GPUs |
--vram-safety-margin N |
600 | Reserved VRAM in MB below which warnings fire |
--stream-attention |
off | Stream attention weights from CPU (for very large models) |
--force-load |
— | Override RAM safety checks and load anyway |
--force-rebuild-cache |
— | Delete existing expert caches and rebuild from safetensors |
--build-cache |
— | Build expert caches (if missing) and exit without starting server |
--heatmap-path PATH |
— | Path to expert_heatmap.json for HCS init |
| Flag | Default | Description |
|---|---|---|
--layer-group-size N |
2 | MoE layers to load per group during prefill |
--gpu-prefill-threshold N |
300 | Minimum tokens to use GPU prefill |
--krasis-threads N |
40 | CPU threads for expert computation |
--gguf-path PATH |
— | GGUF file for CPU experts (instead of native cache) |
| Flag | Default | Description |
|---|---|---|
--draft-model PATH |
— | Draft model for speculative decoding (e.g. ~/.krasis/models/Qwen3-0.6B) |
--draft-k N |
3 | Tokens to draft per speculative round |
--draft-context N |
512 | Context window for draft model warmup |
| Flag | Default | Description |
|---|---|---|
--temperature F |
0.6 | Sampling temperature |
--enable-thinking / --no-enable-thinking |
on | Enable thinking/reasoning mode |
--session-enabled / --no-session-enabled |
off | Enable Session messenger bridge |
| Flag | Default | Description |
|---|---|---|
--benchmark |
— | Run benchmark before launching server |
--benchmark-only |
— | Run benchmark and exit (no server) |
--timing |
— | Enable per-layer decode timing instrumentation |
--stress-test |
— | Run stress test (diverse prompts) and exit |
--perplexity |
— | Run perplexity evaluation and exit |
--note TEXT |
— | Description note written to log file header |
Krasis lets you quantize each component independently. The defaults are a good starting point — increase precision if you need better quality, decrease if you need to fit in less VRAM/RAM.
| Component | Options | Default |
|---|---|---|
| GPU experts | INT4, INT8 | INT4 |
| CPU experts | INT4, INT8 | INT4 |
| Attention | AWQ, BF16 | BF16 |
| Shared expert | INT8, BF16 | INT8 |
| Dense MLP | INT8, BF16 | INT8 |
| LM head | INT8, BF16 | INT8 |
| KV cache | FP8, BF16 | FP8 |
Embeddings, norms, and routing gates are always kept at BF16.
AWQ attention uses calibrated per-tensor quantization (run ./dev awq-calibrate <config> to generate the template). BF16 is full precision with no calibration needed.