- Dev environment: see
DEV.mdfor full setup details - Conda env:
/home/main/miniconda3/envs/krasis/(Python 3.11) - Rust toolchain (maturin builds the PyO3 extension)
- Models stored in
~/.krasis/models/<ModelName>/(safetensors + config.json) - 1x NVIDIA RTX 5090 (32 GB) + 1x NVIDIA RTX 2000 Ada (16 GB), AMD EPYC 7742 (AVX2, 64 cores, 995 GB RAM)
Build the Rust extension:
./dev build
python -m krasis.launcher
Arrow-key driven menu to select model, configure quantization, GPU count, and launch
the server. Saves config to testconfigs/<model>.conf for reuse.
python -m krasis.launcher --non-interactive
Reads the last saved config and launches without the TUI.
python -m krasis.server \
--model-path ~/.krasis/models/Qwen3-Coder-Next \
--pp-partition 48 \
--gpu-expert-bits 4 \
--cpu-expert-bits 4 \
--kv-dtype fp8_e4m3 \
--host 0.0.0.0 --port 8012
Serves an OpenAI-compatible API at /v1/chat/completions (streaming and blocking).
./run_benchmark.sh --model-path ~/.krasis/models/Qwen3-Coder-Next --pp-partition 48
Or directly:
python -m krasis.server --benchmark --model-path ~/.krasis/models/Qwen3-Coder-Next
Runs standardized prefill (20K/35K/50K tokens) and decode (64 tokens x 3 runs)
benchmarks, then exits. Output is logged and archived to benchmarks/.
python -m krasis.suite # default config
python -m krasis.suite --config benchmarks/qcn_1gpu_int4.toml
Reads a TOML file defining model x config combinations, runs each as a subprocess,
and writes a summary markdown table to benchmarks/suite_logs/.
Suite config format (see benchmarks/benchmark_suite.toml):
[[config]]
num_gpus = 1
gpu_expert_bits = 4
cpu_expert_bits = 4
[[model]]
name = "Qwen3-Coder-Next"Saved configs live in testconfigs/. Do not modify these files -- they are
reference configs for reproducible runs.
qcn-4-4.conf-- Qwen3-Coder-Next, 1 GPU, INT4/INT4v2lite-4-4.conf-- DeepSeek-V2-Lite, 1 GPU, INT4/INT4
All tests are in tests/. Run individual tests with:
python tests/<test_file>.py
These load a real model and verify output is coherent:
| File | What it tests |
|---|---|
test_v2lite_sanity.py |
V2-Lite basic generation (MLA model) |
test_v2lite_thorough.py |
V2-Lite extended generation checks |
test_v2lite_dual_format.py |
Dual GPU/CPU format correctness |
test_v2lite_gpu_prefill.py |
V2-Lite GPU prefill at various lengths |
test_v2lite_10k.py |
V2-Lite with 10K token prompt |
test_qwen3_next_generate.py |
QCN hybrid model generation (linear attn + GQA) |
test_kimi_k25.py |
Kimi K2.5 generation tests |
test_pp2_v2lite.py |
V2-Lite with pipeline parallelism (2 GPUs) |
test_pp2_qcn.py |
QCN with pipeline parallelism |
test_pp2_qwen235b.py |
Qwen3-235B with pipeline parallelism |
| File | What it tests |
|---|---|
test_pyo3.py |
Rust-Python binding works |
test_bridge.py |
Rust MoE engine bridge |
test_rust_decode.py |
Full Rust decode loop |
test_rust_vs_python.py |
Rust vs Python decode output match |
test_moe_sanity.py |
MoE forward pass correctness |
test_gpu_prefill.py |
GPU prefill kernel correctness |
test_fp8_kv.py |
FP8 KV cache precision |
test_quant_config.py |
Quantization config parsing |
test_la_graph.py |
Linear attention CUDA graph |
test_la_inplace.py |
Linear attention in-place ops |
test_parallel_prefill.py |
Multi-GPU parallel prefill |
test_gpu_decode.py |
GPU decode path (M=1 Marlin) |
test_attn_verify.py |
Attention output verification |
test_linear_attn_compare.py |
Linear attention vs reference |
test_gqa_compare.py |
GQA attention vs reference |
| File | What it measures |
|---|---|
bench_engine_isolated.py |
Raw Rust MoE throughput (no model load) |
decode_harness.py |
Synthetic GPU decode compute harness via ./dev decode-harness |
bench_8k_decode.py |
Decode with 8K context |
bench_combined.py |
Combined prefill + decode timing |
bench_model.py |
Full model benchmark |
bench_prefill_only.py |
Prefill-only throughput |
bench_prefill_timed.py |
Prefill with wall-clock timing |
bench_prefill_verify.py |
Prefill correctness + speed |
bench_prefill_10k.py |
Prefill at 10K tokens |
bench_prefill_order.py |
Prefill ordering effects |
bench_hot_cached_static.py |
HCS expert cache hit rates |
token_scaling_bench.py |
Decode scaling vs token count |
| File | What it tests |
|---|---|
test_network.py |
HTTP API validation against running server |
Requires a running server. Usage:
python tests/test_network.py --port 8012
python tests/test_network.py --port 8012 --large # include large-prompt tests
python tests/test_network.py --port 8012 --quick # known-answer only
| File | Purpose |
|---|---|
profile_decode.py |
Full decode profiling with instrumentation |
profile_decode_nopin.py |
Decode profiling without expert pinning |
test_decode_timing.py |
Decode per-component timing breakdown |
test_pure_cpu_timing.py |
Pure CPU decode timing |
test_qwen3_next_decode_timing.py |
QCN decode timing |
benchmarks/ contains:
bench_decode.py-- Server decode benchmarkbench.py-- Old MoE throughput benchmarkkt_benchmark.py-- KTransformers comparison benchmarkBENCHMARKS.md-- Summary table of all benchmark runs with links to full logs*.log-- Full benchmark output logssuite_logs/-- Benchmark suite output*.toml-- Suite config files- Prompt files bundled in package (
python/krasis/prompts/):prefill_prompt_{1-6},decode_prompt_{1-6}- Files 1-3 used for warmup, files 4-6 used for timed runs
| File | Purpose | Requires sudo |
|---|---|---|
run_benchmark.sh |
Run benchmark and exit cleanly | No |
gpu_cleanup.sh |
Kill zombie GPU processes, reclaim VRAM | Yes |
gpu_reset.sh |
Full NVIDIA driver reload (stops/restarts Xorg) | Yes |
setup_pcie.sh |
GPU persistence mode, disable ASPM, max power | Yes |
fix-oomd.sh |
Raise systemd-oomd kill threshold to 95% | Yes |
dump_trace.sh |
Dump Python + native stack trace of running server | No |
In the root scripts/ directory:
| File | Purpose |
|---|---|
analyze_expert_rank.py |
Analyze expert weight distributions |
analyze_heatmap.py |
Analyze expert activation heatmaps |
generate_heatmap.py |
Generate heatmap JSON from model runs |
run_heatmap_prompts.py |
Run prompts to collect heatmap data |
clear_page_cache.py |
Evict OS page cache (for cold benchmarks) |
krasis_monitor.py |
Live monitoring of running Krasis instance |
| Variable | Default | Purpose |
|---|---|---|
KRASIS_DECODE_TIMING |
0 |
Enable per-component decode timing |
KRASIS_PREFILL_TIMING |
0 |
Enable prefill timing |
KRASIS_CPU_DECODE_TIMING |
0 |
Enable CPU decode step timing |
KRASIS_TIMING_INTERVAL |
20 |
Steps between timing reports |
KRASIS_BENCH_RUNS |
3 |
Number of decode runs in benchmark |
KRASIS_DEBUG_DECODE |
0 |
Debug logging for decode path |
KRASIS_DEBUG_SYNC |
0 |
Synchronous CUDA error checking |
KRASIS_DIAG |
0 |
Diagnostic logging (per-MoE-layer) |
KRASIS_NO_PIPELINE |
0 |
Disable pipeline parallelism |
KRASIS_FUSED_LINEAR_ATTN |
1 |
Use fused linear attention kernel |
KRASIS_HOME |
~/.krasis |
Model and cache storage root |
KRASIS_LAYER_TIMING |
0 |
Per-layer timing in prefill |
KRASIS_PREFILL_THRESHOLD |
500 |
Token count above which GPU prefill is used |
python tests/test_v2lite_sanity.py
python -m krasis.suite --config benchmarks/qcn_1gpu_int4.toml
# Terminal 1:
python -m krasis.launcher
# Terminal 2 (after server is up):
python tests/test_network.py --port 8012
sudo ./gpu_cleanup.sh # try soft cleanup first
sudo ./gpu_reset.sh # if cleanup doesn't work, full driver reload