Osmium

High-performance MoE inference engine for consumer GPUs. Fork of Krasis with WriteCombined DMA staging for +50% decode throughput on PCIe Gen5.

Runs Qwen3.5-122B-A10B at 57 tok/s decode on a single RTX 5090 (vs 38 tok/s upstream).

Benchmarks

RTX 5090 32GB, PCIe Gen5 x16

Qwen3.5-122B-A10B (122B params, 234 GB BF16 / 56 GB INT4). Single GPU, AWQ attention, FP8 KV cache, WriteCombined DMA.

Headline: 38.7 → 57.5 tok/s (+49%) at 42K context vs Krasis baseline.

Decode throughput across context depths

Single 5090, INT4 experts, AWQ attention, FP8 KV cache, --wc-alloc:

KV Cache	Max Context	HCS Coverage	Decode
200 MB	17K tokens	40.2%	53.1 tok/s
500 MB	42K tokens	39.6%	57.5 tok/s
1000 MB	85K tokens	38.7%	54.8 tok/s
1500 MB	128K tokens	37.7%	50.3 tok/s
2000 MB	170K tokens	36.6%	53.1 tok/s
3400 MB	290K tokens	34.1%	56.8 tok/s

Decode stays 50-57 tok/s from 17K all the way to 290K context. The 122B model never leaves system RAM — only the top-K experts per layer DMA into VRAM each token, and the WriteCombined buffers feed the PCIe link at near-line-rate.

What makes this fast

WriteCombined DMA staging (--wc-alloc): Expert weights allocated via cuMemHostAlloc(WRITECOMBINED) bypass the CPU cache hierarchy. PCIe Gen5 DMA reads hit ~46 GB/s vs ~28 GB/s for regular pinned memory. Per-component layout with incremental heap freeing keeps peak RAM manageable.
HCS (Hot Cache Strategy): Frequently-accessed MoE experts cached in VRAM, reducing PCIe DMA for hot paths.

Everything else (prefill, attention, quantization, KV cache) comes directly from upstream Krasis.

Requirements

Linux (Fedora 43, Ubuntu 24.04+, or WSL2)
NVIDIA GPU with CUDA 12.8+ drivers
Python 3.13 (3.10+ should work)
System RAM: ~74 GB with --wc-alloc for 122B models (91 GB physical + 64 GB NVMe swap recommended)
NVMe swap: Required for WC mode. sudo fallocate -l 64G /swapfile_osmium && sudo chmod 600 /swapfile_osmium && sudo mkswap /swapfile_osmium && sudo swapon /swapfile_osmium

Quick Start

git clone https://github.com/3spky5u-oss/osmium.git
cd osmium
git checkout osmium-v0.3.0

# Create venv and install dependencies
python3.13 -m venv .venv
.venv/bin/pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
.venv/bin/pip install "sglang[all]==0.5.9" flashinfer-python==0.6.3

# Build (requires Rust toolchain + CUDA toolkit)
CUDA_HOME=/usr/local/cuda .venv/bin/pip install maturin
CUDA_HOME=/usr/local/cuda .venv/bin/maturin develop --release

# Download a model
pip install huggingface-hub
huggingface-cli download Qwen/Qwen3.5-122B-A10B

# Ensure swap is active
sudo swapon /swapfile_osmium

# Run with benchmark
CUDA_VISIBLE_DEVICES=0 .venv/bin/python -m krasis.server \
  --model-path ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B/snapshots/* \
  --benchmark --wc-alloc \
  --gpu-expert-bits 4 --attention-quant awq --kv-dtype fp8_e4m3 --kv-cache-mb 1000

What Osmium Changes vs Krasis

Feature	Krasis v0.1.66	Osmium v0.3
Expert DMA	Pinned memory (~28 GB/s)	WriteCombined (~46 GB/s)
Decode (122B, 1 GPU)	38.7 tok/s	57.0 tok/s
RAM management	Standard allocation	Incremental per-layer WC migration
RAM watchdog	5% floor, no swap awareness	0.5% floor, counts SwapFree

Osmium is a single commit (--wc-alloc) on top of upstream Krasis. PR #19 is open upstream.

Key Flags

Flag	Description
`--wc-alloc`	WriteCombined DMA staging. +47% decode speed, requires swap.
`--gpu-expert-bits 4`	INT4 Marlin experts on GPU (default)
`--attention-quant awq`	AWQ attention quantization (frees VRAM for HCS)
`--kv-dtype fp8_e4m3`	FP8 KV cache (halves KV memory vs BF16)
`--kv-cache-mb N`	KV cache size. 1000 = 85K ctx, 3400 = 256K ctx
`--benchmark`	Run prefill + decode benchmark after loading
`--hcs`	Hot Cache Strategy (on by default with `--wc-alloc`)

API

OpenAI-compatible at http://localhost:8012/v1/chat/completions with SSE streaming. Works with Cursor, OpenCode, Continue, and any OpenAI SDK client.

curl http://localhost:8012/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

License

SSPL-1.0 (inherited from Krasis)

Name		Name	Last commit message	Last commit date
Latest commit History 454 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
dist_check		dist_check
logs		logs
perplexity		perplexity
podman		podman
python/krasis		python/krasis
release_inspect		release_inspect
scripts		scripts
src		src
target_prerelease_check		target_prerelease_check
target_py311		target_py311
templates/attention		templates/attention
testconfigs		testconfigs
tests		tests
tools		tools
.gitignore		.gitignore
ADVANCED.md		ADVANCED.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEV.md		DEV.md
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
build.rs		build.rs
dev		dev
dump_trace.sh		dump_trace.sh
fix-oomd.sh		fix-oomd.sh
gpu_cleanup.sh		gpu_cleanup.sh
gpu_reset.sh		gpu_reset.sh
install.sh		install.sh
krasis		krasis
krasis-chat		krasis-chat
krasis_server.png		krasis_server.png
krasis_server_2.png		krasis_server_2.png
krasis_server_3.png		krasis_server_3.png
package-lock.json		package-lock.json
package.json		package.json
perf_diff_20260330_b8c071d_vs_worktree.patch		perf_diff_20260330_b8c071d_vs_worktree.patch
perf_diff_20260330_b8c071d_vs_worktree_review.txt		perf_diff_20260330_b8c071d_vs_worktree_review.txt
pyproject.toml		pyproject.toml
run_benchmark.sh		run_benchmark.sh
session-bridge.mjs		session-bridge.mjs
setup_pcie.sh		setup_pcie.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Osmium

Benchmarks

RTX 5090 32GB, PCIe Gen5 x16

Decode throughput across context depths

What makes this fast

Requirements

Quick Start

What Osmium Changes vs Krasis

Key Flags

API

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

Osmium

Benchmarks

RTX 5090 32GB, PCIe Gen5 x16

Decode throughput across context depths

What makes this fast

Requirements

Quick Start

What Osmium Changes vs Krasis

Key Flags

API

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors