Releases · brontoguana/krasis

Prerelease with vendored CUDA sidecars injected into release wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

Assets 2

29 Mar 22:03

brontoguana

v0.1.65-rc4

a409a35

v0.1.65-rc4 Pre-release

Pre-release

Prerelease with release-wheel sidecar injection, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

Assets 2

29 Mar 21:53

brontoguana

v0.1.65-rc3

06d1512

v0.1.65-rc3 Pre-release

Pre-release

Prerelease with vendored CUDA sidecars packaged for installed wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

Assets 7

29 Mar 20:34

brontoguana

v0.1.65-rc2

d9bd3f5

v0.1.65-rc2 Pre-release

Pre-release

Full Changelog: v0.1.65-rc1...v0.1.65-rc2

Assets 7

19 Mar 15:15

brontoguana

v0.1.65-rc1

14072c3

v0.1.65-rc1 Pre-release

Pre-release

Changes

Return HTTP 413 with structured OpenAI-format error (context_length_exceeded) when KV cache is full, instead of generic 500
Expose max_context_tokens in /v1/models endpoint so clients can see actual hardware-constrained context limit

Assets 7

19 Mar 14:56

brontoguana

v0.1.64

1ef1c2d

v0.1.64 Latest

Latest

Changes since v0.1.63

VRAM budget fix: Dense MLP workspace now included in prefill budget (max of MoE vs dense intermediates)
KV cache auto-cap: Clamps KV cache to available VRAM after weight loading (downward only, warns when capped)
HCS pool fix: hard_budget_mb=0 detection now correct
CPU cache building disabled: No longer builds unused CPU expert caches; deleted 159 GB of stale CPU cache files
Release test report improved: TOC with links, compact benchmark summaries with model/config info, full sanity responses for manual verification
GLM-4.7 graph capture segfault fixed
AWQ attention template added for GLM-4.7

Release test results

QCN (Qwen3-Coder-Next) on RTX 5090:

INT4/4 BF16: 71.4 tok/s decode, 3518 prefill, 60.6% HCS
INT4/4 AWQ: 65.9 tok/s decode, 3536 prefill, 66.5% HCS
INT8/8 BF16: 38.0 tok/s decode, 3368 prefill, 30.7% HCS
INT8/8 AWQ: 35.7 tok/s decode, 3430 prefill, 33.5% HCS
Multi-GPU INT4 BF16: 41.0 tok/s decode, 3544 prefill, 82.1% HCS

Q3.5-35B on RTX 5090:

INT4/4 BF16: 110.7 tok/s decode, 4495 prefill, 100% HCS
INT4/4 AWQ: 95.1 tok/s decode, 4407 prefill
INT8/8 BF16: 60.9 tok/s decode, 4391 prefill
INT8/8 AWQ: 57.1 tok/s decode, 4282 prefill
Multi-GPU INT4 BF16: 49.7 tok/s decode, 4402 prefill

All sanity tests (14 prompts, multi-turn) passed across all configs.

Assets 7

19 Mar 11:40

brontoguana

v0.1.64-rc2

37a59c2

v0.1.64-rc2 Pre-release

Pre-release

Changes since rc1

Fix VRAM budget: dense MLP workspace now included in prefill budget calculation (fixes GLM-4.7 OOM, no impact on existing models)
Fix HCS pool: hard_budget_mb=0 no longer triggers auto-detect (was allocating all VRAM as hard pool)
KV cache auto-cap: clamps to available VRAM after weight loading (safety margin computed from model dimensions)
Per-layer expert storage: eliminate duplicate pinned RAM, fix AWQ calibration
Model discovery: fix for nested dirs, skip TUI when --model-path set
CPU expert cache building fully removed (no longer used)
Added Qwen3.5-397B-A17B to supported models list
Remove per-token overhead from decode loop (Vec clones, cuMemGetInfo)

Tested

QCN release test: all 5 configs passed (70 tok/s INT4/BF16, baselines intact)
Q3.5-35B release test: all 5 configs passed (110 tok/s INT4/BF16)

Assets 7

Releases: brontoguana/krasis

v0.1.66-rc2

Uh oh!

v0.1.66-rc1

Uh oh!

v0.1.65-rc6

Uh oh!

v0.1.65-rc5

Uh oh!

v0.1.65-rc4

Uh oh!

v0.1.65-rc3

Uh oh!

v0.1.65-rc2

Uh oh!

v0.1.65-rc1

Changes

Uh oh!

v0.1.64

Changes since v0.1.63

Release test results

Uh oh!

v0.1.64-rc2

Changes since rc1

Tested

Uh oh!