Releases: brontoguana/krasis
v0.1.66-rc2
Recreated rc2 on commit 418e3ce after switching the manylinux FLA link step to the resolved CUDA stub file path directly.
v0.1.66-rc1
Pre-release for multi-GPU testing.
Changes since v0.1.65-rc6:
- 122B FLA fix: multi-H cubins and scratch buffer sizing
- Cross-compiled FLA kernels for sm80/sm89/sm90/sm120
- FLA kernel arg signature and block size fix
- Arch-specific FLA .so files ship in wheel (no first-run JIT)
- GPU arch auto-detection with forward/backward compat fallback
v0.1.65-rc6
Prerelease for installed-package sidecar fixes and FP8-only KV cache on Ampere.
v0.1.65-rc5
Prerelease with vendored CUDA sidecars injected into release wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.
v0.1.65-rc4
Prerelease with release-wheel sidecar injection, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.
v0.1.65-rc3
Prerelease with vendored CUDA sidecars packaged for installed wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.
v0.1.65-rc2
Full Changelog: v0.1.65-rc1...v0.1.65-rc2
v0.1.65-rc1
Changes
- Return HTTP 413 with structured OpenAI-format error (
context_length_exceeded) when KV cache is full, instead of generic 500 - Expose
max_context_tokensin/v1/modelsendpoint so clients can see actual hardware-constrained context limit
v0.1.64
Changes since v0.1.63
- VRAM budget fix: Dense MLP workspace now included in prefill budget (max of MoE vs dense intermediates)
- KV cache auto-cap: Clamps KV cache to available VRAM after weight loading (downward only, warns when capped)
- HCS pool fix: hard_budget_mb=0 detection now correct
- CPU cache building disabled: No longer builds unused CPU expert caches; deleted 159 GB of stale CPU cache files
- Release test report improved: TOC with links, compact benchmark summaries with model/config info, full sanity responses for manual verification
- GLM-4.7 graph capture segfault fixed
- AWQ attention template added for GLM-4.7
Release test results
QCN (Qwen3-Coder-Next) on RTX 5090:
- INT4/4 BF16: 71.4 tok/s decode, 3518 prefill, 60.6% HCS
- INT4/4 AWQ: 65.9 tok/s decode, 3536 prefill, 66.5% HCS
- INT8/8 BF16: 38.0 tok/s decode, 3368 prefill, 30.7% HCS
- INT8/8 AWQ: 35.7 tok/s decode, 3430 prefill, 33.5% HCS
- Multi-GPU INT4 BF16: 41.0 tok/s decode, 3544 prefill, 82.1% HCS
Q3.5-35B on RTX 5090:
- INT4/4 BF16: 110.7 tok/s decode, 4495 prefill, 100% HCS
- INT4/4 AWQ: 95.1 tok/s decode, 4407 prefill
- INT8/8 BF16: 60.9 tok/s decode, 4391 prefill
- INT8/8 AWQ: 57.1 tok/s decode, 4282 prefill
- Multi-GPU INT4 BF16: 49.7 tok/s decode, 4402 prefill
All sanity tests (14 prompts, multi-turn) passed across all configs.
v0.1.64-rc2
Changes since rc1
- Fix VRAM budget: dense MLP workspace now included in prefill budget calculation (fixes GLM-4.7 OOM, no impact on existing models)
- Fix HCS pool: hard_budget_mb=0 no longer triggers auto-detect (was allocating all VRAM as hard pool)
- KV cache auto-cap: clamps to available VRAM after weight loading (safety margin computed from model dimensions)
- Per-layer expert storage: eliminate duplicate pinned RAM, fix AWQ calibration
- Model discovery: fix for nested dirs, skip TUI when --model-path set
- CPU expert cache building fully removed (no longer used)
- Added Qwen3.5-397B-A17B to supported models list
- Remove per-token overhead from decode loop (Vec clones, cuMemGetInfo)
Tested
- QCN release test: all 5 configs passed (70 tok/s INT4/BF16, baselines intact)
- Q3.5-35B release test: all 5 configs passed (110 tok/s INT4/BF16)