Skip to content

Releases: brontoguana/krasis

v0.1.66-rc2

18 Apr 05:34

Choose a tag to compare

v0.1.66-rc2 Pre-release
Pre-release

Recreated rc2 on commit 418e3ce after switching the manylinux FLA link step to the resolved CUDA stub file path directly.

v0.1.66-rc1

07 Apr 23:06

Choose a tag to compare

v0.1.66-rc1 Pre-release
Pre-release

Pre-release for multi-GPU testing.

Changes since v0.1.65-rc6:

  • 122B FLA fix: multi-H cubins and scratch buffer sizing
  • Cross-compiled FLA kernels for sm80/sm89/sm90/sm120
  • FLA kernel arg signature and block size fix
  • Arch-specific FLA .so files ship in wheel (no first-run JIT)
  • GPU arch auto-detection with forward/backward compat fallback

v0.1.65-rc6

29 Mar 22:22

Choose a tag to compare

v0.1.65-rc6 Pre-release
Pre-release

Prerelease for installed-package sidecar fixes and FP8-only KV cache on Ampere.

v0.1.65-rc5

29 Mar 22:11

Choose a tag to compare

v0.1.65-rc5 Pre-release
Pre-release

Prerelease with vendored CUDA sidecars injected into release wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

v0.1.65-rc4

29 Mar 22:03

Choose a tag to compare

v0.1.65-rc4 Pre-release
Pre-release

Prerelease with release-wheel sidecar injection, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

v0.1.65-rc3

29 Mar 21:53

Choose a tag to compare

v0.1.65-rc3 Pre-release
Pre-release

Prerelease with vendored CUDA sidecars packaged for installed wheels, prerelease installer force-reinstall handling, and FP8-only KV cache on Ampere and in the interactive launcher.

v0.1.65-rc2

29 Mar 20:34

Choose a tag to compare

v0.1.65-rc2 Pre-release
Pre-release

v0.1.65-rc1

19 Mar 15:15

Choose a tag to compare

v0.1.65-rc1 Pre-release
Pre-release

Changes

  • Return HTTP 413 with structured OpenAI-format error (context_length_exceeded) when KV cache is full, instead of generic 500
  • Expose max_context_tokens in /v1/models endpoint so clients can see actual hardware-constrained context limit

v0.1.64

19 Mar 14:56

Choose a tag to compare

Changes since v0.1.63

  • VRAM budget fix: Dense MLP workspace now included in prefill budget (max of MoE vs dense intermediates)
  • KV cache auto-cap: Clamps KV cache to available VRAM after weight loading (downward only, warns when capped)
  • HCS pool fix: hard_budget_mb=0 detection now correct
  • CPU cache building disabled: No longer builds unused CPU expert caches; deleted 159 GB of stale CPU cache files
  • Release test report improved: TOC with links, compact benchmark summaries with model/config info, full sanity responses for manual verification
  • GLM-4.7 graph capture segfault fixed
  • AWQ attention template added for GLM-4.7

Release test results

QCN (Qwen3-Coder-Next) on RTX 5090:

  • INT4/4 BF16: 71.4 tok/s decode, 3518 prefill, 60.6% HCS
  • INT4/4 AWQ: 65.9 tok/s decode, 3536 prefill, 66.5% HCS
  • INT8/8 BF16: 38.0 tok/s decode, 3368 prefill, 30.7% HCS
  • INT8/8 AWQ: 35.7 tok/s decode, 3430 prefill, 33.5% HCS
  • Multi-GPU INT4 BF16: 41.0 tok/s decode, 3544 prefill, 82.1% HCS

Q3.5-35B on RTX 5090:

  • INT4/4 BF16: 110.7 tok/s decode, 4495 prefill, 100% HCS
  • INT4/4 AWQ: 95.1 tok/s decode, 4407 prefill
  • INT8/8 BF16: 60.9 tok/s decode, 4391 prefill
  • INT8/8 AWQ: 57.1 tok/s decode, 4282 prefill
  • Multi-GPU INT4 BF16: 49.7 tok/s decode, 4402 prefill

All sanity tests (14 prompts, multi-turn) passed across all configs.

v0.1.64-rc2

19 Mar 11:40

Choose a tag to compare

v0.1.64-rc2 Pre-release
Pre-release

Changes since rc1

  • Fix VRAM budget: dense MLP workspace now included in prefill budget calculation (fixes GLM-4.7 OOM, no impact on existing models)
  • Fix HCS pool: hard_budget_mb=0 no longer triggers auto-detect (was allocating all VRAM as hard pool)
  • KV cache auto-cap: clamps to available VRAM after weight loading (safety margin computed from model dimensions)
  • Per-layer expert storage: eliminate duplicate pinned RAM, fix AWQ calibration
  • Model discovery: fix for nested dirs, skip TUI when --model-path set
  • CPU expert cache building fully removed (no longer used)
  • Added Qwen3.5-397B-A17B to supported models list
  • Remove per-token overhead from decode loop (Vec clones, cuMemGetInfo)

Tested

  • QCN release test: all 5 configs passed (70 tok/s INT4/BF16, baselines intact)
  • Q3.5-35B release test: all 5 configs passed (110 tok/s INT4/BF16)