Conversation
- Bump SPIR-V version to 1.3 (required for GroupNonUniform ops)
- Add Float16 capability to SPIR-V headers
- Fix loop-invariant expression hoisting in GIOCompiler:
- Expressions defined outside loops but first referenced inside
were being compiled in the loop body, causing SPIR-V dominance errors
- Now hoists invariant expressions before loop structure
- Filters out loop-dependent and scope-dependent (When, etc.) expressions
- Add AGENTS.md with codebase documentation
💘 Generated with Crush
Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
For repeated execute() calls with the same pipeline and buffers, skip the O(n) GExecution tree traversal by caching interpret results. Keyed by (execution identity, layout bindings identity hash). 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
On cache hit (same pipeline + same buffers): - Wait for previous GPU fence - Resubmit same command buffer - Skip tree traversal, descriptor allocation, command recording Eliminates O(n) per-call overhead in decode loop. 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Replace per-submission fences with a single timeline semaphore: - GPU-GPU sync via semaphore wait/signal (no CPU involvement) - Only sync to CPU when reading results (unavoidable for sampling) - Eliminates fence creation/destruction overhead per token Performance: 57.8 -> 62.0 tok/s (+8%) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Optimize F16MatmulVecHybridProgram to compute 2 output rows per workgroup instead of 8 warps computing 1 row each. This matches llama.cpp's approach. Key changes: - Change WARPS_PER_WORKGROUP=8 to NUM_ROWS=2, BLOCK_SIZE=32 (single warp) - Each workgroup computes 2 consecutive output rows sharing input loads - Use separate accumulation loops per row (DSL limitation with Vec2) - Use for-comprehension to sequence writes (DSL bug workaround) Performance: 20 tok/s → 70 tok/s on RTX 2070 Max-Q 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Apply same multi-row optimization to output projection kernel. Each workgroup now computes 4 vocab logits instead of 8 warps computing 1 each. Tested NUM_ROWS values: - NUM_ROWS=2: ~70 tok/s (too many workgroups) - NUM_ROWS=4: ~73 tok/s (best) - NUM_ROWS=8: ~67 tok/s (too much loop overhead) Performance: 70 → 73 tok/s (+4%) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
Previously, ExecutionHandler tracked ALL buffer bindings as "dirty" after each dispatch, causing unnecessary barriers between operations that only READ the same buffer. This serialized Q/K/V matmuls that should run in parallel. Changes: - Add getWrittenBuffers() to DSLCompiler to extract written buffers from GIO - Extend SpirvProgram to track per-binding Read/Write/ReadWrite operations - Modify ExecutionHandler to only mark WRITTEN bindings as dirty - Read-after-read is now barrier-free, enabling parallel execution This reduces unnecessary pipeline barriers and allows independent read operations (like Q/K/V matmuls reading from attnNormOut) to overlap on GPU. Benchmark: ~35% improvement in tok/s on LLM inference (64.7 → 87.2 tok/s) 💘 Generated with Crush Assisted-by: Claude Opus 4.5 via Crush <crush@charm.land>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.