-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
- Estimated release date:
- public preview (alpha): 9/1
- public preview (beta): 9/30
- refactor kernels & TeSA: 10/15
P0
- Tuning more steps to show the speedup gain of the pytorch sparse modules
- Support the openai kernel/template
- code review
- Usage Interface(8.19) (update one version on 8.26)
- Fix triton speed(8.19)
- Sparse Softmax Kernel
- Biased OpenAI MatMul Kernel
- finegrained 99% + block size 8x8 95% + block size 32 x 32
- Documentation (test)
- package data (test)
- sparta.tune(): hook, set search space
- Fix sparse softmax
- Integration test/example: Linear, Softmax
- Fix JIT latency
- Read the docs
- SparTA DDS MatMul kernel
- Batch MatMul & Softmax
- Sparse Attention
- Add sparse matmul kernel: transpose_A
- Functional
- Support backward
- Add perfermance test: Compare with Triton 1.1.2 (Upload test scripts)
- Test current tuner
- Test Sparse Attention
- Update kernel pycuda interface
- Profile Layout converting
- Construct sparse attention op with linear & softmax ops
- Beta version: docs, docstrings & examples
- Test on V100; backward
- Fix kernel output
- Module tuner: get combined search space of connected ops automatically
- Connect to NNI's new tuner
P1
- Apply roller's rules
- Support multi-process tuning
- BCSR kernel: convert(), inverse(), swapaxes(), sum(), rebuild TeSA Converter when set_mask()
- Auto converter: support value mask in matmul kernels
- PyCUDA device context register & operator.to() (multiple cards)
- Support the multiple sparse formats: sdd dsd, dds for linear
- Support the block quantization kernel/fp16/bf16
- Compare Sparse Softmax with Triton's Sparse Softmax and keep improving.
- unit tests
- Model tuning interface / documents / examples
- Common mask patterns
- Refactor TeSA (Meta, linter)
- Fuse layout converting into kernels
P2
- Support the offline LUT or the kernel cache/DB
liecn and donglixp