Feat: Delta Sparse Attention #90

gmlwns2000 · 2025-10-01T14:01:22Z

Intro

Add Delta Sparse Attention.

Example Fault Tolerance Command

while true; do; BSA_K=32 \
BSA_EXACT_K=32 \
BSA_BLOCK_K=64 \
HIP_DEBUG_DELTA_QSA=1 \
HIP_DEBUG_RECOMPUTE_SPLIT=0 \
TRITON_PRINT_AUTOTUNING=1 \
SRT_WARMUP_ALL_SEQ_LENS=0 \
HIP_DEBUG_FA3_MIXING_LEN=0 \
PASSKEY_DECODE_LEN=128 \
PASSKEY_LEN=150 \
SA_BLOCK_SIZE=128 \
SA_DECODE_BLOCK_SIZE=128 \
HIP_DISABLE_AUTOTUNE=0 \
HIP_DEBUG=0 \
HIP_DEBUG_BENCH=0 \
HIP_DEBUG_CAPTURE_DECORATOR=1 \
CUDA_LAUNCH_BLOCKING=0 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
$(which python) -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--kv-cache-dtype auto \
--ep-size 8 \
--tp-size 8 \
--chunked-prefill-size 65536 \
--max-prefill-tokens 65536 \
--cuda-graph-bs 1 2 4 8 16 24 32 48 64 96 128 160 192 256 \
--context-length 256000 \
--max-total-tokens 256000 \
--attention-backend hip_attention \
--hip-attention-config ./configs/mixed_landmark_0814_no_extend_qsa.json \
--hip-attention-config-override-json '{"__seq_thresh_fa3": 65536}' \
--json-model-override-args  '{"rope_scaling":{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}, "max_position_embeddings": 262144}' \
--max-running-requests 64 \
--trust-remote-code \
--tool-call-parser qwen25 \
--dist-timeout 10; done;

- this allows for tracking high scoring blocks when calculating query sparse attention - we may then use this information in a later block sparse attention kernel.

delta_w was in mask stride, but the current code expects queries to be priuned for delta before going to QSA kernel.

added NotImplementedError for setting bsa_block_size_q kwargs > 1. this should be implemented later if it is possible.

…tion into feat/delta-bsa-hip

implementation for both minheap and plain qsa with block indices seems broken here.

triton autotune causes it to fail. Investigating autotuning bug further

more cleanup on code and tests is needed once the final form of the function is decided

winner update tree works here and is shown to outperform linear online top-k for k>128. Code still needs verification, fixing and cleanup

the operation now computes the global min at the end of the operation so that we can do a smaller compare on the next loop to check for top-k min and avoid unnecessary computation

Fix Autotune Bug

Quick update of Q pooling

bsa_meanpool is for an ablation on mask IoU

added sm_scale, mask, and vars for saving

kbumsik · 2025-10-03T09:16:53Z

Fault Tolerance implemented by for loop 😆

kbumsik

GOD

gmlwns2000 and others added 30 commits August 11, 2025 07:51

fix

f197700

fix self extend delta

478ae42

fit gpt oss

4e43315

disable delta glm

2d30fc6

added block sums and bsa index to query sparse attention

7eab17a

- this allows for tracking high scoring blocks when calculating query sparse attention - we may then use this information in a later block sparse attention kernel.

refactor

9e43cb1

refactor

76041ee

fixed bug in test code

9ef3f6b

delta_w was in mask stride, but the current code expects queries to be priuned for delta before going to QSA kernel.

changed bsa_block_size_q default arg to 1

bea3860

added NotImplementedError for setting bsa_block_size_q kwargs > 1. this should be implemented later if it is possible.

commit before incorporating changes into upstream branch

db385e2

fix

c6f870b

wow

9efdef4

add visualization

46540e2

fix

a743eba

update

c42b580

fix bug

fa68b31

fix

ff96a50

handle qsa bsa

40e3232

wip

e1ab25d

wip

0514b44

Merge branch 'feat/delta-bsa-hip' of github.com:DeepAuto-AI/hip-atten…

8b4b63c

…tion into feat/delta-bsa-hip

added implementation of winner min-heap inside qsa

e2fdcf6

implementation for both minheap and plain qsa with block indices seems broken here.

debugging using_exp_sum bug

8ac0b5e

qsa winner heap working here

ff84c21

triton autotune causes it to fail. Investigating autotuning bug further

minor cleanup on qsa code.

92b0de7

more cleanup on code and tests is needed once the final form of the function is decided

heap verified to return same indices as plain online top-k

3f6a4d4

reordered code for minor cleanup

6e5d8b3

winner update tree works here and is shown to outperform linear online top-k for k>128. Code still needs verification, fixing and cleanup

reordered the qsa top-k loop

bb0fe68

the operation now computes the global min at the end of the operation so that we can do a smaller compare on the next loop to check for top-k min and avoid unnecessary computation

cleaned up qsa kernel and added tests

75fd0d4

added back in full hip/bsa tests

5f4cf9d

gmlwns2000 and others added 22 commits September 11, 2025 00:38

update

c08c96a

fix display

8a36a8f

fmt

e7bc542

Merge pull request #86 from DeepAuto-AI/feat/delta-qsa-geon

1843ab0

Fix Autotune Bug

mark as exec

593e55e

hotfix

f31773c

quick fix

42eb92e

fmt

757abc1

Merge pull request #88 from DeepAuto-AI/feat/delta-qsa-fix

f94d7c0

Quick update of Q pooling

added bsa meanpool attention to paged hip

7753f25

bsa_meanpool is for an ablation on mask IoU

bsa_meanpool bugfix

05ccb0a

added sm_scale, mask, and vars for saving

minor bugfix to bsa_meanpool

da2cfc3

removed mask_n in forward_bsa_meanpool

6fb29d1

removed addition of block_size_q to active_mask in forward_bsa_meanpool

55c62a1

fix

3bd1793

fmt

c8e0c09

fix

8e9bdfd

added padding to bsa_meanpool attn

2497286

fix

d098bee

fix

3b1bccd

fix

f209261

fmt

930f558

This was referenced Oct 1, 2025

Feat: Rebase Upstream; Handle Delta Sparse Attention sgl-project/sglang#11136

Closed

Feat: Rebase to Upstraem; Handle Delta Sparse Attention DeepAuto-AI/sglang#32

Merged

gmlwns2000 requested a review from kbumsik October 1, 2025 14:08

gmlwns2000 changed the title ~~Research/delta qsa~~ Feat: Delta Sparse Attention Oct 1, 2025

kbumsik approved these changes Oct 3, 2025

View reviewed changes

gmlwns2000 merged commit 30eb0d1 into deepauto/dev Oct 3, 2025
1 check passed

gmlwns2000 deleted the research/delta-qsa branch October 3, 2025 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Delta Sparse Attention #90

Feat: Delta Sparse Attention #90

Uh oh!

gmlwns2000 commented Oct 1, 2025 •

edited

Loading

Uh oh!

kbumsik commented Oct 3, 2025 •

edited

Loading

Uh oh!

kbumsik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Feat: Delta Sparse Attention #90

Feat: Delta Sparse Attention #90

Uh oh!

Conversation

gmlwns2000 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intro

Example Fault Tolerance Command

Uh oh!

kbumsik commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbumsik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gmlwns2000 commented Oct 1, 2025 •

edited

Loading

kbumsik commented Oct 3, 2025 •

edited

Loading