Skip to content

Conversation

@gmlwns2000
Copy link
Collaborator

@gmlwns2000 gmlwns2000 commented Oct 1, 2025

Intro

Add Delta Sparse Attention.

Example Fault Tolerance Command

while true; do; BSA_K=32 \
BSA_EXACT_K=32 \
BSA_BLOCK_K=64 \
HIP_DEBUG_DELTA_QSA=1 \
HIP_DEBUG_RECOMPUTE_SPLIT=0 \
TRITON_PRINT_AUTOTUNING=1 \
SRT_WARMUP_ALL_SEQ_LENS=0 \
HIP_DEBUG_FA3_MIXING_LEN=0 \
PASSKEY_DECODE_LEN=128 \
PASSKEY_LEN=150 \
SA_BLOCK_SIZE=128 \
SA_DECODE_BLOCK_SIZE=128 \
HIP_DISABLE_AUTOTUNE=0 \
HIP_DEBUG=0 \
HIP_DEBUG_BENCH=0 \
HIP_DEBUG_CAPTURE_DECORATOR=1 \
CUDA_LAUNCH_BLOCKING=0 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
$(which python) -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
--kv-cache-dtype auto \
--ep-size 8 \
--tp-size 8 \
--chunked-prefill-size 65536 \
--max-prefill-tokens 65536 \
--cuda-graph-bs 1 2 4 8 16 24 32 48 64 96 128 160 192 256 \
--context-length 256000 \
--max-total-tokens 256000 \
--attention-backend hip_attention \
--hip-attention-config ./configs/mixed_landmark_0814_no_extend_qsa.json \
--hip-attention-config-override-json '{"__seq_thresh_fa3": 65536}' \
--json-model-override-args  '{"rope_scaling":{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}, "max_position_embeddings": 262144}' \
--max-running-requests 64 \
--trust-remote-code \
--tool-call-parser qwen25 \
--dist-timeout 10; done;

gmlwns2000 and others added 30 commits August 11, 2025 07:51
- this allows for tracking high scoring blocks when calculating
  query sparse attention

- we may then use this information in a later block sparse
  attention kernel.
delta_w was in mask stride, but the current code expects queries to be
priuned for delta before going to QSA kernel.
added NotImplementedError for setting bsa_block_size_q kwargs > 1.
this should be implemented later if it is possible.
implementation for both minheap and plain qsa with block indices seems
broken here.
triton autotune causes it to fail. Investigating autotuning bug further
more cleanup on code and tests is needed once the final form of the
function is decided
winner update tree works here and is shown to outperform linear online
top-k for k>128.

Code still needs verification, fixing and cleanup
the operation now computes the global min at the end of the operation so
that we can do a smaller compare on the next loop to check for top-k min
and avoid unnecessary computation
@kbumsik
Copy link
Contributor

kbumsik commented Oct 3, 2025

Fault Tolerance implemented by for loop 😆

Copy link
Contributor

@kbumsik kbumsik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOD

@gmlwns2000 gmlwns2000 merged commit 30eb0d1 into deepauto/dev Oct 3, 2025
1 check passed
@gmlwns2000 gmlwns2000 deleted the research/delta-qsa branch October 3, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants