Add attention token visualization and manifold exploration by architehc · Pull Request #1 · architehc/sglang

architehc · 2026-01-10T20:11:36Z

Summary

This PR adds comprehensive attention token visualization and manifold exploration capabilities to SGLang, enabling real-time analysis of attention patterns during inference.

Key Features

Top-K Attention Capture: Server-side capture of top-k attended positions per token
Attention Fingerprinting: 20-dimensional fingerprint vectors summarizing attention patterns
Manifold Discovery: UMAP/HDBSCAN clustering for discovering attention pattern zones
Real-time Visualization: React UI for exploring attention patterns and manifold structure
Zone Classification: Automatic classification into syntax_floor, semantic_bridge, structure_ripple zones

Exploration Results: Qwen3 80B vs 4B

We conducted a 5-hour exploration comparing attention patterns between Qwen3-80B (MoE) and Qwen3-4B (Dense):

Zone	80B (MoE)	4B (Dense)	Difference
structure_ripple	64.0%	84.3%	+20.3%
semantic_bridge	35.3%	15.5%	-19.8%
syntax_floor	0.7%	0.1%	-0.6%

Key Finding: Smaller dense models rely ~20% more on structural attention patterns, while larger MoE models achieve more balanced semantic/structural attention through expert routing.

New Components

examples/attention_explorer/ - Full-stack attention exploration application
- ui/ - React frontend with manifold visualization
- discovery/ - Manifold discovery pipeline with checkpointing
- schemas/ - Database schemas for fingerprint storage
- exploration_outputs/ - Reports and visualizations from model explorations

API Extensions

# Enable attention capture
response = client.chat.completions.create(
    model="model-name",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"return_attention_tokens": True}
)

Server Arguments

--return-attention-tokens: Enable attention token capture
--attention-tokens-top-k: Number of top attended positions to capture (default: 32)

Test Plan

Unit tests for attention modules
Integration tests for fingerprinting pipeline
E2E tests for UI features
Manual testing with Qwen3-80B and Qwen3-4B models

Documentation

🤖 Generated with Claude Code

…sgl-project#16698) Co-authored-by: Muqi Li <muqi1029@gmail.com>

…t#16845)

…ct#16848)

…16852)

…roject#16637) Co-authored-by: Ziang Li <ziangli@humansand.ai>

Co-authored-by: Shu Wang <shuw@nvidia.com>

…sgl-project#16723)

Signed-off-by: Shangming Cai <csmthu@gmail.com>

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Co-authored-by: hxie <haiyong.xie@ustc.edu>

…-project#16333)

)

Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

…kV3.2's NSAIndexer (sgl-project#15227)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

sgl-project#15790)

…-project#16871)

…sgl-project#16737) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: DavisTao <dwt614707404@163.com> Co-authored-by: niehen6174 <nihen6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

…ject#16788)

…ion (sgl-project#16667) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>

Format logits_processor.py, fp8_utils.py, radix_cache.py, memory_pool.py, common.py, server_args.py, and others. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove unused imports from Python files to pass lint checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Run black-jupyter and isort on script files - Make scripts with shebangs executable - Fix trailing newlines in JSON files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Make example scripts executable - Fix codespell typo in scenarios.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Restore blank lines between third-party and first-party imports in MOE files to match the expected formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add blank line between third-party (flashinfer) and first-party (sglang) imports to match expected isort formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Mark all Python scripts with shebangs as executable to pass pre-commit check-shebang-scripts-are-executable hook. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Make chat template auto-detect from model name (qwen3, llama3, chatml, raw) - Batch tokenization API calls using Promise.all for better performance - Implement word-level aggregation toggle with edge aggregation - Note: --attention-chunk-size server arg already exists (2048 default) Fixes: - Hardcoded Qwen3 template now auto-detects model family - Per-token detokenize calls now batched in groups of 50 - Word mode toggle now functional with proper edge aggregation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Measures: - Memory reduction vs LRU baseline - Quality preservation (text similarity) - Latency impact of spectral computation Usage: python scripts/benchmark_spectral_eviction.py --port 30000 python scripts/benchmark_spectral_eviction.py --port 30001 --baseline --output baseline.json python scripts/benchmark_spectral_eviction.py --port 30000 --compare baseline.json Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…plorer Features: - Layer-by-layer filtering: Filter attention by layer with dropdown and heatmap modal - Token importance scoring: New module for smart context truncation based on attention patterns - Streaming fingerprint monitor: Real-time dashboard for monitoring attention fingerprints via ZMQ New files: - python/sglang/srt/mem_cache/token_importance.py: TokenImportanceScorer and SmartTruncator - examples/attention_explorer/fingerprint_monitor.py: Dashboard, JSONL, Prometheus modes - examples/attention_explorer/token_importance_demo.py: Demo script with synthetic data option Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix import ordering in fingerprint_monitor.py (isort) - Remove unused imports in token_importance_demo.py (ruff) - Remove unused Union import in token_importance.py (ruff) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements AttentionGuidedRetriever that uses attention fingerprints to: - Identify query tokens that need retrieval (semantic_bridge zone) - Score chunks by fingerprint similarity and target quality - Make smart retrieval decisions based on retrieval need score Key features: - Query analysis: Detects retrieval anchors from attention patterns - Chunk scoring: Combines fingerprint similarity + target quality - Retrieval decision: Skip retrieval for self-contained queries Files: - attention_guided_rag.py: Core AttentionGuidedRetriever class - attention_rag_demo.py: Demo script with synthetic and live modes - test_attention_guided_rag.py: 19 unit tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Web-based UI that demonstrates attention fingerprint-based RAG: - Query input with sample queries for quick testing - Retrieval need meter showing semantic_bridge ratio - Token classification with zone highlighting (semantic_bridge, syntax_floor, etc.) - Document ranking with visual score breakdown bars - Fingerprint visualization for selected chunks - Works in synthetic mode or connects to live SGLang server Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…tions - Add tooltips system with hover explanations for all technical terms - Add Glossary & Help modal in both explorer.html and rag_explorer.html - Document attention zone derivation from softmax(QK^T/√d) operation - Explain how RoPE creates distance decay for local attention patterns - Document semantic attention heads for mid-range (semantic bridge) patterns - Explain attention sinks and global heads for long-range/steering patterns - Add fingerprint computation pipeline documentation (binning, mass, entropy) - Document causal attention mask for Attends To vs Attended By directions - Add Playwright-based UI test suite (17 tests, 100% pass rate) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Documentation now explains concepts at the most basic level: - What is a Token? (word pieces the AI reads) - What is Attention? (where the AI "looks") - What is Temperature? (creativity dial 0-2) - What is Top-K? ("show me the top K things") - What is an Edge? (connection line between tokens) - What is a Layer? (processing floor in the AI factory) - What is RAG? (giving AI a "cheat sheet" of documents) Added "Ask the AI to Explain" sections with example prompts: - Users can copy attention patterns and ask the LLM to interpret - Provides template prompts for understanding results - Teaches users to use the tool iteratively with the AI All tooltips rewritten for complete beginners: - No jargon or assumed knowledge - Real-world analogies (factory floors, creativity dial) - Examples with each explanation - Color coding explained in plain terms Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add visual ASCII diagrams showing 32-layer transformer architecture - Explain how layers progressively build understanding: - Layer 1-4: Tokenization & basic patterns (Syntax Floor) - Layer 5-12: Phrase-level understanding - Layer 13-24: Semantic understanding (Semantic Bridge) - Layer 25-32: Abstract reasoning - Add Query/Key/Value attention flow diagram - Add RAG pipeline visual flow diagram - Add ui_demo_runner.py for capturing UI screenshots with Playwright Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Tests every button, input, slider, and modal - Tests responsive layouts (desktop/laptop/tablet/mobile) - Tests keyboard navigation and accessibility - Stress tests (rapid clicks, typing, modal spam) - Extended query tests with 40+ different inputs - Generates beautiful HTML report with screenshot gallery - Runs in cycles until time expires Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add attention_ws_server.py: WebSocket bridge to SGLang API - Add explorer_stream.html: Real-time streaming UI with live token display - Fix explorer.html: Parse both fingerprints AND edges (not either/or) WebSocket server streams tokens as they're generated for live visualization. UI shows token-by-token generation with ~65 tok/s throughput display. Usage: python attention_ws_server.py --sglang-url http://localhost:8000 --port 8765 Open explorer_stream.html in browser Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix WebSocket server to use non-streaming API for full attention data - Add zone distribution histogram with real-time updates - Add enhanced token tooltips (zone, entropy, top attention %) - Add zone badges in attention list items - Create UI integration plan for progressive feature rollout Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Get only the first text node from token elements to avoid including tooltip HTML content in the "Attends To" panel display. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace word-based heuristic with character-based distribution to properly show token content in streaming UI. This eliminates [tok_N] placeholder tokens that appeared when token count exceeded word count. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Preserve word boundaries when distributing text across tokens - Only split words when there are more tokens than words - Replace placeholder markers with middle dot for empty tokens - Much more readable token display in streaming UI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The test was looking for attention_tokens in meta_info, but the API returns it directly on choices[0]. Updated all three test functions to check both locations for compatibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add timeline slider with prev/next/start/end navigation buttons - Add arrow key navigation (←/→) and Home/End shortcuts - Implement word click to zoom and highlight in 3D graph - Scale selected tokens 1.5x with bright cyan edge highlighting - Add thick 3D tube edges (TubeGeometry) for better visibility - Add 🔥 Top Attention Words clickable list in sidebar - Fix text spacing using full_content from WebSocket server - Add camera zoom limits (15-200 units) to prevent invisibility - Add comprehensive Playwright E2E tests (21 tests passing) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When KV cache is offloaded to CPU RAM (common for 1M+ token contexts), Triton kernels cannot access CPU memory and would segfault. Changes: - Add device assertions in compute_topk_attention_chunked() and compute_topk_attention_tokens() as safety nets - Add smart CPU→GPU transfer in _compute_attention_token_info(): - Detects CPU-resident k_buffer - Transfers only unique referenced KV entries (not full buffer) - Logs warning with transfer size for debugging - Syncs before kernel launch This enables attention visualization on setups with CPU-offloaded KV cache (e.g., 96GB VRAM + 1TB RAM) without crashes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add Playwright-based demo recording script (record_3d_demo.js) - Include MP4 and WebM video recordings of 3D attention tree visualization - Add 12 screenshots showing different views and features - Add README with instructions and screenshot gallery Features demonstrated: - Real-time 3D tree building during token generation - Camera views: Tree, Top, Side, Front - Token navigation with attention highlighting - Edge toggling for attention flow visualization - Manifold zone coloring (semantic_bridge, syntax_floor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes: - Reduce attention sink threshold from 5 to 1 for better zone visibility - Enable sink edge display - Use code generation prompt for more diverse attention patterns - Fix fingerprint local/mid/long thresholds to align with histogram bins: - Local: distance <= 7 (was 16) - Mid: distance <= 255 (was 256) - Update demo video and screenshots with new recording Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ruokee and others added 30 commits January 10, 2026 11:38

fix(function_call): group batch decode by options instead of fallback (…

fbc128a

…sgl-project#16698) Co-authored-by: Muqi Li <muqi1029@gmail.com>

Revert "feat: reduce constrained-decoding overhead in TP" (sgl-projec…

e6d40bf

…t#16845)

Tiny add gauge histogram abstraction for engine and router (sgl-proje…

1f9d479

…ct#16848)

Tiny add routing key distribution metrics (sgl-project#16847)

3ed3b7e

Tiny add CPU resource monitoring for overload diagnosis (sgl-project#…

32a569f

…16852)

[DSv32] Overlap indexer weights_proj during dual_stream decode (sgl-p…

20abaee

…roject#16637) Co-authored-by: Ziang Li <ziangli@humansand.ai>

Tiny let soft watchdog cover initialization phase (sgl-project#16853)

c89949b

Fix EPLB + FP4 Quantization Compatibility Issue (sgl-project#13715)

d27f16f

Co-authored-by: Shu Wang <shuw@nvidia.com>

[Rework] Add SwapAB Optimization for triton fused_moe_kernel on SM90. (…

67b61a4

…sgl-project#16723)

[Doc]Update note for Cuda 13 container usage (sgl-project#16805)

94fc26a

[CI] Add PD Disaggregation aarch64 test (sgl-project#16572)

4b14f62

Signed-off-by: Shangming Cai <csmthu@gmail.com>

[Docker] Add nightly dev docker for Cuda 13 (sgl-project#16862)

7f393d9

[CI] Remove duplicate code in test_mamba_ut (sgl-project#16854)

ad20127

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Attention backend selection bug fix for hicache (sgl-project#16779)

675acec

Co-authored-by: hxie <haiyong.xie@ustc.edu>

enhance LoRA tests and fix base model LoRA eviction in Scheduler (sgl…

6327dff

…-project#16333)

Enhance test for dp-attention + constrained decoding. (sgl-project#16849

3c35873

)

Update Cutedsl version and pin cuda-python version (sgl-project#16838)

9fd2358

Add top-p to run_eval.py (sgl-project#16844)

aeb480c

Update LoRA Weights via Tensor (sgl-project#16226)

3a8b44f

Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

[IDLE FORWARD][Indexer] Fix forward_idle bs mismatch issue in Deepsee…

2d088b8

…kV3.2's NSAIndexer (sgl-project#15227)

Piecewise Cuda Graph Memory Usage (sgl-project#15927)

145bd54

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

[MTP][spec_v2] Fix TRTLLM MLA backend crash in EAGLE draft_extend mode (

3fd88ea

sgl-project#15790)

Tiny extract file logging utils (sgl-project#16870)

641830c

Fix log_decode_stats_every_iteration when having TP in attention (sgl…

a0899bd

…-project#16871)

Tiny add scheduler status logging (sgl-project#16872)

dae6a40

Optimize layernorm_gated for Qwen3-Next (sgl-project#16397)

d1ec93e

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

[diffusion] improve: apply tp optim to cross-attn for wan2.2 (sgl-pro…

76d4881

…ject#16788)

[diffusion] feat: support multiple LoRA adapters loading and applicat…

5c72be1

…ion (sgl-project#16667) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>

tiny refactor pcg split op registration (sgl-project#16863)

206db66

architehc and others added 30 commits January 12, 2026 20:55

Apply black formatting to remaining sglang core files

d392500

Format logits_processor.py, fp8_utils.py, radix_cache.py, memory_pool.py, common.py, server_args.py, and others. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix ruff F401 unused import errors

b3b0f86

Remove unused imports from Python files to pass lint checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Apply pre-commit formatting and fix executable permissions

7bd2e72

- Run black-jupyter and isort on script files - Make scripts with shebangs executable - Fix trailing newlines in JSON files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix remaining pre-commit issues

1ba6fdf

- Make example scripts executable - Fix codespell typo in scenarios.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix import section spacing to match origin/main

40225c1

Restore blank lines between third-party and first-party imports in MOE files to match the expected formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix import spacing in fp8.py for CI lint

dc201ab

Add blank line between third-party (flashinfer) and first-party (sglang) imports to match expected isort formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make shebang scripts executable

0e49003

Mark all Python scripts with shebangs as executable to pass pre-commit check-shebang-scripts-are-executable hook. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make remaining shebang scripts executable

15fef63

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make all remaining shebang scripts executable

d2f45a2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix black formatting in benchmark script

8fe6bac

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix lint issues in attention explorer files

b4601f9

- Fix import ordering in fingerprint_monitor.py (isort) - Remove unused imports in token_importance_demo.py (ruff) - Remove unused Union import in token_importance.py (ruff) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix tooltip text bleeding into attention list items

3cd991c

Get only the first text node from token elements to avoid including tooltip HTML content in the "Attends To" panel display. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attention token visualization and manifold exploration#1

Add attention token visualization and manifold exploration#1
architehc wants to merge 526 commits intomainfrom
feature/attention-token-visualization

architehc commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

architehc commented Jan 10, 2026

Summary

Key Features

Exploration Results: Qwen3 80B vs 4B

New Components

API Extensions

Server Arguments

Test Plan

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants