Add attention token visualization and manifold exploration#1
Open
Add attention token visualization and manifold exploration#1
Conversation
…sgl-project#16698) Co-authored-by: Muqi Li <muqi1029@gmail.com>
…roject#16637) Co-authored-by: Ziang Li <ziangli@humansand.ai>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: hxie <haiyong.xie@ustc.edu>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
…kV3.2's NSAIndexer (sgl-project#15227)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…sgl-project#16737) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: DavisTao <dwt614707404@163.com> Co-authored-by: niehen6174 <nihen6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…ion (sgl-project#16667) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Format logits_processor.py, fp8_utils.py, radix_cache.py, memory_pool.py, common.py, server_args.py, and others. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused imports from Python files to pass lint checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Run black-jupyter and isort on script files - Make scripts with shebangs executable - Fix trailing newlines in JSON files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Make example scripts executable - Fix codespell typo in scenarios.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore blank lines between third-party and first-party imports in MOE files to match the expected formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add blank line between third-party (flashinfer) and first-party (sglang) imports to match expected isort formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Mark all Python scripts with shebangs as executable to pass pre-commit check-shebang-scripts-are-executable hook. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Make chat template auto-detect from model name (qwen3, llama3, chatml, raw) - Batch tokenization API calls using Promise.all for better performance - Implement word-level aggregation toggle with edge aggregation - Note: --attention-chunk-size server arg already exists (2048 default) Fixes: - Hardcoded Qwen3 template now auto-detects model family - Per-token detokenize calls now batched in groups of 50 - Word mode toggle now functional with proper edge aggregation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Measures: - Memory reduction vs LRU baseline - Quality preservation (text similarity) - Latency impact of spectral computation Usage: python scripts/benchmark_spectral_eviction.py --port 30000 python scripts/benchmark_spectral_eviction.py --port 30001 --baseline --output baseline.json python scripts/benchmark_spectral_eviction.py --port 30000 --compare baseline.json Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…plorer Features: - Layer-by-layer filtering: Filter attention by layer with dropdown and heatmap modal - Token importance scoring: New module for smart context truncation based on attention patterns - Streaming fingerprint monitor: Real-time dashboard for monitoring attention fingerprints via ZMQ New files: - python/sglang/srt/mem_cache/token_importance.py: TokenImportanceScorer and SmartTruncator - examples/attention_explorer/fingerprint_monitor.py: Dashboard, JSONL, Prometheus modes - examples/attention_explorer/token_importance_demo.py: Demo script with synthetic data option Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix import ordering in fingerprint_monitor.py (isort) - Remove unused imports in token_importance_demo.py (ruff) - Remove unused Union import in token_importance.py (ruff) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements AttentionGuidedRetriever that uses attention fingerprints to: - Identify query tokens that need retrieval (semantic_bridge zone) - Score chunks by fingerprint similarity and target quality - Make smart retrieval decisions based on retrieval need score Key features: - Query analysis: Detects retrieval anchors from attention patterns - Chunk scoring: Combines fingerprint similarity + target quality - Retrieval decision: Skip retrieval for self-contained queries Files: - attention_guided_rag.py: Core AttentionGuidedRetriever class - attention_rag_demo.py: Demo script with synthetic and live modes - test_attention_guided_rag.py: 19 unit tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Web-based UI that demonstrates attention fingerprint-based RAG: - Query input with sample queries for quick testing - Retrieval need meter showing semantic_bridge ratio - Token classification with zone highlighting (semantic_bridge, syntax_floor, etc.) - Document ranking with visual score breakdown bars - Fingerprint visualization for selected chunks - Works in synthetic mode or connects to live SGLang server Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tions - Add tooltips system with hover explanations for all technical terms - Add Glossary & Help modal in both explorer.html and rag_explorer.html - Document attention zone derivation from softmax(QK^T/√d) operation - Explain how RoPE creates distance decay for local attention patterns - Document semantic attention heads for mid-range (semantic bridge) patterns - Explain attention sinks and global heads for long-range/steering patterns - Add fingerprint computation pipeline documentation (binning, mass, entropy) - Document causal attention mask for Attends To vs Attended By directions - Add Playwright-based UI test suite (17 tests, 100% pass rate) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation now explains concepts at the most basic level:
- What is a Token? (word pieces the AI reads)
- What is Attention? (where the AI "looks")
- What is Temperature? (creativity dial 0-2)
- What is Top-K? ("show me the top K things")
- What is an Edge? (connection line between tokens)
- What is a Layer? (processing floor in the AI factory)
- What is RAG? (giving AI a "cheat sheet" of documents)
Added "Ask the AI to Explain" sections with example prompts:
- Users can copy attention patterns and ask the LLM to interpret
- Provides template prompts for understanding results
- Teaches users to use the tool iteratively with the AI
All tooltips rewritten for complete beginners:
- No jargon or assumed knowledge
- Real-world analogies (factory floors, creativity dial)
- Examples with each explanation
- Color coding explained in plain terms
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add visual ASCII diagrams showing 32-layer transformer architecture - Explain how layers progressively build understanding: - Layer 1-4: Tokenization & basic patterns (Syntax Floor) - Layer 5-12: Phrase-level understanding - Layer 13-24: Semantic understanding (Semantic Bridge) - Layer 25-32: Abstract reasoning - Add Query/Key/Value attention flow diagram - Add RAG pipeline visual flow diagram - Add ui_demo_runner.py for capturing UI screenshots with Playwright Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tests every button, input, slider, and modal - Tests responsive layouts (desktop/laptop/tablet/mobile) - Tests keyboard navigation and accessibility - Stress tests (rapid clicks, typing, modal spam) - Extended query tests with 40+ different inputs - Generates beautiful HTML report with screenshot gallery - Runs in cycles until time expires Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add attention_ws_server.py: WebSocket bridge to SGLang API - Add explorer_stream.html: Real-time streaming UI with live token display - Fix explorer.html: Parse both fingerprints AND edges (not either/or) WebSocket server streams tokens as they're generated for live visualization. UI shows token-by-token generation with ~65 tok/s throughput display. Usage: python attention_ws_server.py --sglang-url http://localhost:8000 --port 8765 Open explorer_stream.html in browser Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix WebSocket server to use non-streaming API for full attention data - Add zone distribution histogram with real-time updates - Add enhanced token tooltips (zone, entropy, top attention %) - Add zone badges in attention list items - Create UI integration plan for progressive feature rollout Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Get only the first text node from token elements to avoid including tooltip HTML content in the "Attends To" panel display. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace word-based heuristic with character-based distribution to properly show token content in streaming UI. This eliminates [tok_N] placeholder tokens that appeared when token count exceeded word count. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Preserve word boundaries when distributing text across tokens - Only split words when there are more tokens than words - Replace placeholder markers with middle dot for empty tokens - Much more readable token display in streaming UI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test was looking for attention_tokens in meta_info, but the API returns it directly on choices[0]. Updated all three test functions to check both locations for compatibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add timeline slider with prev/next/start/end navigation buttons - Add arrow key navigation (←/→) and Home/End shortcuts - Implement word click to zoom and highlight in 3D graph - Scale selected tokens 1.5x with bright cyan edge highlighting - Add thick 3D tube edges (TubeGeometry) for better visibility - Add 🔥 Top Attention Words clickable list in sidebar - Fix text spacing using full_content from WebSocket server - Add camera zoom limits (15-200 units) to prevent invisibility - Add comprehensive Playwright E2E tests (21 tests passing) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When KV cache is offloaded to CPU RAM (common for 1M+ token contexts), Triton kernels cannot access CPU memory and would segfault. Changes: - Add device assertions in compute_topk_attention_chunked() and compute_topk_attention_tokens() as safety nets - Add smart CPU→GPU transfer in _compute_attention_token_info(): - Detects CPU-resident k_buffer - Transfers only unique referenced KV entries (not full buffer) - Logs warning with transfer size for debugging - Syncs before kernel launch This enables attention visualization on setups with CPU-offloaded KV cache (e.g., 96GB VRAM + 1TB RAM) without crashes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Playwright-based demo recording script (record_3d_demo.js) - Include MP4 and WebM video recordings of 3D attention tree visualization - Add 12 screenshots showing different views and features - Add README with instructions and screenshot gallery Features demonstrated: - Real-time 3D tree building during token generation - Camera views: Tree, Top, Side, Front - Token navigation with attention highlighting - Edge toggling for attention flow visualization - Manifold zone coloring (semantic_bridge, syntax_floor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: - Reduce attention sink threshold from 5 to 1 for better zone visibility - Enable sink edge display - Use code generation prompt for more diverse attention patterns - Fix fingerprint local/mid/long thresholds to align with histogram bins: - Local: distance <= 7 (was 16) - Mid: distance <= 255 (was 256) - Update demo video and screenshots with new recording Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive attention token visualization and manifold exploration capabilities to SGLang, enabling real-time analysis of attention patterns during inference.
Key Features
Exploration Results: Qwen3 80B vs 4B
We conducted a 5-hour exploration comparing attention patterns between Qwen3-80B (MoE) and Qwen3-4B (Dense):
Key Finding: Smaller dense models rely ~20% more on structural attention patterns, while larger MoE models achieve more balanced semantic/structural attention through expert routing.
New Components
examples/attention_explorer/- Full-stack attention exploration applicationui/- React frontend with manifold visualizationdiscovery/- Manifold discovery pipeline with checkpointingschemas/- Database schemas for fingerprint storageexploration_outputs/- Reports and visualizations from model explorationsAPI Extensions
Server Arguments
--return-attention-tokens: Enable attention token capture--attention-tokens-top-k: Number of top attended positions to capture (default: 32)Test Plan
Documentation
🤖 Generated with Claude Code