Skip to content

Add attention token visualization and manifold exploration#1

Open
architehc wants to merge 526 commits intomainfrom
feature/attention-token-visualization
Open

Add attention token visualization and manifold exploration#1
architehc wants to merge 526 commits intomainfrom
feature/attention-token-visualization

Conversation

@architehc
Copy link
Owner

Summary

This PR adds comprehensive attention token visualization and manifold exploration capabilities to SGLang, enabling real-time analysis of attention patterns during inference.

Key Features

  • Top-K Attention Capture: Server-side capture of top-k attended positions per token
  • Attention Fingerprinting: 20-dimensional fingerprint vectors summarizing attention patterns
  • Manifold Discovery: UMAP/HDBSCAN clustering for discovering attention pattern zones
  • Real-time Visualization: React UI for exploring attention patterns and manifold structure
  • Zone Classification: Automatic classification into syntax_floor, semantic_bridge, structure_ripple zones

Exploration Results: Qwen3 80B vs 4B

We conducted a 5-hour exploration comparing attention patterns between Qwen3-80B (MoE) and Qwen3-4B (Dense):

Zone 80B (MoE) 4B (Dense) Difference
structure_ripple 64.0% 84.3% +20.3%
semantic_bridge 35.3% 15.5% -19.8%
syntax_floor 0.7% 0.1% -0.6%

Key Finding: Smaller dense models rely ~20% more on structural attention patterns, while larger MoE models achieve more balanced semantic/structural attention through expert routing.

Manifold Comparison

New Components

  • examples/attention_explorer/ - Full-stack attention exploration application
    • ui/ - React frontend with manifold visualization
    • discovery/ - Manifold discovery pipeline with checkpointing
    • schemas/ - Database schemas for fingerprint storage
    • exploration_outputs/ - Reports and visualizations from model explorations

API Extensions

# Enable attention capture
response = client.chat.completions.create(
    model="model-name",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"return_attention_tokens": True}
)

Server Arguments

  • --return-attention-tokens: Enable attention token capture
  • --attention-tokens-top-k: Number of top attended positions to capture (default: 32)

Test Plan

  • Unit tests for attention modules
  • Integration tests for fingerprinting pipeline
  • E2E tests for UI features
  • Manual testing with Qwen3-80B and Qwen3-4B models

Documentation

🤖 Generated with Claude Code

ruokee and others added 30 commits January 10, 2026 11:38
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…sgl-project#16737)

Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: DavisTao <dwt614707404@163.com>
Co-authored-by: niehen6174 <nihen6174@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…ion (sgl-project#16667)

Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
architehc and others added 30 commits January 12, 2026 20:55
Format logits_processor.py, fp8_utils.py, radix_cache.py,
memory_pool.py, common.py, server_args.py, and others.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused imports from Python files to pass lint checks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Run black-jupyter and isort on script files
- Make scripts with shebangs executable
- Fix trailing newlines in JSON files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Make example scripts executable
- Fix codespell typo in scenarios.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore blank lines between third-party and first-party imports
in MOE files to match the expected formatting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add blank line between third-party (flashinfer) and first-party (sglang)
imports to match expected isort formatting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Mark all Python scripts with shebangs as executable to pass
pre-commit check-shebang-scripts-are-executable hook.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Make chat template auto-detect from model name (qwen3, llama3, chatml, raw)
- Batch tokenization API calls using Promise.all for better performance
- Implement word-level aggregation toggle with edge aggregation
- Note: --attention-chunk-size server arg already exists (2048 default)

Fixes:
- Hardcoded Qwen3 template now auto-detects model family
- Per-token detokenize calls now batched in groups of 50
- Word mode toggle now functional with proper edge aggregation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Measures:
- Memory reduction vs LRU baseline
- Quality preservation (text similarity)
- Latency impact of spectral computation

Usage:
  python scripts/benchmark_spectral_eviction.py --port 30000
  python scripts/benchmark_spectral_eviction.py --port 30001 --baseline --output baseline.json
  python scripts/benchmark_spectral_eviction.py --port 30000 --compare baseline.json

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…plorer

Features:
- Layer-by-layer filtering: Filter attention by layer with dropdown and heatmap modal
- Token importance scoring: New module for smart context truncation based on attention patterns
- Streaming fingerprint monitor: Real-time dashboard for monitoring attention fingerprints via ZMQ

New files:
- python/sglang/srt/mem_cache/token_importance.py: TokenImportanceScorer and SmartTruncator
- examples/attention_explorer/fingerprint_monitor.py: Dashboard, JSONL, Prometheus modes
- examples/attention_explorer/token_importance_demo.py: Demo script with synthetic data option

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix import ordering in fingerprint_monitor.py (isort)
- Remove unused imports in token_importance_demo.py (ruff)
- Remove unused Union import in token_importance.py (ruff)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements AttentionGuidedRetriever that uses attention fingerprints to:
- Identify query tokens that need retrieval (semantic_bridge zone)
- Score chunks by fingerprint similarity and target quality
- Make smart retrieval decisions based on retrieval need score

Key features:
- Query analysis: Detects retrieval anchors from attention patterns
- Chunk scoring: Combines fingerprint similarity + target quality
- Retrieval decision: Skip retrieval for self-contained queries

Files:
- attention_guided_rag.py: Core AttentionGuidedRetriever class
- attention_rag_demo.py: Demo script with synthetic and live modes
- test_attention_guided_rag.py: 19 unit tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Web-based UI that demonstrates attention fingerprint-based RAG:
- Query input with sample queries for quick testing
- Retrieval need meter showing semantic_bridge ratio
- Token classification with zone highlighting (semantic_bridge, syntax_floor, etc.)
- Document ranking with visual score breakdown bars
- Fingerprint visualization for selected chunks
- Works in synthetic mode or connects to live SGLang server

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tions

- Add tooltips system with hover explanations for all technical terms
- Add Glossary & Help modal in both explorer.html and rag_explorer.html
- Document attention zone derivation from softmax(QK^T/√d) operation
- Explain how RoPE creates distance decay for local attention patterns
- Document semantic attention heads for mid-range (semantic bridge) patterns
- Explain attention sinks and global heads for long-range/steering patterns
- Add fingerprint computation pipeline documentation (binning, mass, entropy)
- Document causal attention mask for Attends To vs Attended By directions
- Add Playwright-based UI test suite (17 tests, 100% pass rate)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation now explains concepts at the most basic level:
- What is a Token? (word pieces the AI reads)
- What is Attention? (where the AI "looks")
- What is Temperature? (creativity dial 0-2)
- What is Top-K? ("show me the top K things")
- What is an Edge? (connection line between tokens)
- What is a Layer? (processing floor in the AI factory)
- What is RAG? (giving AI a "cheat sheet" of documents)

Added "Ask the AI to Explain" sections with example prompts:
- Users can copy attention patterns and ask the LLM to interpret
- Provides template prompts for understanding results
- Teaches users to use the tool iteratively with the AI

All tooltips rewritten for complete beginners:
- No jargon or assumed knowledge
- Real-world analogies (factory floors, creativity dial)
- Examples with each explanation
- Color coding explained in plain terms

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add visual ASCII diagrams showing 32-layer transformer architecture
- Explain how layers progressively build understanding:
  - Layer 1-4: Tokenization & basic patterns (Syntax Floor)
  - Layer 5-12: Phrase-level understanding
  - Layer 13-24: Semantic understanding (Semantic Bridge)
  - Layer 25-32: Abstract reasoning
- Add Query/Key/Value attention flow diagram
- Add RAG pipeline visual flow diagram
- Add ui_demo_runner.py for capturing UI screenshots with Playwright

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tests every button, input, slider, and modal
- Tests responsive layouts (desktop/laptop/tablet/mobile)
- Tests keyboard navigation and accessibility
- Stress tests (rapid clicks, typing, modal spam)
- Extended query tests with 40+ different inputs
- Generates beautiful HTML report with screenshot gallery
- Runs in cycles until time expires

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add attention_ws_server.py: WebSocket bridge to SGLang API
- Add explorer_stream.html: Real-time streaming UI with live token display
- Fix explorer.html: Parse both fingerprints AND edges (not either/or)

WebSocket server streams tokens as they're generated for live visualization.
UI shows token-by-token generation with ~65 tok/s throughput display.

Usage:
  python attention_ws_server.py --sglang-url http://localhost:8000 --port 8765
  Open explorer_stream.html in browser

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix WebSocket server to use non-streaming API for full attention data
- Add zone distribution histogram with real-time updates
- Add enhanced token tooltips (zone, entropy, top attention %)
- Add zone badges in attention list items
- Create UI integration plan for progressive feature rollout

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Get only the first text node from token elements to avoid including
tooltip HTML content in the "Attends To" panel display.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace word-based heuristic with character-based distribution to
properly show token content in streaming UI. This eliminates [tok_N]
placeholder tokens that appeared when token count exceeded word count.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Preserve word boundaries when distributing text across tokens
- Only split words when there are more tokens than words
- Replace placeholder markers with middle dot for empty tokens
- Much more readable token display in streaming UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test was looking for attention_tokens in meta_info, but the API
returns it directly on choices[0]. Updated all three test functions
to check both locations for compatibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add timeline slider with prev/next/start/end navigation buttons
- Add arrow key navigation (←/→) and Home/End shortcuts
- Implement word click to zoom and highlight in 3D graph
- Scale selected tokens 1.5x with bright cyan edge highlighting
- Add thick 3D tube edges (TubeGeometry) for better visibility
- Add 🔥 Top Attention Words clickable list in sidebar
- Fix text spacing using full_content from WebSocket server
- Add camera zoom limits (15-200 units) to prevent invisibility
- Add comprehensive Playwright E2E tests (21 tests passing)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When KV cache is offloaded to CPU RAM (common for 1M+ token contexts),
Triton kernels cannot access CPU memory and would segfault.

Changes:
- Add device assertions in compute_topk_attention_chunked() and
  compute_topk_attention_tokens() as safety nets
- Add smart CPU→GPU transfer in _compute_attention_token_info():
  - Detects CPU-resident k_buffer
  - Transfers only unique referenced KV entries (not full buffer)
  - Logs warning with transfer size for debugging
  - Syncs before kernel launch

This enables attention visualization on setups with CPU-offloaded
KV cache (e.g., 96GB VRAM + 1TB RAM) without crashes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Playwright-based demo recording script (record_3d_demo.js)
- Include MP4 and WebM video recordings of 3D attention tree visualization
- Add 12 screenshots showing different views and features
- Add README with instructions and screenshot gallery

Features demonstrated:
- Real-time 3D tree building during token generation
- Camera views: Tree, Top, Side, Front
- Token navigation with attention highlighting
- Edge toggling for attention flow visualization
- Manifold zone coloring (semantic_bridge, syntax_floor)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Reduce attention sink threshold from 5 to 1 for better zone visibility
- Enable sink edge display
- Use code generation prompt for more diverse attention patterns
- Fix fingerprint local/mid/long thresholds to align with histogram bins:
  - Local: distance <= 7 (was 16)
  - Mid: distance <= 255 (was 256)
- Update demo video and screenshots with new recording

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.