feat: Implement proper sparse attention in API server by AlexCuadron · Pull Request #2 · AlexCuadron/DoubleSparse

AlexCuadron · 2025-02-22T19:49:35Z

This PR updates the API server to properly implement sparse attention from the core DoubleSparse implementation.

Changes

Core Changes

Replace basic model with proper sparse attention implementation from perplexity_eval.py
Add support for different model architectures (LLaMA, Mistral)
Add channel configuration support
Implement proper token streaming with past key/value caching

Configuration

Add MODEL_ARCHITECTURE setting for model type selection
Add CHANNEL setting for channel selection (q, k, qk)
Improve configuration documentation

Implementation Details

Uses the same sparse attention mechanism as in perplexity_eval.py
Properly handles channel configuration
Memory-efficient implementation with proper caching
Improved error handling for model-specific issues

- Add detailed client guide with examples in multiple languages - Add performance considerations and best practices - Add detailed configuration documentation - Add error handling documentation

- Replace basic model with proper sparse attention implementation - Add support for different model architectures (LLaMA, Mistral) - Add channel configuration support - Update API to be fully OpenAI-compatible - Add proper token streaming implementation - Add configuration for sparse attention parameters

- Add Qwen2 sparse attention implementation - Update configuration to support Qwen2 - Update documentation with Qwen2 support - Improve architecture selection documentation

- Create single server script that matches perplexity_eval.py usage - Automatic architecture detection from model config - Same command-line arguments as perplexity_eval.py - Remove need for manual architecture configuration

AlexCuadron added 4 commits February 22, 2025 19:44

docs: Add comprehensive client guide and update server documentation

4a6184e

- Add detailed client guide with examples in multiple languages - Add performance considerations and best practices - Add detailed configuration documentation - Add error handling documentation

feat: Add Qwen2 architecture support

1a615ae

- Add Qwen2 sparse attention implementation - Update configuration to support Qwen2 - Update documentation with Qwen2 support - Improve architecture selection documentation

refactor: Simplify server to match perplexity_eval.py

0d1db06

- Create single server script that matches perplexity_eval.py usage - Automatic architecture detection from model config - Same command-line arguments as perplexity_eval.py - Remove need for manual architecture configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement proper sparse attention in API server#2

feat: Implement proper sparse attention in API server#2
AlexCuadron wants to merge 4 commits intomainfrom
feature/sparse-api-server-v2

AlexCuadron commented Feb 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexCuadron commented Feb 22, 2025

Changes

Core Changes

Configuration

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant