-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationgood first issueGood for newcomersGood for newcomers
Description
SGLang Cookbook Community Contribution Roadmap
Reference: DeepSeek-V3 Cookbook
Maintainers: We have a Claude Code skill (
.claude/skills/add-model/SKILL.md) that automates most of this workflow — from scaffolding docs, config generators, YAML configs, to sidebar updates. Run/add-modelin Claude Code to use it.
1. Model Introduction
- Overview: Brief description of model purpose and capabilities
- Variants: List versions with specific use cases
- Key Features: Unique capabilities (reasoning, tool calling, multimodal)
- Links: HuggingFace model page and official documentation
2. Installation
Refer to the official installation guide.
3. Deployment
Basic Configuration
sglang serve \
--model-path [model-path] \
--tp [tensor-parallel-size]Optimization Tips
- Parallelism: Recommended TP/DP settings for different GPU counts
- Memory: KV cache, quantization (
--quantization fp8) - Performance: Attention backends, speculative decoding (draft models available at SpecBundle)
4. API Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="[model-path]",
messages=[{"role": "user", "content": "Your question"}],
temperature=0.7,
max_tokens=2048,
)
print(response.choices[0].message.content)Document model-specific features: reasoning mode, tool calling, multimodal, streaming.
5. Benchmarks
Environment
| Item | Value |
|---|---|
| Hardware | [GPU Type] × [Number] |
| Model | [Name/Variant] |
| Tensor Parallelism | [TP Size] |
| SGLang Version | [Version] |
Test Scenarios
| Scenario | Input | Output | Use Case |
|---|---|---|---|
| Chat | 1K | 1K | Conversational AI |
| Reasoning | 1K | 8K | Long-form generation |
| Summarization | 8K | 1K | Document Q&A |
Concurrency Levels
| Level | Concurrency | Goal |
|---|---|---|
| Low | 1 | Best latency |
| Medium | 16 | Balanced |
| High | 64–100 | Max throughput |
Benchmark Commands
# Launch Server
sglang serve \
--model-path [model-path] \
--tp [tp-size]For LLM (text-only models)
Use --dataset-name random:
# Chat (1K/1K) - Priority
python3 -m sglang.bench_serving \
--backend sglang \
--model [model-path] \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts [10|80|500] \
--max-concurrency [1|16|100] \
--request-rate inf
# Reasoning (1K/8K): Change to --random-output-len 8000, --max-concurrency [1|16|64]
# Summarization (8K/1K): Change to --random-input-len 8000For MLLM (multimodal/vision-language models)
Use --dataset-name random-image --random-image-count 1:
python3 -m sglang.bench_serving \
--backend sglang \
--model [model-path] \
--dataset-name random-image \
--random-image-count 1 \
--random-input-len 128 \
--random-output-len 1024 \
--num-prompts [10|80|500] \
--max-concurrency [1|16|100]Check Bench Serving Guide for more details.
Key Metrics
| Metric | Description |
|---|---|
| Request Throughput (req/s) | Requests completed per second |
| Output Throughput (tok/s) | Output tokens generated per second |
| TTFT (ms) | Time to First Token |
| TPOT (ms) | Time Per Output Token |
| ITL (ms) | Inter-Token Latency |
Accuracy Benchmarks
Add accuracy benchmarks. Some integrated benchmarks can be found here.
Contribution Checklist
- Follow template structure
- Include all three scenarios × three concurrency levels
- Document hardware specifications
- Link to official resources
- Verify all commands work
Resources: SGLang Docs | SGLang GitHub | Cookbook Repo
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationgood first issueGood for newcomersGood for newcomers