Skip to content

SGLang Cookbook Community Contribution Roadmap #16

@Richardczl98

Description

@Richardczl98

SGLang Cookbook Community Contribution Roadmap

Reference: DeepSeek-V3 Cookbook

Maintainers: We have a Claude Code skill (.claude/skills/add-model/SKILL.md) that automates most of this workflow — from scaffolding docs, config generators, YAML configs, to sidebar updates. Run /add-model in Claude Code to use it.


1. Model Introduction

  • Overview: Brief description of model purpose and capabilities
  • Variants: List versions with specific use cases
  • Key Features: Unique capabilities (reasoning, tool calling, multimodal)
  • Links: HuggingFace model page and official documentation

2. Installation

Refer to the official installation guide.

3. Deployment

Basic Configuration

sglang serve \
  --model-path [model-path] \
  --tp [tensor-parallel-size]

Optimization Tips

  • Parallelism: Recommended TP/DP settings for different GPU counts
  • Memory: KV cache, quantization (--quantization fp8)
  • Performance: Attention backends, speculative decoding (draft models available at SpecBundle)

4. API Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="[model-path]",
    messages=[{"role": "user", "content": "Your question"}],
    temperature=0.7,
    max_tokens=2048,
)
print(response.choices[0].message.content)

Document model-specific features: reasoning mode, tool calling, multimodal, streaming.

5. Benchmarks

Environment

Item Value
Hardware [GPU Type] × [Number]
Model [Name/Variant]
Tensor Parallelism [TP Size]
SGLang Version [Version]

Test Scenarios

Scenario Input Output Use Case
Chat 1K 1K Conversational AI
Reasoning 1K 8K Long-form generation
Summarization 8K 1K Document Q&A

Concurrency Levels

Level Concurrency Goal
Low 1 Best latency
Medium 16 Balanced
High 64–100 Max throughput

Benchmark Commands

# Launch Server
sglang serve \
  --model-path [model-path] \
  --tp [tp-size]

For LLM (text-only models)

Use --dataset-name random:

# Chat (1K/1K) - Priority
python3 -m sglang.bench_serving \
  --backend sglang \
  --model [model-path] \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts [10|80|500] \
  --max-concurrency [1|16|100] \
  --request-rate inf

# Reasoning (1K/8K): Change to --random-output-len 8000, --max-concurrency [1|16|64]
# Summarization (8K/1K): Change to --random-input-len 8000

For MLLM (multimodal/vision-language models)

Use --dataset-name random-image --random-image-count 1:

python3 -m sglang.bench_serving \
  --backend sglang \
  --model [model-path] \
  --dataset-name random-image \
  --random-image-count 1 \
  --random-input-len 128 \
  --random-output-len 1024 \
  --num-prompts [10|80|500] \
  --max-concurrency [1|16|100]

Check Bench Serving Guide for more details.

Key Metrics

Metric Description
Request Throughput (req/s) Requests completed per second
Output Throughput (tok/s) Output tokens generated per second
TTFT (ms) Time to First Token
TPOT (ms) Time Per Output Token
ITL (ms) Inter-Token Latency

Accuracy Benchmarks

Add accuracy benchmarks. Some integrated benchmarks can be found here.

Contribution Checklist

  • Follow template structure
  • Include all three scenarios × three concurrency levels
  • Document hardware specifications
  • Link to official resources
  • Verify all commands work

Resources: SGLang Docs | SGLang GitHub | Cookbook Repo

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationgood first issueGood for newcomers

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions