LLM Gateway 面试题项目

多模型注册与流式推理平台（Gin + Go 1.22）。

架构

model-registry：模型注册/版本管理/路由元数据/后端推理执行（gRPC 流式）。
model-inference：推理网关，对外 HTTP + SSE，内部转发到 registry。
内部通信：gRPC（registry 流式推理）
对外接口：HTTP + SSE（event: + data:）

依赖

Go 1.22
protoc、protoc-gen-go、protoc-gen-go-grpc（仅在更新 proto 时需要）

仓库结构

model-registry/
model-inference/
shared/
scripts/

运行

启动 registry（默认 gRPC :9090，HTTP :8081）：

go run ./model-registry/cmd/server/main.go

启动 inference 网关（默认 HTTP :8080）：

go run ./model-inference/cmd/server/main.go -registry-addr 127.0.0.1:9090

后端地址配置（registry 侧生效）：

go run ./model-registry/cmd/server/main.go \
  -openai-base-url https://api.openai.com \
  -openai-api-key $OPENAI_API_KEY \
  -ollama-base-url http://127.0.0.1:11434 \
  -qwen-base-url http://127.0.0.1:8000 \
  -qwen-api-key $QWEN_API_KEY

如果是在WSL2中想访问windows中的Ollama，需要设置Ollama监听0.0.0.0，然后获取WIN_IP，ollama-base-url配置为WIN_IP和对应的端口：

WIN_IP=$(ip route | awk '/default/ {print $3; exit}')
curl http://$WIN_IP:11434/api/tags

管理面板

启动 registry 后访问：http://localhost:8081/ui

API

Registry HTTP

POST /models 注册模型
GET /models 模型列表（包含 status、revision、active）
PUT /models/{name}/versions/{version} 热更新模型版本
DELETE /models/{name}/versions/{version} 删除模型版本（仅空闲可删）

Inference HTTP

POST /infer（SSE）

{
  "model": "chat-bot",
  "version": "v1",
  "input": "Tell me a joke",
  "hash_key": "user-123"
}

SSE 格式：

event: token
data: {"token":"Hello"}

event: done
data: {}

快速示例

注册 mock 模型：

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

推理调用（SSE）：

curl -N http://localhost:8080/infer \
  -H 'Content-Type: application/json' \
  -d '{"model":"chat-bot","version":"v1","input":"hello world"}'

测试脚本

scripts/test_integration.sh：registry 推理 + inference 网关联调
scripts/test_hot_update.sh：热更新（mock -> ollama）验证旧流不中断
scripts/stream_load.py：6 并发持续流式压测（每模型 3 并发）

stream_load.py 使用前需注册模型（并发上限 2，用于验证超并发拒绝）：

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"gemma3:4b","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

运行脚本：

python3 scripts/stream_load.py --models chat-bot,gemma3:4b --per-model 3

说明：

每个模型前 2 个线程应持续 streaming
第 3 个线程显示容量超限（验证并发限制）

vLLM（Qwen 后端）示例

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Self Report

## Self Report
- 总耗时：__大约5__ 小时
- 实际做题时间段：不是连续时间段做的，可以看提交记录，因为中间有其他面试或其他事情
- 完成情况：
- [x] 模型注册 / 更新 / 查看
- [x] 流式推理接口
- [x] 热更新不影响已有连接
- [x] 容量管理/并发支持
- [x] 简易管理界面
- [ ] 多版本分流
- [ ] Prometheus metrics
- [ ] 灰度发布
- 备注说明：
- （如对模型状态隔离的实现方式说明 / 关键设计决策 / 哪块逻辑写得不满意 / 下一步想优化的点）

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
model-inference		model-inference
model-registry		model-registry
scripts		scripts
shared		shared
LLM_GATEWAY_INTERVIEW.md		LLM_GATEWAY_INTERVIEW.md
LLM_GATEWAY_INTERVIEW_TODO.md		LLM_GATEWAY_INTERVIEW_TODO.md
README.md		README.md
Vibe Coding后端面试题 - LLM网关.pdf		Vibe Coding后端面试题 - LLM网关.pdf
go.work		go.work
go.work.sum		go.work.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Gateway 面试题项目

架构

依赖

仓库结构

运行

管理面板

API

Registry HTTP

Inference HTTP

快速示例

测试脚本

vLLM（Qwen 后端）示例

Self Report

About

Uh oh!

Releases

Packages

Languages

RedGrey1993/llmgateway

Folders and files

Latest commit

History

Repository files navigation

LLM Gateway 面试题项目

架构

依赖

仓库结构

运行

管理面板

API

Registry HTTP

Inference HTTP

快速示例

测试脚本

vLLM（Qwen 后端）示例

Self Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages