Skip to content

RedGrey1993/llmgateway

Repository files navigation

LLM Gateway 面试题项目

多模型注册与流式推理平台(Gin + Go 1.22)。

架构

  • model-registry:模型注册/版本管理/路由元数据/后端推理执行(gRPC 流式)。
  • model-inference:推理网关,对外 HTTP + SSE,内部转发到 registry。
  • 内部通信:gRPC(registry 流式推理)
  • 对外接口:HTTP + SSE(event: + data:

依赖

  • Go 1.22
  • protocprotoc-gen-goprotoc-gen-go-grpc(仅在更新 proto 时需要)

仓库结构

model-registry/
model-inference/
shared/
scripts/

运行

启动 registry(默认 gRPC :9090,HTTP :8081):

go run ./model-registry/cmd/server/main.go

启动 inference 网关(默认 HTTP :8080):

go run ./model-inference/cmd/server/main.go -registry-addr 127.0.0.1:9090

后端地址配置(registry 侧生效):

go run ./model-registry/cmd/server/main.go \
  -openai-base-url https://api.openai.com \
  -openai-api-key $OPENAI_API_KEY \
  -ollama-base-url http://127.0.0.1:11434 \
  -qwen-base-url http://127.0.0.1:8000 \
  -qwen-api-key $QWEN_API_KEY

如果是在WSL2中想访问windows中的Ollama,需要设置Ollama监听0.0.0.0,然后获取WIN_IP,ollama-base-url配置为WIN_IP和对应的端口:

WIN_IP=$(ip route | awk '/default/ {print $3; exit}')
curl http://$WIN_IP:11434/api/tags

管理面板

启动 registry 后访问:http://localhost:8081/ui

API

Registry HTTP

  • POST /models 注册模型
  • GET /models 模型列表(包含 statusrevisionactive
  • PUT /models/{name}/versions/{version} 热更新模型版本
  • DELETE /models/{name}/versions/{version} 删除模型版本(仅空闲可删)

Inference HTTP

  • POST /infer(SSE)
{
  "model": "chat-bot",
  "version": "v1",
  "input": "Tell me a joke",
  "hash_key": "user-123"
}

SSE 格式:

event: token
data: {"token":"Hello"}

event: done
data: {}

快速示例

注册 mock 模型:

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

推理调用(SSE):

curl -N http://localhost:8080/infer \
  -H 'Content-Type: application/json' \
  -d '{"model":"chat-bot","version":"v1","input":"hello world"}'

测试脚本

  • scripts/test_integration.sh:registry 推理 + inference 网关联调
  • scripts/test_hot_update.sh:热更新(mock -> ollama)验证旧流不中断
  • scripts/stream_load.py:6 并发持续流式压测(每模型 3 并发)

stream_load.py 使用前需注册模型(并发上限 2,用于验证超并发拒绝):

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

curl -X POST http://localhost:8081/models \
  -H 'Content-Type: application/json' \
  -d '{"name":"gemma3:4b","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'

运行脚本:

python3 scripts/stream_load.py --models chat-bot,gemma3:4b --per-model 3

说明:

  • 每个模型前 2 个线程应持续 streaming
  • 第 3 个线程显示容量超限(验证并发限制)

vLLM(Qwen 后端)示例

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Self Report

## Self Report
- 总耗时:__大约5__ 小时
- 实际做题时间段:不是连续时间段做的,可以看提交记录,因为中间有其他面试或其他事情
- 完成情况:
- [x] 模型注册 / 更新 / 查看
- [x] 流式推理接口
- [x] 热更新不影响已有连接
- [x] 容量管理/并发支持
- [x] 简易管理界面
- [ ] 多版本分流
- [ ] Prometheus metrics
- [ ] 灰度发布
- 备注说明:
- (如对模型状态隔离的实现方式说明 / 关键设计决策 / 哪块逻辑写得不满意 / 下一步想优化的点)

About

llm gateway

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published