多模型注册与流式推理平台(Gin + Go 1.22)。
model-registry:模型注册/版本管理/路由元数据/后端推理执行(gRPC 流式)。model-inference:推理网关,对外 HTTP + SSE,内部转发到 registry。- 内部通信:gRPC(registry 流式推理)
- 对外接口:HTTP + SSE(
event:+data:)
- Go 1.22
protoc、protoc-gen-go、protoc-gen-go-grpc(仅在更新 proto 时需要)
model-registry/
model-inference/
shared/
scripts/
启动 registry(默认 gRPC :9090,HTTP :8081):
go run ./model-registry/cmd/server/main.go
启动 inference 网关(默认 HTTP :8080):
go run ./model-inference/cmd/server/main.go -registry-addr 127.0.0.1:9090
后端地址配置(registry 侧生效):
go run ./model-registry/cmd/server/main.go \
-openai-base-url https://api.openai.com \
-openai-api-key $OPENAI_API_KEY \
-ollama-base-url http://127.0.0.1:11434 \
-qwen-base-url http://127.0.0.1:8000 \
-qwen-api-key $QWEN_API_KEY
如果是在WSL2中想访问windows中的Ollama,需要设置Ollama监听0.0.0.0,然后获取WIN_IP,ollama-base-url配置为WIN_IP和对应的端口:
WIN_IP=$(ip route | awk '/default/ {print $3; exit}')
curl http://$WIN_IP:11434/api/tags
启动 registry 后访问:http://localhost:8081/ui
POST /models注册模型GET /models模型列表(包含status、revision、active)PUT /models/{name}/versions/{version}热更新模型版本DELETE /models/{name}/versions/{version}删除模型版本(仅空闲可删)
POST /infer(SSE)
{
"model": "chat-bot",
"version": "v1",
"input": "Tell me a joke",
"hash_key": "user-123"
}SSE 格式:
event: token
data: {"token":"Hello"}
event: done
data: {}
注册 mock 模型:
curl -X POST http://localhost:8081/models \
-H 'Content-Type: application/json' \
-d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'
推理调用(SSE):
curl -N http://localhost:8080/infer \
-H 'Content-Type: application/json' \
-d '{"model":"chat-bot","version":"v1","input":"hello world"}'
scripts/test_integration.sh:registry 推理 + inference 网关联调scripts/test_hot_update.sh:热更新(mock -> ollama)验证旧流不中断scripts/stream_load.py:6 并发持续流式压测(每模型 3 并发)
stream_load.py 使用前需注册模型(并发上限 2,用于验证超并发拒绝):
curl -X POST http://localhost:8081/models \
-H 'Content-Type: application/json' \
-d '{"name":"chat-bot","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'
curl -X POST http://localhost:8081/models \
-H 'Content-Type: application/json' \
-d '{"name":"gemma3:4b","version":"v1","backend_type":"mock","mock":true,"max_concurrency":2,"weight":100,"shadow":false}'
运行脚本:
python3 scripts/stream_load.py --models chat-bot,gemma3:4b --per-model 3
说明:
- 每个模型前 2 个线程应持续 streaming
- 第 3 个线程显示容量超限(验证并发限制)
python -m vllm.entrypoints.openai.api_server \
--model /path/to/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000
## Self Report
- 总耗时:__大约5__ 小时
- 实际做题时间段:不是连续时间段做的,可以看提交记录,因为中间有其他面试或其他事情
- 完成情况:
- [x] 模型注册 / 更新 / 查看
- [x] 流式推理接口
- [x] 热更新不影响已有连接
- [x] 容量管理/并发支持
- [x] 简易管理界面
- [ ] 多版本分流
- [ ] Prometheus metrics
- [ ] 灰度发布
- 备注说明:
- (如对模型状态隔离的实现方式说明 / 关键设计决策 / 哪块逻辑写得不满意 / 下一步想优化的点)