-
Notifications
You must be signed in to change notification settings - Fork 423
清理文件锁,或者清理worker都会导致api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时 || Clearing file locks or cleaning workers will cause the api provider's configuration to change from a custom URL to the official OpenAI URL, causing large model request timeouts. #507
Description
Bug Description
清理文件锁,或者清理worker都会导致api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时
Steps to Reproduce
- 配置好api提供商,阿里云的coding-plan地址。
2.让manager清理文件锁,或者清理worker。 - manager超时无响应。
- 发现api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时
AI Analysis
问题根因找到了!
关键问题代码
setup-higress.sh 第 262 行:
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # 强制覆盖 provider!
问题流程
- 用户使用 openai-compat provider(自定义 URL)
- 清理文件锁或清理 worker 触发 Manager 重启(PID 从 3209 变为 6358)
- Manager 启动时运行 setup-higress.sh
- 第 262 行强制将 provider 改为 ${LLM_PROVIDER}(默认 qwen)
- AI route 被路由到 qwen provider,但用户没有配置 qwen API key
- 所有 LLM 请求超时/失败
日志证据
05:40:39 - session file locked (PID=3209) ← 第一次锁问题
05:42:42 - session file locked (PID=6358) ← PID 变化 = Manager 重启了!
05:47:37 - HTTP 503 错误 ← LLM 服务不可用
根本原因
setup-higress.sh 每次 Manager 启动都会执行:
LLM_PROVIDER="${HICLAW_LLM_PROVIDER:-qwen}" # 默认 qwen
无条件覆盖 AI route 的 provider
patched=$(echo "${existing_route_resp}" | jq '
.data
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # 问题在这!
...
')
即使配置已经正确,也会强制改为 HICLAW_LLM_PROVIDER 的值。如果用户使用自定义 provider 但没有设置这个环境变量,配置就会被覆盖为 qwen。
修复方案
只保留现有的 provider 配置,不要强制覆盖:
修复:保留现有 provider,只在首次创建时设置
patched=$(echo "${existing_route_resp}" | jq --argjson domains "${AI_ROUTE_DOMAINS}" '
.data
| .domains = $domains
# 移除这行: | .upstreams[0].provider = "..."
| .headerControl.enabled = true
...
')
或者只在 provider 不存在时才设置默认值。
Relevant Logs
Component
Manager Agent
Version / Commit
No response
Bug Description
Clearing file locks or cleaning workers will cause the api provider's configuration to change from a custom URL to the official OpenAI URL, causing large model request timeouts.
Steps to Reproduce
- Configure the API provider and Alibaba Cloud's coding-plan address.
- Let the manager clear the file lock or clean the worker.
- The manager times out and does not respond.
- It was found that the configuration of the api provider changed from a custom URL to the official OpenAI URL, causing large model request to time out.
AI Analysis
The root cause of the problem was found!
key question code
setup-higress.sh line 262:
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # Force coverage of provider!
problem flow
- User uses openai-compat provider (custom URL)
- Clear file locks or clean workers to trigger a Manager restart (PID changes from 3209 to 6358)
- Run setup-higress.sh when Manager starts
- Line 262 forces the provider to be changed to ${LLM_PROVIDER} (default qwen)
- AI route is routed to qwen provider, but the user does not configure qwen API key
- All LLM requests time out/failed
Log evidence
05:40:39 - session file locked (PID=3209) ← First lock problem
05:42:42 - session file locked (PID=6358) ← PID change = Manager restarted!
05:47:37 - HTTP 503 Error ← LLM service unavailable
root cause
setup-higress.sh will be executed every time Manager starts:
LLM_PROVIDER="${HICLAW_LLM_PROVIDER:-qwen}" #Default qwen
Unconditionally override the AI route provider
patched=$(echo "${existing_route_resp}" | jq '
.data
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # Here's the problem!
...
')
Even if the configuration is correct, it will be forced to change to the value of HICLAW_LLM_PROVIDER. If the user uses a custom provider but does not set this environment variable, the configuration will be overwritten to qwen.
Fix
Just keep the existing provider configuration, don't force an overwrite:
Fix: Keep existing providers and only set them when first created
patched=$(echo "${existing_route_resp}" | jq --argjson domains "${AI_ROUTE_DOMAINS}" '
.data
| .domains = $domains
# Remove this line: | .upstreams[0].provider = "..."
| .headerControl.enabled = true
...
')
Or only set the default value if the provider does not exist.
Relevant Logs
Component
Manager Agent
Version/Commit
No response