Skip to content

清理文件锁,或者清理worker都会导致api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时 || Clearing file locks or cleaning workers will cause the api provider's configuration to change from a custom URL to the official OpenAI URL, causing large model request timeouts. #507

@emersonli

Description

@emersonli

Bug Description

清理文件锁,或者清理worker都会导致api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时

Steps to Reproduce

  1. 配置好api提供商,阿里云的coding-plan地址。
    2.让manager清理文件锁,或者清理worker。
  2. manager超时无响应。
  3. 发现api提供商的配置从自定义URL到官方OpenAI的 URL,导致大模型请求超时

AI Analysis

问题根因找到了!
关键问题代码
setup-higress.sh 第 262 行:

| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # 强制覆盖 provider!

问题流程

  1. 用户使用 openai-compat provider(自定义 URL)
  2. 清理文件锁或清理 worker 触发 Manager 重启(PID 从 3209 变为 6358)
  3. Manager 启动时运行 setup-higress.sh
  4. 第 262 行强制将 provider 改为 ${LLM_PROVIDER}(默认 qwen)
  5. AI route 被路由到 qwen provider,但用户没有配置 qwen API key
  6. 所有 LLM 请求超时/失败

日志证据
05:40:39 - session file locked (PID=3209) ← 第一次锁问题
05:42:42 - session file locked (PID=6358) ← PID 变化 = Manager 重启了!
05:47:37 - HTTP 503 错误 ← LLM 服务不可用

根本原因
setup-higress.sh 每次 Manager 启动都会执行:

LLM_PROVIDER="${HICLAW_LLM_PROVIDER:-qwen}" # 默认 qwen

无条件覆盖 AI route 的 provider

patched=$(echo "${existing_route_resp}" | jq '
.data
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # 问题在这!
...
')

即使配置已经正确,也会强制改为 HICLAW_LLM_PROVIDER 的值。如果用户使用自定义 provider 但没有设置这个环境变量,配置就会被覆盖为 qwen。

修复方案
只保留现有的 provider 配置,不要强制覆盖:

修复:保留现有 provider,只在首次创建时设置

patched=$(echo "${existing_route_resp}" | jq --argjson domains "${AI_ROUTE_DOMAINS}" '
.data
| .domains = $domains
# 移除这行: | .upstreams[0].provider = "..."
| .headerControl.enabled = true
...
')

或者只在 provider 不存在时才设置默认值。

Relevant Logs

Component

Manager Agent

Version / Commit

No response


Bug Description

Clearing file locks or cleaning workers will cause the api provider's configuration to change from a custom URL to the official OpenAI URL, causing large model request timeouts.

Steps to Reproduce

  1. Configure the API provider and Alibaba Cloud's coding-plan address.
  2. Let the manager clear the file lock or clean the worker.
  3. The manager times out and does not respond.
  4. It was found that the configuration of the api provider changed from a custom URL to the official OpenAI URL, causing large model request to time out.

AI Analysis

The root cause of the problem was found!
key question code
setup-higress.sh line 262:

| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # Force coverage of provider!

problem flow

  1. User uses openai-compat provider (custom URL)
  2. Clear file locks or clean workers to trigger a Manager restart (PID changes from 3209 to 6358)
  3. Run setup-higress.sh when Manager starts
  4. Line 262 forces the provider to be changed to ${LLM_PROVIDER} (default qwen)
  5. AI route is routed to qwen provider, but the user does not configure qwen API key
  6. All LLM requests time out/failed

Log evidence
05:40:39 - session file locked (PID=3209) ← First lock problem
05:42:42 - session file locked (PID=6358) ← PID change = Manager restarted!
05:47:37 - HTTP 503 Error ← LLM service unavailable

root cause
setup-higress.sh will be executed every time Manager starts:

LLM_PROVIDER="${HICLAW_LLM_PROVIDER:-qwen}" #Default qwen

Unconditionally override the AI route provider

patched=$(echo "${existing_route_resp}" | jq '
.data
| .upstreams[0].provider = "'"${LLM_PROVIDER}"'" # Here's the problem!
...
')

Even if the configuration is correct, it will be forced to change to the value of HICLAW_LLM_PROVIDER. If the user uses a custom provider but does not set this environment variable, the configuration will be overwritten to qwen.

Fix
Just keep the existing provider configuration, don't force an overwrite:

Fix: Keep existing providers and only set them when first created

patched=$(echo "${existing_route_resp}" | jq --argjson domains "${AI_ROUTE_DOMAINS}" '
.data
| .domains = $domains
# Remove this line: | .upstreams[0].provider = "..."
| .headerControl.enabled = true
...
')

Or only set the default value if the provider does not exist.

Relevant Logs

Component

Manager Agent

Version/Commit

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions