fix(proxy): implement endpoint stickiness on retries #651

ding113 · 2026-01-24T12:09:22Z

Summary

Implements endpoint stickiness on retries to improve retry reliability and reduce unnecessary endpoint cycling. This PR differentiates between network errors (which should try a different endpoint) and HTTP errors (which should retry the same endpoint).

Related Issues:

Partially addresses 渠道返回 503/400 错误时，系统在原渠道重复重试而非切换至其他备用渠道 #629 - System was repeatedly retrying the same provider instead of switching to backup providers when encountering 503/400 errors. This PR improves the retry logic by implementing intelligent endpoint selection based on error type.
Follow-up to Fix provider exhaustion after model redirect (refs #629) #633 - Builds on the provider exhaustion fix by adding endpoint-level retry intelligence

Problem

The previous retry logic used attemptCount to cycle through endpoints, which caused two issues:

Network errors and HTTP errors were treated the same: Both would advance to the next endpoint, even when HTTP errors (like 500) might be transient and worth retrying on the same endpoint
Endpoint wrap-around: When endpoints were exhausted, the system would wrap around to the first endpoint again, potentially retrying already-failed endpoints

This led to suboptimal retry behavior where:

Transient HTTP errors would unnecessarily cycle through all endpoints
Network connectivity issues wouldn't properly advance to alternative endpoints
Endpoint exhaustion could result in infinite retry loops

Solution

Introduced currentEndpointIndex variable to track endpoint selection independently from attemptCount, enabling different retry behaviors based on error category:

SYSTEM_ERROR (network failures like ECONNREFUSED, ETIMEDOUT): Advances to next lowest-latency endpoint
PROVIDER_ERROR (HTTP 4xx/5xx responses): Maintains same endpoint across retries (sticky behavior)
Endpoint exhaustion: Stays at last endpoint without wrap-around using Math.min(currentEndpointIndex, endpointCandidates.length - 1)

Changes

Core Changes

src/app/v1/_lib/proxy/forwarder.ts (+25/-2):
- Added currentEndpointIndex variable to track endpoint selection independently (line 283)
- Modified endpoint selection logic to use currentEndpointIndex instead of attemptCount (lines 320-326)
- Added endpoint advancement logic for SYSTEM_ERROR only (lines 796-804)
- Added no wrap-around protection with Math.min() clamping (line 325)
- Added detailed inline comments explaining the stickiness behavior

Supporting Changes

tests/unit/proxy/proxy-forwarder-retry-limit.test.ts (+397/-19):
- Added 4 new comprehensive test cases:
  - SYSTEM_ERROR: should switch to next endpoint on each network error retry
  - PROVIDER_ERROR: should keep same endpoint on non-network error retry
  - SYSTEM_ERROR: should not wrap around when endpoints exhausted
  - mixed errors: PROVIDER_ERROR should not advance endpoint index
- Updated 4 existing tests to use SYSTEM_ERROR for endpoint switching scenarios
- All 10 proxy-forwarder-retry-limit tests pass

Testing

Automated Tests

Added 4 new test cases validating endpoint stickiness behavior
Updated 4 existing tests to properly use SYSTEM_ERROR for endpoint switching
All 10 tests in proxy-forwarder-retry-limit.test.ts pass
Test coverage includes: network errors, HTTP errors, endpoint exhaustion, and mixed error scenarios

Manual Testing Scenarios

Network error scenario: Verify SYSTEM_ERROR advances through endpoints (ep1 → ep2 → ep3)
HTTP error scenario: Verify PROVIDER_ERROR stays on same endpoint (ep1 → ep1 → ep1)
Exhaustion scenario: Verify no wrap-around when endpoints exhausted (ep1 → ep2 → ep2 → ep2)
Mixed error scenario: Verify correct endpoint progression with alternating error types

Checklist

Code follows project conventions
Self-review completed
Tests pass locally (all 10 tests passing)
Documentation updated (inline comments added)
No breaking changes

Description enhanced by Claude AI

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

This PR implements endpoint stickiness on retries to improve retry reliability and reduce unnecessary endpoint cycling. The implementation introduces a currentEndpointIndex variable that tracks endpoint selection independently from attemptCount, enabling different retry behaviors based on error type.

Key Changes:

Network errors (SYSTEM_ERROR) now advance to the next lowest-latency endpoint on retry
HTTP errors (PROVIDER_ERROR like 4xx/5xx) maintain the same endpoint across retries
Endpoints no longer wrap around when exhausted - stays at the last endpoint using Math.min(currentEndpointIndex, endpointCandidates.length - 1)
Added comprehensive test coverage with 4 new test cases validating the stickiness behavior

Implementation Quality:
The code change is clean and well-documented with inline comments explaining the behavior. The existing tests were properly updated to use SYSTEM_ERROR for endpoint switching scenarios, and the new tests thoroughly validate all edge cases including mixed error scenarios.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is straightforward, well-tested with comprehensive test coverage (4 new tests + 4 updated tests), and the logic correctly implements the intended endpoint stickiness behavior. All changes are isolated to the retry logic without affecting other parts of the system.
No files require special attention

Important Files Changed

Filename	Overview
src/app/v1/_lib/proxy/forwarder.ts	Implements endpoint stickiness by tracking `currentEndpointIndex` separately from `attemptCount`, advancing only on SYSTEM_ERROR (network failures)
tests/unit/proxy/proxy-forwarder-retry-limit.test.ts	Added 4 comprehensive test cases validating endpoint stickiness behavior for SYSTEM_ERROR, PROVIDER_ERROR, no wrap-around, and mixed scenarios

Sequence Diagram

sequenceDiagram
    participant Client
    participant ProxyForwarder
    participant Endpoint1
    participant Endpoint2
    participant Endpoint3

    Note over ProxyForwarder: currentEndpointIndex = 0
    Note over ProxyForwarder: attemptCount = 0

    Client->>ProxyForwarder: send(request)
    ProxyForwarder->>Endpoint1: attempt 1 (index=0)
    Endpoint1-->>ProxyForwarder: SYSTEM_ERROR (network)
    Note over ProxyForwarder: currentEndpointIndex++ (0→1)
    Note over ProxyForwarder: attemptCount++ (0→1)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint2: attempt 2 (index=1)
    Endpoint2-->>ProxyForwarder: PROVIDER_ERROR (HTTP 500)
    Note over ProxyForwarder: currentEndpointIndex stays at 1
    Note over ProxyForwarder: attemptCount++ (1→2)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint2: attempt 3 (index=1, sticky)
    Endpoint2-->>ProxyForwarder: SYSTEM_ERROR (network)
    Note over ProxyForwarder: currentEndpointIndex++ (1→2)
    Note over ProxyForwarder: attemptCount++ (2→3)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint3: attempt 4 (index=2)
    Endpoint3-->>ProxyForwarder: Success (200 OK)
    ProxyForwarder-->>Client: Response

- Non-network errors (PROVIDER_ERROR): keep same endpoint on retry - Network errors (SYSTEM_ERROR): switch to next lowest-latency endpoint - Endpoint exhaustion: stay at last endpoint without wrap-around - Scope: single request only, no cross-request stickiness Introduced `currentEndpointIndex` variable to track endpoint independently from `attemptCount`. Only network errors (SYSTEM_ERROR) advance the index; HTTP errors (PROVIDER_ERROR) keep the same endpoint for retry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · 2026-01-24T12:09:40Z

Summary of Changes

Hello @ding113, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the proxy's resilience and efficiency by implementing a more sophisticated retry strategy. It introduces 'endpoint stickiness,' where the system intelligently decides whether to retry on the same endpoint or switch to a new one based on the nature of the error encountered. This optimization aims to improve the success rate of requests by adapting to different failure modes, such as transient network issues versus persistent provider-side problems.

Highlights

Intelligent Endpoint Selection on Retries: The proxy's retry mechanism now differentiates between error types to determine endpoint selection. Network errors (SYSTEM_ERROR) will trigger a switch to the next available endpoint, while provider-specific errors (PROVIDER_ERROR, e.g., HTTP 4xx/5xx) will cause retries to occur on the same endpoint.
No Endpoint Wrap-Around: When all available endpoints have been attempted due to SYSTEM_ERRORs, the retry mechanism will no longer wrap around to the first endpoint. Instead, it will stick to the last available endpoint.
New Endpoint Tracking Variable: A new variable, currentEndpointIndex, has been introduced in ProxyForwarder to manage endpoint selection independently from the attemptCount, facilitating the new sticky behavior.
Comprehensive Test Coverage: Four new test cases have been added to specifically validate the endpoint stickiness behavior, and four existing tests were updated to use SYSTEM_ERROR for scenarios requiring endpoint switching, ensuring robust validation of the new logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-24T12:09:54Z

📝 Walkthrough

Walkthrough

代理转发器引入了端点粘性机制，通过独立跟踪当前端点索引来管理重试中的端点选择。在系统错误时，端点索引会递进到下一个可用端点，而非固定重试同一端点。相应的测试套件已更新以验证新的端点切换语义。

Changes

组织 / 文件	变更摘要
代理转发器端点粘性 `src/app/v1/_lib/proxy/forwarder.ts`	添加 `currentEndpointIndex` 用于独立跟踪每个端点的推进；修改重试逻辑，在系统错误发生时递进端点索引；移除模运算，改为固定到最后可用端点；更新注释以反映粘性端点行为
代理转发器重试限制测试 `tests/unit/proxy/proxy-forwarder-retry-limit.test.ts`	更新导入以包含 `ErrorCategory` 和 `categorizeErrorAsync`；将错误模拟从 `ProxyError` 改为网络错误类型 (TypeError ECONNREFUSED)；新增 `端点粘性重试` 测试套件，涵盖系统错误驱动的端点递进、提供商错误的端点固定、端点耗尽时无回绕等场景；调整现有测试以符合新的端点切换语义

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix(proxy): detect SSE error block in HTTP 200 response and trigger retry #649: 两个PR都涉及代理重试/错误处理路径的修改，包括 forwarder.ts 文件和错误分类流程的更新。

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	PR标题 'fix(proxy): implement endpoint stickiness on retries' 准确地概括了主要变更：在重试时实现端点粘性的功能。标题简洁明确，与变更集的核心目标相符。
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	PR描述清晰地说明了所有关键变更内容，包括不同错误类型的端点处理逻辑、核心实现变量和测试计划。

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/endpoint-stickiness-on-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request correctly implements endpoint stickiness for retries, distinguishing between network and provider errors. The logic to advance the endpoint index only on SYSTEM_ERROR and to clamp the index to avoid wrap-around is sound. The accompanying tests are thorough and cover the new behaviors well, including scenarios with mixed errors and endpoint exhaustion. The changes improve the proxy's resilience and retry strategy. I have one suggestion to improve the readability of a test mock setup.

github-actions

Code Review Summary

This PR implements endpoint stickiness on retries by distinguishing between network errors (SYSTEM_ERROR) and HTTP errors (PROVIDER_ERROR). The implementation is clean and well-tested.

PR Size: S

Lines changed: 443 (422 additions, 21 deletions)
Files changed: 2

Issues Found

Category	Critical	High	Medium	Low
Logic/Bugs	0	0	0	0
Security	0	0	0	0
Error Handling	0	0	0	0
Types	0	0	0	0
Comments/Docs	0	0	0	0
Tests	0	0	0	0
Simplification	0	0	0	0

Review Coverage

Analysis

Core Logic: The PR introduces currentEndpointIndex to track endpoint selection independently from attemptCount. This correctly implements the desired behavior:

SYSTEM_ERROR (network failures) advances to the next endpoint
PROVIDER_ERROR (HTTP 4xx/5xx) retries on the same endpoint
No wrap-around when endpoints are exhausted (stays at last endpoint)

Test Coverage: The PR adds 4 new test cases and updates 4 existing tests to properly use SYSTEM_ERROR for endpoint switching scenarios. All tests correctly mock categorizeErrorAsync and use appropriate error types (network errors vs ProxyError).

Error Handling: The implementation correctly calls categorizeErrorAsync to determine error type before deciding whether to advance the endpoint index. The advancement logic is only triggered in the SYSTEM_ERROR branch (line 793-805 in forwarder.ts).

Code Quality: Comments are accurate and explain the stickiness behavior clearly. The Math.min(currentEndpointIndex, endpointCandidates.length - 1) clamp correctly prevents index out of bounds.

Automated review by Claude AI

github-project-automation bot added this to Claude Code Hub Roadmap Jan 24, 2026

github-project-automation bot moved this to Backlog in Claude Code Hub Roadmap Jan 24, 2026

github-actions bot added bug Something isn't working area:provider labels Jan 24, 2026

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

github-actions bot added the size/S Small PR (< 200 lines) label Jan 24, 2026

coderabbitai bot approved these changes Jan 24, 2026

View reviewed changes

ding113 merged commit 8fe1b7f into dev Jan 24, 2026
14 of 18 checks passed

github-project-automation bot moved this from Backlog to Done in Claude Code Hub Roadmap Jan 24, 2026

github-actions bot reviewed Jan 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxy): implement endpoint stickiness on retries #651

fix(proxy): implement endpoint stickiness on retries #651

Uh oh!

ding113 commented Jan 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Uh oh!

coderabbitai bot commented Jan 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(proxy): implement endpoint stickiness on retries #651

fix(proxy): implement endpoint stickiness on retries #651

Uh oh!

Conversation

ding113 commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Core Changes

Supporting Changes

Testing

Automated Tests

Manual Testing Scenarios

Checklist

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

PR Size: S

Issues Found

Review Coverage

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ding113 commented Jan 24, 2026 •

edited by github-actions bot

Loading

coderabbitai bot commented Jan 24, 2026 •

edited

Loading