Skip to content

Conversation

@ding113
Copy link
Owner

@ding113 ding113 commented Jan 24, 2026

Summary

Implements endpoint stickiness on retries to improve retry reliability and reduce unnecessary endpoint cycling. This PR differentiates between network errors (which should try a different endpoint) and HTTP errors (which should retry the same endpoint).

Related Issues:

Problem

The previous retry logic used attemptCount to cycle through endpoints, which caused two issues:

  1. Network errors and HTTP errors were treated the same: Both would advance to the next endpoint, even when HTTP errors (like 500) might be transient and worth retrying on the same endpoint
  2. Endpoint wrap-around: When endpoints were exhausted, the system would wrap around to the first endpoint again, potentially retrying already-failed endpoints

This led to suboptimal retry behavior where:

  • Transient HTTP errors would unnecessarily cycle through all endpoints
  • Network connectivity issues wouldn't properly advance to alternative endpoints
  • Endpoint exhaustion could result in infinite retry loops

Solution

Introduced currentEndpointIndex variable to track endpoint selection independently from attemptCount, enabling different retry behaviors based on error category:

  • SYSTEM_ERROR (network failures like ECONNREFUSED, ETIMEDOUT): Advances to next lowest-latency endpoint
  • PROVIDER_ERROR (HTTP 4xx/5xx responses): Maintains same endpoint across retries (sticky behavior)
  • Endpoint exhaustion: Stays at last endpoint without wrap-around using Math.min(currentEndpointIndex, endpointCandidates.length - 1)

Changes

Core Changes

  • src/app/v1/_lib/proxy/forwarder.ts (+25/-2):
    • Added currentEndpointIndex variable to track endpoint selection independently (line 283)
    • Modified endpoint selection logic to use currentEndpointIndex instead of attemptCount (lines 320-326)
    • Added endpoint advancement logic for SYSTEM_ERROR only (lines 796-804)
    • Added no wrap-around protection with Math.min() clamping (line 325)
    • Added detailed inline comments explaining the stickiness behavior

Supporting Changes

  • tests/unit/proxy/proxy-forwarder-retry-limit.test.ts (+397/-19):
    • Added 4 new comprehensive test cases:
      • SYSTEM_ERROR: should switch to next endpoint on each network error retry
      • PROVIDER_ERROR: should keep same endpoint on non-network error retry
      • SYSTEM_ERROR: should not wrap around when endpoints exhausted
      • mixed errors: PROVIDER_ERROR should not advance endpoint index
    • Updated 4 existing tests to use SYSTEM_ERROR for endpoint switching scenarios
    • All 10 proxy-forwarder-retry-limit tests pass

Testing

Automated Tests

  • Added 4 new test cases validating endpoint stickiness behavior
  • Updated 4 existing tests to properly use SYSTEM_ERROR for endpoint switching
  • All 10 tests in proxy-forwarder-retry-limit.test.ts pass
  • Test coverage includes: network errors, HTTP errors, endpoint exhaustion, and mixed error scenarios

Manual Testing Scenarios

  1. Network error scenario: Verify SYSTEM_ERROR advances through endpoints (ep1 → ep2 → ep3)
  2. HTTP error scenario: Verify PROVIDER_ERROR stays on same endpoint (ep1 → ep1 → ep1)
  3. Exhaustion scenario: Verify no wrap-around when endpoints exhausted (ep1 → ep2 → ep2 → ep2)
  4. Mixed error scenario: Verify correct endpoint progression with alternating error types

Checklist

  • Code follows project conventions
  • Self-review completed
  • Tests pass locally (all 10 tests passing)
  • Documentation updated (inline comments added)
  • No breaking changes

Description enhanced by Claude AI

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

This PR implements endpoint stickiness on retries to improve retry reliability and reduce unnecessary endpoint cycling. The implementation introduces a currentEndpointIndex variable that tracks endpoint selection independently from attemptCount, enabling different retry behaviors based on error type.

Key Changes:

  • Network errors (SYSTEM_ERROR) now advance to the next lowest-latency endpoint on retry
  • HTTP errors (PROVIDER_ERROR like 4xx/5xx) maintain the same endpoint across retries
  • Endpoints no longer wrap around when exhausted - stays at the last endpoint using Math.min(currentEndpointIndex, endpointCandidates.length - 1)
  • Added comprehensive test coverage with 4 new test cases validating the stickiness behavior

Implementation Quality:
The code change is clean and well-documented with inline comments explaining the behavior. The existing tests were properly updated to use SYSTEM_ERROR for endpoint switching scenarios, and the new tests thoroughly validate all edge cases including mixed error scenarios.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is straightforward, well-tested with comprehensive test coverage (4 new tests + 4 updated tests), and the logic correctly implements the intended endpoint stickiness behavior. All changes are isolated to the retry logic without affecting other parts of the system.
  • No files require special attention

Important Files Changed

Filename Overview
src/app/v1/_lib/proxy/forwarder.ts Implements endpoint stickiness by tracking currentEndpointIndex separately from attemptCount, advancing only on SYSTEM_ERROR (network failures)
tests/unit/proxy/proxy-forwarder-retry-limit.test.ts Added 4 comprehensive test cases validating endpoint stickiness behavior for SYSTEM_ERROR, PROVIDER_ERROR, no wrap-around, and mixed scenarios

Sequence Diagram

sequenceDiagram
    participant Client
    participant ProxyForwarder
    participant Endpoint1
    participant Endpoint2
    participant Endpoint3

    Note over ProxyForwarder: currentEndpointIndex = 0
    Note over ProxyForwarder: attemptCount = 0

    Client->>ProxyForwarder: send(request)
    ProxyForwarder->>Endpoint1: attempt 1 (index=0)
    Endpoint1-->>ProxyForwarder: SYSTEM_ERROR (network)
    Note over ProxyForwarder: currentEndpointIndex++ (0→1)
    Note over ProxyForwarder: attemptCount++ (0→1)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint2: attempt 2 (index=1)
    Endpoint2-->>ProxyForwarder: PROVIDER_ERROR (HTTP 500)
    Note over ProxyForwarder: currentEndpointIndex stays at 1
    Note over ProxyForwarder: attemptCount++ (1→2)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint2: attempt 3 (index=1, sticky)
    Endpoint2-->>ProxyForwarder: SYSTEM_ERROR (network)
    Note over ProxyForwarder: currentEndpointIndex++ (1→2)
    Note over ProxyForwarder: attemptCount++ (2→3)
    Note over ProxyForwarder: wait 100ms

    ProxyForwarder->>Endpoint3: attempt 4 (index=2)
    Endpoint3-->>ProxyForwarder: Success (200 OK)
    ProxyForwarder-->>Client: Response
Loading

- Non-network errors (PROVIDER_ERROR): keep same endpoint on retry
- Network errors (SYSTEM_ERROR): switch to next lowest-latency endpoint
- Endpoint exhaustion: stay at last endpoint without wrap-around
- Scope: single request only, no cross-request stickiness

Introduced `currentEndpointIndex` variable to track endpoint independently
from `attemptCount`. Only network errors (SYSTEM_ERROR) advance the index;
HTTP errors (PROVIDER_ERROR) keep the same endpoint for retry.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @ding113, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the proxy's resilience and efficiency by implementing a more sophisticated retry strategy. It introduces 'endpoint stickiness,' where the system intelligently decides whether to retry on the same endpoint or switch to a new one based on the nature of the error encountered. This optimization aims to improve the success rate of requests by adapting to different failure modes, such as transient network issues versus persistent provider-side problems.

Highlights

  • Intelligent Endpoint Selection on Retries: The proxy's retry mechanism now differentiates between error types to determine endpoint selection. Network errors (SYSTEM_ERROR) will trigger a switch to the next available endpoint, while provider-specific errors (PROVIDER_ERROR, e.g., HTTP 4xx/5xx) will cause retries to occur on the same endpoint.
  • No Endpoint Wrap-Around: When all available endpoints have been attempted due to SYSTEM_ERRORs, the retry mechanism will no longer wrap around to the first endpoint. Instead, it will stick to the last available endpoint.
  • New Endpoint Tracking Variable: A new variable, currentEndpointIndex, has been introduced in ProxyForwarder to manage endpoint selection independently from the attemptCount, facilitating the new sticky behavior.
  • Comprehensive Test Coverage: Four new test cases have been added to specifically validate the endpoint stickiness behavior, and four existing tests were updated to use SYSTEM_ERROR for scenarios requiring endpoint switching, ensuring robust validation of the new logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

代理转发器引入了端点粘性机制,通过独立跟踪当前端点索引来管理重试中的端点选择。在系统错误时,端点索引会递进到下一个可用端点,而非固定重试同一端点。相应的测试套件已更新以验证新的端点切换语义。

Changes

组织 / 文件 变更摘要
代理转发器端点粘性
src/app/v1/_lib/proxy/forwarder.ts
添加 currentEndpointIndex 用于独立跟踪每个端点的推进;修改重试逻辑,在系统错误发生时递进端点索引;移除模运算,改为固定到最后可用端点;更新注释以反映粘性端点行为
代理转发器重试限制测试
tests/unit/proxy/proxy-forwarder-retry-limit.test.ts
更新导入以包含 ErrorCategorycategorizeErrorAsync;将错误模拟从 ProxyError 改为网络错误类型 (TypeError ECONNREFUSED);新增 端点粘性重试 测试套件,涵盖系统错误驱动的端点递进、提供商错误的端点固定、端点耗尽时无回绕等场景;调整现有测试以符合新的端点切换语义

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed PR标题 'fix(proxy): implement endpoint stickiness on retries' 准确地概括了主要变更:在重试时实现端点粘性的功能。标题简洁明确,与变更集的核心目标相符。
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR描述清晰地说明了所有关键变更内容,包括不同错误类型的端点处理逻辑、核心实现变量和测试计划。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/endpoint-stickiness-on-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working area:provider labels Jan 24, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements endpoint stickiness for retries, distinguishing between network and provider errors. The logic to advance the endpoint index only on SYSTEM_ERROR and to clamp the index to avoid wrap-around is sound. The accompanying tests are thorough and cover the new behaviors well, including scenarios with mixed errors and endpoint exhaustion. The changes improve the proxy's resilience and retry strategy. I have one suggestion to improve the readability of a test mock setup.

@github-actions github-actions bot added the size/S Small PR (< 200 lines) label Jan 24, 2026
@ding113 ding113 merged commit 8fe1b7f into dev Jan 24, 2026
14 of 18 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in Claude Code Hub Roadmap Jan 24, 2026
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR implements endpoint stickiness on retries by distinguishing between network errors (SYSTEM_ERROR) and HTTP errors (PROVIDER_ERROR). The implementation is clean and well-tested.

PR Size: S

  • Lines changed: 443 (422 additions, 21 deletions)
  • Files changed: 2

Issues Found

Category Critical High Medium Low
Logic/Bugs 0 0 0 0
Security 0 0 0 0
Error Handling 0 0 0 0
Types 0 0 0 0
Comments/Docs 0 0 0 0
Tests 0 0 0 0
Simplification 0 0 0 0

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean
  • Type safety - Clean
  • Documentation accuracy - Clean
  • Test coverage - Adequate
  • Code clarity - Good

Analysis

Core Logic: The PR introduces currentEndpointIndex to track endpoint selection independently from attemptCount. This correctly implements the desired behavior:

  • SYSTEM_ERROR (network failures) advances to the next endpoint
  • PROVIDER_ERROR (HTTP 4xx/5xx) retries on the same endpoint
  • No wrap-around when endpoints are exhausted (stays at last endpoint)

Test Coverage: The PR adds 4 new test cases and updates 4 existing tests to properly use SYSTEM_ERROR for endpoint switching scenarios. All tests correctly mock categorizeErrorAsync and use appropriate error types (network errors vs ProxyError).

Error Handling: The implementation correctly calls categorizeErrorAsync to determine error type before deciding whether to advance the endpoint index. The advancement logic is only triggered in the SYSTEM_ERROR branch (line 793-805 in forwarder.ts).

Code Quality: Comments are accurate and explain the stickiness behavior clearly. The Math.min(currentEndpointIndex, endpointCandidates.length - 1) clamp correctly prevents index out of bounds.


Automated review by Claude AI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider bug Something isn't working size/S Small PR (< 200 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants