trigger ReplayLog fallback only after candidate no-connection timeout by MrGuin · Pull Request #459 · eloqdata/tx_service

MrGuin · 2026-03-16T10:36:43Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Chores
- Optimized internal recovery/replay mechanisms: reduced monitoring timeout and added per-entity tracking with interval-gated retry logic to avoid duplicate replay attempts.
- Improved efficiency and responsiveness of fault-recovery decisions while preserving existing external behavior and public interfaces.

coderabbitai · 2026-03-16T10:37:03Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb08d935-3db2-4d2f-8095-794e30f8c1b7

📥 Commits

Reviewing files that changed from the base of the PR and between 348b91f and 22203da.

📒 Files selected for processing (1)

tx_service/src/fault/log_replay_service.cpp

Walkthrough

Adds per-node-group replay-tracking to the log replay service: introduces a CandidateReplayWatch struct and a map to track per-ng term and timestamps; reduces notify timeout from 10s to 5s; changes timeout handler to apply interval-based gating and avoid redundant ReplayLog requests when inbound recovery streams exist.

Changes

Cohort / File(s)	Summary
Header definitions `tx_service/include/fault/log_replay_service.h`	Adds `CandidateReplayWatch` (term, first_seen_ts, last_replay_ts), `candidate_replay_watch_` map keyed by cc_ng_id, and `kReplayCheckIntervalUs` constant for replay retry gating.
Retry & timeout logic `tx_service/src/fault/log_replay_service.cpp`	Reduces notify timeout (10s→5s); replaces global orphan check with per-ng CandidateReplayWatch-driven logic that: refreshes on term change, skips if a replay connection exists, initializes/uses first_seen_ts and last_replay_ts to enforce kReplayCheckIntervalUs before issuing ReplayLog, and updates last_replay_ts when replay is sent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

retry log replay when stream closed on error #182 — Modifies log_replay_service replay-triggering/gating logic and checks inbound recovery streams before issuing ReplayLog; strongly related to per-ng replay-request behavior.

Poem

🐰 I watched the logs as leaders hopped by,
I marked their terms beneath the sky,
If streams were present I’d quietly rest,
Else I’d wait my interval, then try my best.
Replay or not, I keep time in my chest.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description contains only a pre-submit checklist template with all items unchecked, providing no substantive explanation of the changes or their rationale.	Fill in the description with details about what changed, why it was changed, and the impact; check off completed checklist items or explain why they don't apply.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: gating ReplayLog fallback triggering on a candidate no-connection timeout, which aligns with the code modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix_recovery

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

MrGuin · 2026-03-16T10:37:11Z

Fixes https://github.com/eloqdata/project_tracker/issues/236, https://github.com/eloqdata/project_tracker/issues/235.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tx_service/src/fault/log_replay_service.cpp (1)
129-144: Avoid scanning every stream under inbound_mux_ once per node group.

This new check is O(node_groups × streams) every timeout cycle because inbound_connections_ is keyed by StreamId. On larger clusters, that can turn the fallback path into a lock-heavy poller and delay Connect/on_closed. A small secondary index or per-<cc_ng_id, cc_ng_term> recovering-stream count would make this much cheaper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/fault/log_replay_service.cpp` around lines 129 - 144, The loop
scanning inbound_connections_ under inbound_mux_ for every node group timeout is
inefficient; add a secondary index (e.g., an unordered_map or nested map keyed
by std::pair<cc_ng_id, cc_ng_term> or a struct key) that tracks the count of
recovering streams to avoid O(node_groups×streams) scans. Maintain this index
whenever entries are inserted/erased or when an entry's recovering_ flag changes
(update in the same critical sections that currently touch
inbound_connections_), then replace the scan in the has_replay_connection check
with a constant-time lookup against that new map; keep using inbound_mux_ for
synchronization and update the index in the same protected scope as
inbound_connections_ to preserve correctness.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 145-167: The scan loop should reset the "no-connection" timer when
a healthy replay stream exists: when has_replay_connection is true, update
watch.first_seen_ts (and/or clear it) to the current timestamp (now_ts) so
first_seen_ts reflects the most recent connected time rather than the original
observation; this prevents immediately enqueuing ReplayLog (ReplayLog(ng_id,
candidate_term, ...)) on the next scan after a short-lived drop. Locate the
check using has_replay_connection, watch.first_seen_ts, watch.last_replay_ts,
and ReplayLog and add logic to set watch.first_seen_ts = now_ts when a replay
stream is present.

---

Nitpick comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 129-144: The loop scanning inbound_connections_ under inbound_mux_
for every node group timeout is inefficient; add a secondary index (e.g., an
unordered_map or nested map keyed by std::pair<cc_ng_id, cc_ng_term> or a struct
key) that tracks the count of recovering streams to avoid O(node_groups×streams)
scans. Maintain this index whenever entries are inserted/erased or when an
entry's recovering_ flag changes (update in the same critical sections that
currently touch inbound_connections_), then replace the scan in the
has_replay_connection check with a constant-time lookup against that new map;
keep using inbound_mux_ for synchronization and update the index in the same
protected scope as inbound_connections_ to preserve correctness.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 09db1326-0dda-43d4-8e1e-30ecf2609d85

📥 Commits

Reviewing files that changed from the base of the PR and between a64a31d and 348b91f.

📒 Files selected for processing (2)

tx_service/include/fault/log_replay_service.h
tx_service/src/fault/log_replay_service.cpp

tx_service/src/fault/log_replay_service.cpp

MrGuin added 2 commits March 16, 2026 18:31

tx_service: gate replay fallback on candidate no-connection timeout

e2419a0

format

348b91f

MrGuin requested a review from liunyl March 16, 2026 10:36

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

tx_service/src/fault/log_replay_service.cpp Show resolved Hide resolved

liunyl approved these changes Mar 17, 2026

View reviewed changes

tx_service: reset no-connection timer when replay stream is present

22203da

MrGuin merged commit f74e1f7 into main Mar 17, 2026
3 of 4 checks passed

MrGuin deleted the fix_recovery branch March 17, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trigger ReplayLog fallback only after candidate no-connection timeout#459

trigger ReplayLog fallback only after candidate no-connection timeout#459
MrGuin merged 3 commits intomainfrom
fix_recovery

MrGuin commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

Review failed

❌ Failed checks (2 warnings)

Uh oh!

MrGuin commented Mar 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MrGuin commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

MrGuin commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGuin commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

MrGuin commented Mar 16, 2026 •

edited

Loading