Skip to content

trigger ReplayLog fallback only after candidate no-connection timeout#459

Merged
MrGuin merged 3 commits intomainfrom
fix_recovery
Mar 17, 2026
Merged

trigger ReplayLog fallback only after candidate no-connection timeout#459
MrGuin merged 3 commits intomainfrom
fix_recovery

Conversation

@MrGuin
Copy link
Collaborator

@MrGuin MrGuin commented Mar 16, 2026

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Chores
    • Optimized internal recovery/replay mechanisms: reduced monitoring timeout and added per-entity tracking with interval-gated retry logic to avoid duplicate replay attempts.
    • Improved efficiency and responsiveness of fault-recovery decisions while preserving existing external behavior and public interfaces.

@MrGuin MrGuin requested a review from liunyl March 16, 2026 10:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb08d935-3db2-4d2f-8095-794e30f8c1b7

📥 Commits

Reviewing files that changed from the base of the PR and between 348b91f and 22203da.

📒 Files selected for processing (1)
  • tx_service/src/fault/log_replay_service.cpp

Walkthrough

Adds per-node-group replay-tracking to the log replay service: introduces a CandidateReplayWatch struct and a map to track per-ng term and timestamps; reduces notify timeout from 10s to 5s; changes timeout handler to apply interval-based gating and avoid redundant ReplayLog requests when inbound recovery streams exist.

Changes

Cohort / File(s) Summary
Header definitions
tx_service/include/fault/log_replay_service.h
Adds CandidateReplayWatch (term, first_seen_ts, last_replay_ts), candidate_replay_watch_ map keyed by cc_ng_id, and kReplayCheckIntervalUs constant for replay retry gating.
Retry & timeout logic
tx_service/src/fault/log_replay_service.cpp
Reduces notify timeout (10s→5s); replaces global orphan check with per-ng CandidateReplayWatch-driven logic that: refreshes on term change, skips if a replay connection exists, initializes/uses first_seen_ts and last_replay_ts to enforce kReplayCheckIntervalUs before issuing ReplayLog, and updates last_replay_ts when replay is sent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I watched the logs as leaders hopped by,
I marked their terms beneath the sky,
If streams were present I’d quietly rest,
Else I’d wait my interval, then try my best.
Replay or not, I keep time in my chest.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description contains only a pre-submit checklist template with all items unchecked, providing no substantive explanation of the changes or their rationale. Fill in the description with details about what changed, why it was changed, and the impact; check off completed checklist items or explain why they don't apply.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: gating ReplayLog fallback triggering on a candidate no-connection timeout, which aligns with the code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix_recovery
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

@MrGuin
Copy link
Collaborator Author

MrGuin commented Mar 16, 2026

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tx_service/src/fault/log_replay_service.cpp (1)

129-144: Avoid scanning every stream under inbound_mux_ once per node group.

This new check is O(node_groups × streams) every timeout cycle because inbound_connections_ is keyed by StreamId. On larger clusters, that can turn the fallback path into a lock-heavy poller and delay Connect/on_closed. A small secondary index or per-<cc_ng_id, cc_ng_term> recovering-stream count would make this much cheaper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/fault/log_replay_service.cpp` around lines 129 - 144, The loop
scanning inbound_connections_ under inbound_mux_ for every node group timeout is
inefficient; add a secondary index (e.g., an unordered_map or nested map keyed
by std::pair<cc_ng_id, cc_ng_term> or a struct key) that tracks the count of
recovering streams to avoid O(node_groups×streams) scans. Maintain this index
whenever entries are inserted/erased or when an entry's recovering_ flag changes
(update in the same critical sections that currently touch
inbound_connections_), then replace the scan in the has_replay_connection check
with a constant-time lookup against that new map; keep using inbound_mux_ for
synchronization and update the index in the same protected scope as
inbound_connections_ to preserve correctness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 145-167: The scan loop should reset the "no-connection" timer when
a healthy replay stream exists: when has_replay_connection is true, update
watch.first_seen_ts (and/or clear it) to the current timestamp (now_ts) so
first_seen_ts reflects the most recent connected time rather than the original
observation; this prevents immediately enqueuing ReplayLog (ReplayLog(ng_id,
candidate_term, ...)) on the next scan after a short-lived drop. Locate the
check using has_replay_connection, watch.first_seen_ts, watch.last_replay_ts,
and ReplayLog and add logic to set watch.first_seen_ts = now_ts when a replay
stream is present.

---

Nitpick comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 129-144: The loop scanning inbound_connections_ under inbound_mux_
for every node group timeout is inefficient; add a secondary index (e.g., an
unordered_map or nested map keyed by std::pair<cc_ng_id, cc_ng_term> or a struct
key) that tracks the count of recovering streams to avoid O(node_groups×streams)
scans. Maintain this index whenever entries are inserted/erased or when an
entry's recovering_ flag changes (update in the same critical sections that
currently touch inbound_connections_), then replace the scan in the
has_replay_connection check with a constant-time lookup against that new map;
keep using inbound_mux_ for synchronization and update the index in the same
protected scope as inbound_connections_ to preserve correctness.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 09db1326-0dda-43d4-8e1e-30ecf2609d85

📥 Commits

Reviewing files that changed from the base of the PR and between a64a31d and 348b91f.

📒 Files selected for processing (2)
  • tx_service/include/fault/log_replay_service.h
  • tx_service/src/fault/log_replay_service.cpp

@MrGuin MrGuin merged commit f74e1f7 into main Mar 17, 2026
3 of 4 checks passed
@MrGuin MrGuin deleted the fix_recovery branch March 17, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants