Skip to content

Fix CcStream SendMessageToNode#282

Merged
MrGuin merged 7 commits intomainfrom
fix_stream
Dec 15, 2025
Merged

Fix CcStream SendMessageToNode#282
MrGuin merged 7 commits intomainfrom
fix_stream

Conversation

@MrGuin
Copy link
Collaborator

@MrGuin MrGuin commented Dec 10, 2025

  • introduce a dedicated SendStandbyMessageToNode API, since standby-forward messages require their own retry logic;
  • clarify SendMessageResult return value for cc stream sends.

Related PR:
eloqdata/eloqkv#333
eloqdata/eloqsql#183
eloqdata/eloqdoc#360

Summary by CodeRabbit

  • Refactor
    • Restructured message delivery system with improved retry and reconnection handling for enhanced reliability
    • Added optimized message delivery path for standby node replication

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

This PR refactors message-sending APIs in CcStreamSender by introducing a structured SendMessageResult type to replace boolean return values. Three core methods now return this struct containing sent, queued_for_retry, and need_reconnect fields. A new SendStandbyMessageToNode method adds a best-effort standby replication path. All call sites are updated to interpret the new result type.

Changes

Cohort / File(s) Summary
API Definition
tx_service/include/remote/cc_stream_sender.h
Introduced struct SendMessageResult with fields and static constructors (Sent(), Queued(bool), Failed(bool)). Updated signatures for SendMessageToNg, SendMessageToNode, SendScanRespToNode to return SendMessageResult instead of bool. Added new public method SendStandbyMessageToNode(uint32_t, const CcMessage&) returning bool.
Core Implementation
tx_service/src/remote/cc_stream_sender.cpp
Implemented new SendMessageResult return types for SendMessageToNode, SendScanRespToNode, SendMessageToNg. Simplified parameter lists (removed retry flags, log_verbose, out-parameters). Introduced dedicated SendStandbyMessageToNode implementation. Refactored retry/reconnect logic to inspect result fields (sent, queued_for_retry, need_reconnect). Updated error handling paths (unknown destination, EAGAIN, resend failures) to return appropriate SendMessageResult values.
Call Site Updates
tx_service/src/cc/cc_shard.cpp
Replaced SendMessageToNode(...) with SendStandbyMessageToNode(...) in ForwardStandbyMessage and ResendFailedForwardMessages.
Result Interpretation Updates
tx_service/src/dead_lock_check.cpp, tx_service/src/fault/fault_inject.cpp, tx_service/src/remote/remote_cc_handler.cpp
Updated call sites to capture SendMessageResult and check (send_res.sent || send_res.queued_for_retry) instead of treating direct boolean returns. Success handling and control flow logic remain unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:

  • Verify all six call sites correctly interpret SendMessageResult fields, especially the distinction between sent, queued_for_retry, and need_reconnect states
  • Validate retry and reconnect logic in cc_stream_sender.cpp propagates result fields correctly through ResendMessageToNode and ConnectStreams paths
  • Confirm SendStandbyMessageToNode best-effort semantics (no retry/queuing) and its integration with standby replication flow in cc_shard.cpp
  • Check for any inverted or incorrect condition logic in result field checks across call sites

Possibly related PRs

Suggested reviewers

  • liunyl

Poem

🐰 A structured result, now three booleans strong,
Replaces the simple true and false all along,
Sent, queued, and reconnect—oh what a sight!
Messages hop through their paths in the night,
Standby takes shortcuts, swift and direct!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title is vague and generic, using only 'Fix CcStream SendMessageToNode' without indicating the specific nature of the fix (e.g., API redesign, return type refactoring). Consider a more descriptive title such as 'Refactor CcStream SendMessageToNode return types and add SendStandbyMessageToNode API' to better convey the changeset scope.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description provides context about the main changes and related PRs but lacks specific details about testing, documentation, and issue references as specified in the template.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix_stream

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MrGuin MrGuin requested a review from liunyl December 10, 2025 12:12
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tx_service/include/remote/cc_stream_sender.h (1)

74-95: SendMessageResult API looks good; consider small ergonomics tweaks

The SendMessageResult struct cleanly exposes the three relevant states (sent, queued, need_reconnect), and updating the three send methods to return it clarifies call‑site logic. The dedicated SendStandbyMessageToNode is a reasonable specialization, even though it returns a plain bool.

If you want to make misuse harder, you could optionally:

  • Mark SendMessageResult (or the send methods) as [[nodiscard]] to catch ignored results.
  • Add a brief comment documenting the semantics of sent, queued_for_retry, and need_reconnect so future call sites interpret them consistently.

Otherwise this header‑level API change looks consistent.

Also applies to: 103-116

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1344f8 and b2af488.

📒 Files selected for processing (6)
  • tx_service/include/remote/cc_stream_sender.h (2 hunks)
  • tx_service/src/cc/cc_shard.cpp (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
  • tx_service/src/fault/fault_inject.cpp (1 hunks)
  • tx_service/src/remote/cc_stream_sender.cpp (17 hunks)
  • tx_service/src/remote/remote_cc_handler.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/cc_shard.cpp
🧬 Code graph analysis (2)
tx_service/include/remote/cc_stream_sender.h (1)
tx_service/src/remote/cc_stream_sender.cpp (8)
  • SendMessageToNg (507-514)
  • SendMessageToNg (507-510)
  • SendMessageToNode (158-320)
  • SendMessageToNode (158-161)
  • SendScanRespToNode (322-458)
  • SendScanRespToNode (322-323)
  • SendStandbyMessageToNode (460-495)
  • SendStandbyMessageToNode (460-461)
tx_service/src/remote/cc_stream_sender.cpp (1)
tx_service/include/remote/cc_stream_sender.h (2)
  • Queued (85-93)
  • Sent (80-94)
🔇 Additional comments (6)
tx_service/src/remote/remote_cc_handler.cpp (1)

848-872: BroadcastStatistics now treats queued send as success and surfaces hard failures

Using SendMessageResult and calling hres.SetFinished() only when sent || queued_for_retry keeps the fire‑and‑forget behavior for accepted/queued sends, while allowing genuine send failures to propagate via CcStreamSender’s error handling. Verify that callers are fine with this potential behavior change (previously hard failures were effectively ignored here).

tx_service/src/dead_lock_check.cpp (1)

199-218: Deadlock probe send path correctly updated to SendMessageResult

Using send_res.sent || send_res.queued_for_retry to decide whether to track a node in reply_map_ preserves the previous intent (only expect replies from nodes that accepted/queued the request) while aligning with the new API.

tx_service/src/fault/fault_inject.cpp (1)

175-213: Remote fault-inject path correctly interprets SendMessageResult

b = (send_res.sent || send_res.queued_for_retry); keeps the previous meaning of “remote injection accepted” while properly treating queued‑for‑retry as success and falling back to local injection only on hard failure.

tx_service/src/remote/cc_stream_sender.cpp (2)

158-320: Core send paths correctly refactored to SendMessageResult

The refactoring of SendMessageToNode, SendScanRespToNode, ResendMessageToNode, and ConnectStreams to use SendMessageResult is internally consistent:

  • Unknown node → Failed() with REQUESTED_NODE_NOT_LEADER set on res when provided.
  • Connecting stream (stream_ver == -1) → buffer into appropriate resend list, wake connector, and return Queued(true).
  • EAGAIN → buffer into EAGAIN resend list and return Queued(false).
  • Non‑EAGAIN errors:
    • initial send (resend == false) → mark stream for reconnect, buffer into resend list, wake connector, Queued(true);
    • resend (resend == true) → Failed() and set error on res.
  • Resend threads treat sent || queued_for_retry as success and use need_reconnect only to handle races where another thread has already marked the stream for reconnect.

This preserves prior behavior while making states explicit at call sites.

Also applies to: 322-458, 617-763, 766-916


460-495: SendStandbyMessageToNode lacks reconnect/resend integration

SendStandbyMessageToNode does not trigger reconnect or resend logic on stream failure. If the underlying stream breaks, calls fail and log errors, but CcStreamSender never reconnects the stream unless another code path (e.g., SendMessageToNode) touches the same node. This risks leaving standby replicas unreachable.

Consider wiring SendStandbyMessageToNode into the reconnect machinery, either by delegating to SendMessageToNode or by mirroring its reconnect behavior (flipping stream version to -1 and populating to_connect_regular_streams_ on non-EAGAIN errors).

tx_service/src/cc/cc_shard.cpp (1)

2680-2726: Standby forwarding reconnection concern requires verification

The original review identifies a potential issue where SendStandbyMessageToNode may not trigger the same reconnection logic as SendMessageToNode on stream failures, which could leave standby-only nodes with broken connections. This concern requires examination of the actual SendStandbyMessageToNode and SendMessageToNode implementations in cc_stream_sender.cpp, their error handling paths, and whether they enqueue messages for resend or trigger stream reconnection through version management or ConnectStreams().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tx_service/src/remote/cc_stream_sender.cpp (2)

262-262: Track unresolved design decision about message discarding.

The TODO comments flag an important question: when resend fails (possibly indicating a dead remote node), should the message be definitively discarded? The current behavior returns Failed() without explicit discard logic.

Would you like me to open an issue to track this design decision? It requires clarity on retry semantics and failure handling policy.

Also applies to: 422-422


461-500: Consider API consistency or add documentation for different return type.

SendStandbyMessageToNode returns bool while SendMessageToNode, SendScanRespToNode, and SendMessageToNg all return SendMessageResult. The best-effort semantics (no retry/queuing) justify the simpler return type, but the inconsistency might confuse developers.

Consider one of:

  1. Add a doc comment explaining why standby messages use a simpler bool return (best-effort, no retry infrastructure)
  2. Use SendMessageResult for consistency (though Sent()/Failed() without queuing might be less clear)
/**
 * @brief Send standby message with best-effort semantics (no retry/queuing).
 * 
 * Unlike SendMessageToNode, this does not queue failed messages for retry.
 * Use for standby replication where occasional message loss is acceptable.
 * 
 * @return true if sent immediately, false otherwise
 */
bool SendStandbyMessageToNode(uint32_t dest_node_id, const CcMessage &msg);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2af488 and b71b783.

📒 Files selected for processing (2)
  • tx_service/include/remote/cc_stream_sender.h (2 hunks)
  • tx_service/src/remote/cc_stream_sender.cpp (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/include/remote/cc_stream_sender.h
🧰 Additional context used
🧬 Code graph analysis (1)
tx_service/src/remote/cc_stream_sender.cpp (1)
tx_service/include/remote/cc_stream_sender.h (2)
  • Queued (85-93)
  • Sent (80-94)
🔇 Additional comments (3)
tx_service/src/remote/cc_stream_sender.cpp (3)

158-321: Excellent refactor to structured return type.

The change from boolean return to SendMessageResult significantly improves clarity by explicitly distinguishing between sent, queued-for-retry, and failed states. Eliminating the output parameter need_reconnect in favor of including it in the result struct is a cleaner design.


697-702: Correct interpretation of SendMessageResult.

The logic correctly treats both sent and queued_for_retry as successful outcomes for the purpose of clearing messages from the EAGAIN resend list. This aligns with the intent that queued messages will be handled by the reconnection/resend infrastructure.

Also applies to: 745-750


829-847: Assertions correctly document subtle invariant.

The assertions at lines 836 and 897 verify that need_reconnect implies queued_for_retry when resend=true. This is a subtle but correct invariant: when resending and need_reconnect is true, it can only occur from the stream-connecting case (line 204 / line 365), which sets both flags. The assertions help document this non-obvious relationship.

Also applies to: 890-909

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b71b783 and 1bb9d53.

📒 Files selected for processing (6)
  • tx_service/include/remote/cc_stream_sender.h (2 hunks)
  • tx_service/src/cc/cc_shard.cpp (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
  • tx_service/src/fault/fault_inject.cpp (1 hunks)
  • tx_service/src/remote/cc_stream_sender.cpp (17 hunks)
  • tx_service/src/remote/remote_cc_handler.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tx_service/src/remote/remote_cc_handler.cpp
  • tx_service/src/dead_lock_check.cpp
  • tx_service/include/remote/cc_stream_sender.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/cc_shard.cpp
🧬 Code graph analysis (1)
tx_service/src/remote/cc_stream_sender.cpp (1)
tx_service/include/remote/cc_stream_sender.h (2)
  • Queued (85-93)
  • Sent (80-94)
🔇 Additional comments (12)
tx_service/src/fault/fault_inject.cpp (1)

203-204: LGTM!

The adaptation to the new SendMessageResult return type is correct. The success criteria (sent || queued_for_retry) aligns with the PR's intent to treat queued messages as successfully handled, with the local fault injection fallback still triggered when the remote send definitively fails.

tx_service/src/cc/cc_shard.cpp (2)

2710-2711: LGTM!

The switch to SendStandbyMessageToNode is appropriate for standby forwarding. The boolean return simplifies the logic since standby messages have their own retry mechanism through the message buffer (history_standby_msg_) and retry_fwd_msg_cc_.


2842-2843: LGTM!

Consistent use of SendStandbyMessageToNode in the retry path. The error handling correctly sets all_msgs_sent = false and breaks the loop on failure.

tx_service/src/remote/cc_stream_sender.cpp (9)

158-186: LGTM on the API migration.

The return type change from bool to SendMessageResult provides clearer semantics. The Failed() return when the destination node is unknown is appropriate.


189-205: LGTM!

Returning SendMessageResult::Queued(true) when stream is reconnecting correctly indicates that the message is queued and reconnection is needed.


218-256: LGTM!

EAGAIN handling correctly returns Queued(false) since the message is queued for background resend but no reconnection is needed (the stream is still valid, just temporarily full).


257-321: LGTM!

The non-EAGAIN error handling correctly distinguishes between resend (where further retry is futile) and initial send (where reconnection and retry are attempted). The Queued(true) return for the initial send path correctly signals that reconnection is needed.


512-519: LGTM!

SendMessageToNg correctly delegates to SendMessageToNode and returns the result unchanged.


697-702: LGTM!

The resend logic correctly treats both sent and queued_for_retry as success for counting purposes, since either outcome means the message has been handled.


745-750: LGTM!

Consistent handling with the regular message resend logic above.


829-848: LGTM!

The ConnectStreams logic correctly handles the new SendMessageResult:

  • Checks need_reconnect to determine if the connection loop should break
  • The assertion assert(send_result.queued_for_retry) correctly validates that when need_reconnect is true, the current message was already re-queued internally
  • Remaining messages are re-enqueued for the next connection attempt

890-909: LGTM!

Consistent handling for long message streams with the same pattern as regular messages.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/remote/cc_stream_sender.cpp (1)

713-719: Avoid potential hot-spin when EAGAIN keeps re-queuing.
Right now (sent || queued_for_retry) increments send_cnt, so if sends repeatedly hit EAGAIN (i.e., Queued(false)), the inner resend loop can churn the same node’s queue without backoff. Consider treating Queued(false) as “no progress” (break/yield) or add a small backoff on that outcome.

Also applies to: 761-767

♻️ Duplicate comments (1)
tx_service/src/remote/cc_stream_sender.cpp (1)

461-516: Standby send reconnection-on-non-EAGAIN is implemented (matches prior feedback).

🧹 Nitpick comments (1)
tx_service/include/remote/cc_stream_sender.h (1)

74-127: API shape looks good: SendMessageResult + dedicated standby send clarifies semantics.
Optional: consider whether SendMessageResult (or the send methods) should be [[nodiscard]] to discourage accidental ignoring of {queued_for_retry, need_reconnect} states.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bb9d53 and 94e75a4.

📒 Files selected for processing (6)
  • tx_service/include/remote/cc_stream_sender.h (2 hunks)
  • tx_service/src/cc/cc_shard.cpp (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
  • tx_service/src/fault/fault_inject.cpp (1 hunks)
  • tx_service/src/remote/cc_stream_sender.cpp (17 hunks)
  • tx_service/src/remote/remote_cc_handler.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/src/remote/remote_cc_handler.cpp
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.

Applied to files:

  • tx_service/src/cc/cc_shard.cpp
🧬 Code graph analysis (2)
tx_service/include/remote/cc_stream_sender.h (1)
tx_service/src/remote/cc_stream_sender.cpp (8)
  • SendMessageToNg (528-535)
  • SendMessageToNg (528-531)
  • SendMessageToNode (158-321)
  • SendMessageToNode (158-161)
  • SendScanRespToNode (323-459)
  • SendScanRespToNode (323-324)
  • SendStandbyMessageToNode (461-516)
  • SendStandbyMessageToNode (461-462)
tx_service/src/remote/cc_stream_sender.cpp (1)
tx_service/include/remote/cc_stream_sender.h (2)
  • Queued (85-93)
  • Sent (80-94)
🔇 Additional comments (7)
tx_service/src/dead_lock_check.cpp (1)

210-216: Good adaptation to SendMessageResult (treat queued retry as non-fatal).
This aligns with the new send contract and avoids falsely treating “queued for retry” as an immediate hard failure.

tx_service/src/fault/fault_inject.cpp (1)

203-205: Send result handling matches new API semantics.
Using (sent || queued_for_retry) is consistent with the new “queued is not fatal” meaning.

tx_service/src/cc/cc_shard.cpp (2)

2710-2712: Standby forwarding correctly switches to SendStandbyMessageToNode.
Only potential footgun: if stream_sender_ can still be null after lazy init, this would crash—consider a defensive null check if not guaranteed by invariant.


2842-2844: Retry path correctly uses standby-specific send API.
Same note as above: consider guarding against a null stream_sender_ if that’s possible in practice.

tx_service/src/remote/cc_stream_sender.cpp (3)

158-205: SendMessageToNode migration to SendMessageResult looks consistent.
Return values line up with the documented meaning (sent vs queued, reconnect-needed encoded via need_reconnect).

Also applies to: 255-316, 320-321


323-366: SendScanRespToNode migration to SendMessageResult looks consistent.

Also applies to: 415-459


845-864: Reconnect-aware resend loop is directionally good; double-check drop semantics on Failed().
These loops only special-case need_reconnect; if a resend returns Failed() (not queued), the message is effectively dropped after dequeue. If that’s intended, OK; otherwise you may want explicit handling (e.g., re-enqueue remaining batch and/or mark stream for reconnect).

Also applies to: 906-924

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tx_service/src/remote/cc_stream_sender.cpp (2)

262-262: Consider documenting the message discard policy.

The TODO comment indicates uncertainty about whether to discard messages that fail after reconnection. This design decision impacts reliability guarantees and should be documented or resolved.


845-865: Consider documenting or enforcing the invariant that need_reconnect=true requires queued_for_retry=true.

The assertions at lines 853 and 914 assume that when need_reconnect is true, queued_for_retry must also be true. This invariant holds throughout the current codebase—all paths that set need_reconnect=true use Queued(true), which also sets queued_for_retry=true. However, the SendMessageResult API structure allows Failed(true) which would return {sent=false, queued_for_retry=false, need_reconnect=true}, creating a theoretically invalid state.

To prevent this type of violation in future code, consider either:

  1. Adding a comment documenting this invariant on the SendMessageResult struct
  2. Modifying the factory methods to enforce it structurally (e.g., remove the need_reconnect parameter from Failed())
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c52e483 and 50a63e6.

📒 Files selected for processing (6)
  • tx_service/include/remote/cc_stream_sender.h (2 hunks)
  • tx_service/src/cc/cc_shard.cpp (2 hunks)
  • tx_service/src/dead_lock_check.cpp (1 hunks)
  • tx_service/src/fault/fault_inject.cpp (1 hunks)
  • tx_service/src/remote/cc_stream_sender.cpp (17 hunks)
  • tx_service/src/remote/remote_cc_handler.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tx_service/src/cc/cc_shard.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
tx_service/src/remote/cc_stream_sender.cpp (1)
tx_service/include/remote/cc_stream_sender.h (2)
  • Queued (85-93)
  • Sent (80-94)
tx_service/include/remote/cc_stream_sender.h (1)
tx_service/src/remote/cc_stream_sender.cpp (8)
  • SendMessageToNg (528-535)
  • SendMessageToNg (528-531)
  • SendMessageToNode (158-321)
  • SendMessageToNode (158-161)
  • SendScanRespToNode (323-459)
  • SendScanRespToNode (323-324)
  • SendStandbyMessageToNode (461-516)
  • SendStandbyMessageToNode (461-462)
🔇 Additional comments (10)
tx_service/src/remote/remote_cc_handler.cpp (1)

868-873: LGTM! Correctly interprets the new SendMessageResult semantics.

The change properly handles both the sent and queued_for_retry states as success conditions for the fire-and-forget broadcast statistics operation.

tx_service/src/dead_lock_check.cpp (1)

210-213: LGTM! Consistent with the new result semantics.

The change correctly interprets sent || queued_for_retry as successful submission for the deadlock detection coordination message.

tx_service/src/fault/fault_inject.cpp (1)

203-205: LGTM! Proper fallback logic with the new result type.

The change correctly uses sent || queued_for_retry to determine whether the remote fault injection succeeded, falling back to local execution only when both are false.

tx_service/include/remote/cc_stream_sender.h (2)

74-94: LGTM! Well-designed result structure.

The SendMessageResult struct with static factory methods provides a clear, self-documenting API for the multi-state send outcomes (Sent, Queued, Failed). The three boolean fields effectively capture the necessary state for callers to handle retries and reconnections.


104-127: LGTM! Clear API contracts with good documentation.

The updated signatures returning SendMessageResult provide better visibility into send outcomes compared to simple boolean returns. The documentation for SendStandbyMessageToNode clearly distinguishes its best-effort semantics from the retry-enabled SendMessageToNode.

tx_service/src/remote/cc_stream_sender.cpp (5)

158-321: LGTM! Clear result semantics with proper error handling.

The updated SendMessageToNode correctly returns SendMessageResult with appropriate state for each error path:

  • Failed() when destination unknown or resend fails after reconnect
  • Queued(true) when stream is connecting or needs reconnection
  • Queued(false) on EAGAIN (transient backpressure)
  • Sent() on immediate success

The need_reconnect field properly distinguishes between transient (EAGAIN) and persistent errors.


323-459: LGTM! Consistent with SendMessageToNode logic.

The implementation correctly applies the same SendMessageResult semantics to long message streams, maintaining consistency across the codebase.


461-516: LGTM! Best-effort semantics properly implemented.

The SendStandbyMessageToNode correctly implements fire-and-forget semantics without retry queuing. The reconnection logic for non-EAGAIN errors (lines 498-511) properly addresses the past review comment, ensuring prompt stream recovery rather than relying on idle timeout.


528-535: LGTM! Clean delegation pattern.

The wrapper correctly propagates the SendMessageResult from SendMessageToNode.


713-719: LGTM! Proper resend logic with the new result type.

The background resend thread correctly interprets both sent and queued_for_retry as progress, and breaks to the next stream when neither succeeds, preventing wasted retries on a persistently broken connection.

Also applies to: 761-767

@MrGuin MrGuin merged commit 0c6d867 into main Dec 15, 2025
4 checks passed
@MrGuin MrGuin deleted the fix_stream branch December 15, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants