Skip to content

[DEV-1437] Add leader failover reconnection tests#475

Merged
w1am merged 1 commit intomasterfrom
w1am/test-leader-failover
Apr 13, 2026
Merged

[DEV-1437] Add leader failover reconnection tests#475
w1am merged 1 commit intomasterfrom
w1am/test-leader-failover

Conversation

@w1am
Copy link
Copy Markdown
Collaborator

@w1am w1am commented Apr 6, 2026

Cover reconnection behavior for both gRPC (write) and Rust bridge (read) code paths during leader failover scenarios including node kill/resurrect, NotLeader error recovery, concurrent operations, and repeated reads with requiresLeader.

test: fix leader failover tests for dual-connection architecture

- Remove redundant write reconnection test (covered by UnavailableError.test.ts)
- Remove test.only marker
- Remove customer name references from comments
- Fix "readStream after kill" test with retry loop for Rust client stabilization
- Fix "mixed read/write" test to trigger NotLeader on both gRPC and bridge paths
@w1am w1am force-pushed the w1am/test-leader-failover branch from e72d149 to 1170335 Compare April 9, 2026 09:24
@w1am w1am changed the title add leader failover reconnection tests Add leader failover reconnection tests Apr 13, 2026
@w1am w1am marked this pull request as ready for review April 13, 2026 10:21
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add leader failover reconnection tests for dual-connection architecture

🧪 Tests

Grey Divider

Walkthroughs

Description
• Add comprehensive leader failover reconnection tests for dual-connection architecture
• Cover gRPC write path with leader kill/resurrect scenarios
• Cover Rust bridge read path with NotLeader error recovery
• Test concurrent operations and mixed read/write during failover
Diagram
flowchart LR
  A["Test Suite"] --> B["Write Operations<br/>gRPC Path"]
  A --> C["Read Operations<br/>Rust Bridge Path"]
  A --> D["Concurrent Operations<br/>Mixed Paths"]
  B --> B1["Leader Kill/Resurrect"]
  C --> C1["NotLeader Error Recovery"]
  C --> C2["Cluster Stabilization"]
  D --> D1["Parallel Writes"]
  D --> D2["Mixed Read/Write"]
Loading

Grey Divider

File Changes

1. packages/test/src/connection/reconnect/leader-failover.test.ts 🧪 Tests +441/-0

Leader failover reconnection test suite

• New test file with 441 lines covering leader failover reconnection scenarios
• Tests write operations (gRPC path) with leader kill/resurrect and reconnection validation
• Tests read operations (Rust bridge path) with NotLeader error handling and recovery
• Tests concurrent write operations and mixed read/write operations during failover
• Includes retry loops for Rust client stabilization and cluster recovery delays

packages/test/src/connection/reconnect/leader-failover.test.ts


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 13, 2026

Code Review by Qodo

🐞 Bugs (4)   📘 Rule violations (0)   📎 Requirement gaps (0)   🖥 UI issues (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (2) ☼ Reliability (2)

Grey Divider


Action required

1. Missing failover assertions 🐞
Description
The leader-resurrection write test never asserts that the post-kill append actually failed and never
verifies the client did not reconnect to the resurrected (now follower) node, so it can pass while
not validating the behavior described in the test comments.
Code

packages/test/src/connection/reconnect/leader-failover.test.ts[R57-81]

+      // First operation should fail
+      try {
+        await client.appendToStream(
+          "resurrect-stream",
+          jsonEvent({ type: "should-fail", data: { message: "test" } }),
+          { credentials: { username: "admin", password: "changeit" } }
+        );
+      } catch (error) {
+        // Expected failure
+      }
+
+      // Resurrect the killed leader (it comes back as follower)
+      await cluster.resurrect();
+      await delay(5_000);
+
+      // Subsequent operations should succeed on the new leader
+      const afterResurrect = await client.appendToStream(
+        "resurrect-stream",
+        jsonEvent({
+          type: "after-resurrect",
+          data: { message: "test" },
+        }),
+        { credentials: { username: "admin", password: "changeit" } }
+      );
+      expect(afterResurrect).toBeDefined();
Evidence
The test explicitly claims it must not reconnect to the resurrected follower, but it only swallows
the expected failure and then checks that a later append is defined; it never validates that an
error occurred nor that the connection endpoint changed away from the killed node.

packages/test/src/connection/reconnect/leader-failover.test.ts[29-35]
packages/test/src/connection/reconnect/leader-failover.test.ts[51-66]
packages/test/src/connection/reconnect/leader-failover.test.ts[68-82]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test `should reconnect after leader is killed and resurrected as follower` can pass even if the post-kill operation does not fail and even if the client reconnects back to the resurrected node, because it does not assert either condition.

### Issue Context
The test’s comment requires: "Client should NOT reconnect to the old (now follower) node", but the test only checks that a later append returns a defined result.

### Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[29-82]

### Suggested changes
- Replace the post-kill `try/catch` with an explicit rejection assertion (e.g., `await expect(client.appendToStream(...)).rejects.toBeDefined()` or `rejects.toBeInstanceOf(UnavailableError)` if appropriate).
- After the successful post-resurrect append, call `getCurrentConnection(client)` again and assert it is **not** equal to the original `leaderConnection` (or assert it matches the newly elected leader if you can derive that reliably).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Caught errors not asserted 🐞
Description
Multiple tests catch and ignore errors without asserting the expected error type (e.g.,
NotLeaderError) or even asserting that an error occurred, which can hide regressions and allow
unintended success/failure modes to slip through.
Code

packages/test/src/connection/reconnect/leader-failover.test.ts[R119-131]

+      // Read with requiresLeader: true should fail on follower
+      try {
+        const events = await collect(
+          client.readStream("read-reconnect-stream", {
+            maxCount: 10,
+            fromRevision: START,
+            requiresLeader: true,
+          })
+        );
+        expect(events).toBe("unreachable");
+      } catch (error) {
+        // Expected: NotLeaderError from the Rust bridge
+      }
Evidence
Several try/catch blocks treat errors as “expected” but never assert what was thrown; for
Rust-bridge reads, the library explicitly maps bridge NotLeader errors into NotLeaderError, so the
tests can and should validate the specific exception.

packages/test/src/connection/reconnect/leader-failover.test.ts[119-131]
packages/test/src/connection/reconnect/leader-failover.test.ts[157-169]
packages/test/src/connection/reconnect/leader-failover.test.ts[220-230]
packages/test/src/connection/reconnect/leader-failover.test.ts[274-285]
packages/db-client/src/utils/convertBridgeError.ts[29-44]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The tests currently swallow “expected” errors (NotLeader, node-down) without asserting they actually happened and/or without asserting the error type. This reduces the tests’ ability to detect regressions and makes failures harder to diagnose.

### Issue Context
For Rust bridge reads, `convertBridgeError` maps bridge errors into typed JS errors (including `NotLeaderError`), so tests can validate behavior precisely.

### Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[57-66]
- packages/test/src/connection/reconnect/leader-failover.test.ts[119-131]
- packages/test/src/connection/reconnect/leader-failover.test.ts[157-169]
- packages/test/src/connection/reconnect/leader-failover.test.ts[220-230]
- packages/test/src/connection/reconnect/leader-failover.test.ts[274-285]

### Suggested changes
- Import and assert specific error classes where appropriate (e.g., `NotLeaderError`, `UnavailableError`).
- Use `await expect(promise).rejects.toBeInstanceOf(NotLeaderError)` (or `.toThrow(NotLeaderError)`), rather than empty catch blocks.
- Where a rejection is required for the test’s logic, add an assertion that ensures the failure path executed (e.g., `expect.assertions(n)` or a `let failed = false` guard).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Resurrect leader readiness assumed 🐞
Description
Tests assume a fixed delay after cluster.resurrect() is sufficient, but resurrect() only waits
for liveness and does not wait for leader election, which can cause intermittent failures when
operations run before the cluster is stable.
Code

packages/test/src/connection/reconnect/leader-failover.test.ts[R68-71]

+      // Resurrect the killed leader (it comes back as follower)
+      await cluster.resurrect();
+      await delay(5_000);
+
Evidence
Cluster.resurrect() returns after waiting for /health/live only, whereas leader election
readiness is a separate check (leaderElected) used during initial up(). The new tests use
hardcoded sleeps after resurrection instead of waiting for leader election or retrying
leader-dependent operations.

packages/test/src/utils/Cluster.ts[385-392]
packages/test/src/utils/Cluster.ts[503-542]
packages/test/src/connection/reconnect/leader-failover.test.ts[68-71]
packages/test/src/connection/reconnect/leader-failover.test.ts[232-235]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
After `cluster.resurrect()`, tests use fixed sleeps before proceeding. Since `resurrect()` only checks liveness, leader election may still be in progress, making tests timing-dependent.

### Issue Context
`Cluster.up()` waits for leader election for multi-node clusters, but `Cluster.resurrect()` does not.

### Fix Focus Areas
- packages/test/src/utils/Cluster.ts[385-396]
- packages/test/src/connection/reconnect/leader-failover.test.ts[68-82]
- packages/test/src/connection/reconnect/leader-failover.test.ts[232-254]

### Suggested changes
- Prefer polling/attempt loops (as already done later in this file) for post-resurrect operations rather than a single fixed `delay(5_000)`.
- Optionally extend `Cluster.resurrect()` to wait for leader election when `count > 1` (e.g., call an internal/public leader-election wait) so all tests get consistent semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Concurrent reconnect race in test 🐞
Description
The concurrent write test assumes reconnection has completed immediately after the first NotLeader
failure, but reconnection is triggered asynchronously (handleError is not awaited), so parallel
appends can race with reconnection and intermittently reject.
Code

packages/test/src/connection/reconnect/leader-failover.test.ts[R331-379]

+      // Trigger NotLeader error with requiresLeader
+      try {
+        await client.appendToStream(
+          "concurrent-stream",
+          jsonEvent({ type: "trigger-not-leader", data: {} }),
+          {
+            requiresLeader: true,
+            credentials: { username: "admin", password: "changeit" },
+          }
+        );
+        expect("should not reach").toBe("here");
+      } catch (error) {
+        // Expected: NotLeaderError triggers reconnection
+      }
+
+      // Now fire multiple operations rapidly in parallel
+      // If handleError has a race condition, some of these could fail
+      const results = await Promise.allSettled([
+        client.appendToStream(
+          "concurrent-stream",
+          jsonEvent({ type: "concurrent-1", data: {} }),
+          {
+            requiresLeader: true,
+            credentials: { username: "admin", password: "changeit" },
+          }
+        ),
+        client.appendToStream(
+          "concurrent-stream",
+          jsonEvent({ type: "concurrent-2", data: {} }),
+          {
+            requiresLeader: true,
+            credentials: { username: "admin", password: "changeit" },
+          }
+        ),
+        client.appendToStream(
+          "concurrent-stream",
+          jsonEvent({ type: "concurrent-3", data: {} }),
+          {
+            requiresLeader: true,
+            credentials: { username: "admin", password: "changeit" },
+          }
+        ),
+      ]);
+
+      // All operations should succeed since the client should have
+      // reconnected to the leader after the first NotLeader error
+      for (const result of results) {
+        expect(result.status).toBe("fulfilled");
+      }
Evidence
After the NotLeader-triggering append, the test immediately starts parallel leader-required appends
and expects all fulfilled. In the client, execute() calls handleError() without await, meaning
the reconnection state update can still be in flight when subsequent operations begin.

packages/test/src/connection/reconnect/leader-failover.test.ts[331-379]
packages/db-client/src/Client/index.ts[466-478]
packages/db-client/src/Client/index.ts[530-564]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test fires concurrent `requiresLeader` appends immediately after a NotLeader-triggering call and asserts all are fulfilled. Because reconnection is initiated asynchronously, this can race and produce intermittent failures.

### Issue Context
`Client.execute()` triggers `handleError()` but does not await it, so reconnection may not be fully applied before the next operations run.

### Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[314-380]
- packages/db-client/src/Client/index.ts[466-478]

### Suggested changes
- Add a small “reconnection gate” before launching parallel writes, e.g. retry a single `appendToStream(... requiresLeader: true ...)` until it succeeds, then run `Promise.allSettled`.
- Alternatively, relax the assertion to allow transient failures and retry rejected results (if the goal is to validate eventual success under concurrency).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@w1am w1am added cherry-pick:release/v1.1 Cherry picks PR into v1.1 release branch cherry-pick:release/v1.2 labels Apr 13, 2026
@w1am w1am merged commit 72f5dda into master Apr 13, 2026
30 checks passed
@w1am w1am deleted the w1am/test-leader-failover branch April 13, 2026 10:23
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w1am 👉 Created pull request targeting release/v1.1: #477

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w1am 👉 Created pull request targeting release/v1.2: #478

Comment on lines +57 to +81
// First operation should fail
try {
await client.appendToStream(
"resurrect-stream",
jsonEvent({ type: "should-fail", data: { message: "test" } }),
{ credentials: { username: "admin", password: "changeit" } }
);
} catch (error) {
// Expected failure
}

// Resurrect the killed leader (it comes back as follower)
await cluster.resurrect();
await delay(5_000);

// Subsequent operations should succeed on the new leader
const afterResurrect = await client.appendToStream(
"resurrect-stream",
jsonEvent({
type: "after-resurrect",
data: { message: "test" },
}),
{ credentials: { username: "admin", password: "changeit" } }
);
expect(afterResurrect).toBeDefined();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Missing failover assertions 🐞 Bug ≡ Correctness

The leader-resurrection write test never asserts that the post-kill append actually failed and never
verifies the client did not reconnect to the resurrected (now follower) node, so it can pass while
not validating the behavior described in the test comments.
Agent Prompt
### Issue description
The test `should reconnect after leader is killed and resurrected as follower` can pass even if the post-kill operation does not fail and even if the client reconnects back to the resurrected node, because it does not assert either condition.

### Issue Context
The test’s comment requires: "Client should NOT reconnect to the old (now follower) node", but the test only checks that a later append returns a defined result.

### Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[29-82]

### Suggested changes
- Replace the post-kill `try/catch` with an explicit rejection assertion (e.g., `await expect(client.appendToStream(...)).rejects.toBeDefined()` or `rejects.toBeInstanceOf(UnavailableError)` if appropriate).
- After the successful post-resurrect append, call `getCurrentConnection(client)` again and assert it is **not** equal to the original `leaderConnection` (or assert it matches the newly elected leader if you can derive that reliably).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +119 to +131
// Read with requiresLeader: true should fail on follower
try {
const events = await collect(
client.readStream("read-reconnect-stream", {
maxCount: 10,
fromRevision: START,
requiresLeader: true,
})
);
expect(events).toBe("unreachable");
} catch (error) {
// Expected: NotLeaderError from the Rust bridge
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Caught errors not asserted 🐞 Bug ≡ Correctness

Multiple tests catch and ignore errors without asserting the expected error type (e.g.,
NotLeaderError) or even asserting that an error occurred, which can hide regressions and allow
unintended success/failure modes to slip through.
Agent Prompt
### Issue description
The tests currently swallow “expected” errors (NotLeader, node-down) without asserting they actually happened and/or without asserting the error type. This reduces the tests’ ability to detect regressions and makes failures harder to diagnose.

### Issue Context
For Rust bridge reads, `convertBridgeError` maps bridge errors into typed JS errors (including `NotLeaderError`), so tests can validate behavior precisely.

### Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[57-66]
- packages/test/src/connection/reconnect/leader-failover.test.ts[119-131]
- packages/test/src/connection/reconnect/leader-failover.test.ts[157-169]
- packages/test/src/connection/reconnect/leader-failover.test.ts[220-230]
- packages/test/src/connection/reconnect/leader-failover.test.ts[274-285]

### Suggested changes
- Import and assert specific error classes where appropriate (e.g., `NotLeaderError`, `UnavailableError`).
- Use `await expect(promise).rejects.toBeInstanceOf(NotLeaderError)` (or `.toThrow(NotLeaderError)`), rather than empty catch blocks.
- Where a rejection is required for the test’s logic, add an assertion that ensures the failure path executed (e.g., `expect.assertions(n)` or a `let failed = false` guard).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@w1am w1am changed the title Add leader failover reconnection tests [DEV-1437] Add leader failover reconnection tests Apr 13, 2026
@linear
Copy link
Copy Markdown

linear bot commented Apr 13, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick:release/v1.1 Cherry picks PR into v1.1 release branch cherry-pick:release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant