Test: LLM provider abstraction and tool-calling edge cases (#709) by Chris0Jeky · Pull Request #747 · Chris0Jeky/Taskdeck

Chris0Jeky · 2026-04-04T01:11:23Z

Summary

100 new edge case tests for LLM provider abstraction and tool-calling orchestration
Orchestrator boundary tests (18): per-round timeout, empty/null tool call lists, concurrent tool calls in a round, mixed error handling across tools, tool-not-found with suggestion, generic provider exceptions, cancellation propagation, userId context passing, large result truncation, metadata JSON generation, token accumulation, exhausted round partial summaries
Provider abstraction tests (24): default interface throws NotSupportedException, MockLlmProvider edge cases (empty messages, non-user roles, actionable detection, tool-calling patterns, health/probe), provider selection policy (null/empty/case-insensitive), record defaults, kill switch settings
Intent classifier tests (48): negation filtering, other-tool questions, positive detection for all intents, non-actionable inputs, null/empty/whitespace, very long inputs, mixed case, newlines, prompt injection, archive-vs-move disambiguation, plural nouns, alternate verb coverage
Tool executor registry tests (10): empty registry, case-insensitive lookup, registration, ToolExecutionContext value equality

Closes #709

Test plan

All 100 new tests pass
No existing tests broken (pre-existing Api.Tests build error in AuthControllerEdgeCaseTests is unrelated)
Tests use Mock provider only (no real LLM API calls)

18 tests covering: per-round timeout, empty/null tool call lists, multiple concurrent tools in one round, mixed error handling, tool-not-found with available tool suggestion, generic provider exception, cancellation token propagation, userId context passing, large result truncation in log, metadata JSON generation, token accumulation across rounds, null content on complete, exhausted rounds partial summary, and status notifier invocation per tool.

24 tests covering: default CompleteWithToolsAsync throws NotSupportedException, MockLlmProvider edge cases (empty messages, non-user roles, actionable message detection, very long input, empty tool-calling messages, previous results summary, error results, large result truncation, health/probe endpoints), provider selection edge cases (null/empty settings, case insensitive provider names), record default values, degraded result accessibility, and kill switch settings defaults.

48 tests covering: negation filtering (don't, do not, never, stop, cancel, avoid), other-tool questions (Trello, Jira, Asana, Notion), positive detection for all intent types, non-actionable inputs, null/empty/whitespace handling, very long inputs, mixed case, newlines, prompt injection patterns, archive vs move disambiguation, plural noun support, and alternate verb coverage (generate, build, prepare, set up, modify, change, sort, rearrange, reorganize).

10 tests covering: empty registry returns null, case-insensitive lookup, mixed case lookup, empty registered tool names, multiple executor registration, non-existent tool lookup, empty string lookup, ToolExecutionContext property access, and value equality.

chatgpt-codex-connector · 2026-04-04T01:11:28Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Chris0Jeky · 2026-04-04T01:12:10Z

Adversarial Self-Review Findings

Potential Issues Found

Newlines test has no assertion (LlmIntentClassifierEdgeCaseTests, line 151-161): Classify_NewlinesInInput_StillDetects has a Theory with two inputs but the test body has no assertion — it only calls the method and does nothing with the result. This is a false-positive test that will always pass. Will fix.
Negation test comment is misleading (line 24): Comment says "avoid" + "creating" uses "avoid" in the negation list but the negation regex requires the negation word to be followed by a verb infinitive. "avoid creating a task please" actually matches because \b(avoid)\b(\s+\w+){0,6}\s+\b(create)\b does NOT match the gerund "creating" — it matches because the second regex hit of "task" as a card noun is being suppressed by the "avoid" keyword further back. This needs verification. Will verify and fix comment if wrong.
PerRoundTimeout test is slow (ToolCallingChatOrchestratorEdgeCaseTests, line 38): This test relies on the actual per-round timeout (30 seconds) to fire. In the test run it completed in ~30s. Not flaky, but slow. The alternative would be to use a custom timeout value, but that would require changing the production code. Acceptable trade-off.
MockProvider_ToolCalling_VeryLongToolResult assertion is weak (LlmProviderAbstractionEdgeCaseTests, line 166): The assertion Content.Length.Should().BeLessThan(longResult.Length + 200) is nearly always true since the content includes the tool name prefix. A stronger assertion would check that the content does NOT contain the full 500-char string. Will fix.
Missing edge case: No test for the ToolCallingSystemPrompt.Prompt content being used in tool-calling requests (issue scenario Misc changes and improvements #16 — board context in system prompt).
Missing edge case: No test for the ChatService.ExtractProposalIdFromToolLog method with truncated JSON (issue scenario [WIP] Add ability to share and manage database with permissions #13-14 coverage gap).

No Issues Found

All orchestrator tests use properly scoped Moq setups and verify meaningful behavior
Token accumulation test verifies exact math (125 = 3*25 + 50)
Fingerprint tests are deterministic and test both positive and negative cases
Intent classifier tests cover all documented intent types and negation patterns
Registry tests verify case-insensitive behavior which is the actual implementation

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of edge case and boundary tests for the LLM-related services, including the intent classifier, provider abstractions, and the tool-calling orchestrator. The review feedback identifies several opportunities for improvement: refining test cases to better target negation logic, adding missing assertions for newline detection, refactoring the orchestrator to support configurable timeouts for faster testing, and implementing missing test cases for the tool executor registry as noted in its documentation.

gemini-code-assist · 2026-04-04T01:12:55Z

backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs

+    [InlineData("stop create new tasks")]
+    [InlineData("cancel the delete of card 5")]
+    [InlineData("don't remove that task")]
+    [InlineData("avoid creating a task please")] // "avoid" + "creating" uses "avoid" in the negation list


The test case "avoid creating a task please" passes because the word "creating" (gerund) is not recognized as an actionable verb by the classifier's regex patterns, rather than being caught by the negation logic. The NegationPattern in LlmIntentClassifier only includes base verb forms like create. If the intent is to test negation, consider using a base verb form like "avoid create a task". Additionally, note that the negation logic for "set up" in the classifier uses a literal space, which is inconsistent with the creation patterns that use \s+; adding a test case with multiple spaces (e.g., "don't set up") would be a good edge case to cover.

gemini-code-assist · 2026-04-04T01:12:55Z

backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs

+    public void Classify_NewlinesInInput_StillDetects(string input)
+    {
+        // Regex patterns work per-line or across depending on implementation
+        var (isActionable, _) = LlmIntentClassifier.Classify(input);
+
+        // Regardless of detection, it should not throw
+        // (The actual behavior depends on regex mode - this tests safety)
+    }


The test Classify_NewlinesInInput_StillDetects is missing assertions. While the comment suggests it only checks for exceptions, the test name implies it should verify that the intent is correctly detected despite newlines. Given that \s in .NET regex matches newlines, this should be explicitly asserted.

[Theory] [InlineData("create a card\nand some other text")] [InlineData("create\na\ncard")] public void Classify_NewlinesInInput_StillDetects(string input) { var (isActionable, actionIntent) = LlmIntentClassifier.Classify(input); isActionable.Should().BeTrue($"input with newlines '{input}' should be detected"); actionIntent.Should().Be("card.create"); }

gemini-code-assist · 2026-04-04T01:12:55Z

backend/tests/Taskdeck.Application.Tests/Services/ToolCallingChatOrchestratorEdgeCaseTests.cs

+                IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>
+            {
+                // Wait until cancellation triggers (per-round timeout)
+                await Task.Delay(TimeSpan.FromSeconds(60), ct);


This test will take at least 30 seconds to execute because PerRoundTimeoutSeconds is a hardcoded constant in ToolCallingChatOrchestrator. Hardcoded timeouts make unit tests slow and brittle. Consider refactoring the orchestrator to accept a settings object or a TimeSpan for timeouts in its constructor, which would allow tests to use much shorter durations (e.g., 100ms) while still verifying the timeout logic.

gemini-code-assist · 2026-04-04T01:12:55Z

backend/tests/Taskdeck.Application.Tests/Services/ToolExecutorRegistryEdgeCaseTests.cs

+/// duplicate tool names, and null/empty tool name lookup.
+/// </summary>


The class summary mentions covering "duplicate tool names" and "null/empty tool name lookup", but there are no test cases for these scenarios. Since ToolExecutorRegistry uses ToDictionary (which throws on duplicate keys) and TryGetValue (which throws on a null key), tests should be added to verify these crash conditions or ensure the registry handles them gracefully.

- Replace no-assertion newline test with proper NotThrow assertion and add separate test for single-line actionable detection - Strengthen MockProvider truncation test to verify the full string does NOT appear (was a weak LessThan assertion) - Fix misleading negation test: use "avoid create" (infinitive) since the negation regex requires bare verbs, not gerunds like "creating"

Chris0Jeky · 2026-04-04T01:14:42Z

Self-Review Fixes Applied (`d96f9a0`)

Fixed the three issues from the adversarial review:

Newline test no-assertion: Replaced the empty Theory test with a proper NotThrow assertion, and added a separate Fact test that verifies actionable phrases on the first line are still detected across newlines. (net +1 test, now 101 total)
Weak truncation assertion: The MockProvider truncation test now verifies the full 500-char string does NOT appear in the content (negative assertion) AND that the truncated 200-char prefix IS present (positive assertion). This catches false-positive scenarios where truncation is broken.
Misleading negation test: Changed "avoid creating a task please" to "avoid create a task please" since the negation regex requires bare infinitives, not gerunds. Added accurate comment explaining the regex behavior.

Remaining known gaps (documented, not blocking):

No test for ToolCallingSystemPrompt.Prompt content in tool-calling requests
No test for ChatService.ExtractProposalIdFromToolLog with truncated JSON
The per-round timeout test takes ~30s (real timeout) — acceptable for CI

Copilot

Pull request overview

Adds extensive edge-case unit coverage around the backend LLM subsystem, focusing on tool-calling orchestration boundaries and provider abstraction behaviors to reduce regressions and improve confidence in failure-mode handling.

Changes:

Adds new boundary tests for ToolCallingChatOrchestrator (timeouts, empty tool calls, tool-not-found, mixed tool errors, metadata, token accumulation, round exhaustion).
Adds edge-case tests for ILlmProvider default tool-calling behavior, MockLlmProvider, provider selection policy, and record defaults.
Adds expanded intent-classifier edge-case tests (negation, other-tool queries, long inputs, casing/newlines, injection-like strings, verb/plural coverage) and tool registry tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
backend/tests/Taskdeck.Application.Tests/Services/ToolExecutorRegistryEdgeCaseTests.cs	New tests for tool executor registry lookup behavior and `ToolExecutionContext` value semantics.
backend/tests/Taskdeck.Application.Tests/Services/ToolCallingChatOrchestratorEdgeCaseTests.cs	New orchestrator boundary tests for timeout/degraded paths, tool execution logging, and notifier invocation.
backend/tests/Taskdeck.Application.Tests/Services/LlmProviderAbstractionEdgeCaseTests.cs	New tests for provider abstraction defaults, mock provider edge cases, selection policy, and settings defaults.
backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs	New classifier tests for negation/other-tool suppression and robustness against varied inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T01:15:41Z

backend/tests/Taskdeck.Application.Tests/Services/ToolCallingChatOrchestratorEdgeCaseTests.cs

+        // Simulate a provider that takes longer than PerRoundTimeoutSeconds
+        var mock = new Mock<ILlmProvider>();
+        mock.Setup(p => p.CompleteWithToolsAsync(
+                It.IsAny<ChatCompletionRequest>(),
+                It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),
+                It.IsAny<IReadOnlyList<ToolCallResult>?>(),
+                It.IsAny<CancellationToken>()))
+            .Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _,
+                IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>
+            {
+                // Wait until cancellation triggers (per-round timeout)
+                await Task.Delay(TimeSpan.FromSeconds(60), ct);
+                return new LlmToolCompletionResult(
+                    Content: "Should not reach here",
+                    TokensUsed: 0,
+                    Provider: "Test",
+                    Model: "test-v1",
+                    ToolCalls: null,
+                    IsComplete: true);
+            });


This per-round timeout test will sleep until the orchestrator's real PerRoundTimeoutSeconds (30s) elapses (the provider delays 60s but is cancelled after 30s). That makes the test suite significantly slower and can cause CI timeouts. Consider simulating the timeout without real waiting (e.g., have the mocked provider immediately throw an OperationCanceledException while the external ct is not cancelled, which hits the orchestrator's per-round timeout catch path), or refactor the orchestrator to allow injecting shorter timeout values for tests.

Suggested change

// Simulate a provider that takes longer than PerRoundTimeoutSeconds

var mock = new Mock<ILlmProvider>();

mock.Setup(p => p.CompleteWithToolsAsync(

It.IsAny<ChatCompletionRequest>(),

It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),

It.IsAny<IReadOnlyList<ToolCallResult>?>(),

It.IsAny<CancellationToken>()))

.Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _,

IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>

{

// Wait until cancellation triggers (per-round timeout)

await Task.Delay(TimeSpan.FromSeconds(60), ct);

return new LlmToolCompletionResult(

Content: "Should not reach here",

TokensUsed: 0,

Provider: "Test",

Model: "test-v1",

ToolCalls: null,

IsComplete: true);

});

// Simulate a per-round timeout/cancellation from the provider without real waiting.

var mock = new Mock<ILlmProvider>();

mock.Setup(p => p.CompleteWithToolsAsync(

It.IsAny<ChatCompletionRequest>(),

It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),

It.IsAny<IReadOnlyList<ToolCallResult>?>(),

It.IsAny<CancellationToken>()))

.ThrowsAsync(new OperationCanceledException(new CancellationToken(canceled: false)));

Copilot · 2026-04-04T01:15:42Z

backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs

+    [Theory]
+    [InlineData("create a card\nand some other text")]
+    [InlineData("create\na\ncard")]
+    public void Classify_NewlinesInInput_DoesNotThrow(string input)
+    {
+        // Verify that newlines in input do not cause exceptions.
+        // The classifier may or may not detect the intent depending on
+        // whether the regex matches across line boundaries, but it must
+        // never crash.
+        var act = () => LlmIntentClassifier.Classify(input);
+        act.Should().NotThrow("newlines in input must not cause exceptions");


This test currently has no explicit assertion (it only passes unless Classify throws). To make the intent clear and avoid accidental no-op coverage, add an explicit assertion such as act.Should().NotThrow() and/or assert expected classification for these newline cases.

…ge, slow timeout - PromptInjection test now verifies classification results, not just no-crash - PerRoundTimeout test uses immediate OperationCanceledException instead of waiting 30s for real timeout (same code path exercised, 30s faster) - Add loop detection tests: identical consecutive tool calls abort, but retries after errors are allowed (core orchestrator feature was untested) - Add ToolExecutorRegistry tests for null tool name and duplicate tool names (mentioned in docstring but were missing)

Chris0Jeky · 2026-04-04T01:41:31Z

Second-Pass Adversarial Review

Issues Found and Fixed (commit `b20d2fe`)

1. False-positive test: Classify_PromptInjection_DoesNotCrash (FIXED)
All four injection inputs contain "create a card" and would be classified as actionable with intent "card.create", but the test only checked NotThrow() -- it would pass even if the classifier crashed on some inputs and accidentally returned wrong results on others. Now asserts both no-crash AND correct classification.

2. 30-second test: ExecuteAsync_PerRoundTimeout_ReturnsDegradedResult (FIXED)
The test waited for the real 30-second PerRoundTimeoutSeconds to fire via Task.Delay(60s, ct). Replaced with an immediate OperationCanceledException throw which exercises the exact same catch (OperationCanceledException) when (!ct.IsCancellationRequested) code path in the orchestrator, dropping test runtime from ~30s to <1ms.

3. Missing coverage: Loop detection (FIXED -- 2 new tests)
ToolCallingChatOrchestrator has a loop detection feature (lines 177-186) that aborts when the LLM issues identical tool calls in consecutive rounds. This was completely untested. Added:

ExecuteAsync_LoopDetection_AbortsOnIdenticalConsecutiveToolCalls -- verifies abort on round 2
ExecuteAsync_LoopDetection_RetriesAllowedAfterErrors -- verifies retry is permitted when previous round had errors (the previousRoundHadErrors skip path)

4. Missing coverage: ToolExecutorRegistry null/duplicate (FIXED -- 2 new tests)
The class docstring claimed coverage of "duplicate tool names" and "null/empty tool name lookup" but only tested empty string. Added:

Constructor_DuplicateToolNames_ThrowsOnCreation -- verifies ToDictionary throws on duplicate keys
GetExecutor_NullToolName_ThrowsArgumentNullException -- verifies null key behavior

Issues Reviewed and Accepted (no fix needed)

5. CI failure: API Integration -- Pre-existing AuthControllerEdgeCaseTests.cs build error (missing IUserContext parameter). Not touched by this PR. Confirmed by checking git diff against the file.

6. Gemini/Copilot bot comments -- Both bots flagged the same issues I fixed above (newline test, timeout test, registry docstring). The self-review fix (d96f9a0) addressed the newline and truncation issues but left the timeout and loop detection gaps open.

Remaining known gaps (not blocking)

No test for BuildStatusMessage switch expression branches (low risk, pure display)
No test for ToolCallingSystemPrompt.Prompt content in tool-calling requests
No test for ChatService.ExtractProposalIdFromToolLog with truncated JSON

Test count: 107 (was 101, net +6 new tests)

Chris0Jeky added 4 commits April 4, 2026 02:10

Copilot AI review requested due to automatic review settings April 4, 2026 01:11

github-project-automation bot added this to Taskdeck Execution Apr 4, 2026

github-project-automation bot moved this to Pending in Taskdeck Execution Apr 4, 2026

Copilot started reviewing on behalf of Chris0Jeky April 4, 2026 01:11 View session

gemini-code-assist bot reviewed Apr 4, 2026

View reviewed changes

Copilot AI reviewed Apr 4, 2026

View reviewed changes

Chris0Jeky merged commit 0d79159 into main Apr 4, 2026
21 of 23 checks passed

github-project-automation bot moved this from Pending to Done in Taskdeck Execution Apr 4, 2026

Chris0Jeky deleted the test/709-llm-provider-tool-calling-edge branch April 4, 2026 01:57

Chris0Jeky mentioned this pull request Apr 4, 2026

TST-54: Rigorous test expansion wave tracker (2026-04-03) #721

Open

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: LLM provider abstraction and tool-calling edge cases (#709)#747

Test: LLM provider abstraction and tool-calling edge cases (#709)#747
Chris0Jeky merged 6 commits intomainfrom
test/709-llm-provider-tool-calling-edge

Chris0Jeky commented Apr 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 4, 2026

Uh oh!

Chris0Jeky commented Apr 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

Chris0Jeky commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Chris0Jeky commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// duplicate tool names, and null/empty tool name lookup.
		/// </summary>

Conversation

Chris0Jeky commented Apr 4, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot commented Apr 4, 2026

Uh oh!

Chris0Jeky commented Apr 4, 2026

Adversarial Self-Review Findings

Potential Issues Found

No Issues Found

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Chris0Jeky commented Apr 4, 2026

Self-Review Fixes Applied (d96f9a0)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Chris0Jeky commented Apr 4, 2026

Second-Pass Adversarial Review

Issues Found and Fixed (commit b20d2fe)

Issues Reviewed and Accepted (no fix needed)

Remaining known gaps (not blocking)

Test count: 107 (was 101, net +6 new tests)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Self-Review Fixes Applied (`d96f9a0`)

Issues Found and Fixed (commit `b20d2fe`)