Skip to content

Test: LLM provider abstraction and tool-calling edge cases (#709)#747

Merged
Chris0Jeky merged 6 commits intomainfrom
test/709-llm-provider-tool-calling-edge
Apr 4, 2026
Merged

Test: LLM provider abstraction and tool-calling edge cases (#709)#747
Chris0Jeky merged 6 commits intomainfrom
test/709-llm-provider-tool-calling-edge

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

  • 100 new edge case tests for LLM provider abstraction and tool-calling orchestration
  • Orchestrator boundary tests (18): per-round timeout, empty/null tool call lists, concurrent tool calls in a round, mixed error handling across tools, tool-not-found with suggestion, generic provider exceptions, cancellation propagation, userId context passing, large result truncation, metadata JSON generation, token accumulation, exhausted round partial summaries
  • Provider abstraction tests (24): default interface throws NotSupportedException, MockLlmProvider edge cases (empty messages, non-user roles, actionable detection, tool-calling patterns, health/probe), provider selection policy (null/empty/case-insensitive), record defaults, kill switch settings
  • Intent classifier tests (48): negation filtering, other-tool questions, positive detection for all intents, non-actionable inputs, null/empty/whitespace, very long inputs, mixed case, newlines, prompt injection, archive-vs-move disambiguation, plural nouns, alternate verb coverage
  • Tool executor registry tests (10): empty registry, case-insensitive lookup, registration, ToolExecutionContext value equality

Closes #709

Test plan

  • All 100 new tests pass
  • No existing tests broken (pre-existing Api.Tests build error in AuthControllerEdgeCaseTests is unrelated)
  • Tests use Mock provider only (no real LLM API calls)

18 tests covering: per-round timeout, empty/null tool call lists,
multiple concurrent tools in one round, mixed error handling,
tool-not-found with available tool suggestion, generic provider
exception, cancellation token propagation, userId context passing,
large result truncation in log, metadata JSON generation, token
accumulation across rounds, null content on complete, exhausted
rounds partial summary, and status notifier invocation per tool.
24 tests covering: default CompleteWithToolsAsync throws
NotSupportedException, MockLlmProvider edge cases (empty messages,
non-user roles, actionable message detection, very long input,
empty tool-calling messages, previous results summary, error
results, large result truncation, health/probe endpoints),
provider selection edge cases (null/empty settings, case
insensitive provider names), record default values, degraded
result accessibility, and kill switch settings defaults.
48 tests covering: negation filtering (don't, do not, never, stop,
cancel, avoid), other-tool questions (Trello, Jira, Asana, Notion),
positive detection for all intent types, non-actionable inputs,
null/empty/whitespace handling, very long inputs, mixed case,
newlines, prompt injection patterns, archive vs move disambiguation,
plural noun support, and alternate verb coverage (generate, build,
prepare, set up, modify, change, sort, rearrange, reorganize).
10 tests covering: empty registry returns null, case-insensitive
lookup, mixed case lookup, empty registered tool names, multiple
executor registration, non-existent tool lookup, empty string
lookup, ToolExecutionContext property access, and value equality.
Copilot AI review requested due to automatic review settings April 4, 2026 01:11
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Self-Review Findings

Potential Issues Found

  1. Newlines test has no assertion (LlmIntentClassifierEdgeCaseTests, line 151-161): Classify_NewlinesInInput_StillDetects has a Theory with two inputs but the test body has no assertion — it only calls the method and does nothing with the result. This is a false-positive test that will always pass. Will fix.

  2. Negation test comment is misleading (line 24): Comment says "avoid" + "creating" uses "avoid" in the negation list but the negation regex requires the negation word to be followed by a verb infinitive. "avoid creating a task please" actually matches because \b(avoid)\b(\s+\w+){0,6}\s+\b(create)\b does NOT match the gerund "creating" — it matches because the second regex hit of "task" as a card noun is being suppressed by the "avoid" keyword further back. This needs verification. Will verify and fix comment if wrong.

  3. PerRoundTimeout test is slow (ToolCallingChatOrchestratorEdgeCaseTests, line 38): This test relies on the actual per-round timeout (30 seconds) to fire. In the test run it completed in ~30s. Not flaky, but slow. The alternative would be to use a custom timeout value, but that would require changing the production code. Acceptable trade-off.

  4. MockProvider_ToolCalling_VeryLongToolResult assertion is weak (LlmProviderAbstractionEdgeCaseTests, line 166): The assertion Content.Length.Should().BeLessThan(longResult.Length + 200) is nearly always true since the content includes the tool name prefix. A stronger assertion would check that the content does NOT contain the full 500-char string. Will fix.

  5. Missing edge case: No test for the ToolCallingSystemPrompt.Prompt content being used in tool-calling requests (issue scenario Misc changes and improvements #16 — board context in system prompt).

  6. Missing edge case: No test for the ChatService.ExtractProposalIdFromToolLog method with truncated JSON (issue scenario [WIP] Add ability to share and manage database with permissions #13-14 coverage gap).

No Issues Found

  • All orchestrator tests use properly scoped Moq setups and verify meaningful behavior
  • Token accumulation test verifies exact math (125 = 3*25 + 50)
  • Fingerprint tests are deterministic and test both positive and negative cases
  • Intent classifier tests cover all documented intent types and negation patterns
  • Registry tests verify case-insensitive behavior which is the actual implementation

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of edge case and boundary tests for the LLM-related services, including the intent classifier, provider abstractions, and the tool-calling orchestrator. The review feedback identifies several opportunities for improvement: refining test cases to better target negation logic, adding missing assertions for newline detection, refactoring the orchestrator to support configurable timeouts for faster testing, and implementing missing test cases for the tool executor registry as noted in its documentation.

[InlineData("stop create new tasks")]
[InlineData("cancel the delete of card 5")]
[InlineData("don't remove that task")]
[InlineData("avoid creating a task please")] // "avoid" + "creating" uses "avoid" in the negation list
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test case "avoid creating a task please" passes because the word "creating" (gerund) is not recognized as an actionable verb by the classifier's regex patterns, rather than being caught by the negation logic. The NegationPattern in LlmIntentClassifier only includes base verb forms like create. If the intent is to test negation, consider using a base verb form like "avoid create a task". Additionally, note that the negation logic for "set up" in the classifier uses a literal space, which is inconsistent with the creation patterns that use \s+; adding a test case with multiple spaces (e.g., "don't set up") would be a good edge case to cover.

Comment on lines +154 to +161
public void Classify_NewlinesInInput_StillDetects(string input)
{
// Regex patterns work per-line or across depending on implementation
var (isActionable, _) = LlmIntentClassifier.Classify(input);

// Regardless of detection, it should not throw
// (The actual behavior depends on regex mode - this tests safety)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test Classify_NewlinesInInput_StillDetects is missing assertions. While the comment suggests it only checks for exceptions, the test name implies it should verify that the intent is correctly detected despite newlines. Given that \s in .NET regex matches newlines, this should be explicitly asserted.

    [Theory]
    [InlineData("create a card\nand some other text")]
    [InlineData("create\na\ncard")]
    public void Classify_NewlinesInInput_StillDetects(string input)
    {        
        var (isActionable, actionIntent) = LlmIntentClassifier.Classify(input);

        isActionable.Should().BeTrue($"input with newlines '{input}' should be detected");
        actionIntent.Should().Be("card.create");
    }

IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>
{
// Wait until cancellation triggers (per-round timeout)
await Task.Delay(TimeSpan.FromSeconds(60), ct);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test will take at least 30 seconds to execute because PerRoundTimeoutSeconds is a hardcoded constant in ToolCallingChatOrchestrator. Hardcoded timeouts make unit tests slow and brittle. Consider refactoring the orchestrator to accept a settings object or a TimeSpan for timeouts in its constructor, which would allow tests to use much shorter durations (e.g., 100ms) while still verifying the timeout logic.

Comment on lines +13 to +14
/// duplicate tool names, and null/empty tool name lookup.
/// </summary>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The class summary mentions covering "duplicate tool names" and "null/empty tool name lookup", but there are no test cases for these scenarios. Since ToolExecutorRegistry uses ToDictionary (which throws on duplicate keys) and TryGetValue (which throws on a null key), tests should be added to verify these crash conditions or ensure the registry handles them gracefully.

- Replace no-assertion newline test with proper NotThrow assertion
  and add separate test for single-line actionable detection
- Strengthen MockProvider truncation test to verify the full string
  does NOT appear (was a weak LessThan assertion)
- Fix misleading negation test: use "avoid create" (infinitive) since
  the negation regex requires bare verbs, not gerunds like "creating"
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Self-Review Fixes Applied (d96f9a0)

Fixed the three issues from the adversarial review:

  1. Newline test no-assertion: Replaced the empty Theory test with a proper NotThrow assertion, and added a separate Fact test that verifies actionable phrases on the first line are still detected across newlines. (net +1 test, now 101 total)

  2. Weak truncation assertion: The MockProvider truncation test now verifies the full 500-char string does NOT appear in the content (negative assertion) AND that the truncated 200-char prefix IS present (positive assertion). This catches false-positive scenarios where truncation is broken.

  3. Misleading negation test: Changed "avoid creating a task please" to "avoid create a task please" since the negation regex requires bare infinitives, not gerunds. Added accurate comment explaining the regex behavior.

Remaining known gaps (documented, not blocking):

  • No test for ToolCallingSystemPrompt.Prompt content in tool-calling requests
  • No test for ChatService.ExtractProposalIdFromToolLog with truncated JSON
  • The per-round timeout test takes ~30s (real timeout) — acceptable for CI

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds extensive edge-case unit coverage around the backend LLM subsystem, focusing on tool-calling orchestration boundaries and provider abstraction behaviors to reduce regressions and improve confidence in failure-mode handling.

Changes:

  • Adds new boundary tests for ToolCallingChatOrchestrator (timeouts, empty tool calls, tool-not-found, mixed tool errors, metadata, token accumulation, round exhaustion).
  • Adds edge-case tests for ILlmProvider default tool-calling behavior, MockLlmProvider, provider selection policy, and record defaults.
  • Adds expanded intent-classifier edge-case tests (negation, other-tool queries, long inputs, casing/newlines, injection-like strings, verb/plural coverage) and tool registry tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
backend/tests/Taskdeck.Application.Tests/Services/ToolExecutorRegistryEdgeCaseTests.cs New tests for tool executor registry lookup behavior and ToolExecutionContext value semantics.
backend/tests/Taskdeck.Application.Tests/Services/ToolCallingChatOrchestratorEdgeCaseTests.cs New orchestrator boundary tests for timeout/degraded paths, tool execution logging, and notifier invocation.
backend/tests/Taskdeck.Application.Tests/Services/LlmProviderAbstractionEdgeCaseTests.cs New tests for provider abstraction defaults, mock provider edge cases, selection policy, and settings defaults.
backend/tests/Taskdeck.Application.Tests/Services/LlmIntentClassifierEdgeCaseTests.cs New classifier tests for negation/other-tool suppression and robustness against varied inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +60
// Simulate a provider that takes longer than PerRoundTimeoutSeconds
var mock = new Mock<ILlmProvider>();
mock.Setup(p => p.CompleteWithToolsAsync(
It.IsAny<ChatCompletionRequest>(),
It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),
It.IsAny<IReadOnlyList<ToolCallResult>?>(),
It.IsAny<CancellationToken>()))
.Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _,
IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>
{
// Wait until cancellation triggers (per-round timeout)
await Task.Delay(TimeSpan.FromSeconds(60), ct);
return new LlmToolCompletionResult(
Content: "Should not reach here",
TokensUsed: 0,
Provider: "Test",
Model: "test-v1",
ToolCalls: null,
IsComplete: true);
});
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This per-round timeout test will sleep until the orchestrator's real PerRoundTimeoutSeconds (30s) elapses (the provider delays 60s but is cancelled after 30s). That makes the test suite significantly slower and can cause CI timeouts. Consider simulating the timeout without real waiting (e.g., have the mocked provider immediately throw an OperationCanceledException while the external ct is not cancelled, which hits the orchestrator's per-round timeout catch path), or refactor the orchestrator to allow injecting shorter timeout values for tests.

Suggested change
// Simulate a provider that takes longer than PerRoundTimeoutSeconds
var mock = new Mock<ILlmProvider>();
mock.Setup(p => p.CompleteWithToolsAsync(
It.IsAny<ChatCompletionRequest>(),
It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),
It.IsAny<IReadOnlyList<ToolCallResult>?>(),
It.IsAny<CancellationToken>()))
.Returns(async (ChatCompletionRequest _, IReadOnlyList<TaskdeckToolSchema> _,
IReadOnlyList<ToolCallResult>? _, CancellationToken ct) =>
{
// Wait until cancellation triggers (per-round timeout)
await Task.Delay(TimeSpan.FromSeconds(60), ct);
return new LlmToolCompletionResult(
Content: "Should not reach here",
TokensUsed: 0,
Provider: "Test",
Model: "test-v1",
ToolCalls: null,
IsComplete: true);
});
// Simulate a per-round timeout/cancellation from the provider without real waiting.
var mock = new Mock<ILlmProvider>();
mock.Setup(p => p.CompleteWithToolsAsync(
It.IsAny<ChatCompletionRequest>(),
It.IsAny<IReadOnlyList<TaskdeckToolSchema>>(),
It.IsAny<IReadOnlyList<ToolCallResult>?>(),
It.IsAny<CancellationToken>()))
.ThrowsAsync(new OperationCanceledException(new CancellationToken(canceled: false)));

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +161
[Theory]
[InlineData("create a card\nand some other text")]
[InlineData("create\na\ncard")]
public void Classify_NewlinesInInput_DoesNotThrow(string input)
{
// Verify that newlines in input do not cause exceptions.
// The classifier may or may not detect the intent depending on
// whether the regex matches across line boundaries, but it must
// never crash.
var act = () => LlmIntentClassifier.Classify(input);
act.Should().NotThrow("newlines in input must not cause exceptions");
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently has no explicit assertion (it only passes unless Classify throws). To make the intent clear and avoid accidental no-op coverage, add an explicit assertion such as act.Should().NotThrow() and/or assert expected classification for these newline cases.

Copilot uses AI. Check for mistakes.
…ge, slow timeout

- PromptInjection test now verifies classification results, not just no-crash
- PerRoundTimeout test uses immediate OperationCanceledException instead of
  waiting 30s for real timeout (same code path exercised, 30s faster)
- Add loop detection tests: identical consecutive tool calls abort, but
  retries after errors are allowed (core orchestrator feature was untested)
- Add ToolExecutorRegistry tests for null tool name and duplicate tool names
  (mentioned in docstring but were missing)
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Second-Pass Adversarial Review

Issues Found and Fixed (commit b20d2fe)

1. False-positive test: Classify_PromptInjection_DoesNotCrash (FIXED)
All four injection inputs contain "create a card" and would be classified as actionable with intent "card.create", but the test only checked NotThrow() -- it would pass even if the classifier crashed on some inputs and accidentally returned wrong results on others. Now asserts both no-crash AND correct classification.

2. 30-second test: ExecuteAsync_PerRoundTimeout_ReturnsDegradedResult (FIXED)
The test waited for the real 30-second PerRoundTimeoutSeconds to fire via Task.Delay(60s, ct). Replaced with an immediate OperationCanceledException throw which exercises the exact same catch (OperationCanceledException) when (!ct.IsCancellationRequested) code path in the orchestrator, dropping test runtime from ~30s to <1ms.

3. Missing coverage: Loop detection (FIXED -- 2 new tests)
ToolCallingChatOrchestrator has a loop detection feature (lines 177-186) that aborts when the LLM issues identical tool calls in consecutive rounds. This was completely untested. Added:

  • ExecuteAsync_LoopDetection_AbortsOnIdenticalConsecutiveToolCalls -- verifies abort on round 2
  • ExecuteAsync_LoopDetection_RetriesAllowedAfterErrors -- verifies retry is permitted when previous round had errors (the previousRoundHadErrors skip path)

4. Missing coverage: ToolExecutorRegistry null/duplicate (FIXED -- 2 new tests)
The class docstring claimed coverage of "duplicate tool names" and "null/empty tool name lookup" but only tested empty string. Added:

  • Constructor_DuplicateToolNames_ThrowsOnCreation -- verifies ToDictionary throws on duplicate keys
  • GetExecutor_NullToolName_ThrowsArgumentNullException -- verifies null key behavior

Issues Reviewed and Accepted (no fix needed)

5. CI failure: API Integration -- Pre-existing AuthControllerEdgeCaseTests.cs build error (missing IUserContext parameter). Not touched by this PR. Confirmed by checking git diff against the file.

6. Gemini/Copilot bot comments -- Both bots flagged the same issues I fixed above (newline test, timeout test, registry docstring). The self-review fix (d96f9a0) addressed the newline and truncation issues but left the timeout and loop detection gaps open.

Remaining known gaps (not blocking)

  • No test for BuildStatusMessage switch expression branches (low risk, pure display)
  • No test for ToolCallingSystemPrompt.Prompt content in tool-calling requests
  • No test for ChatService.ExtractProposalIdFromToolLog with truncated JSON

Test count: 107 (was 101, net +6 new tests)

@Chris0Jeky Chris0Jeky merged commit 0d79159 into main Apr 4, 2026
21 of 23 checks passed
@github-project-automation github-project-automation bot moved this from Pending to Done in Taskdeck Execution Apr 4, 2026
@Chris0Jeky Chris0Jeky deleted the test/709-llm-provider-tool-calling-edge branch April 4, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

TST-42: LLM provider abstraction and tool-calling edge case tests

2 participants