feat: relax tool matching on resume #1537

ixchio · 2025-12-29T17:55:40Z

Summary

Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.

Changes

Add optional used_tools: set[str] | None param to resolve_diff_from_deserialized()
Extract used tool names from event history in state.py before reconciliation
When used_tools is provided, only validate that those tools exist in runtime
New tools can be added to resumed conversations without breaking anything

Tests

test_conversation_allows_removing_unused_tools - removing tools that were never used is allowed
test_conversation_allows_adding_new_tools - adding new tools when resuming is allowed

All existing reconciliation tests continue to pass.

Fixes #1533

@ryanhoangt

Closes OpenHands#1417 Implements early stopping mechanism to detect failures early and terminate behavior tests before full trajectory completes, reducing LLM costs. Changes: - Add early_stopper.py with 6 pruner classes: * EarlyStopperBase (abstract base) * FileEditPruner (detect forbidden file edits) * BashCommandPruner (detect forbidden commands) * TestExecutionPruner (detect excessive test runs) * CompositeEarlyStopper (combine multiple pruners) * LLMJudgePruner (periodic lightweight LLM checks) - Integrate early stopping in BaseIntegrationTest callback - Add get_early_stopper() hook in SoftwareAgentSDKBehaviorTest - Apply early stoppers to b01, b02, b05 behavior tests - Add 17 unit tests for early stopper functionality Per discussion with @ryanhoangt: - Pattern-based pruning first (zero LLM cost) - Stop on first failure signal - Skip final LLM judge when early stopped - Reusable pruner classes with base interface

…ributeError The early_stopper and early_stop_result attributes were being initialized AFTER LocalConversation was created, causing AttributeError when the callback accessed these attributes during test execution. This fixes the CI failures where all behavior tests failed with: '...Test' object has no attribute 'early_stopper'

- b02: Add TerminalTestAwarePruner that whitelists tests/tools/terminal/ paths - b05: Add allowed_patterns to skip auto-generated files (.openhands/, __pycache__) - b05: Increase max_creates to 3 to allow training script, README, and test file This fixes the false positives where: - b02 was blocking legitimate targeted terminal test runs - b05 was blocking auto-generated framework files

- Fix trailing space in 'pytest tests/ ' pattern that wouldn't catch bare commands - Rewrite tests to use real FileEditorAction/TerminalAction objects - Fix line-too-long lint error in LLMJudgePruner prompt - Remove unused imports and variables All 17 unit tests passing.

- Use CommandLiteral type for file editor commands - Add cast() for list covariance (list[ActionEvent] -> list[Event]) - Add null checks before string operations on result.reason - Use getattr() pattern to avoid type narrowing issue with ImageContent All 17 tests pass, pyright reports 0 errors.

- Remove TestExecutionPruner and LLMJudgePruner from early_stopper.py - Simplify b02 to rely on LLM judge instead of pattern matching - Remove RedundantFileCreationPruner from b05 - Keep FileEditPruner/BashCommandPruner/CompositeEarlyStopper as core infra - b01 still gets early stopping (saves the most cost anyway)

Closes OpenHands#1504 - BaseWorkspace: raise NotImplementedError by default - LocalWorkspace: no-op (nothing to pause on host filesystem) - DockerWorkspace: docker pause/unpause commands - APIRemoteWorkspace: expose /pause and /resume endpoints - OpenHandsCloudWorkspace: pause() not supported yet, resume() works Signed-off-by: ixchio <ruzvadpathan7058@gmail.com>

Only fail if a tool that was *used in history* is missing. Allow adding new tools freely when resuming conversations. Changes: - Add optional used_tools param to resolve_diff_from_deserialized() - Extract used tool names from event history in state.py - Add test_conversation_allows_removing_unused_tools - Add test_conversation_allows_adding_new_tools Fixes OpenHands#1533

neubig · 2025-12-29T18:01:56Z

@OpenHands /codereview-roasted including checking if you can (1) reproduce the issue with the previous code, and (2) run the new code and demonstrate that the problem that existed with the previous code is now fixed.

openhands-ai · 2025-12-29T18:02:05Z

I'm on it! neubig can track my progress at all-hands.dev

ixchio · 2025-12-29T18:04:49Z

Closing - recreating with clean single commit

openhands-ai · 2025-12-29T18:08:38Z

Code Review for PR #1537: "feat: relax tool matching on resume"

Verification Results

✅ Issue Reproduction (main branch)

Successfully reproduced the issue on the main branch:

Test 1 (Adding new tool): ❌ FAILED with "Tools don't match between runtime and persisted agents. Missing in persisted: {'FileEditorTool'}."
Test 2 (Removing unused tool): ❌ FAILED with "Tools don't match between runtime and persisted agents. Missing in runtime: {'FileEditorTool'}."

✅ Fix Verification (PR branch)

Successfully verified the fix on the relax-tool-matching-1533 branch:

Test 1 (Adding new tool): ✅ PASS - Adding new tools when resuming is allowed
Test 2 (Removing unused tool): ✅ PASS - Removing unused tools when resuming is allowed
All existing tests: ✅ 8 passed in test_agent_reconciliation.py
New workspace tests: ✅ 13 passed in test_workspace_pause_resume.py

Taste Rating: 🟡 Acceptable - Works but could be cleaner

Linus-Style Analysis

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

[base.py, Lines 386-394] Skipping Strict Comparison
When used_tools is provided, the entire strict model comparison is skipped. This means other agent configuration changes (beyond tools) could slip through without validation. Consider only skipping the tools comparison, not the entire model comparison.
[tests/cross/test_agent_reconciliation.py] Missing Critical Test Case
There's no test verifying that removing a tool that WAS actually used in history still fails. This is the critical safety check that ensures the fix doesn't break consistency. Add a test like:
```
def test_conversation_fails_when_removing_used_tool():
    """Test that removing a tool that was used in history fails."""
    # Create conversation, use a tool (create ActionEvent), then try to resume without it
```
[PR Scope] Unrelated Changes
The workspace pause() and resume() methods are unrelated to issue Allow adding new tools when resuming conversations without failing #1533 about tool matching. These should be in a separate PR to maintain clean commit history and easier review.

[STYLE NOTES] (Minor)

[base.py, Line 371] Comment Clarity
The comment "Update tools to match runtime (allows new tools to be added)" could be clearer about what's happening - the persisted agent's tools are being replaced with runtime tools.

VERDICT:

✅ Worth merging with minor improvements suggested

The core logic is sound and solves the real problem described in issue #1533. The implementation is pragmatic - only fail if tools that were actually used are missing, which provides a better user experience while maintaining consistency.

KEY INSIGHT:

The fix correctly identifies that the strict tool matching was overly conservative. By tracking which tools were actually used in history, the code can now allow flexible tool additions while still preventing the removal of tools that would break conversation continuity.

_{View full conversation}

neubig · 2025-12-29T18:34:58Z

BTW @ixchio it's not necessary to create a single commit -- we squash merge all commits, so they get merged into a single commit anyway.

ixchio · 2025-12-29T18:41:04Z

BTW @ixchio it's not necessary to create a single commit -- we squash merge all commits, so they get merged into a single commit anyway.

Ah good to know! Thanks for the tip - will skip the cleanup next time since you squash merge anyway. Recreated as #1538 with clean branch, but same code.

ixchio added 17 commits December 18, 2025 17:35

Merge branch 'main' into feat/early-stopping-cost-optimization

2a07ceb

Merge branch 'main' into feat/early-stopping-cost-optimization

00a6ae0

Merge branch 'main' into feat/early-stopping-cost-optimization

e8e9fa9

Merge branch 'main' into feat/early-stopping-cost-optimization

7e64a40

Merge branch 'main' into feat/early-stopping-cost-optimization

11c9bbd

Merge branch 'main' into feat/early-stopping-cost-optimization

f9bb004

Merge branch 'main' into feat/early-stopping-cost-optimization

37f5014

fix: remove redundant get_early_stopper override

be2dd9f

Merge branch 'main' into relax-tool-matching-1533

0a41ae6

neubig self-requested a review December 29, 2025 18:02

ixchio closed this Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: relax tool matching on resume #1537

feat: relax tool matching on resume #1537

Uh oh!

ixchio commented Dec 29, 2025

Uh oh!

neubig commented Dec 29, 2025

Uh oh!

openhands-ai bot commented Dec 29, 2025

Uh oh!

ixchio commented Dec 29, 2025

Uh oh!

openhands-ai bot commented Dec 29, 2025

Uh oh!

neubig commented Dec 29, 2025

Uh oh!

ixchio commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants