Skip to content

Conversation

@ixchio
Copy link
Contributor

@ixchio ixchio commented Dec 29, 2025

Summary

Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.

Changes

  • Add optional used_tools: set[str] | None param to resolve_diff_from_deserialized()
  • Extract used tool names from event history in state.py before reconciliation
  • When used_tools is provided, only validate that those tools exist in runtime
  • New tools can be added to resumed conversations without breaking anything

Tests

  • test_conversation_allows_removing_unused_tools - removing tools that were never used is allowed
  • test_conversation_allows_adding_new_tools - adding new tools when resuming is allowed

All existing reconciliation tests continue to pass.

Fixes #1533

Closes OpenHands#1417

Implements early stopping mechanism to detect failures early and terminate
behavior tests before full trajectory completes, reducing LLM costs.

Changes:
- Add early_stopper.py with 6 pruner classes:
  * EarlyStopperBase (abstract base)
  * FileEditPruner (detect forbidden file edits)
  * BashCommandPruner (detect forbidden commands)
  * TestExecutionPruner (detect excessive test runs)
  * CompositeEarlyStopper (combine multiple pruners)
  * LLMJudgePruner (periodic lightweight LLM checks)
- Integrate early stopping in BaseIntegrationTest callback
- Add get_early_stopper() hook in SoftwareAgentSDKBehaviorTest
- Apply early stoppers to b01, b02, b05 behavior tests
- Add 17 unit tests for early stopper functionality

Per discussion with @ryanhoangt:
- Pattern-based pruning first (zero LLM cost)
- Stop on first failure signal
- Skip final LLM judge when early stopped
- Reusable pruner classes with base interface
…ributeError

The early_stopper and early_stop_result attributes were being initialized
AFTER LocalConversation was created, causing AttributeError when the
callback accessed these attributes during test execution.

This fixes the CI failures where all behavior tests failed with:
'...Test' object has no attribute 'early_stopper'
- b02: Add TerminalTestAwarePruner that whitelists tests/tools/terminal/ paths
- b05: Add allowed_patterns to skip auto-generated files (.openhands/, __pycache__)
- b05: Increase max_creates to 3 to allow training script, README, and test file

This fixes the false positives where:
- b02 was blocking legitimate targeted terminal test runs
- b05 was blocking auto-generated framework files
- Fix trailing space in 'pytest tests/ ' pattern that wouldn't catch bare commands
- Rewrite tests to use real FileEditorAction/TerminalAction objects
- Fix line-too-long lint error in LLMJudgePruner prompt
- Remove unused imports and variables

All 17 unit tests passing.
- Use CommandLiteral type for file editor commands
- Add cast() for list covariance (list[ActionEvent] -> list[Event])
- Add null checks before string operations on result.reason
- Use getattr() pattern to avoid type narrowing issue with ImageContent

All 17 tests pass, pyright reports 0 errors.
- Remove TestExecutionPruner and LLMJudgePruner from early_stopper.py
- Simplify b02 to rely on LLM judge instead of pattern matching
- Remove RedundantFileCreationPruner from b05
- Keep FileEditPruner/BashCommandPruner/CompositeEarlyStopper as core infra
- b01 still gets early stopping (saves the most cost anyway)
Closes OpenHands#1504

- BaseWorkspace: raise NotImplementedError by default
- LocalWorkspace: no-op (nothing to pause on host filesystem)
- DockerWorkspace: docker pause/unpause commands
- APIRemoteWorkspace: expose /pause and /resume endpoints
- OpenHandsCloudWorkspace: pause() not supported yet, resume() works

Signed-off-by: ixchio <ruzvadpathan7058@gmail.com>
Only fail if a tool that was *used in history* is missing.
Allow adding new tools freely when resuming conversations.

Changes:
- Add optional used_tools param to resolve_diff_from_deserialized()
- Extract used tool names from event history in state.py
- Add test_conversation_allows_removing_unused_tools
- Add test_conversation_allows_adding_new_tools

Fixes OpenHands#1533
@neubig
Copy link
Contributor

neubig commented Dec 29, 2025

@OpenHands /codereview-roasted including checking if you can (1) reproduce the issue with the previous code, and (2) run the new code and demonstrate that the problem that existed with the previous code is now fixed.

@openhands-ai
Copy link

openhands-ai bot commented Dec 29, 2025

I'm on it! neubig can track my progress at all-hands.dev

@neubig neubig self-requested a review December 29, 2025 18:02
@ixchio
Copy link
Contributor Author

ixchio commented Dec 29, 2025

Closing - recreating with clean single commit

@ixchio ixchio closed this Dec 29, 2025
@openhands-ai
Copy link

openhands-ai bot commented Dec 29, 2025

Code Review for PR #1537: "feat: relax tool matching on resume"

Verification Results

✅ Issue Reproduction (main branch)

Successfully reproduced the issue on the main branch:

  • Test 1 (Adding new tool): ❌ FAILED with "Tools don't match between runtime and persisted agents. Missing in persisted: {'FileEditorTool'}."
  • Test 2 (Removing unused tool): ❌ FAILED with "Tools don't match between runtime and persisted agents. Missing in runtime: {'FileEditorTool'}."

✅ Fix Verification (PR branch)

Successfully verified the fix on the relax-tool-matching-1533 branch:

  • Test 1 (Adding new tool): ✅ PASS - Adding new tools when resuming is allowed
  • Test 2 (Removing unused tool): ✅ PASS - Removing unused tools when resuming is allowed
  • All existing tests: ✅ 8 passed in test_agent_reconciliation.py
  • New workspace tests: ✅ 13 passed in test_workspace_pause_resume.py

Taste Rating: 🟡 Acceptable - Works but could be cleaner


Linus-Style Analysis

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

  1. [base.py, Lines 386-394] Skipping Strict Comparison
    When used_tools is provided, the entire strict model comparison is skipped. This means other agent configuration changes (beyond tools) could slip through without validation. Consider only skipping the tools comparison, not the entire model comparison.

  2. [tests/cross/test_agent_reconciliation.py] Missing Critical Test Case
    There's no test verifying that removing a tool that WAS actually used in history still fails. This is the critical safety check that ensures the fix doesn't break consistency. Add a test like:

    def test_conversation_fails_when_removing_used_tool():
        """Test that removing a tool that was used in history fails."""
        # Create conversation, use a tool (create ActionEvent), then try to resume without it
  3. [PR Scope] Unrelated Changes
    The workspace pause() and resume() methods are unrelated to issue Allow adding new tools when resuming conversations without failing #1533 about tool matching. These should be in a separate PR to maintain clean commit history and easier review.

[STYLE NOTES] (Minor)

  1. [base.py, Line 371] Comment Clarity
    The comment "Update tools to match runtime (allows new tools to be added)" could be clearer about what's happening - the persisted agent's tools are being replaced with runtime tools.

VERDICT:

Worth merging with minor improvements suggested

The core logic is sound and solves the real problem described in issue #1533. The implementation is pragmatic - only fail if tools that were actually used are missing, which provides a better user experience while maintaining consistency.


KEY INSIGHT:

The fix correctly identifies that the strict tool matching was overly conservative. By tracking which tools were actually used in history, the code can now allow flexible tool additions while still preventing the removal of tools that would break conversation continuity.

View full conversation

@neubig
Copy link
Contributor

neubig commented Dec 29, 2025

BTW @ixchio it's not necessary to create a single commit -- we squash merge all commits, so they get merged into a single commit anyway.

@ixchio
Copy link
Contributor Author

ixchio commented Dec 29, 2025

BTW @ixchio it's not necessary to create a single commit -- we squash merge all commits, so they get merged into a single commit anyway.

Ah good to know! Thanks for the tip - will skip the cleanup next time since you squash merge anyway. Recreated as #1538 with clean branch, but same code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow adding new tools when resuming conversations without failing

2 participants