-
Notifications
You must be signed in to change notification settings - Fork 104
feat: relax tool matching on resume #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes OpenHands#1417 Implements early stopping mechanism to detect failures early and terminate behavior tests before full trajectory completes, reducing LLM costs. Changes: - Add early_stopper.py with 6 pruner classes: * EarlyStopperBase (abstract base) * FileEditPruner (detect forbidden file edits) * BashCommandPruner (detect forbidden commands) * TestExecutionPruner (detect excessive test runs) * CompositeEarlyStopper (combine multiple pruners) * LLMJudgePruner (periodic lightweight LLM checks) - Integrate early stopping in BaseIntegrationTest callback - Add get_early_stopper() hook in SoftwareAgentSDKBehaviorTest - Apply early stoppers to b01, b02, b05 behavior tests - Add 17 unit tests for early stopper functionality Per discussion with @ryanhoangt: - Pattern-based pruning first (zero LLM cost) - Stop on first failure signal - Skip final LLM judge when early stopped - Reusable pruner classes with base interface
…ributeError The early_stopper and early_stop_result attributes were being initialized AFTER LocalConversation was created, causing AttributeError when the callback accessed these attributes during test execution. This fixes the CI failures where all behavior tests failed with: '...Test' object has no attribute 'early_stopper'
- b02: Add TerminalTestAwarePruner that whitelists tests/tools/terminal/ paths - b05: Add allowed_patterns to skip auto-generated files (.openhands/, __pycache__) - b05: Increase max_creates to 3 to allow training script, README, and test file This fixes the false positives where: - b02 was blocking legitimate targeted terminal test runs - b05 was blocking auto-generated framework files
- Fix trailing space in 'pytest tests/ ' pattern that wouldn't catch bare commands - Rewrite tests to use real FileEditorAction/TerminalAction objects - Fix line-too-long lint error in LLMJudgePruner prompt - Remove unused imports and variables All 17 unit tests passing.
- Use CommandLiteral type for file editor commands - Add cast() for list covariance (list[ActionEvent] -> list[Event]) - Add null checks before string operations on result.reason - Use getattr() pattern to avoid type narrowing issue with ImageContent All 17 tests pass, pyright reports 0 errors.
- Remove TestExecutionPruner and LLMJudgePruner from early_stopper.py - Simplify b02 to rely on LLM judge instead of pattern matching - Remove RedundantFileCreationPruner from b05 - Keep FileEditPruner/BashCommandPruner/CompositeEarlyStopper as core infra - b01 still gets early stopping (saves the most cost anyway)
Closes OpenHands#1504 - BaseWorkspace: raise NotImplementedError by default - LocalWorkspace: no-op (nothing to pause on host filesystem) - DockerWorkspace: docker pause/unpause commands - APIRemoteWorkspace: expose /pause and /resume endpoints - OpenHandsCloudWorkspace: pause() not supported yet, resume() works Signed-off-by: ixchio <ruzvadpathan7058@gmail.com>
Only fail if a tool that was *used in history* is missing. Allow adding new tools freely when resuming conversations. Changes: - Add optional used_tools param to resolve_diff_from_deserialized() - Extract used tool names from event history in state.py - Add test_conversation_allows_removing_unused_tools - Add test_conversation_allows_adding_new_tools Fixes OpenHands#1533
|
@OpenHands /codereview-roasted including checking if you can (1) reproduce the issue with the previous code, and (2) run the new code and demonstrate that the problem that existed with the previous code is now fixed. |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
Closing - recreating with clean single commit |
Code Review for PR #1537: "feat: relax tool matching on resume"Verification Results✅ Issue Reproduction (main branch)Successfully reproduced the issue on the
✅ Fix Verification (PR branch)Successfully verified the fix on the
Taste Rating: 🟡 Acceptable - Works but could be cleanerLinus-Style Analysis[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)
[STYLE NOTES] (Minor)
VERDICT:✅ Worth merging with minor improvements suggested The core logic is sound and solves the real problem described in issue #1533. The implementation is pragmatic - only fail if tools that were actually used are missing, which provides a better user experience while maintaining consistency. KEY INSIGHT:The fix correctly identifies that the strict tool matching was overly conservative. By tracking which tools were actually used in history, the code can now allow flexible tool additions while still preventing the removal of tools that would break conversation continuity. |
|
BTW @ixchio it's not necessary to create a single commit -- we squash merge all commits, so they get merged into a single commit anyway. |
Summary
Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.
Changes
used_tools: set[str] | Noneparam toresolve_diff_from_deserialized()state.pybefore reconciliationused_toolsis provided, only validate that those tools exist in runtimeTests
test_conversation_allows_removing_unused_tools- removing tools that were never used is allowedtest_conversation_allows_adding_new_tools- adding new tools when resuming is allowedAll existing reconciliation tests continue to pass.
Fixes #1533