feat: relax tool matching on resume #1538

ixchio · 2025-12-29T18:05:20Z

Summary

Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.

Changes

Add optional used_tools: set[str] | None param to resolve_diff_from_deserialized()
Extract used tool names from event history in state.py before reconciliation
When used_tools is provided, only validate that those tools exist in runtime
New tools can be added to resumed conversations without breaking anything

Tests

test_conversation_allows_removing_unused_tools - removing tools that were never used is allowed
test_conversation_allows_adding_new_tools - adding new tools when resuming is allowed

All existing reconciliation tests continue to pass.

Fixes #1533

Only fail if a tool that was *used in history* is missing. Allow adding new tools freely when resuming conversations. Changes: - Add optional used_tools param to resolve_diff_from_deserialized() - Extract used tool names from event history in state.py - Add test_conversation_allows_removing_unused_tools - Add test_conversation_allows_adding_new_tools Fixes OpenHands#1533

openhands-sdk/openhands/sdk/conversation/state.py

openhands-sdk/openhands/sdk/agent/base.py

ixchio · 2025-12-31T17:46:28Z

Hey @enyst, great catch on the performance concern!

I've been thinking abut this too. The current implementation iterates through all events which could indeed be O(n) with large conversation histories.

Optimization Approach:

Instead of scanning events on resume, we could cache used tool names incrementally:

Add a used_tool_names: set[str] field to ConversationState
Update it lazily when ActionEvents are appended (via a wrapper method)
This gives us O(1) lookup during reconciliation

Here's the general shape:

# In ConversationState
used_tool_names: set[str] = Field(
    default_factory=set,
    description="Cached tool names used in this conversation"
)

def append_event(self, event: Event) -> None:
    if isinstance(event, ActionEvent):
        if event.tool_name not in self.used_tool_names:
            self.used_tool_names = self.used_tool_names | {event.tool_name}
    self._events.append(event)

This way:

✅ No event iteration on resume
✅ Cached state persists with the conversation
✅ Backward compatible (empty set = strict matching fallback)

Happy to push this optimization if you think it's the right direction. Let me know!!!!

Addresses performance concern raised in PR review: - Added used_tool_names field to ConversationState that persists with state - Added append_event() method that maintains the cache incrementally - On resume, reads from persisted cache instead of iterating all events - This provides O(1) lookup vs O(n) iteration through all events

ixchio · 2026-01-01T16:49:57Z

@enyst Thanks for the performance feedback! 🙏

I've addressed your concern by adding a used_tool_names cache that:

Persists with state - stored in base_state.json, so on resume it's read directly (O(1))
Updated incrementally - via new append_event() method that updates the cache whenever an ActionEvent is appended
No event iteration needed - on resume, we just read state.used_tool_names instead of scanning all events

This should scale well even with high event counts. Let me know if you have any other concerns!

enyst · 2026-01-03T10:20:02Z

@OpenHands Do a /codereview-roasted on this PR. Don’t assume anything, investigate, and when you have feedback, post it as a comment to this PR.

openhands-ai · 2026-01-03T10:20:11Z

I'm on it! enyst can track my progress at all-hands.dev

enyst · 2026-01-03T10:21:59Z

/codereview-roasted

Taste rating: 🟡 Acceptable. The intent is good (fixes a real UX papercut) and the caching avoids an O(n) scan. But the current implementation pays for that simplicity by quietly turning off an invariant (agent equality) and by introducing a cache that can become stale. Both are fixable.

[CRITICAL ISSUES]

You made resolve_diff_from_deserialized() stop validating anything once used_tools is passed.

Where: openhands-sdk/openhands/sdk/agent/base.py Lines 386–395
Problem: When used_tools is not None you skip the final strict model_dump equality check entirely. That means any difference between runtime agent and persisted agent (beyond the intended tool list differences) will be silently accepted.
- Example: different condenser settings, different agent_context (skills), different filter regex, etc.
- In other words: you didn’t “relax tool matching”, you relaxed agent matching as a whole. That’s a behavior change with real debugging/consistency implications.
Fix suggestion: Keep the strict comparison, but normalize/ignore only the tools field when used_tools is provided. E.g. compare model_dump(..., exclude={"tools"}) (or equivalent) and still raise on other diffs.

used_tool_names is a persisted cache with no backfill/migration path → stale/empty cache will allow unintended removals.

Where: openhands-sdk/openhands/sdk/conversation/state.py Lines 123–241
Problem: Existing persisted conversations won’t have used_tool_names. Pydantic will default it to set(). On resume, you pass that empty set into reconciliation, which will happily allow removing tools even if the historical events used them.
- You’re basically saying “trust the cache” but you don’t guarantee it exists or is correct.
  - Fix suggestion: On resume, if the field is missing / empty, recompute it from EventLog (yes, O(n), but only for old states / cache misses). Then persist it. That keeps the optimization without breaking correctness.

3) append_event() is easy to bypass, so the cache correctness depends on convention, not enforcement.

~~- Where: openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py Line 139 + any other call sites~~
- Problem: You updated the default callback to call append_event(), but nothing prevents other code from doing state.events.append(e) directly. There are plenty of ways for future code to accidentally bypass the cache and reintroduce staleness.
~~- Fix suggestion: Either:~~
~~- Make events append path private (hard in Python), or~~
~~- Add a thin API that centralizes event recording (and use it everywhere), and/or~~
~~- In ConversationState.create() do the backfill scan if cache looks inconsistent.~~

[IMPROVEMENT OPPORTUNITIES] (good taste / maintainability)

Unused maps: runtime_tools_map / persisted_tools_map are computed but never used.

Where: openhands-sdk/openhands/sdk/agent/base.py Lines 352–355
Why it matters: Dead code signals the API/design drift and adds cognitive noise. If you only need names, compute just names.

5) Error message uses raw set formatting (nondeterministic ordering).

- Where: openhands-sdk/openhands/sdk/agent/base.py Lines 365–369
~~- Why it matters: Makes failures flaky for snapshot-y tests/log parsing and is harder to read. Sort tool names for stable output.~~

Tests don’t prove the new semantic: “only tools used in history must exist”.

Where: tests/cross/test_agent_reconciliation.py Lines 110–207
Problem: Both new tests explicitly note “no tools used”. That validates “tool list can change if nothing happened”, which is not the core requirement.
Fix suggestion: Add a test that actually writes an ActionEvent using a tool (or uses the LocalConversation in a way that triggers one), then verify that resuming without that tool fails, while removing an unused tool still passes.

[STYLE NOTES]

Comment bloat / obvious comments.

Where: state.py around Lines 234–237
Nit: The code already makes it obvious you’re not scanning events. The repeated commentary reads like a PR description pasted into the source. Keep one tight comment or move rationale to the PR/issue.

Verdict

✅ Worth merging after tightening correctness. The feature request in #1533 is real, and the incremental cache is a sane optimization. But please don’t ship the two correctness footguns:

silently skipping agent reconciliation checks, and
trusting an uninitialized/stale cache.

- Fix agent reconciliation: validate all config except tools when used_tools provided - Add cache backfill for old conversations without used_tool_names - Sort tool names in error messages for deterministic output - Remove unused runtime_tools_map/persisted_tools_map variables - Add test for core requirement: tools used in history must exist

ixchio · 2026-01-03T10:46:25Z

Hey @enyst! 👋

Thanks for the thorough code review – those were all valid catches! I've pushed a commit addressing the feedback:

Critical fixes:

Agent reconciliation was too permissive – You were right, I accidentally disabled the entire equality check when used_tools was provided. Fixed by using model_dump(exclude={"tools"}) so we still validate condenser, context, etc. while only relaxing tool matching.
Stale cache for old conversations – Added backfill logic in create(): if used_tool_names is empty but there are events, we scan once to populate the cache. This handles the migration path for existing persisted conversations gracefully.
Cache bypass risk – The backfill above also serves as a safety net. If anything bypasses append_event(), we catch it on resume.

Also cleaned up:

Removed unused runtime_tools_map/persisted_tools_map since we only needed the name sets
Sorted tool names in error messages for deterministic output
Added test_conversation_fails_when_used_tool_is_missing to verify the core semantic: tools used in history must exist when resuming

All 9 tests in test_agent_reconciliation.py are passing. Let me know if there's anything else! 🙏

enyst · 2026-01-03T10:55:11Z

openhands-sdk/openhands/sdk/conversation/state.py

+            # Use persisted used_tool_names cache for O(1) lookup on resume.
+            # If the cache is empty but there are events, we need to backfill
+            # for backward compatibility with old persisted conversations.
+            used_tools = state.used_tool_names


I’m not sure about this part of the agent’s feedback. Is it None or is it empty:

if no tools have been used, in a new conversation (one from after the change)

for a legacy conversation?

For a legacy conversation, it seems like this is the same with line 370 in agent base file, “ Legacy behavior: require exact match when used_tools not provided”, and if so, we could maybe use that fallback since it seems simpler, WDYT?

Good point! 🤔

You're right - both new conversations and legacy ones would have an empty set, but they mean different things...
For a new conversation: set() means "no tools used yet" (correct)
For a legacy conversation: set() is just Pydantic's default (we have no idea what was used)

My current backfill catches legacy by checking if not used_tools and len(events) > 0, but you're right that this is a bit hacky - a new conversation with just a user message (no tools) would trigger an unnecessary scan

Your idea of using None for legacy → fall back to strict matching is cleaner! Would you prefer I refactor to:

None = legacy, use strict matching (line 370 fallback)

set() = new conversation, no tools used

Let me know and I'll push the update!!

tofarr

I think the original design behind the EventLog was that it was to behave like a regular python list (Even though it is not, as it is backed by memory!) I think introducing the append_event method opposes that concept.

Would it make sense to to pull the used tools from the agent state on the fly when checking the runtime_tools_map vs the persisted_tools_map? (Also, we should check the full definition rather than just the names)

ixchio · 2026-01-03T15:40:35Z

Hey @tofarr, thanks for jumping in!

That's a fair point about EventLog's list-like design. I introduced append_event() as a caching optimization after @enyst raised a performance concern about O(n) event scanning on resume.

To clarify your suggestion:
Are you proposing we skip the cache entirely and just scan the EventLog for ActionEvents at reconciliation time? Something like:

# In create() resume path
used_tools = {
    event.tool_name
    for event in state._events
    if isinstance(event, ActionEvent) and event.tool_name
}

If so, I'm actually okay with that - it's simpler, and the O(n) scan only happens once per resume. The cache was an optimization that added complexity.

On checking full tool definitions vs just names:
Could you elaborate on what you mean? Tool specs can change between sessions (e.g., different params), and the current reconciliation only validates that required tools exist. What additional checks would you like to see?

Happy to rework the approach based on your feedback!

tofarr · 2026-01-03T15:56:39Z

Hi @ixchio !

Thanks for your help with this one, and I think we are in agreement about the caching - once per resume seems acceptable to me too, and keeping it simple is preferable to premature optimization (Though @enyst may know something I don't)

The current code does a check of the tool names only - in theory, we could have a situation where a tool definition changes but retains the old name. I know this is unlikely, but I would have considered any change in the tools to be an unlikely occurrence!

tofarr · 2026-01-03T16:01:06Z

Maybe we could do the actual use check only of the names don't match? (Avoiding the O(n) event scanning in most cases?)

ixchio · 2026-01-03T16:17:57Z

That's a smart optimization @tofarr!

So the flow would be:

If tool names match → All good, no event scanning needed (common case)
If tool names don't match → Scan events to find which tools were actually used, only require those

This gives us the best of both worlds - O(1) for the happy path, O(n) only when tools changed.

I'll refactor the implementation:

Drop the used_tool_names cache and append_event() method
Keep EventLog behaving like a regular list
Add the event scan as a fallback in resolve_diff_from_deserialized() only when names differ

On the tool definition changes question - I think thats a separate concern from this PR (which focuses on tool presence)...we could add full schema validation in a follow-up if needed?

Let me push the simplified implementation thn !

@tofarr

- Remove used_tool_names cache and append_event() method - EventLog behaves like regular list again (per @tofarr's feedback) - Scan events on-the-fly only when tool names don't match - O(1) for happy path (same tools), O(n) fallback when tools change

ixchio · 2026-01-03T16:26:59Z

Done! Pushed the simplified implementation

tofarr

Great Job! 🍰

ixchio added 2 commits December 29, 2025 23:34

Merge branch 'main' into relax-tool-matching-clean

caa0e57

enyst self-requested a review December 29, 2025 18:11

ixchio mentioned this pull request Dec 29, 2025

feat: relax tool matching on resume #1537

Closed

ixchio added 3 commits December 30, 2025 11:25

Merge branch 'main' into relax-tool-matching-clean

6248db6

Merge branch 'main' into relax-tool-matching-clean

e47b611

Merge branch 'main' into relax-tool-matching-clean

ac0162e

enyst reviewed Dec 31, 2025

View reviewed changes

openhands-sdk/openhands/sdk/conversation/state.py Outdated Show resolved Hide resolved

enyst reviewed Dec 31, 2025

View reviewed changes

openhands-sdk/openhands/sdk/agent/base.py Outdated Show resolved Hide resolved

ixchio mentioned this pull request Dec 31, 2025

Allow adding new tools when resuming conversations without failing #1533

Closed

Merge branch 'main' into relax-tool-matching-clean

3e3aa0d

ixchio requested a review from enyst January 1, 2026 15:47

ixchio added 3 commits January 2, 2026 10:47

Merge branch 'main' into relax-tool-matching-clean

ed69ba0

Merge branch 'main' into relax-tool-matching-clean

2a6bf2a

Merge branch 'main' into relax-tool-matching-clean

7e4ce23

OpenHands deleted a comment from openhands-ai bot Jan 3, 2026

enyst reviewed Jan 3, 2026

View reviewed changes

Merge branch 'main' into relax-tool-matching-clean

c80d0fc

ixchio requested a review from enyst January 3, 2026 11:39

tofarr requested changes Jan 3, 2026

View reviewed changes

ixchio requested a review from tofarr January 3, 2026 15:42

tofarr approved these changes Jan 3, 2026

View reviewed changes

tofarr merged commit 996daa7 into OpenHands:main Jan 4, 2026
14 checks passed

feat: relax tool matching on resume #1538

feat: relax tool matching on resume #1538

Uh oh!

Conversation

ixchio commented Dec 29, 2025

Summary

Changes

Tests

Uh oh!

Uh oh!

Uh oh!

ixchio commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ixchio commented Jan 1, 2026

Uh oh!

enyst commented Jan 3, 2026

Uh oh!

openhands-ai bot commented Jan 3, 2026

Uh oh!

enyst commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

/codereview-roasted

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES] (good taste / maintainability)

[STYLE NOTES]

Verdict

Uh oh!

ixchio commented Jan 3, 2026

Uh oh!

enyst Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

ixchio Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

tofarr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ixchio commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tofarr commented Jan 3, 2026

Uh oh!

tofarr commented Jan 3, 2026

Uh oh!

ixchio commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ixchio commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ixchio commented Dec 31, 2025 •

edited

Loading

enyst commented Jan 3, 2026 •

edited

Loading

tofarr left a comment •

edited

Loading

ixchio commented Jan 3, 2026 •

edited

Loading

ixchio commented Jan 3, 2026 •

edited

Loading

ixchio commented Jan 3, 2026 •

edited

Loading