Skip to content

Conversation

@ixchio
Copy link
Contributor

@ixchio ixchio commented Dec 29, 2025

Summary

Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.

Changes

  • Add optional used_tools: set[str] | None param to resolve_diff_from_deserialized()
  • Extract used tool names from event history in state.py before reconciliation
  • When used_tools is provided, only validate that those tools exist in runtime
  • New tools can be added to resumed conversations without breaking anything

Tests

  • test_conversation_allows_removing_unused_tools - removing tools that were never used is allowed
  • test_conversation_allows_adding_new_tools - adding new tools when resuming is allowed

All existing reconciliation tests continue to pass.

Fixes #1533

Only fail if a tool that was *used in history* is missing.
Allow adding new tools freely when resuming conversations.

Changes:
- Add optional used_tools param to resolve_diff_from_deserialized()
- Extract used tool names from event history in state.py
- Add test_conversation_allows_removing_unused_tools
- Add test_conversation_allows_adding_new_tools

Fixes OpenHands#1533
@enyst enyst self-requested a review December 29, 2025 18:11
@ixchio
Copy link
Contributor Author

ixchio commented Dec 31, 2025

Hey @enyst, great catch on the performance concern!

I've been thinking abut this too. The current implementation iterates through all events which could indeed be O(n) with large conversation histories.

Optimization Approach:

Instead of scanning events on resume, we could cache used tool names incrementally:

  1. Add a used_tool_names: set[str] field to ConversationState
  2. Update it lazily when ActionEvents are appended (via a wrapper method)
  3. This gives us O(1) lookup during reconciliation

Here's the general shape:

# In ConversationState
used_tool_names: set[str] = Field(
    default_factory=set,
    description="Cached tool names used in this conversation"
)

def append_event(self, event: Event) -> None:
    if isinstance(event, ActionEvent):
        if event.tool_name not in self.used_tool_names:
            self.used_tool_names = self.used_tool_names | {event.tool_name}
    self._events.append(event)

This way:

  • ✅ No event iteration on resume
  • ✅ Cached state persists with the conversation
  • ✅ Backward compatible (empty set = strict matching fallback)

Happy to push this optimization if you think it's the right direction. Let me know!!!!

@ixchio ixchio requested a review from enyst January 1, 2026 15:47
Addresses performance concern raised in PR review:
- Added used_tool_names field to ConversationState that persists with state
- Added append_event() method that maintains the cache incrementally
- On resume, reads from persisted cache instead of iterating all events
- This provides O(1) lookup vs O(n) iteration through all events
@ixchio
Copy link
Contributor Author

ixchio commented Jan 1, 2026

@enyst Thanks for the performance feedback! 🙏

I've addressed your concern by adding a used_tool_names cache that:

  1. Persists with state - stored in base_state.json, so on resume it's read directly (O(1))
  2. Updated incrementally - via new append_event() method that updates the cache whenever an ActionEvent is appended
  3. No event iteration needed - on resume, we just read state.used_tool_names instead of scanning all events

This should scale well even with high event counts. Let me know if you have any other concerns!

@enyst
Copy link
Collaborator

enyst commented Jan 3, 2026

@OpenHands Do a /codereview-roasted on this PR. Don’t assume anything, investigate, and when you have feedback, post it as a comment to this PR.

@openhands-ai
Copy link

openhands-ai bot commented Jan 3, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

enyst commented Jan 3, 2026

/codereview-roasted

Taste rating: 🟡 Acceptable. The intent is good (fixes a real UX papercut) and the caching avoids an O(n) scan. But the current implementation pays for that simplicity by quietly turning off an invariant (agent equality) and by introducing a cache that can become stale. Both are fixable.


[CRITICAL ISSUES]

  1. You made resolve_diff_from_deserialized() stop validating anything once used_tools is passed.
  • Where: openhands-sdk/openhands/sdk/agent/base.py Lines 386–395
  • Problem: When used_tools is not None you skip the final strict model_dump equality check entirely. That means any difference between runtime agent and persisted agent (beyond the intended tool list differences) will be silently accepted.
    • Example: different condenser settings, different agent_context (skills), different filter regex, etc.
    • In other words: you didn’t “relax tool matching”, you relaxed agent matching as a whole. That’s a behavior change with real debugging/consistency implications.
  • Fix suggestion: Keep the strict comparison, but normalize/ignore only the tools field when used_tools is provided. E.g. compare model_dump(..., exclude={"tools"}) (or equivalent) and still raise on other diffs.
  1. used_tool_names is a persisted cache with no backfill/migration path → stale/empty cache will allow unintended removals.
  • Where: openhands-sdk/openhands/sdk/conversation/state.py Lines 123–241
  • Problem: Existing persisted conversations won’t have used_tool_names. Pydantic will default it to set(). On resume, you pass that empty set into reconciliation, which will happily allow removing tools even if the historical events used them.
    • You’re basically saying “trust the cache” but you don’t guarantee it exists or is correct.
      - Fix suggestion: On resume, if the field is missing / empty, recompute it from EventLog (yes, O(n), but only for old states / cache misses). Then persist it. That keeps the optimization without breaking correctness.

3) append_event() is easy to bypass, so the cache correctness depends on convention, not enforcement.

- Where: openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py Line 139 + any other call sites
- Problem: You updated the default callback to call append_event(), but nothing prevents other code from doing state.events.append(e) directly. There are plenty of ways for future code to accidentally bypass the cache and reintroduce staleness.
- Fix suggestion: Either:
- Make events append path private (hard in Python), or
- Add a thin API that centralizes event recording (and use it everywhere), and/or
- In ConversationState.create() do the backfill scan if cache looks inconsistent.


[IMPROVEMENT OPPORTUNITIES] (good taste / maintainability)

  1. Unused maps: runtime_tools_map / persisted_tools_map are computed but never used.
  • Where: openhands-sdk/openhands/sdk/agent/base.py Lines 352–355
  • Why it matters: Dead code signals the API/design drift and adds cognitive noise. If you only need names, compute just names.

5) Error message uses raw set formatting (nondeterministic ordering).

- Where: openhands-sdk/openhands/sdk/agent/base.py Lines 365–369
- Why it matters: Makes failures flaky for snapshot-y tests/log parsing and is harder to read. Sort tool names for stable output.

  1. Tests don’t prove the new semantic: “only tools used in history must exist”.
  • Where: tests/cross/test_agent_reconciliation.py Lines 110–207
  • Problem: Both new tests explicitly note “no tools used”. That validates “tool list can change if nothing happened”, which is not the core requirement.
  • Fix suggestion: Add a test that actually writes an ActionEvent using a tool (or uses the LocalConversation in a way that triggers one), then verify that resuming without that tool fails, while removing an unused tool still passes.

[STYLE NOTES]

  1. Comment bloat / obvious comments.
  • Where: state.py around Lines 234–237
  • Nit: The code already makes it obvious you’re not scanning events. The repeated commentary reads like a PR description pasted into the source. Keep one tight comment or move rationale to the PR/issue.

Verdict

Worth merging after tightening correctness. The feature request in #1533 is real, and the incremental cache is a sane optimization. But please don’t ship the two correctness footguns:

  • silently skipping agent reconciliation checks, and
  • trusting an uninitialized/stale cache.

@OpenHands OpenHands deleted a comment from openhands-ai bot Jan 3, 2026
- Fix agent reconciliation: validate all config except tools when used_tools provided
- Add cache backfill for old conversations without used_tool_names
- Sort tool names in error messages for deterministic output
- Remove unused runtime_tools_map/persisted_tools_map variables
- Add test for core requirement: tools used in history must exist
@ixchio
Copy link
Contributor Author

ixchio commented Jan 3, 2026

Hey @enyst! 👋

Thanks for the thorough code review – those were all valid catches! I've pushed a commit addressing the feedback:

Critical fixes:

  1. Agent reconciliation was too permissive – You were right, I accidentally disabled the entire equality check when used_tools was provided. Fixed by using model_dump(exclude={"tools"}) so we still validate condenser, context, etc. while only relaxing tool matching.

  2. Stale cache for old conversations – Added backfill logic in create(): if used_tool_names is empty but there are events, we scan once to populate the cache. This handles the migration path for existing persisted conversations gracefully.

  3. Cache bypass risk – The backfill above also serves as a safety net. If anything bypasses append_event(), we catch it on resume.

Also cleaned up:

  • Removed unused runtime_tools_map/persisted_tools_map since we only needed the name sets
  • Sorted tool names in error messages for deterministic output
  • Added test_conversation_fails_when_used_tool_is_missing to verify the core semantic: tools used in history must exist when resuming

All 9 tests in test_agent_reconciliation.py are passing. Let me know if there's anything else! 🙏

# Use persisted used_tool_names cache for O(1) lookup on resume.
# If the cache is empty but there are events, we need to backfill
# for backward compatibility with old persisted conversations.
used_tools = state.used_tool_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure about this part of the agent’s feedback. Is it None or is it empty:

  • if no tools have been used, in a new conversation (one from after the change)
  • for a legacy conversation?

For a legacy conversation, it seems like this is the same with line 370 in agent base file, “ Legacy behavior: require exact match when used_tools not provided”, and if so, we could maybe use that fallback since it seems simpler, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! 🤔

You're right - both new conversations and legacy ones would have an empty set, but they mean different things...
For a new conversation: set() means "no tools used yet" (correct)
For a legacy conversation: set() is just Pydantic's default (we have no idea what was used)

My current backfill catches legacy by checking if not used_tools and len(events) > 0, but you're right that this is a bit hacky - a new conversation with just a user message (no tools) would trigger an unnecessary scan

Your idea of using None for legacy → fall back to strict matching is cleaner! Would you prefer I refactor to:

  • None = legacy, use strict matching (line 370 fallback)
  • set() = new conversation, no tools used

Let me know and I'll push the update!!

@ixchio ixchio requested a review from enyst January 3, 2026 11:39
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original design behind the EventLog was that it was to behave like a regular python list (Even though it is not, as it is backed by memory!) I think introducing the append_event method opposes that concept.

Would it make sense to to pull the used tools from the agent state on the fly when checking the runtime_tools_map vs the persisted_tools_map? (Also, we should check the full definition rather than just the names)

@ixchio
Copy link
Contributor Author

ixchio commented Jan 3, 2026

Hey @tofarr, thanks for jumping in!

That's a fair point about EventLog's list-like design. I introduced append_event() as a caching optimization after @enyst raised a performance concern about O(n) event scanning on resume.

To clarify your suggestion:
Are you proposing we skip the cache entirely and just scan the EventLog for ActionEvents at reconciliation time? Something like:

# In create() resume path
used_tools = {
    event.tool_name
    for event in state._events
    if isinstance(event, ActionEvent) and event.tool_name
}

If so, I'm actually okay with that - it's simpler, and the O(n) scan only happens once per resume. The cache was an optimization that added complexity.

On checking full tool definitions vs just names:
Could you elaborate on what you mean? Tool specs can change between sessions (e.g., different params), and the current reconciliation only validates that required tools exist. What additional checks would you like to see?

Happy to rework the approach based on your feedback!

@ixchio ixchio requested a review from tofarr January 3, 2026 15:42
@tofarr
Copy link
Collaborator

tofarr commented Jan 3, 2026

Hi @ixchio !

Thanks for your help with this one, and I think we are in agreement about the caching - once per resume seems acceptable to me too, and keeping it simple is preferable to premature optimization (Though @enyst may know something I don't)

The current code does a check of the tool names only - in theory, we could have a situation where a tool definition changes but retains the old name. I know this is unlikely, but I would have considered any change in the tools to be an unlikely occurrence!

@tofarr
Copy link
Collaborator

tofarr commented Jan 3, 2026

Maybe we could do the actual use check only of the names don't match? (Avoiding the O(n) event scanning in most cases?)

@ixchio
Copy link
Contributor Author

ixchio commented Jan 3, 2026

That's a smart optimization @tofarr!

So the flow would be:

  1. If tool names match → All good, no event scanning needed (common case)
  2. If tool names don't match → Scan events to find which tools were actually used, only require those

This gives us the best of both worlds - O(1) for the happy path, O(n) only when tools changed.

I'll refactor the implementation:

  • Drop the used_tool_names cache and append_event() method
  • Keep EventLog behaving like a regular list
  • Add the event scan as a fallback in resolve_diff_from_deserialized() only when names differ

On the tool definition changes question - I think thats a separate concern from this PR (which focuses on tool presence)...we could add full schema validation in a follow-up if needed?

Let me push the simplified implementation thn !

- Remove used_tool_names cache and append_event() method
- EventLog behaves like regular list again (per @tofarr's feedback)
- Scan events on-the-fly only when tool names don't match
- O(1) for happy path (same tools), O(n) fallback when tools change
@ixchio
Copy link
Contributor Author

ixchio commented Jan 3, 2026

Done! Pushed the simplified implementation

Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job! 🍰

@tofarr tofarr merged commit 996daa7 into OpenHands:main Jan 4, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow adding new tools when resuming conversations without failing

3 participants