-
Notifications
You must be signed in to change notification settings - Fork 100
feat: relax tool matching on resume #1538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Only fail if a tool that was *used in history* is missing. Allow adding new tools freely when resuming conversations. Changes: - Add optional used_tools param to resolve_diff_from_deserialized() - Extract used tool names from event history in state.py - Add test_conversation_allows_removing_unused_tools - Add test_conversation_allows_adding_new_tools Fixes OpenHands#1533
|
Hey @enyst, great catch on the performance concern! I've been thinking abut this too. The current implementation iterates through all events which could indeed be O(n) with large conversation histories. Optimization Approach: Instead of scanning events on resume, we could cache used tool names incrementally:
Here's the general shape: # In ConversationState
used_tool_names: set[str] = Field(
default_factory=set,
description="Cached tool names used in this conversation"
)
def append_event(self, event: Event) -> None:
if isinstance(event, ActionEvent):
if event.tool_name not in self.used_tool_names:
self.used_tool_names = self.used_tool_names | {event.tool_name}
self._events.append(event)This way:
Happy to push this optimization if you think it's the right direction. Let me know!!!! |
Addresses performance concern raised in PR review: - Added used_tool_names field to ConversationState that persists with state - Added append_event() method that maintains the cache incrementally - On resume, reads from persisted cache instead of iterating all events - This provides O(1) lookup vs O(n) iteration through all events
|
@enyst Thanks for the performance feedback! 🙏 I've addressed your concern by adding a
This should scale well even with high event counts. Let me know if you have any other concerns! |
|
@OpenHands Do a /codereview-roasted on this PR. Don’t assume anything, investigate, and when you have feedback, post it as a comment to this PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
/codereview-roastedTaste rating: 🟡 Acceptable. The intent is good (fixes a real UX papercut) and the caching avoids an O(n) scan. But the current implementation pays for that simplicity by quietly turning off an invariant (agent equality) and by introducing a cache that can become stale. Both are fixable. [CRITICAL ISSUES]
[IMPROVEMENT OPPORTUNITIES] (good taste / maintainability)
[STYLE NOTES]
Verdict✅ Worth merging after tightening correctness. The feature request in #1533 is real, and the incremental cache is a sane optimization. But please don’t ship the two correctness footguns:
|
- Fix agent reconciliation: validate all config except tools when used_tools provided - Add cache backfill for old conversations without used_tool_names - Sort tool names in error messages for deterministic output - Remove unused runtime_tools_map/persisted_tools_map variables - Add test for core requirement: tools used in history must exist
|
Hey @enyst! 👋 Thanks for the thorough code review – those were all valid catches! I've pushed a commit addressing the feedback: Critical fixes:
Also cleaned up:
All 9 tests in |
| # Use persisted used_tool_names cache for O(1) lookup on resume. | ||
| # If the cache is empty but there are events, we need to backfill | ||
| # for backward compatibility with old persisted conversations. | ||
| used_tools = state.used_tool_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure about this part of the agent’s feedback. Is it None or is it empty:
- if no tools have been used, in a new conversation (one from after the change)
- for a legacy conversation?
For a legacy conversation, it seems like this is the same with line 370 in agent base file, “ Legacy behavior: require exact match when used_tools not provided”, and if so, we could maybe use that fallback since it seems simpler, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! 🤔
You're right - both new conversations and legacy ones would have an empty set, but they mean different things...
For a new conversation: set() means "no tools used yet" (correct)
For a legacy conversation: set() is just Pydantic's default (we have no idea what was used)
My current backfill catches legacy by checking if not used_tools and len(events) > 0, but you're right that this is a bit hacky - a new conversation with just a user message (no tools) would trigger an unnecessary scan
Your idea of using None for legacy → fall back to strict matching is cleaner! Would you prefer I refactor to:
None= legacy, use strict matching (line 370 fallback)- set() = new conversation, no tools used
Let me know and I'll push the update!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original design behind the EventLog was that it was to behave like a regular python list (Even though it is not, as it is backed by memory!) I think introducing the append_event method opposes that concept.
Would it make sense to to pull the used tools from the agent state on the fly when checking the runtime_tools_map vs the persisted_tools_map? (Also, we should check the full definition rather than just the names)
|
Hey @tofarr, thanks for jumping in! That's a fair point about EventLog's list-like design. I introduced To clarify your suggestion: # In create() resume path
used_tools = {
event.tool_name
for event in state._events
if isinstance(event, ActionEvent) and event.tool_name
}If so, I'm actually okay with that - it's simpler, and the O(n) scan only happens once per resume. The cache was an optimization that added complexity. On checking full tool definitions vs just names: Happy to rework the approach based on your feedback! |
|
Hi @ixchio ! Thanks for your help with this one, and I think we are in agreement about the caching - once per resume seems acceptable to me too, and keeping it simple is preferable to premature optimization (Though @enyst may know something I don't) The current code does a check of the tool names only - in theory, we could have a situation where a tool definition changes but retains the old name. I know this is unlikely, but I would have considered any change in the tools to be an unlikely occurrence! |
|
Maybe we could do the actual use check only of the names don't match? (Avoiding the O(n) event scanning in most cases?) |
|
That's a smart optimization @tofarr! So the flow would be:
This gives us the best of both worlds - O(1) for the happy path, O(n) only when tools changed. I'll refactor the implementation:
On the tool definition changes question - I think thats a separate concern from this PR (which focuses on tool presence)...we could add full schema validation in a follow-up if needed? Let me push the simplified implementation thn ! |
- Remove used_tool_names cache and append_event() method - EventLog behaves like regular list again (per @tofarr's feedback) - Scan events on-the-fly only when tool names don't match - O(1) for happy path (same tools), O(n) fallback when tools change
|
Done! Pushed the simplified implementation |
tofarr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Job! 🍰
Summary
Only fail if a tool that was used in history is missing. Allow adding new tools freely when resuming conversations.
Changes
used_tools: set[str] | Noneparam toresolve_diff_from_deserialized()state.pybefore reconciliationused_toolsis provided, only validate that those tools exist in runtimeTests
test_conversation_allows_removing_unused_tools- removing tools that were never used is allowedtest_conversation_allows_adding_new_tools- adding new tools when resuming is allowedAll existing reconciliation tests continue to pass.
Fixes #1533