-
Notifications
You must be signed in to change notification settings - Fork 100
Set initial execution status to error if it was running #1554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test that a conversation with RUNNING execution_status becomes ERROR when resumed/restarted. This verifies the fix that prevents conversations from incorrectly remaining in RUNNING state after a crash or unexpected termination. Co-authored-by: openhands <openhands@all-hands.dev>
Coverage Report •
|
|||||||||||||||||||||||||||||||||||
hieptl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! 🙏
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure "it cannot possibly be RUNNING"? I thought we have an autosave, though I could be wrong.
So I understand we may encounter a problem deserializing from RUNNING, but I'm not sure what the best solution is, could we maybe save it as something else, or avoid to save when running. Or set it as IDLE perhaps? The latter makes more sense to me, unless something prevents that.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
| unmatched_actions = ConversationState.get_unmatched_actions(state.events) | ||
| if unmatched_actions: | ||
| first_action = unmatched_actions[0] | ||
| error_event = AgentErrorEvent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is there a reason why this isn't suitable?
| pending_actions = ConversationState.get_unmatched_actions(state.events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workflow is thus:
- Agent executes some tool call which leaks memory
- Agent server pod is evicted from K8 as a result of this.
- Pod restarts -
execution_statuswasrunningbut is nowerror. - User can prompt the agent to run again - but it will run the last Action which does not have an observation - resulting in a repeat of step 1.
After change
4. The action which crashed the pod now has an AgetnErrorObservation, letting the agent know not to run the same action again (Unless prompted with something like "Please try that again!")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you. We had this kind of reset in agent controller back in V0. Indeed I think AgentErrorEvent is correct... 🤔
The other question is, is this really server-specific? If the state is RUNNING and it's auto-saved, which I think it is, is there anything preventing it from happening on some who-knows-what stuck process on LocalConversation?
Summary
When a conversation is deserialized if the execution_status is
running, we set theexecution_statustoerror- because this means that the conversation stopped while executing some action - this most commonly means some sort of crash in a process started by the agent.The Web Frontend does not pick this up yet, but the
execution_statusdoes appear aserror, and theruntime_statusappears asSTATUS$ERRORTesting
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bbc3277-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bbc3277-python) is a multi-arch manifest supporting both amd64 and arm64bbc3277-python-amd64) are also available if needed