Skip to content

Conversation

@tofarr
Copy link
Collaborator

@tofarr tofarr commented Dec 30, 2025

Summary

When a conversation is deserialized if the execution_status is running, we set the execution_status to error - because this means that the conversation stopped while executing some action - this most commonly means some sort of crash in a process started by the agent.

The Web Frontend does not pick this up yet, but the execution_status does appear as error, and the runtime_status appears as STATUS$ERROR

image image

Testing

  • Start an agent server instance with

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bbc3277-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bbc3277-python \
  ghcr.io/openhands/agent-server:bbc3277-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bbc3277-golang-amd64
ghcr.io/openhands/agent-server:bbc3277-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bbc3277-golang-arm64
ghcr.io/openhands/agent-server:bbc3277-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bbc3277-java-amd64
ghcr.io/openhands/agent-server:bbc3277-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bbc3277-java-arm64
ghcr.io/openhands/agent-server:bbc3277-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bbc3277-python-amd64
ghcr.io/openhands/agent-server:bbc3277-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:bbc3277-python-arm64
ghcr.io/openhands/agent-server:bbc3277-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:bbc3277-golang
ghcr.io/openhands/agent-server:bbc3277-java
ghcr.io/openhands/agent-server:bbc3277-python

About Multi-Architecture Support

  • Each variant tag (e.g., bbc3277-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., bbc3277-python-amd64) are also available if needed

tofarr and others added 2 commits December 30, 2025 14:26
Test that a conversation with RUNNING execution_status becomes ERROR
when resumed/restarted. This verifies the fix that prevents conversations
from incorrectly remaining in RUNNING state after a crash or unexpected
termination.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 30, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   conversation_service.py33620937%64, 67, 78–79, 82–85, 87, 91, 93, 96–103, 106–107, 110–114, 117–119, 121–124, 126, 133–134, 136–138, 141, 145, 147, 149, 156, 162, 170–171, 180–183, 192, 201, 206–207, 210, 223–224, 242, 245, 256–260, 262–265, 268–273, 276–279, 281–283, 286, 289–291, 296–299, 307, 312–314, 328–332, 335, 337, 340–342, 344, 348, 352, 359–363, 366–367, 371–375, 378–379, 383–387, 390–391, 397–402, 409–410, 414, 416–417, 422–423, 429–430, 436–438, 456, 480, 508, 510–511, 537, 539, 541–544, 549, 551–552, 556–557, 559–560, 563–565, 568, 574, 579–582, 589–590, 594–598, 600, 605, 609–611, 615–616, 618–620, 622, 624, 637–639, 642, 645, 648–651, 658–659, 663–665, 668–669, 671
   event_service.py31415849%55–56, 75–77, 81–86, 89–92, 107, 123, 127, 131–132, 139, 141, 148–149, 157–160, 167–169, 186, 210–211, 214–215, 217–219, 221, 226, 229–230, 233–235, 238, 242–244, 246, 248, 259–262, 275–276, 279–280, 283, 286–288, 291–292, 295–296, 300, 303, 307, 311–312, 314, 331–332, 349, 351, 355–357, 361, 370–371, 373, 377, 383, 385, 393–398, 447, 449–452, 461, 477, 484, 488, 499–500, 510–513, 515–516, 520, 522, 526–529, 534–536, 538, 542–545, 549–552, 560–563, 582–583, 585–592, 594–595, 604–605, 607–608, 615–616, 618–619, 623, 629, 639–640, 647
openhands-sdk/openhands/sdk/llm
   llm.py42015762%359, 364, 368, 372–373, 376, 380–381, 392–393, 395–396, 400, 417, 435–438, 485, 515–517, 538, 542, 557, 563–564, 588–589, 599, 624–629, 650–651, 654, 658, 670, 675–678, 687, 695–702, 706–709, 711, 724, 728–729, 731–732, 737–738, 740, 747, 750–755, 812–817, 874–875, 878–881, 923, 940, 994, 997, 1000–1008, 1012–1014, 1017, 1020–1022, 1029–1030, 1039, 1046–1048, 1052, 1054–1059, 1061–1078, 1081–1085, 1087–1088, 1094–1103, 1116, 1130, 1135
TOTAL14512687152% 

@tofarr tofarr marked this pull request as ready for review December 30, 2025 23:17
Copy link
Contributor

@hieptl hieptl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🙏

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure "it cannot possibly be RUNNING"? I thought we have an autosave, though I could be wrong.

So I understand we may encounter a problem deserializing from RUNNING, but I'm not sure what the best solution is, could we maybe save it as something else, or avoid to save when running. Or set it as IDLE perhaps? The latter makes more sense to me, unless something prevents that.

@openhands-ai
Copy link

openhands-ai bot commented Jan 3, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1554 at branch `fix-restarted-conversation`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

unmatched_actions = ConversationState.get_unmatched_actions(state.events)
if unmatched_actions:
first_action = unmatched_actions[0]
error_event = AgentErrorEvent(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there a reason why this isn't suitable?

pending_actions = ConversationState.get_unmatched_actions(state.events)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow is thus:

  1. Agent executes some tool call which leaks memory
  2. Agent server pod is evicted from K8 as a result of this.
  3. Pod restarts - execution_status was running but is now error.
  4. User can prompt the agent to run again - but it will run the last Action which does not have an observation - resulting in a repeat of step 1.

After change
4. The action which crashed the pod now has an AgetnErrorObservation, letting the agent know not to run the same action again (Unless prompted with something like "Please try that again!")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you. We had this kind of reset in agent controller back in V0. Indeed I think AgentErrorEvent is correct... 🤔

The other question is, is this really server-specific? If the state is RUNNING and it's auto-saved, which I think it is, is there anything preventing it from happening on some who-knows-what stuck process on LocalConversation?

@tofarr tofarr merged commit 8fb2354 into main Jan 4, 2026
21 checks passed
@tofarr tofarr deleted the fix-restarted-conversation branch January 4, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants