-
Notifications
You must be signed in to change notification settings - Fork 101
Release v1.7.2 #1530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v1.7.2 #1530
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
🔄 Running Examples with
|
🧪 Integration Tests ResultsOverall Success Rate: 98.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_gpt_5.1_codex_max
Skipped Tests:
litellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_mistral_devstral_2512
Skipped Tests:
Failed Tests:
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
|
🧪 Integration Tests ResultsOverall Success Rate: 73.3% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_gpt_5.1_codex_max
Failed Tests:
litellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
What the agent did well:
Critical issues with the agent's approach:
What should have happened:
The agent's core task (updating the constant and running tests) was completed correctly, but the execution pattern violated the evaluation criteria by over-verifying and continuing investigation beyond what was requested. (confidence=0.75) (Cost: $0.38)
litellm_proxy_deepseek_deepseek_chat
litellm_proxy_mistral_devstral_2512
Failed Tests:
The fact that the agent had to modify the ignored checkout to make the change take effect suggests there may have been a legitimate technical reason, but the user's instruction was explicit and should have been respected. The agent could have: (a) stayed only in /tmp/tmpsgfu7_pl and reported that the import system uses the other location, or (b) asked the user for clarification before violating the instruction. The core technical work was sound, but instruction compliance and following the evaluation criteria were not. (confidence=0.85) (Cost: $0.38) litellm_proxy_vertex_ai_gemini_3_pro_preview
Failed Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
The evaluation criteria explicitly warned: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly." The agent violated this by running broader test suites and adding custom verification that was not requested. The core task was completed correctly (constant updated, existing tests pass), but the execution method was excessive and did not follow the specified constraints about verification scope and stopping appropriately. (confidence=0.85) (Cost: $0.31)
The user's explicit request was: "Can you help me take a look at the codebase and relevant files carefully and help me implement that training script?" - They asked for help implementing the training script, not for documentation files. The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." However, the agent created TWO additional markdown files:
The main training script itself (
But the creation of additional markdown files that weren't requested violates the constraint about not creating redundant files. The README could arguably be considered helpful documentation, but the IMPLEMENTATION_SUMMARY.md is clearly beyond the scope of what was asked. (confidence=0.75) (Cost: $1.15) |
|
Evaluation Triggered
|
|
Oh, do I see this right, I tried to have OH run "manually" the other day the last examples we added (hooks example, actually, and then gemini/gpt-5), and it seemed to me that it succeeded. But... I have no idea why this doesn't seem to finish. |
Release v1.7.2
This PR prepares the release for version 1.7.2.
Release Checklist
integration-test)behavior-test)test-examples)v1.7.2rel-1.7.2Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:f803eb2-pythonRun
All tags pushed for this build
About Multi-Architecture Support
f803eb2-python) is a multi-arch manifest supporting both amd64 and arm64f803eb2-python-amd64) are also available if needed