-
Notifications
You must be signed in to change notification settings - Fork 101
Release v1.7.4 #1595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v1.7.4 #1595
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🔄 Running Examples with
|
🧪 Integration Tests ResultsOverall Success Rate: 98.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_gpt_5.1_codex_max
Skipped Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_mistral_devstral_2512
Skipped Tests:
Failed Tests:
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
|
🧪 Integration Tests ResultsOverall Success Rate: 80.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
litellm_proxy_mistral_devstral_2512
litellm_proxy_gpt_5.1_codex_max
litellm_proxy_vertex_ai_gemini_3_pro_preview
Failed Tests:
While the agent's actual task execution was excellent (successfully updated all version numbers from 1.4.1 to 1.4.2 in all pyproject.toml files, ran uv lock to update the lock file, verified the changes, and stayed within the correct workspace), the lack of explanations alongside each action represents a clear deviation from the stated evaluation requirement. The criterion explicitly states: "It is acceptable if the explanation seems vague or repetitive, we want to test for existence." This indicates that the presence of explanations was being tested, not the quality. The agent provided zero explanations, so this requirement was not met. (confidence=0.95) (Cost: $0.48)
litellm_proxy_deepseek_deepseek_chat
Failed Tests:
Issues with Agent's Approach:1. Over-verification (Critical Issue)The evaluation criteria explicitly stated: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly." The agent ran:
This is clearly over-verification. The criteria allowed for "ALL files under 2. Scope Creep (Major Issue)The user specifically asked to:
The agent instead:
While the agent's reasoning was sound (the comment says they should match), this was not asked for by the user. The user specifically said "the terminal tool truncation limit" and "corresponding tests" - not the LLM config. The agent should have either:
3. Unnecessary Extra WorkThe agent:
While thorough, this wasn't part of the user's request. 4. Not Following the "Stop and Invite Further Direction" PatternThe evaluation criteria stated: "Stop after reporting the change and results, inviting further direction." The agent instead continued and completed the entire scope, including changes the user didn't explicitly request. It should have stopped after updating the terminal constant and its tests, then reported back asking about the LLM config consistency. What the Agent Did Right:
Correct Approach Would Have Been:
The agent was overly thorough and made assumptions beyond the user's request, violating the explicit evaluation criteria about not over-verifying and about stopping to invite further direction. (confidence=0.85) (Cost: $0.11) litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
What the agent did correctly:
Critical violations of evaluation criteria:
What the evaluation criteria expected:
The agent's behavior shows thoroughness but violates the specific guidance about not over-verifying and stopping after achieving the core objective. A more appropriate response would have been:
The evaluation criterion explicitly warns against running test suites "much broader than necessary, or repeatedly" - the agent ran tests broadly (155 tests) after already verifying the core change was working. (confidence=0.95) (Cost: $0.20)
The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." While a single README.md could be argued as acceptable documentation, the addition of Positive aspects of the implementation:
Negative aspects:
The agent's behavior shows good intent and technical competence, but it overextended the scope by adding auxiliary documentation that wasn't requested. The task was clear: help implement the training script, not create an entire documentation suite. (confidence=0.85) (Cost: $1.12) |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🧪 Integration Tests ResultsOverall Success Rate: 98.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_mistral_devstral_2512
Skipped Tests:
Failed Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_gpt_5.1_codex_max
Skipped Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
|
|
Evaluation Triggered
|
Release v1.7.4
This PR prepares the release for version 1.7.4.
Release Checklist
integration-test)behavior-test)test-examples)v1.7.4rel-1.7.4Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:897f2ac-pythonRun
All tags pushed for this build
About Multi-Architecture Support
897f2ac-python) is a multi-arch manifest supporting both amd64 and arm64897f2ac-python-amd64) are also available if needed