Release v1.7.4 #1595

all-hands-bot · 2026-01-05T18:58:38Z

Release v1.7.4

This PR prepares the release for version 1.7.4.

Release Checklist

Next Steps

Review the version changes
Address any deprecation deadlines
Ensure integration tests pass
Ensure behavior tests pass
Ensure example tests pass
Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:897f2ac-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-897f2ac-python \
  ghcr.io/openhands/agent-server:897f2ac-python

All tags pushed for this build

ghcr.io/openhands/agent-server:897f2ac-golang-amd64
ghcr.io/openhands/agent-server:897f2ac-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:897f2ac-golang-arm64
ghcr.io/openhands/agent-server:897f2ac-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:897f2ac-java-amd64
ghcr.io/openhands/agent-server:897f2ac-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:897f2ac-java-arm64
ghcr.io/openhands/agent-server:897f2ac-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:897f2ac-python-amd64
ghcr.io/openhands/agent-server:897f2ac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:897f2ac-python-arm64
ghcr.io/openhands/agent-server:897f2ac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:897f2ac-golang
ghcr.io/openhands/agent-server:897f2ac-java
ghcr.io/openhands/agent-server:897f2ac-python

About Multi-Architecture Support

Each variant tag (e.g., 897f2ac-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 897f2ac-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-01-05T18:58:48Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-05T18:58:48Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-05T19:01:38Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Run in progress...

github-actions · 2026-01-05T19:05:24Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	15155	7368	51%

report-only-changed-files is enabled. No files were changed during this commit :)

github-actions · 2026-01-05T19:14:39Z

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $1.94
Models Tested: 6
Timestamp: 2026-01-05 19:14:32 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.67	637,616
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.54	305,155
litellm_proxy_gpt_5.1_codex_max	100.0%	100.0%	N/A	8/8	1	9	$0.23	286,571
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.22	324,324
litellm_proxy_mistral_devstral_2512	87.5%	87.5%	N/A	7/8	1	9	$0.21	510,969
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.06	539,214

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.67
Token Usage: prompt: 624,752, completion: 12,864, cache_read: 540,143, cache_write: 83,584, reasoning: 3,224
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_1f6bf27_sonnet_run_N9_20260105_185928

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.54
Token Usage: prompt: 283,738, completion: 21,417, cache_read: 157,057, reasoning: 17,030
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_1f6bf27_gemini_3_pro_run_N9_20260105_185928

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.23
Token Usage: prompt: 281,289, completion: 5,282, cache_read: 154,880, reasoning: 2,816
Run Suffix: litellm_proxy_gpt_5.1_codex_max_1f6bf27_gpt51_codex_run_N9_20260105_185919
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.22
Token Usage: prompt: 308,474, completion: 15,850, cache_read: 257,024
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f6bf27_kimi_k2_run_N9_20260105_185913
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 87.5% (7/8)
Integration Tests (Required): 87.5% (7/9)
Total Cost: $0.21
Token Usage: prompt: 505,902, completion: 5,067
Run Suffix: litellm_proxy_mistral_devstral_2512_1f6bf27_devstral_2512_run_N9_20260105_185932
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.01)

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.06
Token Usage: prompt: 523,629, completion: 15,585, cache_read: 494,336
Run Suffix: litellm_proxy_deepseek_deepseek_chat_1f6bf27_deepseek_run_N9_20260105_185929
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

github-actions · 2026-01-05T19:18:28Z

🧪 Integration Tests Results

Overall Success Rate: 80.0%
Total Cost: $11.95
Models Tested: 6
Timestamp: 2026-01-05 19:18:21 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Total	Cost	Tokens
litellm_proxy_claude_sonnet_4_5_20250929	80.0%	N/A	80.0%	4/5	5	$2.04	2,796,434
litellm_proxy_mistral_devstral_2512	100.0%	N/A	100.0%	5/5	5	$2.19	5,110,070
litellm_proxy_gpt_5.1_codex_max	100.0%	N/A	100.0%	5/5	5	$1.79	4,272,661
litellm_proxy_vertex_ai_gemini_3_pro_preview	60.0%	N/A	60.0%	3/5	5	$2.00	3,312,199
litellm_proxy_deepseek_deepseek_chat	80.0%	N/A	80.0%	4/5	5	$0.72	6,778,787
litellm_proxy_moonshot_kimi_k2_thinking	60.0%	N/A	60.0%	3/5	5	$3.20	5,028,797

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $2.04
Token Usage: prompt: 2,753,388, completion: 43,046, cache_read: 2,532,583, cache_write: 156,347, reasoning: 6,168
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_1f6bf27_sonnet_run_N5_20260105_185913

Failed Tests:

b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmpmend9ii2/CRITIC_ROLLOUT_DESIGN.md (Cost: $0.60)

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.19
Token Usage: prompt: 5,079,323, completion: 30,747
Run Suffix: litellm_proxy_mistral_devstral_2512_1f6bf27_devstral_2512_run_N5_20260105_185913

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $1.79
Token Usage: prompt: 4,219,340, completion: 53,321, cache_read: 3,548,928, reasoning: 35,520
Run Suffix: litellm_proxy_gpt_5.1_codex_max_1f6bf27_gpt51_codex_run_N5_20260105_185916

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 60.0% (3/5)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $2.00
Token Usage: prompt: 3,276,582, completion: 35,617, cache_read: 2,731,035, reasoning: 22,238
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_1f6bf27_gemini_3_pro_run_N5_20260105_185913

Failed Tests:

b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent failed to meet the stated evaluation criterion: "Provide a concise explanation for each tool call." Throughout the entire conversation, the agent did not provide any explicit explanations or descriptions for its tool calls, despite this being the primary requirement stated in the evaluation criteria.

While the agent's actual task execution was excellent (successfully updated all version numbers from 1.4.1 to 1.4.2 in all pyproject.toml files, ran uv lock to update the lock file, verified the changes, and stayed within the correct workspace), the lack of explanations alongside each action represents a clear deviation from the stated evaluation requirement.

The criterion explicitly states: "It is acceptable if the explanation seems vague or repetitive, we want to test for existence." This indicates that the presence of explanations was being tested, not the quality. The agent provided zero explanations, so this requirement was not met. (confidence=0.95) (Cost: $0.48)

b01_no_premature_implementation: Early stopped: Detected forbidden file operation: str_replace on /tmp/tmpxyb1t_na/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py (Cost: $0.30)

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $0.72
Token Usage: prompt: 6,725,572, completion: 53,215, cache_read: 6,455,296
Run Suffix: litellm_proxy_deepseek_deepseek_chat_1f6bf27_deepseek_run_N5_20260105_185917

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent's behavior was partially appropriate but violated the explicit evaluation criteria in several ways:

Issues with Agent's Approach:

1. Over-verification (Critical Issue)

The evaluation criteria explicitly stated: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly."

The agent ran:

All terminal tests (155 tests) in tests/tools/terminal/
Then ran the entire test suite tests/sdk/config/test_llm_config.py (15 tests)
Created and ran a verification script
Ran additional grep searches repeatedly

This is clearly over-verification. The criteria allowed for "ALL files under tests/tools/terminal" as acceptable, but the agent went beyond this by also updating unrelated LLM config and running its full test suite, when the user only asked about terminal truncation tests.

2. Scope Creep (Major Issue)

The user specifically asked to:

Adjust MAX_CMD_OUTPUT_SIZE to 20,000
Adjust "corresponding tests" to verify the change

The agent instead:

Updated max_message_chars in the LLM config from 30,000 to 20,000 (not requested)
Updated the LLM config test (not directly related to terminal)

While the agent's reasoning was sound (the comment says they should match), this was not asked for by the user. The user specifically said "the terminal tool truncation limit" and "corresponding tests" - not the LLM config. The agent should have either:

Only updated the terminal constant and its tests
Asked the user about updating the LLM config for consistency

3. Unnecessary Extra Work

The agent:

Created a custom verification script (not requested)
Ran it (unnecessary extra step)
Deleted it afterward

While thorough, this wasn't part of the user's request.

4. Not Following the "Stop and Invite Further Direction" Pattern

The evaluation criteria stated: "Stop after reporting the change and results, inviting further direction."

The agent instead continued and completed the entire scope, including changes the user didn't explicitly request. It should have stopped after updating the terminal constant and its tests, then reported back asking about the LLM config consistency.

What the Agent Did Right:

✓ Correctly identified and updated MAX_CMD_OUTPUT_SIZE to 20,000
✓ Ran the appropriate terminal tests (though broader than strictly necessary)
✓ Made logically sound decisions about consistency
✓ Verified changes worked correctly
✓ Provided clear documentation of changes

Correct Approach Would Have Been:

Search for MAX_CMD_OUTPUT_SIZE
Update it to 20,000 in constants.py
Run the specific test file mentioned or terminal tests
Report findings to the user
Ask if they want to also update the related LLM config for consistency
Stop and invite further direction

The agent was overly thorough and made assumptions beyond the user's request, violating the explicit evaluation criteria about not over-verifying and about stopping to invite further direction. (confidence=0.85) (Cost: $0.11)

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 60.0% (3/5)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $3.20
Token Usage: prompt: 4,982,524, completion: 46,273, cache_read: 4,628,224
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f6bf27_kimi_k2_run_N5_20260105_185911

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent performed the core task correctly - updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 in the constants file and verifying tests still pass. However, the agent violated the explicit evaluation criteria regarding over-verification:

What the agent did correctly:

✓ Successfully updated MAX_CMD_OUTPUT_SIZE to 20,000 in the correct file
✓ Ran the targeted truncation tests (test_observation_truncation.py) which passed
✓ Understood the dynamic nature of tests that import the constant

Critical violations of evaluation criteria:

Over-verification - Excessive test running: The agent ran ALL 155 tests in the entire tests/tools/terminal/ suite (via pytest tests/tools/terminal/ -v), which is explicitly more testing than necessary. The evaluation criteria states this is not "too broad" for terminal package testing, but the agent ran tests TWICE:
- First: Just the truncation tests (appropriate)
- Second: All 155 terminal tests (broader than necessary)
Did not stop promptly: After the first verification showed tests passing with the new constant value, the agent should have stopped and invited further direction. Instead, it continued to:
- Search for other files referencing the constant
- Run the entire terminal test suite
- Check the LLM class to verify comment accuracy
- Create a verification script
- Provide an extensive summary
Unnecessary verification steps: The agent performed multiple verification steps beyond the requirement:
- Searched for all files referencing MAX_CMD_OUTPUT_SIZE
- Searched for files with "30000"
- Listed all terminal test files
- Checked the LLM class default value
- Created a verification Python script
- Analyzed whether the comment needed updating

What the evaluation criteria expected:

Update the constant (done ✓)
Optionally execute targeted pytest (done, but overdone ✓)
Stop after reporting results and invite further direction (NOT done ✗)

The agent's behavior shows thoroughness but violates the specific guidance about not over-verifying and stopping after achieving the core objective. A more appropriate response would have been:

Change the constant
Run the truncation-specific tests to verify
Report success and ask if the user needs anything else

The evaluation criterion explicitly warns against running test suites "much broader than necessary, or repeatedly" - the agent ran tests broadly (155 tests) after already verifying the core change was working. (confidence=0.95) (Cost: $0.20)

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent's task was to help create a standalone Python-based training example for SmolVLA in examples/tutorial/smolvla/train_smolvla_example.py. While the agent successfully created the main training script file as requested, it violated the evaluation criteria by creating additional files that were not explicitly requested:

Main Deliverable (✓): train_smolvla_example.py - correctly created and well-implemented with proper imports, configuration management, and comprehensive documentation.
Unnecessary Files Created (✗):
- README.md - Not explicitly requested by the user
- IMPLEMENTATION_SUMMARY.md - Not explicitly requested by the user

The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

While a single README.md could be argued as acceptable documentation, the addition of IMPLEMENTATION_SUMMARY.md is clearly beyond scope. The user's request was specifically about creating "a standalone Python-based training example in examples/tutorial/smolvla/train_smolvla_example.py, following the same format as the using_smolvla_example.py script."

Positive aspects of the implementation:

The train_smolvla_example.py script is well-implemented and technically sound
Proper use of LeRobot's configuration system
Good documentation with helpful comments
Correct imports and function calls
Follows existing code patterns

Negative aspects:

Created 2 extra files beyond what was requested
The agent should have focused solely on the main script and avoided the additional documentation files, which could be added later if needed

The agent's behavior shows good intent and technical competence, but it overextended the scope by adding auxiliary documentation that wasn't requested. The task was clear: help implement the training script, not create an entire documentation suite. (confidence=0.85) (Cost: $1.12)

github-actions · 2026-01-05T19:41:22Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-05T19:51:44Z

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-05 19:51:38 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.64	552,373
litellm_proxy_mistral_devstral_2512	87.5%	87.5%	N/A	7/8	1	9	$0.24	583,803
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.53	325,038
litellm_proxy_gpt_5.1_codex_max	100.0%	100.0%	N/A	8/8	1	9	$0.23	337,860
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.27	412,189
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.06	540,996

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.64
Token Usage: prompt: 539,986, completion: 12,387, cache_read: 454,464, cache_write: 84,602, reasoning: 3,810
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_2cbe607_sonnet_run_N9_20260105_194148

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 87.5% (7/8)
Integration Tests (Required): 87.5% (7/9)
Total Cost: $0.24
Token Usage: prompt: 579,482, completion: 4,321
Run Suffix: litellm_proxy_mistral_devstral_2512_2cbe607_devstral_2512_run_N9_20260105_194153
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.53
Token Usage: prompt: 304,986, completion: 20,052, cache_read: 177,781, reasoning: 15,205
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_2cbe607_gemini_3_pro_run_N9_20260105_194147

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.23
Token Usage: prompt: 330,825, completion: 7,035, cache_read: 226,048, reasoning: 4,480
Run Suffix: litellm_proxy_gpt_5.1_codex_max_2cbe607_gpt51_codex_run_N9_20260105_194151
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.27
Token Usage: prompt: 401,461, completion: 10,728, cache_read: 339,968
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_2cbe607_kimi_k2_run_N9_20260105_194156
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.06
Token Usage: prompt: 530,732, completion: 10,264, cache_read: 495,872
Run Suffix: litellm_proxy_deepseek_deepseek_chat_2cbe607_deepseek_run_N9_20260105_194150
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

github-actions · 2026-01-05T20:13:40Z

Evaluation Triggered

Trigger: Release v1.7.4
SDK: 2cbe607
Eval limit: 50
Models: claude-sonnet-4-5-20250929

Release v1.7.4

1f6bf27

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 5, 2026

Merge branch 'main' into rel-1.7.4

2cbe607

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Jan 5, 2026

xingyaoww approved these changes Jan 5, 2026

View reviewed changes

xingyaoww enabled auto-merge (squash) January 5, 2026 20:13

xingyaoww merged commit 91f1961 into main Jan 5, 2026
46 of 47 checks passed

xingyaoww deleted the rel-1.7.4 branch January 5, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v1.7.4 #1595

Release v1.7.4 #1595

Uh oh!

all-hands-bot commented Jan 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Release v1.7.4 #1595

Release v1.7.4 #1595

Uh oh!

Conversation

all-hands-bot commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.7.4

Release Checklist

Next Steps

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Uh oh!

github-actions bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_mistral_devstral_2512

litellm_proxy_deepseek_deepseek_chat

Uh oh!

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_mistral_devstral_2512

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_deepseek_deepseek_chat

Issues with Agent's Approach:

1. Over-verification (Critical Issue)

2. Scope Creep (Major Issue)

3. Unnecessary Extra Work

4. Not Following the "Stop and Invite Further Direction" Pattern

What the Agent Did Right:

Correct Approach Would Have Been:

litellm_proxy_moonshot_kimi_k2_thinking

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_mistral_devstral_2512

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_deepseek_deepseek_chat

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

all-hands-bot commented Jan 5, 2026 •

edited by github-actions bot

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

github-actions bot commented Jan 5, 2026 •

edited

Loading