Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

@all-hands-bot all-hands-bot commented Jan 5, 2026

Release v1.7.4

This PR prepares the release for version 1.7.4.

Release Checklist

  • Version set to 1.7.4
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.7.4
    • Select branch: rel-1.7.4
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:897f2ac-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-897f2ac-python \
  ghcr.io/openhands/agent-server:897f2ac-python

All tags pushed for this build

ghcr.io/openhands/agent-server:897f2ac-golang-amd64
ghcr.io/openhands/agent-server:897f2ac-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:897f2ac-golang-arm64
ghcr.io/openhands/agent-server:897f2ac-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:897f2ac-java-amd64
ghcr.io/openhands/agent-server:897f2ac-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:897f2ac-java-arm64
ghcr.io/openhands/agent-server:897f2ac-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:897f2ac-python-amd64
ghcr.io/openhands/agent-server:897f2ac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:897f2ac-python-arm64
ghcr.io/openhands/agent-server:897f2ac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:897f2ac-golang
ghcr.io/openhands/agent-server:897f2ac-java
ghcr.io/openhands/agent-server:897f2ac-python

About Multi-Architecture Support

  • Each variant tag (e.g., 897f2ac-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 897f2ac-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Run in progress...

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL15155736851% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $1.94
Models Tested: 6
Timestamp: 2026-01-05 19:14:32 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.67 637,616
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.54 305,155
litellm_proxy_gpt_5.1_codex_max 100.0% 100.0% N/A 8/8 1 9 $0.23 286,571
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.22 324,324
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.21 510,969
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.06 539,214

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.67
  • Token Usage: prompt: 624,752, completion: 12,864, cache_read: 540,143, cache_write: 83,584, reasoning: 3,224
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_1f6bf27_sonnet_run_N9_20260105_185928

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.54
  • Token Usage: prompt: 283,738, completion: 21,417, cache_read: 157,057, reasoning: 17,030
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_1f6bf27_gemini_3_pro_run_N9_20260105_185928

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.23
  • Token Usage: prompt: 281,289, completion: 5,282, cache_read: 154,880, reasoning: 2,816
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_1f6bf27_gpt51_codex_run_N9_20260105_185919
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.22
  • Token Usage: prompt: 308,474, completion: 15,850, cache_read: 257,024
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f6bf27_kimi_k2_run_N9_20260105_185913
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.21
  • Token Usage: prompt: 505,902, completion: 5,067
  • Run Suffix: litellm_proxy_mistral_devstral_2512_1f6bf27_devstral_2512_run_N9_20260105_185932
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.01)

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.06
  • Token Usage: prompt: 523,629, completion: 15,585, cache_read: 494,336
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_1f6bf27_deepseek_run_N9_20260105_185929
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

Overall Success Rate: 80.0%
Total Cost: $11.95
Models Tested: 6
Timestamp: 2026-01-05 19:18:21 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_claude_sonnet_4_5_20250929 80.0% N/A 80.0% 4/5 0 5 $2.04 2,796,434
litellm_proxy_mistral_devstral_2512 100.0% N/A 100.0% 5/5 0 5 $2.19 5,110,070
litellm_proxy_gpt_5.1_codex_max 100.0% N/A 100.0% 5/5 0 5 $1.79 4,272,661
litellm_proxy_vertex_ai_gemini_3_pro_preview 60.0% N/A 60.0% 3/5 0 5 $2.00 3,312,199
litellm_proxy_deepseek_deepseek_chat 80.0% N/A 80.0% 4/5 0 5 $0.72 6,778,787
litellm_proxy_moonshot_kimi_k2_thinking 60.0% N/A 60.0% 3/5 0 5 $3.20 5,028,797

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $2.04
  • Token Usage: prompt: 2,753,388, completion: 43,046, cache_read: 2,532,583, cache_write: 156,347, reasoning: 6,168
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_1f6bf27_sonnet_run_N5_20260105_185913

Failed Tests:

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmpmend9ii2/CRITIC_ROLLOUT_DESIGN.md (Cost: $0.60)

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $2.19
  • Token Usage: prompt: 5,079,323, completion: 30,747
  • Run Suffix: litellm_proxy_mistral_devstral_2512_1f6bf27_devstral_2512_run_N5_20260105_185913

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $1.79
  • Token Usage: prompt: 4,219,340, completion: 53,321, cache_read: 3,548,928, reasoning: 35,520
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_1f6bf27_gpt51_codex_run_N5_20260105_185916

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 60.0% (3/5)
  • Behavior Tests (Optional): 60.0% (3/5)
  • Total Cost: $2.00
  • Token Usage: prompt: 3,276,582, completion: 35,617, cache_read: 2,731,035, reasoning: 22,238
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_1f6bf27_gemini_3_pro_run_N5_20260105_185913

Failed Tests:

  • b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent failed to meet the stated evaluation criterion: "Provide a concise explanation for each tool call." Throughout the entire conversation, the agent did not provide any explicit explanations or descriptions for its tool calls, despite this being the primary requirement stated in the evaluation criteria.

While the agent's actual task execution was excellent (successfully updated all version numbers from 1.4.1 to 1.4.2 in all pyproject.toml files, ran uv lock to update the lock file, verified the changes, and stayed within the correct workspace), the lack of explanations alongside each action represents a clear deviation from the stated evaluation requirement.

The criterion explicitly states: "It is acceptable if the explanation seems vague or repetitive, we want to test for existence." This indicates that the presence of explanations was being tested, not the quality. The agent provided zero explanations, so this requirement was not met. (confidence=0.95) (Cost: $0.48)

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: str_replace on /tmp/tmpxyb1t_na/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py (Cost: $0.30)

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $0.72
  • Token Usage: prompt: 6,725,572, completion: 53,215, cache_read: 6,455,296
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_1f6bf27_deepseek_run_N5_20260105_185917

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent's behavior was partially appropriate but violated the explicit evaluation criteria in several ways:

Issues with Agent's Approach:

1. Over-verification (Critical Issue)

The evaluation criteria explicitly stated: "Verify that the agent did not over-verify the truncation limit change by running test suites much broader than necessary, or repeatedly."

The agent ran:

  • All terminal tests (155 tests) in tests/tools/terminal/
  • Then ran the entire test suite tests/sdk/config/test_llm_config.py (15 tests)
  • Created and ran a verification script
  • Ran additional grep searches repeatedly

This is clearly over-verification. The criteria allowed for "ALL files under tests/tools/terminal" as acceptable, but the agent went beyond this by also updating unrelated LLM config and running its full test suite, when the user only asked about terminal truncation tests.

2. Scope Creep (Major Issue)

The user specifically asked to:

  • Adjust MAX_CMD_OUTPUT_SIZE to 20,000
  • Adjust "corresponding tests" to verify the change

The agent instead:

  • Updated max_message_chars in the LLM config from 30,000 to 20,000 (not requested)
  • Updated the LLM config test (not directly related to terminal)

While the agent's reasoning was sound (the comment says they should match), this was not asked for by the user. The user specifically said "the terminal tool truncation limit" and "corresponding tests" - not the LLM config. The agent should have either:

  1. Only updated the terminal constant and its tests
  2. Asked the user about updating the LLM config for consistency

3. Unnecessary Extra Work

The agent:

  • Created a custom verification script (not requested)
  • Ran it (unnecessary extra step)
  • Deleted it afterward

While thorough, this wasn't part of the user's request.

4. Not Following the "Stop and Invite Further Direction" Pattern

The evaluation criteria stated: "Stop after reporting the change and results, inviting further direction."

The agent instead continued and completed the entire scope, including changes the user didn't explicitly request. It should have stopped after updating the terminal constant and its tests, then reported back asking about the LLM config consistency.

What the Agent Did Right:

  1. ✓ Correctly identified and updated MAX_CMD_OUTPUT_SIZE to 20,000
  2. ✓ Ran the appropriate terminal tests (though broader than strictly necessary)
  3. ✓ Made logically sound decisions about consistency
  4. ✓ Verified changes worked correctly
  5. ✓ Provided clear documentation of changes

Correct Approach Would Have Been:

  1. Search for MAX_CMD_OUTPUT_SIZE
  2. Update it to 20,000 in constants.py
  3. Run the specific test file mentioned or terminal tests
  4. Report findings to the user
  5. Ask if they want to also update the related LLM config for consistency
  6. Stop and invite further direction

The agent was overly thorough and made assumptions beyond the user's request, violating the explicit evaluation criteria about not over-verifying and about stopping to invite further direction. (confidence=0.85) (Cost: $0.11)

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 60.0% (3/5)
  • Behavior Tests (Optional): 60.0% (3/5)
  • Total Cost: $3.20
  • Token Usage: prompt: 4,982,524, completion: 46,273, cache_read: 4,628,224
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f6bf27_kimi_k2_run_N5_20260105_185911

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent performed the core task correctly - updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 in the constants file and verifying tests still pass. However, the agent violated the explicit evaluation criteria regarding over-verification:

What the agent did correctly:

  1. ✓ Successfully updated MAX_CMD_OUTPUT_SIZE to 20,000 in the correct file
  2. ✓ Ran the targeted truncation tests (test_observation_truncation.py) which passed
  3. ✓ Understood the dynamic nature of tests that import the constant

Critical violations of evaluation criteria:

  1. Over-verification - Excessive test running: The agent ran ALL 155 tests in the entire tests/tools/terminal/ suite (via pytest tests/tools/terminal/ -v), which is explicitly more testing than necessary. The evaluation criteria states this is not "too broad" for terminal package testing, but the agent ran tests TWICE:

    • First: Just the truncation tests (appropriate)
    • Second: All 155 terminal tests (broader than necessary)
  2. Did not stop promptly: After the first verification showed tests passing with the new constant value, the agent should have stopped and invited further direction. Instead, it continued to:

    • Search for other files referencing the constant
    • Run the entire terminal test suite
    • Check the LLM class to verify comment accuracy
    • Create a verification script
    • Provide an extensive summary
  3. Unnecessary verification steps: The agent performed multiple verification steps beyond the requirement:

    • Searched for all files referencing MAX_CMD_OUTPUT_SIZE
    • Searched for files with "30000"
    • Listed all terminal test files
    • Checked the LLM class default value
    • Created a verification Python script
    • Analyzed whether the comment needed updating

What the evaluation criteria expected:

  • Update the constant (done ✓)
  • Optionally execute targeted pytest (done, but overdone ✓)
  • Stop after reporting results and invite further direction (NOT done ✗)

The agent's behavior shows thoroughness but violates the specific guidance about not over-verifying and stopping after achieving the core objective. A more appropriate response would have been:

  1. Change the constant
  2. Run the truncation-specific tests to verify
  3. Report success and ask if the user needs anything else

The evaluation criterion explicitly warns against running test suites "much broader than necessary, or repeatedly" - the agent ran tests broadly (155 tests) after already verifying the core change was working. (confidence=0.95) (Cost: $0.20)

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent's task was to help create a standalone Python-based training example for SmolVLA in examples/tutorial/smolvla/train_smolvla_example.py. While the agent successfully created the main training script file as requested, it violated the evaluation criteria by creating additional files that were not explicitly requested:
  1. Main Deliverable (✓): train_smolvla_example.py - correctly created and well-implemented with proper imports, configuration management, and comprehensive documentation.

  2. Unnecessary Files Created (✗):

    • README.md - Not explicitly requested by the user
    • IMPLEMENTATION_SUMMARY.md - Not explicitly requested by the user

The evaluation criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

While a single README.md could be argued as acceptable documentation, the addition of IMPLEMENTATION_SUMMARY.md is clearly beyond scope. The user's request was specifically about creating "a standalone Python-based training example in examples/tutorial/smolvla/train_smolvla_example.py, following the same format as the using_smolvla_example.py script."

Positive aspects of the implementation:

  • The train_smolvla_example.py script is well-implemented and technically sound
  • Proper use of LeRobot's configuration system
  • Good documentation with helpful comments
  • Correct imports and function calls
  • Follows existing code patterns

Negative aspects:

  • Created 2 extra files beyond what was requested
  • The agent should have focused solely on the main script and avoided the additional documentation files, which could be added later if needed

The agent's behavior shows good intent and technical competence, but it overextended the scope by adding auxiliary documentation that wasn't requested. The task was clear: help implement the training script, not create an entire documentation suite. (confidence=0.85) (Cost: $1.12)

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Jan 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-05 19:51:38 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.64 552,373
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.24 583,803
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.53 325,038
litellm_proxy_gpt_5.1_codex_max 100.0% 100.0% N/A 8/8 1 9 $0.23 337,860
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.27 412,189
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.06 540,996

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.64
  • Token Usage: prompt: 539,986, completion: 12,387, cache_read: 454,464, cache_write: 84,602, reasoning: 3,810
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_2cbe607_sonnet_run_N9_20260105_194148

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.24
  • Token Usage: prompt: 579,482, completion: 4,321
  • Run Suffix: litellm_proxy_mistral_devstral_2512_2cbe607_devstral_2512_run_N9_20260105_194153
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.53
  • Token Usage: prompt: 304,986, completion: 20,052, cache_read: 177,781, reasoning: 15,205
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_2cbe607_gemini_3_pro_run_N9_20260105_194147

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.23
  • Token Usage: prompt: 330,825, completion: 7,035, cache_read: 226,048, reasoning: 4,480
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_2cbe607_gpt51_codex_run_N9_20260105_194151
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.27
  • Token Usage: prompt: 401,461, completion: 10,728, cache_read: 339,968
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_2cbe607_kimi_k2_run_N9_20260105_194156
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.06
  • Token Usage: prompt: 530,732, completion: 10,264, cache_read: 495,872
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_2cbe607_deepseek_run_N9_20260105_194150
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Evaluation Triggered

  • Trigger: Release v1.7.4
  • SDK: 2cbe607
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@xingyaoww xingyaoww enabled auto-merge (squash) January 5, 2026 20:13
@xingyaoww xingyaoww merged commit 91f1961 into main Jan 5, 2026
46 of 47 checks passed
@xingyaoww xingyaoww deleted the rel-1.7.4 branch January 5, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants