Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 1, 2026

Summary

Update integration tests to use the default agent preset from openhands.tools.preset.default so tests validate the same agent configuration shipped to production (GUI/CLI).

Changes:

  • BaseIntegrationTest now uses get_default_tools() and get_default_condenser() by default, with an enable_browser property for tests that need browser support
  • Functional tests (t01-t08) now inherit default tools from base class, removing redundant tool registration
  • t05 and t06 (browsing tests) enable browser via enable_browser property
  • t09 keeps custom condenser for token-based condensation testing (special case)
  • behavior_helpers.py uses get_default_tools() for behavior tests
  • b05 uses default tools from base class

This ensures integration tests validate the exact agent configuration that will be used in production.

Fixes #372

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
    • The changes update existing integration tests to use the default preset - no new tests needed as the existing tests now validate the production configuration
  • If there is an example, have you run the example to make sure that it works?
    • N/A - this PR modifies test infrastructure
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
    • Verified imports work correctly and pre-commit hooks pass
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
    • N/A - this is an internal test infrastructure change
  • Is the github CI passing?
    • Awaiting CI results

@simonrosenberg can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d2506a1-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d2506a1-python \
  ghcr.io/openhands/agent-server:d2506a1-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d2506a1-golang-amd64
ghcr.io/openhands/agent-server:d2506a1-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d2506a1-golang-arm64
ghcr.io/openhands/agent-server:d2506a1-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d2506a1-java-amd64
ghcr.io/openhands/agent-server:d2506a1-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d2506a1-java-arm64
ghcr.io/openhands/agent-server:d2506a1-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d2506a1-python-amd64
ghcr.io/openhands/agent-server:d2506a1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:d2506a1-python-arm64
ghcr.io/openhands/agent-server:d2506a1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:d2506a1-golang
ghcr.io/openhands/agent-server:d2506a1-java
ghcr.io/openhands/agent-server:d2506a1-python

About Multi-Architecture Support

  • Each variant tag (e.g., d2506a1-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., d2506a1-python-amd64) are also available if needed

Update integration tests to use the default agent preset from
openhands.tools.preset.default so tests validate the same agent
configuration shipped to production (GUI/CLI).

Changes:
- BaseIntegrationTest now uses get_default_tools() and get_default_condenser()
  by default, with an enable_browser property for tests that need browser
- Functional tests (t01-t08) now inherit default tools from base class
- t05 and t06 (browsing tests) enable browser via enable_browser property
- t09 keeps custom condenser for token-based condensation testing
- behavior_helpers.py uses get_default_tools() for behavior tests
- b05 uses default tools from base class

This ensures integration tests validate the exact agent configuration
that will be used in production.

Fixes #372

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Jan 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $2.53
Models Tested: 6
Timestamp: 2026-01-02 18:10:07 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_mistral_devstral_2512 75.0% 75.0% N/A 6/8 1 9 $0.59 1,441,140
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.20 304,478
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.63 382,177
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.06 595,997
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.86 865,925
litellm_proxy_gpt_5.1_codex_max 100.0% 100.0% N/A 8/8 1 9 $0.19 360,650

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 75.0% (6/8)
  • Integration Tests (Required): 75.0% (6/9)
  • Total Cost: $0.59
  • Token Usage: prompt: 1,435,188, completion: 5,952
  • Run Suffix: litellm_proxy_mistral_devstral_2512_df03e9c_devstral_2512_run_N9_20260102_180259
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.01)
  • t05_simple_browsing ⚠️ REQUIRED: Agent did not find the answer. Response: ... (Cost: $0.37)

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.20
  • Token Usage: prompt: 293,534, completion: 10,944, cache_read: 228,352
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_df03e9c_kimi_k2_run_N9_20260102_180259
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.63
  • Token Usage: prompt: 359,621, completion: 22,556, cache_read: 198,493, reasoning: 15,864
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_df03e9c_gemini_3_pro_run_N9_20260102_180259

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.06
  • Token Usage: prompt: 586,289, completion: 9,708, cache_read: 541,696
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_df03e9c_deepseek_run_N9_20260102_180300
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.86
  • Token Usage: prompt: 852,224, completion: 13,701, cache_read: 736,965, cache_write: 114,204, reasoning: 4,193
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_df03e9c_sonnet_run_N9_20260102_180300

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.19
  • Token Usage: prompt: 356,730, completion: 3,920, cache_read: 259,840, reasoning: 1,536
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_df03e9c_gpt51_codex_run_N9_20260102_180300
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test: use default agent preset for integration tests

4 participants