Use default agent preset for integration tests #1565

simonrosenberg · 2026-01-01T20:35:35Z

Summary

Update integration tests to use the default agent preset from openhands.tools.preset.default so tests validate the same agent configuration shipped to production (GUI/CLI).

Changes:

BaseIntegrationTest now uses get_default_tools() and get_default_condenser() by default, with an enable_browser property for tests that need browser support
Functional tests (t01-t08) now inherit default tools from base class, removing redundant tool registration
t05 and t06 (browsing tests) enable browser via enable_browser property
t09 keeps custom condenser for token-based condensation testing (special case)
behavior_helpers.py uses get_default_tools() for behavior tests
b05 uses default tools from base class

This ensures integration tests validate the exact agent configuration that will be used in production.

Fixes #372

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
- The changes update existing integration tests to use the default preset - no new tests needed as the existing tests now validate the production configuration
If there is an example, have you run the example to make sure that it works?
- N/A - this PR modifies test infrastructure
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
- Verified imports work correctly and pre-commit hooks pass
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
- N/A - this is an internal test infrastructure change
Is the github CI passing?
- Awaiting CI results

@simonrosenberg can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d2506a1-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d2506a1-python \
  ghcr.io/openhands/agent-server:d2506a1-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d2506a1-golang-amd64
ghcr.io/openhands/agent-server:d2506a1-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d2506a1-golang-arm64
ghcr.io/openhands/agent-server:d2506a1-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d2506a1-java-amd64
ghcr.io/openhands/agent-server:d2506a1-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d2506a1-java-arm64
ghcr.io/openhands/agent-server:d2506a1-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d2506a1-python-amd64
ghcr.io/openhands/agent-server:d2506a1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:d2506a1-python-arm64
ghcr.io/openhands/agent-server:d2506a1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:d2506a1-golang
ghcr.io/openhands/agent-server:d2506a1-java
ghcr.io/openhands/agent-server:d2506a1-python

About Multi-Architecture Support

Each variant tag (e.g., d2506a1-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., d2506a1-python-amd64) are also available if needed

Update integration tests to use the default agent preset from openhands.tools.preset.default so tests validate the same agent configuration shipped to production (GUI/CLI). Changes: - BaseIntegrationTest now uses get_default_tools() and get_default_condenser() by default, with an enable_browser property for tests that need browser - Functional tests (t01-t08) now inherit default tools from base class - t05 and t06 (browsing tests) enable browser via enable_browser property - t09 keeps custom condenser for token-based condensation testing - behavior_helpers.py uses get_default_tools() for behavior tests - b05 uses default tools from base class This ensures integration tests validate the exact agent configuration that will be used in production. Fixes #372 Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-01-02T18:02:37Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-02T18:10:14Z

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $2.53
Models Tested: 6
Timestamp: 2026-01-02 18:10:07 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_mistral_devstral_2512	75.0%	75.0%	N/A	6/8	1	9	$0.59	1,441,140
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.20	304,478
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.63	382,177
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.06	595,997
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.86	865,925
litellm_proxy_gpt_5.1_codex_max	100.0%	100.0%	N/A	8/8	1	9	$0.19	360,650

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 75.0% (6/8)
Integration Tests (Required): 75.0% (6/9)
Total Cost: $0.59
Token Usage: prompt: 1,435,188, completion: 5,952
Run Suffix: litellm_proxy_mistral_devstral_2512_df03e9c_devstral_2512_run_N9_20260102_180259
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.01)
t05_simple_browsing ⚠️ REQUIRED: Agent did not find the answer. Response: ... (Cost: $0.37)

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.20
Token Usage: prompt: 293,534, completion: 10,944, cache_read: 228,352
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_df03e9c_kimi_k2_run_N9_20260102_180259
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.63
Token Usage: prompt: 359,621, completion: 22,556, cache_read: 198,493, reasoning: 15,864
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_df03e9c_gemini_3_pro_run_N9_20260102_180259

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.06
Token Usage: prompt: 586,289, completion: 9,708, cache_read: 541,696
Run Suffix: litellm_proxy_deepseek_deepseek_chat_df03e9c_deepseek_run_N9_20260102_180300
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.86
Token Usage: prompt: 852,224, completion: 13,701, cache_read: 736,965, cache_write: 114,204, reasoning: 4,193
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_df03e9c_sonnet_run_N9_20260102_180300

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.19
Token Usage: prompt: 356,730, completion: 3,920, cache_read: 259,840, reasoning: 1,536
Run Suffix: litellm_proxy_gpt_5.1_codex_max_df03e9c_gpt51_codex_run_N9_20260102_180300
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

openhands-ai bot mentioned this pull request Jan 1, 2026

Test: use default agent preset for integration tests #372

Open

xingyaoww added the integration-test Runs the integration tests and comments the results label Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use default agent preset for integration tests #1565

Use default agent preset for integration tests #1565

Uh oh!

simonrosenberg commented Jan 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use default agent preset for integration tests #1565

Are you sure you want to change the base?

Use default agent preset for integration tests #1565

Uh oh!

Conversation

simonrosenberg commented Jan 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes:

Checklist

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_gpt_5.1_codex_max

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simonrosenberg commented Jan 1, 2026 •

edited by github-actions bot

Loading