custom agent #312

XuhuiZhou · 2025-04-16T16:46:54Z

Closes #

📑 Description

✅ Checks

My pull request adheres to the code style of this project
My code requires changes to the documentation
I have updated the documentation as required
All the tests have passed
Branch name follows type/descript (e.g. feature/add-llm-agents)
Ready for code review

ℹ Additional Information

Note

Adds custom-agent benchmarking via _benchmark_impl, introduces strict LLM base models and structured eval schemas, implements private message recipients, and streamlines generation/evaluator logic with updated docs/tests.

Benchmarking:
- Introduce _benchmark_impl(models, agent_class, ...) to run benchmarks with custom agent classes; CLI benchmark wraps it with LLMAgent.
- Support task filters (hard, cooperative, competitive), per-run tag, and batch reruns; write env model to ParallelSotopiaEnv.
- Results now track agent_classes; display splits test/partner tables and JSONL uses new keys; average reward filters by agent_class.
Core Models & Evaluation:
- Add LLMBaseModel (extra=forbid) and LLMEvalBaseModel; migrate message/env/eval models to these bases.
- Replace tuple scores with structured {reasoning, score} types across SotopiaDimensions(Plus) and custom-dimension builder; update validators and tests.
Generation/Evaluation:
- Simplify temperature handling; remove custom temperature wrappers.
- format_bad_output derives schema from parser; response schema strictness disabled for LLMEvalBaseModel types.
- Evaluators use LLMEvalBaseModel; structured output auto-enabled when supported.
Environment & Agents:
- Add private message recipients (AgentAction.to: list[str]); env renders per-viewer visibility; default masked/none actions include to=[].
- LLMAgent requests structured action output; Human/Redis agents return actions with to=[].
Database/Logs:
- EpisodeLog gains agent_classes; include in episode existence checks and displays.
Docs/Examples:
- Docs: guide to run benchmarks with custom agents (using _benchmark_impl).
- Examples updated (model names, temps); .gitignore adds *.rdb.
Config/Tests:
- Move dev deps to [dependency-groups]; pytest asyncio scope config.
- Comprehensive test updates for new schemas, private messaging, and benchmarking flow.

^{Written by Cursor Bugbot for commit b8100ec. This will update automatically on new commits. Configure here.}

update main

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 6

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-12-19T04:43:25Z

sotopia/generation_utils/generate.py


 # subject to future OpenAI changes
-DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-4o-mini"
+DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-5-mini-2025-08-07"


Bug: Invalid model name will cause API failures

The DEFAULT_BAD_OUTPUT_PROCESS_MODEL is set to "gpt-5-mini-2025-08-07" which is not a valid OpenAI model name. OpenAI does not have a GPT-5 model, and the naming convention doesn't match any known model. The previous value "gpt-4o-mini" was a valid model. This will cause all calls to format_bad_output to fail when they fall back to this default model for reformatting malformed LLM outputs.

cursor · 2025-12-19T04:43:25Z

sotopia/cli/benchmark/benchmark.py

+            f"benchmark_{model}_{partner_model}_{evaluator_model}_{task}_trial0"
+            if tag == ""
+            else tag
+        )


Bug: Tag variable reused incorrectly across model loop iterations

In the benchmark function, the tag variable is only set when it's empty (tag == ""). When looping over multiple models, the tag gets set on the first iteration and is never updated for subsequent models. This causes all models after the first to be tagged with the first model's tag, corrupting benchmark data by mixing episodes from different models under the same tag and preventing correct retrieval of model-specific results.

cursor · 2025-12-19T04:43:25Z

sotopia/cli/benchmark/benchmark.py

+    print_logs: Annotated[bool, typer.Argument(help="Print logs.")] = False,
+    only_show_performance: Annotated[
+        bool, typer.Argument(help="Only show performance.")
+    ] = False,


Bug: Parameter only_show_performance declared but never used

The only_show_performance parameter is declared in the benchmark function signature but the code that previously handled it was removed. When users pass only_show_performance=True expecting to only display existing benchmark results without running new benchmarks, the function will instead execute the full benchmark. This breaks the expected behavior of this flag.

cursor · 2025-12-19T04:43:25Z

sotopia/generation_utils/output_parsers.py

            # Fallback to JSON validation for backward compatibility
-            if "properties" in json_result:
+            # Type narrowing: check that json_result is a dict before accessing "properties"
+            if isinstance(json_result, dict) and "properties" in json_result:


Bug: Transformed JSON discarded when context is not provided

In PydanticOutputParser.parse, the extract_value transformation processes json_result to unwrap nested "value" structures (line 51), and the result is stored in data (line 57). However, when context is None, the code at line 70 uses the original untransformed result string instead of data. This causes inconsistent behavior: with context the transformed data is validated, without context the original untransformed string is used, effectively discarding the extract_value transformation in that code path.

Additional Locations (1)

sotopia/generation_utils/output_parsers.py#L50-L57

codecov · 2025-12-19T20:24:05Z

Codecov Report

❌ Patch coverage is 82.27848% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.80%. Comparing base (8d9b532) to head (b8100ec).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
sotopia/cli/benchmark/benchmark.py	56.75%	32 Missing ⚠️
sotopia/agents/llm_agent.py	33.33%	2 Missing ⚠️
sotopia/agents/redis_agent.py	0.00%	2 Missing ⚠️
sotopia/generation_utils/generate.py	92.00%	2 Missing ⚠️
sotopia/generation_utils/output_parsers.py	87.50%	2 Missing ⚠️
sotopia/envs/parallel.py	50.00%	1 Missing ⚠️
sotopia/server.py	0.00%	1 Missing ⚠️

@@            Coverage Diff             @@
##             main     #312      +/-   ##
==========================================
- Coverage   74.92%   74.80%   -0.12%     
==========================================
  Files          72       72              
  Lines        4821     4827       +6     
==========================================
- Hits         3612     3611       -1     
- Misses       1209     1216       +7

Files with missing lines	Coverage Δ
sotopia/database/__init__.py	`58.33% <100.00%> (+0.88%)`	⬆️
sotopia/database/base_models.py	`90.72% <100.00%> (+0.50%)`	⬆️
sotopia/database/evaluation_dimensions.py	`85.11% <100.00%> (+4.52%)`	⬆️
sotopia/database/logs.py	`82.75% <100.00%> (+0.20%)`	⬆️
sotopia/database/persistent_profile.py	`64.74% <100.00%> (+0.25%)`	⬆️
sotopia/envs/evaluators.py	`96.45% <100.00%> (ø)`
sotopia/generation_utils/__init__.py	`100.00% <ø> (ø)`
sotopia/messages/message_classes.py	`53.44% <100.00%> (+0.26%)`	⬆️
tests/database/test_evaluation_dimensions.py	`100.00% <100.00%> (ø)`
tests/envs/test_evaluators.py	`100.00% <ø> (ø)`
... and 9 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cursor · 2025-12-19T20:27:46Z

sotopia/cli/benchmark/benchmark.py

+            output_to_jsonl=output_to_jsonl,
+            agent_class=agent_class.__name__,
+            tag=tag,
+        )


Bug: save_dir parameter not passed to benchmark_display function

The save_dir parameter is accepted in _benchmark_impl at line 534 but is not passed to benchmark_display when called at lines 651-659. Since benchmark_display has save_dir as a parameter with default ., JSONL output files will always be saved to the current directory regardless of the --save-dir CLI option value specified by users.

sotopia/messages/message_classes.py

openhands-ai · 2025-12-19T21:14:01Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pytest in docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #312 at branch `feature/benchmark_with_customed_agents`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

cursor · 2025-12-19T21:47:39Z

sotopia/messages/message_classes.py

                )
                raise e
-                parsed_action = AgentAction(action_type="none", argument="")
+                parsed_action = AgentAction(action_type="none", argument="", to=[])


Bug: Unreachable code after raise statement

The code on lines 343-344 (parsed_action = AgentAction(...) and name = agent_names[...]) appears after raise e on line 342, making it unreachable. This appears to be a refactoring error when adding the to=[] parameter. If the intent was error recovery, the raise e should be removed; otherwise these dead lines should be deleted.

XuhuiZhou and others added 22 commits April 16, 2025 12:43

custom agent

cfbb233

self-defined tag option

3efe458

better cli

2343097

update checking mechenism

0cd503a

rewards display

b0bf1f1

agenerate action structure output

85291dc

update environmentlist model

8b4d83d

update env name

7fc2612

benchmark display update

620e3d3

benchmark refine

23e3332

force llm agent structured output

7766c2e

fix eval in benchmark

2215ebb

structure output and git ignore

1bd406f

remove trakc dump

ac6b57a

Merge branch 'main' into feature/benchmark_with_customed_agents

2007bda

update main

improve generation

02f7b87

update benchmark

d858670

update sotopia for better compact

5370d5f

Resolve merge conflict in persistent_profile.py

46348c0

Merge branch 'main' into feature/benchmark_with_customed_agents

30075c2

fix tests

3f5f475

fix mypy

61f7cbf

cursor bot reviewed Dec 19, 2025

View reviewed changes

fix benchmark tests

3255c97

cursor bot reviewed Dec 19, 2025

View reviewed changes

fix benchmark

6612c92

XuhuiZhou added 2 commits December 19, 2025 16:28

empty task

deb111d

custom agent

b8100ec

cursor bot reviewed Dec 19, 2025

View reviewed changes

XuhuiZhou merged commit abd4485 into main Dec 19, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

custom agent #312

custom agent #312

Uh oh!

XuhuiZhou commented Apr 16, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

codecov bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

Uh oh!

openhands-ai bot commented Dec 19, 2025

Uh oh!

cursor bot Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

custom agent #312

custom agent #312

Uh oh!

Conversation

XuhuiZhou commented Apr 16, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📑 Description

✅ Checks

ℹ Additional Information

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: Invalid model name will cause API failures

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: Tag variable reused incorrectly across model loop iterations

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: Parameter only_show_performance declared but never used

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: Transformed JSON discarded when context is not provided

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: save_dir parameter not passed to benchmark_display function

Uh oh!

Uh oh!

openhands-ai bot commented Dec 19, 2025

Uh oh!

cursor bot Dec 19, 2025

Choose a reason for hiding this comment

Bug: Unreachable code after raise statement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XuhuiZhou commented Apr 16, 2025 •

edited by cursor bot

Loading

Bug: Parameter `only_show_performance` declared but never used

codecov bot commented Dec 19, 2025 •

edited

Loading

Bug: `save_dir` parameter not passed to `benchmark_display` function