Skip to content

Conversation

@XuhuiZhou
Copy link
Member

@XuhuiZhou XuhuiZhou commented Apr 16, 2025

Closes #

πŸ“‘ Description

βœ… Checks

  • My pull request adheres to the code style of this project
  • My code requires changes to the documentation
  • I have updated the documentation as required
  • All the tests have passed
  • Branch name follows type/descript (e.g. feature/add-llm-agents)
  • Ready for code review

β„Ή Additional Information


Note

Adds custom-agent benchmarking via _benchmark_impl, introduces strict LLM base models and structured eval schemas, implements private message recipients, and streamlines generation/evaluator logic with updated docs/tests.

  • Benchmarking:
    • Introduce _benchmark_impl(models, agent_class, ...) to run benchmarks with custom agent classes; CLI benchmark wraps it with LLMAgent.
    • Support task filters (hard, cooperative, competitive), per-run tag, and batch reruns; write env model to ParallelSotopiaEnv.
    • Results now track agent_classes; display splits test/partner tables and JSONL uses new keys; average reward filters by agent_class.
  • Core Models & Evaluation:
    • Add LLMBaseModel (extra=forbid) and LLMEvalBaseModel; migrate message/env/eval models to these bases.
    • Replace tuple scores with structured {reasoning, score} types across SotopiaDimensions(Plus) and custom-dimension builder; update validators and tests.
  • Generation/Evaluation:
    • Simplify temperature handling; remove custom temperature wrappers.
    • format_bad_output derives schema from parser; response schema strictness disabled for LLMEvalBaseModel types.
    • Evaluators use LLMEvalBaseModel; structured output auto-enabled when supported.
  • Environment & Agents:
    • Add private message recipients (AgentAction.to: list[str]); env renders per-viewer visibility; default masked/none actions include to=[].
    • LLMAgent requests structured action output; Human/Redis agents return actions with to=[].
  • Database/Logs:
    • EpisodeLog gains agent_classes; include in episode existence checks and displays.
  • Docs/Examples:
    • Docs: guide to run benchmarks with custom agents (using _benchmark_impl).
    • Examples updated (model names, temps); .gitignore adds *.rdb.
  • Config/Tests:
    • Move dev deps to [dependency-groups]; pytest asyncio scope config.
    • Comprehensive test updates for new schemas, private messaging, and benchmarking flow.

Written by Cursor Bugbot for commit b8100ec. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 6

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.


# subject to future OpenAI changes
DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-4o-mini"
DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-5-mini-2025-08-07"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Invalid model name will cause API failures

The DEFAULT_BAD_OUTPUT_PROCESS_MODEL is set to "gpt-5-mini-2025-08-07" which is not a valid OpenAI model name. OpenAI does not have a GPT-5 model, and the naming convention doesn't match any known model. The previous value "gpt-4o-mini" was a valid model. This will cause all calls to format_bad_output to fail when they fall back to this default model for reformatting malformed LLM outputs.

Fix in CursorΒ Fix in Web

f"benchmark_{model}_{partner_model}_{evaluator_model}_{task}_trial0"
if tag == ""
else tag
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Tag variable reused incorrectly across model loop iterations

In the benchmark function, the tag variable is only set when it's empty (tag == ""). When looping over multiple models, the tag gets set on the first iteration and is never updated for subsequent models. This causes all models after the first to be tagged with the first model's tag, corrupting benchmark data by mixing episodes from different models under the same tag and preventing correct retrieval of model-specific results.

Fix in CursorΒ Fix in Web

print_logs: Annotated[bool, typer.Argument(help="Print logs.")] = False,
only_show_performance: Annotated[
bool, typer.Argument(help="Only show performance.")
] = False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Parameter only_show_performance declared but never used

The only_show_performance parameter is declared in the benchmark function signature but the code that previously handled it was removed. When users pass only_show_performance=True expecting to only display existing benchmark results without running new benchmarks, the function will instead execute the full benchmark. This breaks the expected behavior of this flag.

Fix in CursorΒ Fix in Web

# Fallback to JSON validation for backward compatibility
if "properties" in json_result:
# Type narrowing: check that json_result is a dict before accessing "properties"
if isinstance(json_result, dict) and "properties" in json_result:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Transformed JSON discarded when context is not provided

In PydanticOutputParser.parse, the extract_value transformation processes json_result to unwrap nested "value" structures (line 51), and the result is stored in data (line 57). However, when context is None, the code at line 70 uses the original untransformed result string instead of data. This causes inconsistent behavior: with context the transformed data is validated, without context the original untransformed string is used, effectively discarding the extract_value transformation in that code path.

Additional Locations (1)

Fix in CursorΒ Fix in Web

@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 82.27848% with 42 lines in your changes missing coverage. Please review.
βœ… Project coverage is 74.80%. Comparing base (8d9b532) to head (b8100ec).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sotopia/cli/benchmark/benchmark.py 56.75% 32 Missing ⚠️
sotopia/agents/llm_agent.py 33.33% 2 Missing ⚠️
sotopia/agents/redis_agent.py 0.00% 2 Missing ⚠️
sotopia/generation_utils/generate.py 92.00% 2 Missing ⚠️
sotopia/generation_utils/output_parsers.py 87.50% 2 Missing ⚠️
sotopia/envs/parallel.py 50.00% 1 Missing ⚠️
sotopia/server.py 0.00% 1 Missing ⚠️
@@            Coverage Diff             @@
##             main     #312      +/-   ##
==========================================
- Coverage   74.92%   74.80%   -0.12%     
==========================================
  Files          72       72              
  Lines        4821     4827       +6     
==========================================
- Hits         3612     3611       -1     
- Misses       1209     1216       +7     
Files with missing lines Coverage Ξ”
sotopia/database/__init__.py 58.33% <100.00%> (+0.88%) ⬆️
sotopia/database/base_models.py 90.72% <100.00%> (+0.50%) ⬆️
sotopia/database/evaluation_dimensions.py 85.11% <100.00%> (+4.52%) ⬆️
sotopia/database/logs.py 82.75% <100.00%> (+0.20%) ⬆️
sotopia/database/persistent_profile.py 64.74% <100.00%> (+0.25%) ⬆️
sotopia/envs/evaluators.py 96.45% <100.00%> (ΓΈ)
sotopia/generation_utils/__init__.py 100.00% <ΓΈ> (ΓΈ)
sotopia/messages/message_classes.py 53.44% <100.00%> (+0.26%) ⬆️
tests/database/test_evaluation_dimensions.py 100.00% <100.00%> (ΓΈ)
tests/envs/test_evaluators.py 100.00% <ΓΈ> (ΓΈ)
... and 9 more

... and 1 file with indirect coverage changes

πŸš€ New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

output_to_jsonl=output_to_jsonl,
agent_class=agent_class.__name__,
tag=tag,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: save_dir parameter not passed to benchmark_display function

The save_dir parameter is accepted in _benchmark_impl at line 534 but is not passed to benchmark_display when called at lines 651-659. Since benchmark_display has save_dir as a parameter with default ., JSONL output files will always be saved to the current directory regardless of the --save-dir CLI option value specified by users.

Fix in CursorΒ Fix in Web

@openhands-ai
Copy link

openhands-ai bot commented Dec 19, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pytest in docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #312 at branch `feature/benchmark_with_customed_agents`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

)
raise e
parsed_action = AgentAction(action_type="none", argument="")
parsed_action = AgentAction(action_type="none", argument="", to=[])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unreachable code after raise statement

The code on lines 343-344 (parsed_action = AgentAction(...) and name = agent_names[...]) appears after raise e on line 342, making it unreachable. This appears to be a refactoring error when adding the to=[] parameter. If the intent was error recovery, the raise e should be removed; otherwise these dead lines should be deleted.

Fix in CursorΒ Fix in Web

@XuhuiZhou XuhuiZhou merged commit abd4485 into main Dec 19, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants