-
Notifications
You must be signed in to change notification settings - Fork 41
custom agent #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom agent #312
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on January 6
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
|
|
||
| # subject to future OpenAI changes | ||
| DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-4o-mini" | ||
| DEFAULT_BAD_OUTPUT_PROCESS_MODEL = "gpt-5-mini-2025-08-07" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Invalid model name will cause API failures
The DEFAULT_BAD_OUTPUT_PROCESS_MODEL is set to "gpt-5-mini-2025-08-07" which is not a valid OpenAI model name. OpenAI does not have a GPT-5 model, and the naming convention doesn't match any known model. The previous value "gpt-4o-mini" was a valid model. This will cause all calls to format_bad_output to fail when they fall back to this default model for reformatting malformed LLM outputs.
| f"benchmark_{model}_{partner_model}_{evaluator_model}_{task}_trial0" | ||
| if tag == "" | ||
| else tag | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Tag variable reused incorrectly across model loop iterations
In the benchmark function, the tag variable is only set when it's empty (tag == ""). When looping over multiple models, the tag gets set on the first iteration and is never updated for subsequent models. This causes all models after the first to be tagged with the first model's tag, corrupting benchmark data by mixing episodes from different models under the same tag and preventing correct retrieval of model-specific results.
sotopia/cli/benchmark/benchmark.py
Outdated
| print_logs: Annotated[bool, typer.Argument(help="Print logs.")] = False, | ||
| only_show_performance: Annotated[ | ||
| bool, typer.Argument(help="Only show performance.") | ||
| ] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Parameter only_show_performance declared but never used
The only_show_performance parameter is declared in the benchmark function signature but the code that previously handled it was removed. When users pass only_show_performance=True expecting to only display existing benchmark results without running new benchmarks, the function will instead execute the full benchmark. This breaks the expected behavior of this flag.
| # Fallback to JSON validation for backward compatibility | ||
| if "properties" in json_result: | ||
| # Type narrowing: check that json_result is a dict before accessing "properties" | ||
| if isinstance(json_result, dict) and "properties" in json_result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Transformed JSON discarded when context is not provided
In PydanticOutputParser.parse, the extract_value transformation processes json_result to unwrap nested "value" structures (line 51), and the result is stored in data (line 57). However, when context is None, the code at line 70 uses the original untransformed result string instead of data. This causes inconsistent behavior: with context the transformed data is validated, without context the original untransformed string is used, effectively discarding the extract_value transformation in that code path.
Additional Locations (1)
Codecov Reportβ Patch coverage is @@ Coverage Diff @@
## main #312 +/- ##
==========================================
- Coverage 74.92% 74.80% -0.12%
==========================================
Files 72 72
Lines 4821 4827 +6
==========================================
- Hits 3612 3611 -1
- Misses 1209 1216 +7
... and 1 file with indirect coverage changes π New features to boost your workflow:
|
| output_to_jsonl=output_to_jsonl, | ||
| agent_class=agent_class.__name__, | ||
| tag=tag, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: save_dir parameter not passed to benchmark_display function
The save_dir parameter is accepted in _benchmark_impl at line 534 but is not passed to benchmark_display when called at lines 651-659. Since benchmark_display has save_dir as a parameter with default ., JSONL output files will always be saved to the current directory regardless of the --save-dir CLI option value specified by users.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
| ) | ||
| raise e | ||
| parsed_action = AgentAction(action_type="none", argument="") | ||
| parsed_action = AgentAction(action_type="none", argument="", to=[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unreachable code after raise statement
The code on lines 343-344 (parsed_action = AgentAction(...) and name = agent_names[...]) appears after raise e on line 342, making it unreachable. This appears to be a refactoring error when adding the to=[] parameter. If the intent was error recovery, the raise e should be removed; otherwise these dead lines should be deleted.
Closes #
π Description
β Checks
type/descript(e.g.feature/add-llm-agents)βΉ Additional Information
Note
Adds custom-agent benchmarking via
_benchmark_impl, introduces strict LLM base models and structured eval schemas, implements private message recipients, and streamlines generation/evaluator logic with updated docs/tests._benchmark_impl(models, agent_class, ...)to run benchmarks with custom agent classes; CLIbenchmarkwraps it withLLMAgent.hard,cooperative,competitive), per-runtag, and batch reruns; write env model toParallelSotopiaEnv.agent_classes; display splits test/partner tables and JSONL uses new keys; average reward filters byagent_class.LLMBaseModel(extra=forbid) andLLMEvalBaseModel; migrate message/env/eval models to these bases.{reasoning, score}types acrossSotopiaDimensions(Plus)and custom-dimension builder; update validators and tests.format_bad_outputderives schema from parser; response schema strictness disabled forLLMEvalBaseModeltypes.LLMEvalBaseModel; structured output auto-enabled when supported.AgentAction.to: list[str]); env renders per-viewer visibility; default masked/none actions includeto=[].LLMAgentrequests structured action output; Human/Redis agents return actions withto=[].EpisodeLoggainsagent_classes; include in episode existence checks and displays._benchmark_impl)..gitignoreadds*.rdb.[dependency-groups]; pytest asyncio scope config.Written by Cursor Bugbot for commit b8100ec. This will update automatically on new commits. Configure here.