Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the judge CLI entrypoint into the judge package so it can be imported normally (improving testability and removing path-based import hacks).
Changes:
- Added
judge/cli.pycontainingget_parser()and asyncmain(args). - Converted root
judge.pyinto a thin entrypoint delegating tojudge.cli. - Updated pipeline + tests to import/patch
judge.clidirectly instead of loadingjudge.pyviaimportlib+ file paths.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
judge/cli.py |
New module that houses the judge CLI parser + async main logic. |
judge.py |
Now delegates to judge.cli and runs the async entrypoint. |
run_pipeline.py |
Imports judge.cli.main directly instead of loading judge.py by path. |
tests/unit/judge/test_judge_cli.py |
Uses normal imports from judge.cli and patches module objects directly. |
tests/integration/test_pipeline.py |
Patches judge.cli.main directly and removes importlib-based mocking. |
Comments suppressed due to low confidence (1)
run_pipeline.py:195
- The comment says imports are deferred "to allow --debug flag to be set", but
judge.cliis imported beforeset_debug(True)runs. Either move thegenerate_main/judge_mainimports to after the debug flag handling, or update the comment so it doesn’t claim behavior that isn’t true.
# Import generate and judge main functions
# We import here to avoid circular dependencies and to allow --debug flag to be set
from generate import main as generate_main
from judge.cli import main as judge_main
# Set debug mode if flag is provided
if args.debug:
from utils.debug import set_debug
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Default: temperature=0 (unless overridden)" | ||
| ), | ||
| type=parse_key_value_list, | ||
| default={}, |
There was a problem hiding this comment.
--judge-model-extra-params uses a mutable dict (default={}) as the argparse default. Because LLMJudge mutates this dict (e.g., sets temperature), subsequent parse_args() calls in the same process can inherit mutated defaults. Use default=None (and normalize to {} in main) or ensure a fresh dict per parse (e.g., copy).
| default={}, | |
| default=None, |
There was a problem hiding this comment.
that might be a good suggestion if we are doing testing/multiple calls
at the same time, we have to make sure that setting multiple args in the same call works properly
|
|
||
| # Load rubric configuration once at startup | ||
| print("📚 Loading rubric configuration...") | ||
| rubric_config = await RubricConfig.load(rubric_folder="data") |
There was a problem hiding this comment.
The CLI exposes --rubrics (and run_pipeline.py passes rubrics=args.rubrics), but main() always loads RubricConfig from the hard-coded rubric_folder="data" and never uses args.rubrics. This makes the flag ineffective and can lead to confusing behavior. Either wire args.rubrics into RubricConfig.load (e.g., derive folder/filename from the provided path) or remove/rename the flag if it’s not supported.
| rubric_config = await RubricConfig.load(rubric_folder="data") | |
| rubric_folder = getattr(args, "rubrics", None) or "data" | |
| rubric_config = await RubricConfig.load(rubric_folder=rubric_folder) |
There was a problem hiding this comment.
that's a good idea too, and clearly something I didn't properly refactor
sator-labs
left a comment
There was a problem hiding this comment.
Minor the couple comments from the robot, LGTM
Another thing to think about in the future is if we want to centralize all the parser args in the pipeline script and then delegate the appropriate ones in each submodule (otherwise we are repeating them) OR have the pipeline load the judge ones and the generate ones and merge them
Description
Move the judge CLI (get_parser, main) from the root script into the package so tests can import it normally.
Resolves the concern that tests were hacking script location; tests now rely on normal imports.