-
Notifications
You must be signed in to change notification settings - Fork 17
Add Evaluation script #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yuki-Imajuku
commented
Sep 21, 2025
- Add evaluation scripts
- Update version setting in pyproject.toml
|
@codex Please review this PR carefully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive evaluation framework for Large Language Models (LLMs) on ALE-Bench, including supporting scripts and configuration files.
- Introduces the
ale_bench_evalpackage with modules for repeated sampling, self-refinement, and automated evaluation - Updates version configuration to use dynamic versioning from uv
- Fixes Docker container management issues in input generation and session closing
Reviewed Changes
Copilot reviewed 63 out of 64 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ale_bench_eval/ | Complete evaluation framework with 12 new modules for LLM benchmarking |
| scripts/run_eval.sh | Bash script for automated evaluation execution |
| llm_configs/ | 26 JSON configuration files for various LLM models |
| docs/evaluation.md | Comprehensive documentation for the evaluation tool |
| pyproject.toml | Version configuration updates and new eval dependencies |
| src/ale_bench/ | Docker container management fixes |
| results/.gitignore | Results directory setup |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
|
@codex Review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 64 out of 65 changed files in this pull request and generated 10 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".