Skip to content

Conversation

@Yuki-Imajuku
Copy link
Collaborator

  • Add evaluation scripts
  • Update version setting in pyproject.toml

@Yuki-Imajuku
Copy link
Collaborator Author

@codex Please review this PR carefully

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive evaluation framework for Large Language Models (LLMs) on ALE-Bench, including supporting scripts and configuration files.

  • Introduces the ale_bench_eval package with modules for repeated sampling, self-refinement, and automated evaluation
  • Updates version configuration to use dynamic versioning from uv
  • Fixes Docker container management issues in input generation and session closing

Reviewed Changes

Copilot reviewed 63 out of 64 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/ale_bench_eval/ Complete evaluation framework with 12 new modules for LLM benchmarking
scripts/run_eval.sh Bash script for automated evaluation execution
llm_configs/ 26 JSON configuration files for various LLM models
docs/evaluation.md Comprehensive documentation for the evaluation tool
pyproject.toml Version configuration updates and new eval dependencies
src/ale_bench/ Docker container management fixes
results/.gitignore Results directory setup

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

@Yuki-Imajuku
Copy link
Collaborator Author

@codex Review again.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 64 out of 65 changed files in this pull request and generated 10 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

@Yuki-Imajuku Yuki-Imajuku merged commit 343c94b into main Sep 22, 2025
5 checks passed
@Yuki-Imajuku Yuki-Imajuku deleted the feat/eval branch September 22, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants