Add Evaluation script #13

Yuki-Imajuku · 2025-09-21T13:25:09Z

Add evaluation scripts
Update version setting in pyproject.toml

Yuki-Imajuku · 2025-09-21T13:25:30Z

@codex Please review this PR carefully

Copilot

Pull Request Overview

This PR adds a comprehensive evaluation framework for Large Language Models (LLMs) on ALE-Bench, including supporting scripts and configuration files.

Introduces the ale_bench_eval package with modules for repeated sampling, self-refinement, and automated evaluation
Updates version configuration to use dynamic versioning from uv
Fixes Docker container management issues in input generation and session closing

Reviewed Changes

Copilot reviewed 63 out of 64 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/ale_bench_eval/	Complete evaluation framework with 12 new modules for LLM benchmarking
scripts/run_eval.sh	Bash script for automated evaluation execution
llm_configs/	26 JSON configuration files for various LLM models
docs/evaluation.md	Comprehensive documentation for the evaluation tool
pyproject.toml	Version configuration updates and new eval dependencies
src/ale_bench/	Docker container management fixes
results/.gitignore	Results directory setup

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ale_bench_eval/selection.py

src/ale_bench_eval/scaffolds.py

src/ale_bench_eval/safe_generation.py

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

src/ale_bench_eval/codes/ce_python.py

Yuki-Imajuku · 2025-09-22T06:05:06Z

@codex Review again.

Copilot

Pull Request Overview

Copilot reviewed 64 out of 65 changed files in this pull request and generated 10 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ale_bench_eval/selection.py

src/ale_bench_eval/scaffolds.py

src/ale_bench_eval/safe_generation.py

src/ale_bench_eval/prompts/builder.py

src/ale_bench_eval/logger.py

src/ale_bench_eval/__main__.py

src/ale_bench/__init__.py

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

src/ale_bench_eval/analyze_results.py

add ale_bench_eval

5a8dac8

Yuki-Imajuku requested a review from Copilot September 21, 2025 13:25

Copilot AI reviewed Sep 21, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Sep 21, 2025

View reviewed changes

src/ale_bench_eval/codes/ce_python.py Show resolved Hide resolved

Yuki-Imajuku added 7 commits September 22, 2025 11:07

update actions config and apply comments

cc61b89

update github actions config

0337077

update actions config

f6079ba

update actions

e4474d2

update actions config

0c54711

update actions config

91f7847

update datetime utc

fab63e9

Yuki-Imajuku requested a review from Copilot September 22, 2025 06:04

Copilot AI reviewed Sep 22, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Sep 22, 2025

View reviewed changes

src/ale_bench_eval/analyze_results.py Show resolved Hide resolved

Yuki-Imajuku merged commit 343c94b into main Sep 22, 2025
5 checks passed

Yuki-Imajuku deleted the feat/eval branch September 22, 2025 07:52

Add Evaluation script #13

Add Evaluation script #13

Uh oh!

Conversation

Yuki-Imajuku commented Sep 21, 2025

Uh oh!

Yuki-Imajuku commented Sep 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yuki-Imajuku commented Sep 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants