An “agent exam” built from a real R codebase benchmark, designed to evaluate autonomous CLI coding agents on repo navigation, safe changes, and Shiny feature work.
- Blog context: https://posit.co/blog/r-llm-evaluation-03/
- Upstream repository: https://github.com/skaltman/model-eval
template_repo/: a frozen baseline snapshot of the upstream repo used for gradingBASELINE_COMMIT.txt: the upstream commit hash of the snapshotAGENTS.md: participant instructions (tasks, constraints, deliverables)scripts/: grading/check utilities used by the reviewerattempts/: collected submissions (REPORT.md+SUBMISSION.patch) for each participantdocs/: optional background and rubric (not required for participants)
- Apply
attempts/<name>/SUBMISSION.patchontotemplate_repo/ - Run automated checks via
scripts/run_checks.sh(T1/T2/T3) - Manually spot-check the most important constraint: T2 safe modes must not trigger any API calls and should not require API keys
Note: safe modes are ideally implemented so they also avoid importing API-related packages (e.g., ellmer, vitals) before exiting, keeping --help/--list-models/--dry-run lightweight and dependency-tolerant.
template_repo/is derived fromskaltman/model-eval(commit inBASELINE_COMMIT.txt).scripts/were authored for this exam and adapted to the upstream repo’s structure and expected behaviors (they are not upstream-provided tests).
Participants:
attempts/claude-opus-20260131/: Claude Code — Opus 4.5attempts/codex-gpt52-20260131/: Codex CLI — gpt-5.2-highattempts/gemini-cli-2026-01-31/: Gemini CLI — gemini-3-pro-previewattempts/opencode-20250131/: OpenCode — kimi-k2.5
Leaderboard:
| Rank | Attempt | Agent/model | Automated checks (T1/T2/T3) | Autograder pass rate | Final score |
|---|---|---|---|---|---|
| 1 | attempts/claude-opus-20260131/ |
Claude Code/Opus 4.5 | 3/3 | 100% | 100 |
| 2 | attempts/codex-gpt52-20260131/ |
Codex CLI/gpt-5.2-high | 3/3 | 100% | 99 |
| 3 | attempts/opencode-20250131/ |
OpenCode/kimi-k2.5 | 3/3 | 100% | 92 |
| 4 | attempts/gemini-cli-2026-01-31/ |
Gemini CLI/gemini-3-pro-preview | 3/3 | 100% | 89 |
Notes:
- “Autograder pass rate” is the fraction of exam checks passed; it is not the upstream ARE benchmark “code generation accuracy”.
- All four submissions passed the automated checks; the final score includes manual-review deductions (primarily around T2 safe-mode robustness).
- Claude Code (100): Best alignment with T2 safe-mode expectations (early exit before importing API-related packages) and strong T3 edge-case handling (no “correct” models case).
- Codex CLI (99): Strong overall and stricter validation in T1; minor deduction for re-implementing unevaluated-model detection in
--dry-runrather than reusingfind_unevaluated_models(). - OpenCode (92): Functional checks pass, but T2 imports API-related packages at top-level, making safe modes unnecessarily depend on those packages.
- Gemini CLI (89): Similar safe-mode dependency issue as OpenCode; submission patch also triggers trailing whitespace warnings on apply (
attempts/gemini-cli-2026-01-31/SUBMISSION.patch).