-
Notifications
You must be signed in to change notification settings - Fork 23
Description
I'm trying to reproduce the Kimi K2.5 Thinking mode results on the APEX-Agents Law domain using the Archipelago codebase (latest main branch). The Law domain contains 160 tasks from the mercor/apex-agents dataset.
Reproduction setup
Model: Kimi K2.5 (thinking mode)
Judge model: gemini-2.5-flash
max_steps: 50
Other parameters: defaults as per codebase (system prompt from registry, grading config default)
Environment: Docker + uv sync per README instructions
Results obtained
Mean score: 32.0
Pass@1: 10%
Reported on leaderboard/paper: mean score ~40, Pass@1 ~16.
Additionally, I noticed that the default system_prompt in the codebase (agents/runner/agents/registry.py) appears different from the one described in the APEX-Agents paper (arXiv:2601.14242).
Does my configuration (judge model, max_steps, system prompt) match the official leaderboard reproduction setup?
Are there any known reproducibility notes, config overrides, or common pitfalls for Kimi models?
Would you recommend a specific system prompt, judge model, or other parameters to align closer to the reported scores?