Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain

I'm trying to reproduce the Kimi K2.5 Thinking mode results on the APEX-Agents Law domain using the Archipelago codebase (latest main branch). The Law domain contains 160 tasks from the mercor/apex-agents dataset.
Reproduction setup

Model: Kimi K2.5 (thinking mode)
Judge model: gemini-2.5-flash
max_steps: 50
Other parameters: defaults as per codebase (system prompt from registry, grading config default)
Environment: Docker + uv sync per README instructions

Results obtained

Mean score: 32.0
Pass@1: 10%

Reported on leaderboard/paper: mean score ~40, Pass@1 ~16.
Additionally, I noticed that the default system_prompt in the codebase (agents/runner/agents/registry.py) appears different from the one described in the APEX-Agents paper (arXiv:2601.14242).

Does my configuration (judge model, max_steps, system prompt) match the official leaderboard reproduction setup?
Are there any known reproducibility notes, config overrides, or common pitfalls for Kimi models?
Would you recommend a specific system prompt, judge model, or other parameters to align closer to the reported scores?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions