Skip to content

Reproduction discrepancy: Kimi K2.5 Thinking scores lower than reported on Law domain #4

@QorinAI

Description

@QorinAI

I'm trying to reproduce the Kimi K2.5 Thinking mode results on the APEX-Agents Law domain using the Archipelago codebase (latest main branch). The Law domain contains 160 tasks from the mercor/apex-agents dataset.
Reproduction setup

Model: Kimi K2.5 (thinking mode)
Judge model: gemini-2.5-flash
max_steps: 50
Other parameters: defaults as per codebase (system prompt from registry, grading config default)
Environment: Docker + uv sync per README instructions

Results obtained

Mean score: 32.0
Pass@1: 10%

Reported on leaderboard/paper: mean score ~40, Pass@1 ~16.
Additionally, I noticed that the default system_prompt in the codebase (agents/runner/agents/registry.py) appears different from the one described in the APEX-Agents paper (arXiv:2601.14242).

Does my configuration (judge model, max_steps, system prompt) match the official leaderboard reproduction setup?
Are there any known reproducibility notes, config overrides, or common pitfalls for Kimi models?
Would you recommend a specific system prompt, judge model, or other parameters to align closer to the reported scores?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions