A practical playbook for running evidence-driven autoresearch loops.
Inspired by the high-level logic of Karpathy's autoresearch, but optimized for real-world use:
- stronger baseline discipline
- deeper post-run analysis
- explicit acceptance gates
- durable file-based state
- controlled paper / repo / SOTA intake
- reusable templates for repeated experiment loops
Most research loops fail for boring reasons:
- the baseline was unstable,
- too many things changed at once,
- the evaluation path drifted,
- a weak proxy metric was mistaken for real progress,
- or the whole loop depended on a transient chat/session.
practical-autoresearch is a small public repo that tries to fix that.
It gives you a reusable operating pattern for iterative research work on:
- ML systems
- retrieval systems
- agent systems
- evaluation pipelines
- benchmark optimization loops
- any setup where repeated experiment → analysis → next-step cycles matter
practical-autoresearch is a public methodology repo for running automatic research loops with stronger rigor: baseline-first, evidence-driven, resumable, and designed for real engineering constraints.
Karpathy-style autoresearch gives the top-level spirit:
- iterate fast
- measure tightly
- keep the loop autonomous where possible
This repo adds the practical layer that many teams end up needing anyway:
If the baseline is not trustworthy, all downstream conclusions are suspect.
A round does not end at “metric up”. It ends when you understand:
- what improved,
- what regressed,
- which slices moved,
- how strong the signal is,
- and what the next best move should be.
Fast local evaluation is useful. It is not enough for important acceptance decisions.
A serious loop should be resumable from files, not dependent on one agent session or one person's memory.
This repo encourages continuous scanning of papers, repos, and system reports — but only through explicit evaluation, not novelty impulse.
.
├── README.md
├── LICENSE
├── SKILL.md
├── docs/
│ ├── methodology.md
│ ├── acceptance-gates.md
│ ├── experiment-ledger-spec.md
│ └── research-program-template.md
├── research-loop/
│ └── templates/
│ ├── round-log.md
│ └── post-run-analysis.md
└── skill/
├── autoresearch-playbook/
│ └── SKILL.md
└── research_skill.md
A healthy autoresearch loop usually looks like this:
- define objective, constraints, and success metric
- verify a trustworthy baseline
- run one focused experiment round
- analyze deeply
- apply an acceptance gate
- scan external developments
- decide the next move from evidence
- update durable state files
In short:
establish truth → test carefully → analyze deeply → update direction → preserve state → repeat
For a real project, keep the loop state in files like these:
my-autoresearch-worktree/
├── program.md
├── ledger.jsonl
├── CURRENT_STATUS.md
├── experiments/
│ └── NNN/
│ ├── round-log.md
│ └── diff.patch
├── eval/
│ └── eval_harness.py
└── src/
└── ... system under test ...
program.md— goals, constraints, active hypotheses, queueledger.jsonl— append-only experiment historyCURRENT_STATUS.md— resumable snapshot for humans/agentsexperiments/NNN/— detailed per-round logs and diffseval_harness.py— as stable as possible across roundssrc/— the thing being improved
- Methodology →
docs/methodology.md - Acceptance gates →
docs/acceptance-gates.md - Experiment ledger spec →
docs/experiment-ledger-spec.md - Research program template →
docs/research-program-template.md - Round log template →
research-loop/templates/round-log.md - Post-run analysis template →
research-loop/templates/post-run-analysis.md - Skill-style reusable prompt →
SKILL.md
This repo is especially useful for people working on:
- retrieval and memory systems
- agent evaluation loops
- benchmark optimization programs
- model routing / tool-use systems
- production ML systems where regressions are expensive
- any experimental stack that needs repeated, explainable iteration
This repo is not:
- a benchmark leaderboard
- a dump of private experiments
- a claim of implementation parity with any other repo
- tied to a single dataset, company, model provider, or hardware setup
It is meant to stay public, reusable, and method-oriented.
The point is not “fully autonomous research theater”. The point is practical autoresearch:
- good enough to run in the real world,
- disciplined enough to avoid fake progress,
- simple enough to reuse across projects.