practical-autoresearch

A practical playbook for running evidence-driven autoresearch loops.

Inspired by the high-level logic of Karpathy's autoresearch, but optimized for real-world use:

stronger baseline discipline
deeper post-run analysis
explicit acceptance gates
durable file-based state
controlled paper / repo / SOTA intake
reusable templates for repeated experiment loops

TL;DR

Most research loops fail for boring reasons:

the baseline was unstable,
too many things changed at once,
the evaluation path drifted,
a weak proxy metric was mistaken for real progress,
or the whole loop depended on a transient chat/session.

practical-autoresearch is a small public repo that tries to fix that.

It gives you a reusable operating pattern for iterative research work on:

ML systems
retrieval systems
agent systems
evaluation pipelines
benchmark optimization loops
any setup where repeated experiment → analysis → next-step cycles matter

Short intro

practical-autoresearch is a public methodology repo for running automatic research loops with stronger rigor: baseline-first, evidence-driven, resumable, and designed for real engineering constraints.

What makes this different

Karpathy-style autoresearch gives the top-level spirit:

iterate fast
measure tightly
keep the loop autonomous where possible

This repo adds the practical layer that many teams end up needing anyway:

1. Baseline is sacred

If the baseline is not trustworthy, all downstream conclusions are suspect.

2. Deep analysis is mandatory

A round does not end at “metric up”. It ends when you understand:

what improved,
what regressed,
which slices moved,
how strong the signal is,
and what the next best move should be.

3. Acceptance gates beat self-deception

Fast local evaluation is useful. It is not enough for important acceptance decisions.

4. Research should survive resets

A serious loop should be resumable from files, not dependent on one agent session or one person's memory.

5. New papers enter through controlled intake

This repo encourages continuous scanning of papers, repos, and system reports — but only through explicit evaluation, not novelty impulse.

What this repo contains

.
├── README.md
├── LICENSE
├── SKILL.md
├── docs/
│   ├── methodology.md
│   ├── acceptance-gates.md
│   ├── experiment-ledger-spec.md
│   └── research-program-template.md
├── research-loop/
│   └── templates/
│       ├── round-log.md
│       └── post-run-analysis.md
└── skill/
    ├── autoresearch-playbook/
    │   └── SKILL.md
    └── research_skill.md

Core operating model

A healthy autoresearch loop usually looks like this:

define objective, constraints, and success metric
verify a trustworthy baseline
run one focused experiment round
analyze deeply
apply an acceptance gate
scan external developments
decide the next move from evidence
update durable state files

In short:

establish truth → test carefully → analyze deeply → update direction → preserve state → repeat

Recommended durable files

For a real project, keep the loop state in files like these:

my-autoresearch-worktree/
├── program.md
├── ledger.jsonl
├── CURRENT_STATUS.md
├── experiments/
│   └── NNN/
│       ├── round-log.md
│       └── diff.patch
├── eval/
│   └── eval_harness.py
└── src/
    └── ... system under test ...

Suggested roles

program.md — goals, constraints, active hypotheses, queue
ledger.jsonl — append-only experiment history
CURRENT_STATUS.md — resumable snapshot for humans/agents
experiments/NNN/ — detailed per-round logs and diffs
eval_harness.py — as stable as possible across rounds
src/ — the thing being improved

Documentation map

Methodology → docs/methodology.md
Acceptance gates → docs/acceptance-gates.md
Experiment ledger spec → docs/experiment-ledger-spec.md
Research program template → docs/research-program-template.md
Round log template → research-loop/templates/round-log.md
Post-run analysis template → research-loop/templates/post-run-analysis.md
Skill-style reusable prompt → SKILL.md

Good fit for

This repo is especially useful for people working on:

retrieval and memory systems
agent evaluation loops
benchmark optimization programs
model routing / tool-use systems
production ML systems where regressions are expensive
any experimental stack that needs repeated, explainable iteration

Not this repo's goal

This repo is not:

a benchmark leaderboard
a dump of private experiments
a claim of implementation parity with any other repo
tied to a single dataset, company, model provider, or hardware setup

It is meant to stay public, reusable, and method-oriented.

Why the name

The point is not “fully autonomous research theater”. The point is practical autoresearch:

good enough to run in the real world,
disciplined enough to avoid fake progress,
simple enough to reuse across projects.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

practical-autoresearch

TL;DR

Short intro

What makes this different

1. Baseline is sacred

2. Deep analysis is mandatory

3. Acceptance gates beat self-deception

4. Research should survive resets

5. New papers enter through controlled intake

What this repo contains

Core operating model

Recommended durable files

Suggested roles

Documentation map

Good fit for

Not this repo's goal

Why the name

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
research-loop/templates		research-loop/templates
skill		skill
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

practical-autoresearch

TL;DR

Short intro

What makes this different

1. Baseline is sacred

2. Deep analysis is mandatory

3. Acceptance gates beat self-deception

4. Research should survive resets

5. New papers enter through controlled intake

What this repo contains

Core operating model

Recommended durable files

Suggested roles

Documentation map

Good fit for

Not this repo's goal

Why the name

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages