Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .agents/skills/deep-research/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ Iterate until evidence quality is sufficient:
8. Run contradiction/counter-evidence checks.
9. Synthesize and produce final report.

When the topic has implementation, benchmark, reproduction, or planning implications, also apply [references/codebase-and-data-research-rules.md](references/codebase-and-data-research-rules.md).

## Re-entry Policy (Mid-Run)

When called during an ongoing run (not only at run start):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Codebase And Data Research Rules

Use this reference when the research topic has an implementation, benchmark, reproduction, or planning component.

## 1. Goal

Convert vague "look at the repo/data" work into a traceable evidence pass:

1. identify the most relevant code bases
2. identify the most relevant datasets or workloads
3. determine whether the target plan should align with prior work or intentionally deviate
4. record evidence and uncertainty explicitly

## 2. Codebase Search Order

Search in this order unless the user gives a stricter constraint:

1. local target repo or currently checked out project
2. official implementation linked by the paper, project page, or organization
3. author-maintained GitHub repositories
4. widely used third-party reproductions only if official code is missing or incomplete

Do not treat random GitHub repos as equivalent evidence to official implementations.

## 3. Codebase Search Rules

For each high-relevance method or paper, check whether an open-source code base exists.

Minimum checks:

1. repository URL
2. owner type: official org, author, or third party
3. activity signals: last commit date, issues, stars only as a weak signal
4. reproducibility signals: environment file, training script, eval script, config examples, checkpoints, README completeness
5. scope match: full training, inference only, eval only, or partial reproduction
6. license and usage constraints

If the work is highly relevant and no code is open-sourced, record that absence explicitly instead of silently skipping it.

## 4. What To Extract From A Relevant Repo

When a code base is relevant, extract only the implementation facts needed for reasoning:

1. entry points for training, evaluation, and inference
2. config system and major knobs
3. data manifest or preprocessing path
4. benchmark or metric implementation
5. dependency and runtime assumptions
6. checkpoint availability
7. any mismatch between the paper claim and the released code

Prefer concrete file paths, script names, and config names over generic summaries.

## 5. Dataset And Workload Search Rules

For each candidate dataset, workload, trace, or benchmark, record:

1. what prior work used it
2. whether it is the main comparison target in the literature
3. task definition and label space
4. train/validation/test split policy
5. scale, domain, and freshness
6. license, access, and usage restrictions
7. known contamination, leakage, or annotation-quality concerns
8. preprocessing conventions used by the most relevant prior work

Do not choose datasets only because they are easy to download.

## 6. Dataset Alignment Decision Rule

Default rule:

1. if the task claims improvement over prior work, align first with the datasets or workloads used by the strongest relevant baselines
2. if the literature has no stable comparison set, propose a primary benchmark set and explain why
3. if the target project serves a different domain or product requirement, keep one literature-aligned benchmark and add one target-domain benchmark

You must state which of the following applies:

1. `fully aligned with prior work`
2. `partially aligned with one added domain-specific benchmark`
3. `intentionally different from prior work`

If intentionally different, explain what comparability is lost and what practical validity is gained.

## 7. Comparison Set Construction

When planning evaluations, define the comparison set in this order:

1. strongest official or canonical baseline from the literature
2. strongest practical open-source baseline that can actually be run
3. target-project incumbent or current production baseline when applicable
4. ablations of the proposed method

If the canonical baseline cannot be reproduced, say why and choose the closest defensible substitute.

## 8. Evidence Quality Rules

Treat evidence tiers in this order:

1. paper plus official repo plus released configs or checkpoints
2. paper plus author repo
3. paper only
4. third-party reproduction
5. blog posts, tweets, or issue comments as weak supporting evidence only

Do not present tier 4 or 5 evidence as if it were definitive.

## 9. Required Output When This Reference Applies

Include a compact `Codebase and Data Audit` block in the research output:

1. target repo inspected or not
2. related official repos found or not
3. strongest runnable baseline repo
4. dataset/workload alignment choice
5. main reproducibility risks
6. unresolved gaps

## 10. Failure And Gap Handling

If repository or dataset evidence is incomplete:

1. say what was searched
2. say what was not found
3. state whether the gap changes the recommendation materially
4. provide the least risky fallback
10 changes: 10 additions & 0 deletions .agents/skills/deep-research/references/output-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,15 @@ Then add short narrative:
- For `conflict-resolution`: dispute map and source-tier arbitration.
- For `idea-exploration`: landscape and opportunity boundaries.

## Codebase and Data Audit (When Applicable)

- target repo inspected or not
- related official repos found or not
- strongest runnable baseline repo
- dataset/workload alignment choice
- main reproducibility risks
- unresolved gaps

## Research Trail Summary

- queries_run=
Expand Down Expand Up @@ -101,3 +110,4 @@ Rules:
4. Match output language to user language.
5. If topic is paper-centric, do not skip `Key Works Deep Dive`.
6. If degradation is used, include explicit degradation metadata and `Degrade Log`.
7. If `references/codebase-and-data-research-rules.md` applies, do not skip `Codebase and Data Audit`.
2 changes: 2 additions & 0 deletions .agents/skills/research-plan/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ Follow this order unless the user explicitly asks for a lighter output:
9. Specify the implementation foundation.
10. Specify expected outcomes and decision rules.

When the plan depends on repository choice, reproducibility, benchmark selection, or dataset alignment, apply [references/codebase-and-data-research-rules.md](references/codebase-and-data-research-rules.md).

## Experiment Design Rules

Do not produce vague items such as "run some experiments" or "evaluate performance."
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Codebase And Data Research Rules

Use this reference when the plan depends on an implementation foundation, benchmark choice, or reproduction strategy.

## 1. Goal

Turn "code base and dataset requirements" into explicit planning decisions:

1. which repo to build on
2. which related repos to compare against
3. which datasets or workloads to align with
4. what reproducibility risks are accepted

## 2. Codebase Selection Order

Choose the implementation foundation in this order unless the user overrides it:

1. the current target repo when it already matches the project objective
2. official code linked by the key paper or project page
3. author-maintained GitHub repositories
4. strong third-party reproductions only when official code is missing or unusable

State which repo is the primary foundation and why it wins on relevance, maintainability, and reproducibility.

## 3. Codebase Audit Checklist

For each primary or comparison repo, check:

1. owner type: official org, author, or third party
2. repository scope: full training, inference only, eval only, or partial
3. entry points for training, evaluation, and inference
4. config system and experiment launch pattern
5. dataset manifests and preprocessing scripts
6. metric implementation and benchmark coverage
7. environment reproducibility: lock files, Docker, conda, or install docs
8. checkpoint availability
9. license constraints
10. maintenance signals such as last meaningful commit

Do not name a repo in the plan without explaining whether it is actually runnable for the target objective.

## 4. GitHub Search Rule

For every highly relevant prior work, check whether there is an open-source repository.

Record at least:

1. repo URL
2. owner
3. official or unofficial status
4. what part of the paper it covers
5. whether it is suitable as a baseline, implementation reference, or both

If no public repo exists for an important work, say so explicitly.

## 5. Dataset And Workload Alignment Rule

Default planning rule:

1. align first with the datasets or workloads used by the strongest relevant prior work
2. if the target deployment domain differs, keep one literature-aligned benchmark and add one target-domain benchmark
3. if prior work is fragmented, define a benchmark set that covers the main comparison axis and explain the tradeoff

Do not choose datasets only because they are convenient.

## 6. Dataset Audit Checklist

For each chosen dataset, workload, trace, or benchmark, specify:

1. which related works use it
2. task definition and label space
3. split policy
4. scale and domain coverage
5. freshness or temporal boundary when relevant
6. license and access restrictions
7. preprocessing and normalization conventions
8. contamination, leakage, or quality risks

## 7. Planning Decision Labels

Every plan must label dataset/workload strategy as one of:

1. `aligned with prior work`
2. `aligned plus target-domain extension`
3. `new benchmark strategy`

Every plan must label codebase strategy as one of:

1. `extend current repo`
2. `reuse official repo`
3. `port ideas into target repo`
4. `build minimal new repo`

## 8. Required Plan Content

When this reference applies, the plan must make these items explicit:

1. primary implementation repo
2. comparison repos considered and why they were rejected or kept
3. strongest runnable baseline
4. dataset/workload alignment choice
5. benchmark comparability risks
6. reproducibility blockers and fallback path
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ For each experiment, fill:

- Language / framework:
- Existing repo to reuse:
- Codebase strategy label (`extend current repo` / `reuse official repo` / `port ideas into target repo` / `build minimal new repo`):
- Comparison repos considered:
- Strongest runnable baseline repo:
- Experiment or execution entry points:
- Evaluation entry points:
- Config system:
Expand All @@ -63,10 +66,13 @@ For each experiment, fill:
### Data / Workloads / Inputs

- Primary data source / workload / benchmark:
- Dataset/workload strategy label (`aligned with prior work` / `aligned plus target-domain extension` / `new benchmark strategy`):
- Which related works use the chosen data:
- Validation / comparison scope:
- Robustness input:
- License / access:
- Preprocessing / normalization:
- Comparability risks:

### Additional Experiment Variants

Expand Down