Vibe Science

An AI-native research engine that loops until discovery — with adversarial review, quality gates, and serendipity tracking.

Vibe Science turns an LLM into a disciplined research agent. It provides a structured methodology (OTAE loop), an adversarial review system (Reviewer 2 Ensemble), typed evidence tracking, and quality gates — while preserving room for unexpected discoveries.

This repository tracks the evolution of Vibe Science across four major releases, from the original OTAE loop to a fully fault-injected verification framework. Each version is self-contained and independently installable.

The Problem

AI agents are dangerous in science. Not because they hallucinate — that's the easy problem.

The dangerous problem is that they find real patterns in real data and construct plausible narratives around them, without ever asking: "What if this is an artifact?"

What the agent does:

Optimizes for completion, not truth
Gets excited by strong signals (p < 10⁻¹⁰⁰!)
Constructs narratives around artifacts
Never searches for what kills its own claims
Declares "done" after 1 sprint

What actually happened (21 sprints, CRISPR):

OR = 2.30 → reversed sign under propensity matching
"Bidirectional effects" → biologically impossible
"Regime switch" → Cohen's d = 0.07 (noise)
"Generalizable rankings" → don't generalize between assays

None of these were hallucinations. The data was real. The statistics were correct. The problem was dispositional: the agent never tried to destroy its own claims.

The Solution

                    Builder (Researcher)              Destroyer (Reviewer 2)
                    ───────────────────               ─────────────────────
  Optimizes for:    Completion                        Survival
  Default stance:   "This looks promising"            "This is probably an artifact"
  Strong signal:    Excitement → narrative → paper     Suspicion → confounders → controls
  Web search for:   Supporting evidence               Contradictions, prior art, known artifacts
  Says "done":      When results look good            When ALL counter-verifications pass

Vibe Science embeds both dispositions in the same system. The builder builds. The destroyer destroys. Only what survives both gets published.

Version Evolution

Version	Codename	Architecture	Key Innovation	Laws	Gates
v3.5	TERTIUM DATUR	OTAE Loop	R2 double-pass, typed claims, evidence formula	7	12
v4.0	ARBOR VITAE	OTAE-Tree	Tree search, branch scoring, serendipity branches	10	26
v4.5	ARBOR VITAE (Pruned)	OTAE-Tree + Brainstorm	Phase 0 brainstorm, R2 6 modes, 5-stage pipeline	10	25
v5.0	IUDEX	OTAE-Tree + Verification	SFI, blind-first pass, R3 judge, schema-validated gates	10	27
v5.0 Codex	IUDEX	Same as v5.0	OpenAI Codex port (condensed SKILL.md, no hooks/TEAM)	10	27

v3.5 — TERTIUM DATUR

The foundation. Field-tested over 21 sprints of CRISPR-Cas9 research (VibeX 2026).

v3.5 introduces the OTAE (Observe-Think-Act-Evaluate) research loop — a six-phase cycle adapted from the OpenAI Codex unrolled agent loop. Each cycle executes exactly one action, evaluates the result, and persists state to files before looping back.

Architecture: OTAE Loop

╔══════════════════════════════════════════════════════════════╗
║                     OTAE-SCIENCE LOOP                        ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  OBSERVE     →  Read STATE.md + PROGRESS. Identify delta.    ║
║       ↓                                                      ║
║  THINK       →  Plan highest-value next action.              ║
║       ↓         Which skill to dispatch? What to falsify?    ║
║  ACT         →  Execute ONE action                           ║
║       ↓         (search / analyze / extract / compute)       ║
║  EVALUATE    →  Extract claims → score confidence → gate     ║
║       ↓         Detect serendipity → flag for triage         ║
║  CHECKPOINT  →  R2 trigger? Serendipity triage? Stop?        ║
║       ↓                                                      ║
║  CRYSTALLIZE →  Update STATE.md, PROGRESS.md, CLAIM-LEDGER   ║
║       ↓         → LOOP BACK TO OBSERVE                       ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝

7 Constitutional Laws

	Law	Rule
1	DATA-FIRST	No thesis without evidence from data
2	EVIDENCE DISCIPLINE	Every claim: claim_id + evidence chain + confidence + status
3	GATES BLOCK	Quality gates are hard stops, not suggestions
4	R2 ALWAYS-ON	Every milestone passes adversarial review
5	SERENDIPITY PRESERVED	Unexpected discoveries are features, not distractions
6	ARTIFACTS OVER PROSE	If it can produce a file, it MUST
7	FRESH CONTEXT RESILIENCE	Resumable from STATE.md alone

Reviewer 2 Ensemble

Not a gate you pass — a co-pilot you can't fire. 4 specialist reviewers (Methods, Stats, Bio, Engineering) run a double-pass workflow:

Fatal Hunt (purely destructive): find what's broken
Method Repair (constructive): propose what would fix it

Every flaw gets a numeric severity score (0-100):

Range	Level	Action
0-29	MINOR	Note, continue
30-59	MAJOR	Must address before next cycle
60-79	SEVERE	Must fix + re-submit to R2
80-100	FATAL	REJECT — no re-submission without new evidence

3-level orthogonal attack: L1-Logic · L2-Statistics · L3-Data

Evidence Engine

Every claim is quantified, not felt:

confidence = E×0.30 + R×0.25 + C×0.20 + K×0.15 + D×0.10

FLOOR: E < 0.2 → capped at 0.20

4 typed claims: descriptive · correlative · causal · predictive — evidence standard scales with claim type.

Serendipity Engine

Active scanner, not passive logger. Quantitative triage (0-15 score) with scheduled sprints every 10 cycles:

Score	Action
>= 12	INTERRUPT — immediate attention
>= 8	QUEUE — next available cycle
>= 4	FILE — track for patterns
< 4	NOISE — discard

Quality Gates

12 gates organized in 3 categories — each is a hard stop:

Pipeline (G0-G5): Input Sanity, Schema, Design, Training, Metrics, Artifacts
Literature (L0-L2): Source Validity (DOI verified), Coverage (>= 3 sources), Review Complete
Decision (D0-D2): Decision Justified, Claim Promotion (R2 approved), RQ Conclusion

Field Testing: 21 Sprints

Metric	Value
Sprints completed	21
Total claims registered	34
Claims killed or downgraded	11 (32%)
Most dangerous claim caught	OR=2.30, p < 10⁻¹⁰⁰ — sign reversed by propensity matching
Paper reference	Vibe Science: Adversarial Epistemic Architecture for LLM-Driven Research (VibeX 2026)

Protocols (9)

Protocol	Purpose
Loop OTAE	6-phase cycle with emergency protocols (context rot, state corruption, infinite loop)
Evidence Engine	Claim Ledger, confidence formula, Assumption Register, anti-hallucination rules
Reviewer 2 Ensemble	4-domain adversarial review, double-pass, typed claims, tool-use obligation
Search Protocol	Source priority (Scopus > PubMed > OpenAlex > bioRxiv > web), DOI verification
Analysis Orchestrator	Artifact contract (manifest + report + figures + metrics + scripts)
Serendipity Engine	Quantitative triage (0-15), scheduled Sprints, PURSUE/QUEUE/FILE/DISCARD
Knowledge Base	Cross-RQ persistence: library.json, patterns.md, dead-ends.md
Data Extraction	NO TRUNCATION rule, AnnData schema contract, GEO/SRA/ENA handling
Audit & Reproducibility	Decision log, run comparison, manifests, 10-point reproducibility contract

v4.0 — ARBOR VITAE

Evolves the flat OTAE loop into a branching tree search over hypotheses.

The biggest architectural change since v1.0. Each OTAE cycle becomes a node in a tree — the agent can branch, score, prune, and backtrack through the hypothesis space.

Architecture: OTAE-Tree

                         root
                        /    \
                    node-A   node-B          ← each = full OTAE cycle
                   / |  \      |
                A1  A2  A3    B1             ← children = variations
               /
             A1a                             ← deeper exploration

     Selection: Score = Evidence×0.6 + Metrics×0.3 + Novelty×0.1
     Pruning:   3 debug fails → prune | 5 non-improving → soft prune
     Health:    good_nodes / total >= 0.2 or EMERGENCY STOP

7 node types: draft · debug · improve · hyperparameter · ablation · replication · serendipity

3 tree modes: LINEAR (literature) · BRANCHING (experiments) · HYBRID (both)

v3.5 → v4.0: What Changed

Dimension	v3.5	v4.0
Loop	Flat OTAE	OTAE-Tree (nodes in a tree)
Exploration	Sequential	Best-first with branching + pruning
Serendipity	Linear scanning	Cross-branch pattern detection
Laws	7	10 (+Explore, +Confounder Harness, +Crystallize)
Gates	12	26 (+Tree T0-T3, +Brainstorm B0, +Stage S1-S5)
Protocols	9	16 (+7 new)
Stages	None	5-stage experiment lifecycle
Agents	Single context	SOLO + TEAM modes
Configuration	Plugin only	CLAUDE.md constitution + hooks enforcement
Paper output	Manual	Writeup Engine (IMRAD from verified claims)
Figure validation	None	VLM Gate (vision-language model check)

3 New Constitutional Laws

	Law	Rule
8	EXPLORE BEFORE EXPLOIT	Minimum 3 draft nodes before any is promoted
9	CONFOUNDER HARNESS	Raw → Conditioned → Matched. Sign change = ARTIFACT. `NO HARNESS = NO CLAIM`
10	CRYSTALLIZE OR LOSE	Every result written to persistent file. `NOT IN FILE = DOESN'T EXIST`

Serendipity Branches

Unexpected findings become first-class tree nodes (serendipity type). Cross-branch pattern detection finds connections that are invisible when exploring linearly — the same variable behaving differently in two branches becomes a discovery signal, not noise.

SOLO vs TEAM Mode

	SOLO	TEAM
Context	All roles in one window	Separate agents per role
R2 independence	Simulated (double-pass)	True (own context window)
Cost	1x	~3-4x
Best for	Literature, short sessions	Computational experiments, high stakes

5-Stage Experiment Manager

S1 Preliminary (max 20 iter) → S2 Hyperparameter (max 12)
→ S3 Research Agenda (max 12) → S4 Ablation (max 18) → S5 Synthesis (max 5)

Shortcuts: Literature-only: S1 → S5 | Analysis: S1 → S2 → S4 → S5

New Protocols (+7)

Protocol	Purpose
Tree Search	3 modes, 7 node types, best-first selection, pruning rules
Experiment Manager	5-stage lifecycle with iteration limits
Auto-Experiment	Code generation → execution → metric parsing pipeline
Brainstorm Engine	Phase 0 structured ideation
Agent Teams	SOLO/TEAM architecture, shared filesystem, fallback
VLM Gate	Optional figure validation via vision-language model
Writeup Engine	IMRAD paper drafting from verified claims

v4.5 — ARBOR VITAE (Pruned)

Adds structured ideation (Phase 0) and a 5-stage research pipeline. 206 lines smaller than v4.0 while being more capable.

v4.5 applies progressive disclosure aggressively — the SKILL.md is shorter, but protocol files are richer. The key innovations are in systematic idea generation and R2 expansion.

Phase 0: Brainstorm Engine

A mandatory 10-step ideation phase before the OTAE loop begins:

UNDERSTAND → LANDSCAPE → GAPS → INVERSION → DATA →
HYPOTHESES → COLLISION-ZONE → TRIAGE → PRODUCTIVE TENSIONS →
R2 REVIEW → COMMIT

Key moves:

Inversion Exercise: systematically invert top 3 consensus claims to generate contrarian hypotheses
Collision-Zone Thinking: force cross-domain hypotheses (physics × biology, economics × ecology)
Productive Tensions: preserve competing paradigms instead of premature convergence

R2 reviews the brainstorm output. Only WEAK_ACCEPT or better locks the research direction.

R2 Expanded to 6 Modes

Mode	Trigger	Purpose
BRAINSTORM	Phase 0	Review ideation quality
FORCED	Every 20 cycles	Mandatory scheduled review
BATCH	Multiple claims	Group review of pending claims
SHADOW	Continuous	Background monitoring
VETO	R2-initiated	Emergency stop on a finding
REDIRECT	R2-initiated	Force a change of direction

v4.0 → v4.5: What Changed

Intervention	What it does
Inversion Exercise	Systematically invert consensus claims for contrarian hypotheses
Collision-Zone	Force cross-domain hypothesis generation
Productive Tensions	Preserve competing paradigms
R2 Red Flag Checklist	12 mandatory flags (6 statistical + 6 methodological) at every review
Counter-Evidence Search	Active hunt for contradicting evidence before claim promotion
DOI Verification	Verify every citation resolves before trusting it
Progressive Disclosure	-381 lines removed, pointers to protocol files, faster context loading

Evidence Engine Upgrades

Counter-evidence search mandatory at confidence >= 0.60
DOI verification before any claim promotion
Confounder Harness (LAW 9): Raw → Conditioned → Matched — sign change = ARTIFACT (killed), collapse >50% = CONFOUNDED

Auto-Experiment Protocol

Computational hypothesis testing pipeline:

1. Researcher formulates hypothesis as testable code
2. Auto-Experiment generates script with seeds + version info
3. Execution → metric parsing → artifact creation
4. R2 reviews results (not just conclusions)
5. Gate evaluation → tree node scoring

v5.0 — IUDEX

The verification release. Every finding is tested before it's trusted. R2 is structurally unbypassable — not just prompted, architecturally enforced.

Huang et al. (ICLR 2024) proved that LLMs cannot self-correct reasoning without external feedback. v4.5's R2 was strong but prompt-enforced. v5.0 makes adversarial review architecturally unbypassable.

Four Innovations

Seeded Fault Injection (SFI)

Inject known errors before R2 reviews. If R2 misses them → review invalid.

Mutation testing for scientific claims.

Gate: V0 (RMS >= 0.80)

Judge Agent (R3)

Meta-reviewer scores R2's review on 6 dimensions.

Reviews the review, not the claims.

Gate: J0 (>= 12/18)

Blind-First Pass (BFP)

R2 reviews claims before seeing justifications. Breaks anchoring bias.

Think first, then compare.

Protocol: 2-phase within one review

Schema-Validated Gates (SVG)

8 gates enforce JSON Schema. Prose claims ignored.

Structure, not promises.

9 schema files (READ-ONLY)

FORCED Review Flow (v5.0)

TRIGGER → SFI (inject faults) → BFP Phase 1 (blind review)
       → BFP Phase 2 (full context) → V0 (vigilance gate)
       → J0 (judge gate) → Schema Validation → Normal Gate Eval

       V0 FAIL or J0 FAIL → restart from BFP Phase 1

Six Enhancements

Enhancement	What it does
R2 Salvagente	Killed claims with potential preserve serendipity seeds
Structured Seeds	Schema-validated research objects, not notes
Exploration Budget	LAW 8 gains measurable 20% floor at T3
Confidence Formula	Hard veto + geometric mean with dynamic floor
Circuit Breaker	Same objection × 3 rounds → DISPUTED. Frozen, not killed. S5 poison pill.
Permission Model	R2 produces verdicts. Orchestrator executes. Separation of powers.

Agent Permission Model

R2 Ensemble (READ only) → verdict.yaml → Orchestrator (READ+WRITE)
                                              │
                    ┌─────────────────────────┤
                    ↓                         ↓
              R3 Judge (scores)         Claim Ledger (updated)

R2 NEVER writes to the claim ledger.
R3 NEVER modifies R2's report.
Schemas are READ-ONLY for all agents.

v5.0 Structural Innovations (not new Laws — new enforcement mechanisms)

v5.0 keeps the same 10 Laws as v4.0/v4.5. The breakthrough is making them architecturally unbypassable through 4 new structural mechanisms:

Innovation	What it does	Gate
Seeded Fault Injection (SFI)	Inject known faults before R2 reviews. Miss = review INVALID.	V0 (RMS >= 0.80)
Blind-First Pass (BFP)	R2 reviews claims before seeing justifications. Breaks anchoring bias.	— (within review)
Judge Agent (R3)	Meta-reviewer scores R2's review on 6 dimensions. Reviews the review, not the claims.	J0 (>= 12/18)
Schema-Validated Gates (SVG)	8 gates enforce JSON Schema. Prose claims of completion are ignored.	8 schema gates

Plus: Circuit Breaker (deadlock → DISPUTED), Agent Permission Model (separation of powers), R2 Salvagente (killed claims must produce serendipity seeds).

Quality Gates: 27 Total

Pipeline (G0-G6)

G0  Input Sanity
G1  Schema Validation
G2  Design Review
G3  Training Health
G4  Metrics Baseline
G5  Artifacts Exist
G6  VLM Validation (opt)

Literature (L0-L2)

L0  Source Validity (DOI)
L1  Coverage (>= 3)
L2  Review Complete

Decision (D0-D2)

D0  Decision Justified
D1  Claim Promotion
D2  RQ Conclusion

Tree (T0-T3)

T0  Node Validity
T1  Debug Limit (max 3)
T2  Branch Diversity
T3  Tree Health (>= 20%)

Brainstorm + Stage

B0  Brainstorm Quality
S1-S5  Stage Gates

Verification (v5.0)

V0  R2 Vigilance (SFI)
J0  Judge Assessment (R3)

v5.0 Protocols (21 total)

Category	Protocols
Core Loop	Loop OTAE, Evidence Engine, Reviewer 2 Ensemble, Search Protocol
Tree & Experiment	Tree Search, Experiment Manager, Auto-Experiment, Brainstorm Engine, Agent Teams
Research Support	Analysis Orchestrator, Data Extraction, Serendipity Engine, Knowledge Base, Audit & Reproducibility
Output	VLM Gate, Writeup Engine
v5.0 Structural	Seeded Fault Injection, Judge Agent (R3), Blind-First Pass, Schema Validation, Circuit Breaker

Key Features Across Versions

Reviewer 2 Ensemble — Evolution

Feature	v3.5	v4.0	v4.5	v5.0
Reviewers	4 (Methods, Stats, Bio, Eng)	4	4	4
Modes	3 (standard, batch, forced)	3	6 (+shadow, veto, redirect)	6
Workflow	Double-pass	Double-pass	Double-pass + red flags	Double-pass + BFP + SFI
Independence	Simulated	TEAM mode available	TEAM mode	TEAM + R3 Judge
Attack levels	3-level orthogonal	3-level	+ 12 red flag checklist	+ fault injection
Schema enforcement	None	None	None	8 gates schema-validated

Evidence Engine — Evolution

Feature	v3.5	v4.0+
Formula	E·R·C·K·D weighted sum	Geometric mean with hard veto
Floor	E < 0.2 → cap at 0.20	E < 0.05 or D < 0.05 → zero
Counter-evidence	Not required	Mandatory at confidence >= 0.60
Confounder harness	Not systematic	LAW 9: Raw → Conditioned → Matched
Claim types	4 typed	4 typed + schema-validated promotion

Serendipity Engine — Evolution

Feature	v3.5	v4.0+	v5.0
Scale	0-15	0-20	0-20
Scanning	Every 10 cycles	Every EVALUATE	Every EVALUATE
Cross-branch	No	Yes	Yes
Salvagente	No	No	Yes — killed claims produce seeds
Interrupt threshold	>= 12	>= 15	>= 15

Academic Foundations

v5.0 IUDEX is grounded in peer-reviewed research. Every architectural decision traces to an empirical finding:

Core: LLM Self-Correction Limitations (3 papers)

Paper	Key Finding	v5.0 Response
Huang et al. (ICLR 2024) — "LLMs Cannot Self-Correct Reasoning Yet"	Intrinsic self-correction ineffective; 74.7% retain initial answer	Foundation for entire v5.0 architecture. R2 must be structurally separated, not just prompted
Gou et al. (ICLR 2024) — "CRITIC"	Self-correction works ONLY with external tool feedback	Validates R2's mandatory tool-use. But prompts can be circumvented → Schema-Validated Gates
Kamoi et al. (TACL 2024)	No prior work demonstrates successful self-correction from prompts alone	Motivates architectural triad (SFI + BFP + R3)

Multi-Agent Correction (3 papers)

Paper	Key Finding	v5.0 Response
Du et al. (ICML 2024) — "Multiagent Debate"	Multiple agents debating reduces factual errors by 30%+	Direct validation of R2 multi-reviewer architecture
Wang et al. (2022) — "Self-Consistency"	Sampling N independent chains and aggregating outperforms single-pass	In SOLO mode, R2 generates N=3 independent assessments
Dhuliawala et al. (2023) — "Chain-of-Verification"	Generate verification questions independently from original draft	Strengthens BFP Phase 1 design

Peer Review as Model (4 papers)

Paper	Key Finding	v5.0 Response
Krlev & Spicer (JMS 2023) — "Reining in Reviewer Two"	Epistemic respect = assess on soundness, not origin	R2's calibration: destructive but rigorous. R3 enforces review quality
Watling et al. (2021) — "Don't Be Reviewer 2!"	Checklist-only reviews are mechanical	R2 Red Flag Checklist is a floor. R3 ensures R2 goes beyond checklist compliance
Jefferson et al. (JAMA 2002)	Interventions to improve peer review were "relatively unsuccessful"	You cannot fix peer review with better instructions alone → SFI + R3
PMC (2024) — "Peer Reviews of Peer Reviews"	Longer reviews rated higher — a length bias	R3 rubric rewards specificity and evidence, not verbosity

Mutation Testing Theory — SFI Design (2 papers)

Paper	Key Finding	v5.0 Response
Jia & Harman (IEEE TSE 2011) — "Mutation Testing"	10% random sampling ~84% as effective as exhaustive	Justifies 1-3 faults per FORCED review
Papadakis et al. (2019) — "Mutation Testing Advances"	Equivalent mutants inflate scores, must be managed	EQUIV state in fault taxonomy

Concurrent Work: DeepMind Deep Think & Aletheia (4 papers)

Paper	Key Finding	v5.0 Response
Snell et al. (2024) — "Scaling LLM Test-Time Compute"	Inference-time compute scaling improves reasoning	Theoretical grounding for OTAE-Tree
DeepMind Aletheia (2026) — "Autonomous Mathematics Research"	Generator-Verifier-Reviser architecture	Architecturally isomorphic to Researcher-R2-Researcher loop
DeepMind (2026) — "Accelerating Scientific Research with Gemini"	Human-AI collaboration patterns	Maps to OTAE loop, tree search, brainstorm collision-zone
Kumar et al. (ICLR 2025) — SCoRe	RL enables genuine self-correction (+15.6% MATH)	Self-correction CAN work with structural change. SFI + R3 are the agent-level analog

Vibe Science vs Deep Think / Aletheia

	Deep Think / Aletheia	Vibe Science v5.0
Level	Inference-time (within model)	Agent-time (separate agents)
Verifier	Process Reward Model (trained)	R2 Ensemble (prompted + tool-grounded)
Verification type	Logical (reasoning-based)	Empirical (tool-grounded: PubMed, Scopus, web)
Structural enforcement	PRM weights (non-bypassable)	JSON Schema gates + SFI + R3
Cost	Proprietary (Google AI Ultra)	Open source (Apache 2.0, any LLM)

Complementary, not competing. Deep Think catches what pure reasoning can catch. Vibe Science catches what only external empirical verification can catch.

Design Philosophy

Vibe Science was built by reverse-engineering two complementary approaches:

Agentic research loops (Ralph, GSD, BMAD, AI-Scientist-v2)

Excellent as systems: infinite loop, state management, tree search.

Missing: executability, adversarial review, serendipity.

Scientific toolkits (Anthropic bio-research, Claude Scientific Skills, MCP)

Excellent as tools: CLI scripts, database APIs, analysis pipelines.

Missing: loop, persistence, adversarial review.

Vibe Science fuses both: the systematic rigor of a research loop with the concrete executability of a scientific toolkit, bound together by an adversarial co-pilot that prevents the system from lying to itself.

"A research system that doesn't execute is a wish. A toolkit that doesn't iterate is a toolbox. You need both: loop + tool."

Lineage

Ancestor	Pattern Taken
Ralph Wiggum	Bounded iterative loop (warn@15, forced-R2@20, alert@30)
GSD	File-based state persistence (STATE.md, PROGRESS.md)
BMAD	Multi-agent ensemble pattern
OpenAI Codex loop	OTAE cycle structure, single action per cycle
Anthropic bio-research	CLI scripts, MCP endpoints, executability
Superpowers	Dispatch/routing architecture
AI-Scientist-v2	Tree search architecture, 4-stage manager

Repository Structure

vibe-science/
├── README.md                   ← You are here
├── CITATION.cff                ← GitHub citation metadata (DOI)
├── LICENSE                     ← Apache 2.0
├── NOTICE                      ← Academic citation requirement
├── CHANGELOG.md                ← Version history
├── logos/                      ← Version-specific SVG logos
│   ├── logo-v3.5.svg
│   ├── logo-v4.0.svg
│   ├── logo-v4.5.svg
│   └── logo-v5.0.svg
│
├── vibe-science-v3.5/          ← Claude Code skill (v3.5)
│   ├── SKILL.md                    320 lines
│   ├── protocols/ (9)              ~1,500 lines
│   ├── gates/gates.md              272 lines
│   └── assets/ (3)                 ~615 lines
│
├── vibe-science-v4.0/          ← Claude Code skill (v4.0)
│   ├── SKILL.md + CLAUDE.md
│   ├── protocols/ (16)
│   ├── gates/
│   └── assets/ (6)
│
├── vibe-science-v4.5/          ← Claude Code skill (v4.5)
│   ├── SKILL.md + CLAUDE.md
│   ├── protocols/ (16)
│   ├── gates/
│   └── assets/ (6)
│
├── vibe-science-v5.0/          ← Claude Code skill (v5.0 IUDEX)
│   ├── SKILL.md (~1,150 lines)
│   ├── CLAUDE.md (constitution)
│   ├── protocols/ (21)
│   ├── gates/ + schemas/ (9)
│   └── assets/ (8)
│
└── vibe-science-v5.0-codex/    ← OpenAI Codex skill (v5.0 IUDEX)
    ├── SKILL.md (~480 lines)
    ├── agents/openai.yaml
    ├── references/ (23)
    └── assets/ (11)

Installation

Claude Code

git clone https://github.com/th3vib3coder/vibe-science.git

# Install the version you want:
claude plugins add ./vibe-science/vibe-science-v5.0    # latest
claude plugins add ./vibe-science/vibe-science-v3.5    # stable, paper version

OpenAI Codex

# The Codex version is in vibe-science-v5.0-codex/
# Follow instructions in vibe-science-v5.0-codex/README.md

Manual (any LLM interface)

Upload the SKILL.md of your chosen version as a system prompt or project knowledge file. Upload protocols/, gates/, and assets/ directories for on-demand reference loading.

For the Paper

This repository documents the evolution of Vibe Science for the VibeX 2026 publication. Each version directory is a complete, standalone snapshot:

v3.5 is the version described in the paper (field-tested, 21 sprints)
v4.0 → v4.5 → v5.0 show the progression of ideas
Annotated git tags (v3.5.0, v4.0.0, v4.5.0, v5.0.0) provide traceable evolution
Annotated git tags and Zenodo archival provide permanent traceability

Citation & Attribution

If you use Vibe Science in your research, please cite:

Russo, C. & Bertelli, E. (MD) (2026). Vibe Science: an AI-native research engine with adversarial review and serendipity tracking. https://github.com/th3vib3coder/vibe-science · DOI: 10.5281/zenodo.18665031

Se utilizzi Vibe Science nella tua ricerca, ti chiediamo di citare:

Russo, C. & Bertelli, E. (MD) (2026). Vibe Science: un motore di ricerca AI-nativo con revisione avversariale e tracciamento della serendipità. https://github.com/th3vib3coder/vibe-science · DOI: 10.5281/zenodo.18665031

License

Apache 2.0 — see LICENSE.

Authors

Carmine Russo · Dr. Elisa Bertelli (MD)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
logos		logos
vibe-science-v3.5		vibe-science-v3.5
vibe-science-v4.0		vibe-science-v4.0
vibe-science-v4.5		vibe-science-v4.5
vibe-science-v5.0-codex		vibe-science-v5.0-codex
vibe-science-v5.0		vibe-science-v5.0
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

License

th3vib3coder/vibe-science

Folders and files

Latest commit

History

Repository files navigation

Vibe Science

The Problem

The Solution

Version Evolution

v3.5 — TERTIUM DATUR

Architecture: OTAE Loop

7 Constitutional Laws

Reviewer 2 Ensemble

Evidence Engine

Serendipity Engine

Quality Gates

Field Testing: 21 Sprints

Protocols (9)

v4.0 — ARBOR VITAE

Architecture: OTAE-Tree

v3.5 → v4.0: What Changed

3 New Constitutional Laws

Serendipity Branches

SOLO vs TEAM Mode

5-Stage Experiment Manager

New Protocols (+7)

v4.5 — ARBOR VITAE (Pruned)

Phase 0: Brainstorm Engine

R2 Expanded to 6 Modes

v4.0 → v4.5: What Changed

Evidence Engine Upgrades

Auto-Experiment Protocol

v5.0 — IUDEX

Four Innovations

FORCED Review Flow (v5.0)

Six Enhancements

Agent Permission Model

v5.0 Structural Innovations (not new Laws — new enforcement mechanisms)

Quality Gates: 27 Total

v5.0 Protocols (21 total)

Key Features Across Versions

Reviewer 2 Ensemble — Evolution

Evidence Engine — Evolution

Serendipity Engine — Evolution

Academic Foundations

Vibe Science vs Deep Think / Aletheia

Design Philosophy

Lineage

Repository Structure

Installation

Claude Code

OpenAI Codex

Manual (any LLM interface)

For the Paper

Citation & Attribution

License

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Packages