diff --git a/README.md b/README.md index 447e4c31..3724b8c2 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ Codev is an operating system for structured human-AI collaboration. You write specs and plans that AI agents execute reliably. -> **Results**: In head-to-head comparison, SPIR scored 92-95 vs VIBE's 12-15 on the same task. [See case study](#-example-implementations) +> **Results**: One architect + autonomous AI builders shipped [106 PRs in 14 days](codev/resources/development-analysis-2026-02-17.md), median feature in 57 minutes. In controlled comparison, SPIR consistently outperformed unstructured AI coding across [4 rounds](codev/resources/vibe-vs-spir-r4-comparison-2026-02.md). [Case study](#-example-implementations) | [Production data](#production-metrics-feb-2026) **Quick Links**: [FAQ](docs/faq.md) | [Tips](docs/tips.md) | [Cheatsheet](codev/resources/cheatsheet.md) | [CLI Reference](codev/resources/commands/overview.md) | [Why Codev?](docs/why.md) | [Discord](https://discord.gg/mJ92DhDa6n) @@ -21,9 +21,9 @@ Codev is an operating system for structured human-AI collaboration. You write sp - [Learn About Codev](#learn-about-codev) - [What is Codev?](#what-is-codev) - [The SPIR Protocol](#the-spir-protocol) -- [Agent Farm](#agent-farm-optional) - [Example Implementations](#-example-implementations) -- [Quick Start & Prerequisites](#quick-start) +- [Real-World Performance](#-eating-our-own-dog-food) +- [Agent Farm](#agent-farm-optional) - [Contributing](#contributing) ## Quick Start @@ -202,10 +202,9 @@ In much the same way an operating system has a memory hierarchy, Codev repos hav Both projects below were given **the exact same prompt** to build a Todo Manager application using **Claude Code with Opus**. The difference? The methodology used: ### [Todo Manager - VIBE](https://github.com/ansari-project/todo-manager-vibe) -- Built using a **VIBE-style prompt** approach -- Shows rapid prototyping with conversational AI interaction -- Demonstrates how a simple prompt can drive development -- Results in working code through chat-based iteration +- Built using a **VIBE-style prompt** approach (same model, same prompt) +- Produced boilerplate scaffolding but 0% of the specified functionality +- No tests, no database, no working API — demonstrates how conversational approaches can miss the mark entirely ### [Todo Manager - SPIR](https://github.com/ansari-project/codev-demo) - Built using the **SPIR protocol** with full document-driven development @@ -215,53 +214,62 @@ Both projects below were given **the exact same prompt** to build a Todo Manager - Multi-agent consultation throughout the process
-📊 Automated Multi-Agent Analysis (click to expand) +📊 Multi-Agent Comparison (4 rounds) (click to expand) -**Note**: This comparison was generated through automated analysis by 3 independent AI agents (Claude, GPT-5, and Gemini Pro), not human review. +**Methodology**: Same prompt, same AI model (Claude Opus). Unstructured (conversational) vs SPIR (structured protocol). Scored by 3 independent AI agents (Claude, Codex, Gemini Pro) on a 1-10 scale. Full auto-approved gates — no human review input — to isolate the protocol's effect. -#### Quality Scores (out of 100) -| Aspect | VIBE | SPIR | -|--------|------|--------| -| **Overall Score** | **12-15** | **92-95** | -| Functionality | 0 | 100 | -| Test Coverage | 0 | 85 | -| Documentation | 0 | 95 | -| Architecture | N/A | 90 | -| Production Readiness | 0 | 85 | +#### Latest Results (Round 4, Feb 2026) -#### Key Differences +| Dimension | Unstructured | SPIR | Delta | +|-----------|:----------:|:----:|:-----:| +| **Overall** | **5.8** | **7.0** | **+1.2** | +| Bugs | 6.7 | 7.3 | +0.7 | +| Code Quality | 7.0 | 7.7 | +0.7 | +| Tests | 5.0 | 6.7 | +1.7 | +| Deployment | 2.7 | 6.7 | +4.0 | -**VIBE**: 3 files (boilerplate only), 0% functionality, 0 tests, no database, no API +#### Key Findings -**SPIR**: 32 source files, 100% functionality, 5 test suites, SQLite + Drizzle ORM, complete REST API, full component architecture, MCP integration, TypeScript + Zod validation +- **+1.2 quality advantage consistent across all 4 rounds** (R1: +1.3, R2: +1.2, R4: +1.2) +- SPIR produced **2.9x more test code** with broader layer coverage +- SPIR produced **fewer source lines** (1,249 vs 1,294) while being more complete — the first round where structured code was more concise +- **Deployment readiness** showed the largest delta of any dimension in any round (+4.0): multi-stage Dockerfile, standalone output, deploy instructions +- Multi-agent consultation caught **5 implementation bugs pre-merge** at a cost of $4.38 -#### Why SPIR Won +**Build time**: SPIR took ~56 min vs ~15 min for unstructured (3.7x). Consultation accounts for 45% of the overhead. Estimated cost: $14-19 vs $4-7 (3-5x). For production code, the deployment readiness and test coverage alone justify the investment. -As GPT-5 noted: *"SPIR's methodology clearly outperformed... Plan-first approach with defined scope, iterative verification, and delivery mindset"* - -The verdict: **Context-driven development ensures completeness**, while conversational approaches can miss the mark entirely despite identical prompts and AI models. +See [full Round 4 analysis](codev/resources/vibe-vs-spir-r4-comparison-2026-02.md) for detailed scoring, bug sweeps, and architecture comparison.
## 🐕 Eating Our Own Dog Food -Codev is **self-hosted** - we use Codev methodology to build Codev itself. This means: +Codev is **self-hosted** — we use Codev to build Codev. Every feature goes through SPIR. Every improvement has a spec, plan, and review. + +### Production Metrics (Feb 2026) -- **Our test infrastructure** is specified in `codev/specs/0001-test-infrastructure.md` -- **Our development process** follows the SPIR protocol we advocate -- **Our improvements** come from lessons learned using our own methodology +Over a 14-day sprint building Codev with Codev ([full analysis](codev/resources/development-analysis-2026-02-17.md)): + +| Metric | Value | +|--------|-------| +| Merged PRs | 106 | +| Closed issues | 105 | +| Commits | 801 | +| Median feature implementation | 57 minutes | +| Fully autonomous builders | 85% (22 of 26) | +| Pre-merge bugs caught by consultation | 20 | +| Consultation cost per PR | $1.59 | + +One architect with autonomous builders matched the output of a **3-4 person elite engineering team** (benchmarked against 5 PRs/developer/week from LinearB's 2026 analysis of 8.1M PRs). The bugfix pipeline is genuinely autonomous: 66% of fixes ship in under 30 minutes (median 13 min from PR creation to merge). + +Multi-agent consultation catches real bugs that single-model review misses. No single reviewer found all 20 bugs — Codex excels at edge-case exhaustiveness, Claude at runtime semantics, Gemini at architecture. This self-hosting approach ensures: 1. The methodology is battle-tested on real development 2. We experience the same workflow we recommend to users -3. Any pain points are felt by us first and fixed quickly +3. Pain points are felt by us first and fixed quickly 4. The framework evolves based on actual usage, not theory -You can see this in practice: -- Check `codev/specs/` for our feature specifications -- Review `codev/plans/` for how we break down work -- Learn from `codev/reviews/` to see what we've discovered - ### Understanding This Repository's Structure This repository has a dual nature: