Shadow Score Specification

Version: 1.0.0
Status: Draft
Date: 2026-02-24
Authors: DUBSOpenHub
License: MIT

1. Abstract

Shadow Score is a framework-agnostic metric for measuring the quality of AI-generated code. It quantifies the difference between what an AI implementation agent tested for itself and what an independent, specification-derived test suite actually requires. A Shadow Score of 0% means the implementation's own tests fully anticipated the acceptance criteria. A high Shadow Score reveals blind spots — requirements the AI "thought" it covered but didn't.

Shadow Score is computed using the Sealed-Envelope Protocol, a testing methodology where acceptance tests are generated from the specification before code is written and hidden from the implementation agent throughout the build process.

2. Motivation

The Teach-to-Test Problem

Most AI coding tools write code and tests together. The AI generates an implementation, then writes tests that validate what it built — not what was required. These tests almost always pass, creating a false sense of quality.

This is analogous to a student writing both the exam and the answer key. A 100% score is guaranteed but meaningless.

What's Missing

The industry lacks a standardized, quantitative metric for answering: "How well did the AI understand and implement the specification, independent of its own self-assessment?"

Existing metrics fall short:

Metric	Limitation
Test pass rate	AI writes tests to match its own code — circular
Code coverage	Measures lines executed, not requirements fulfilled
Human code review	Subjective, expensive, doesn't scale
LLM-as-judge	Another AI evaluating AI — same blindness risk

Shadow Score Solves This

Shadow Score introduces an independent, adversarial quality signal by separating test authorship from code authorship and measuring the delta. It answers a specific question: "What percentage of specification requirements did the implementation fail to satisfy, as measured by tests the implementer never saw?"

3. Definitions

3.1 Roles

Role	Responsibility	Information Access
Specifier	Produces the specification (requirements, acceptance criteria)	Full project context
Seal Author	Generates specification tests (sealed tests) from the spec	Specification ONLY — never code, never architecture
Implementer	Writes code and implementation tests (open tests)	Specification + architecture — NEVER sealed tests
Validator	Runs all tests and computes Shadow Score	Full access: code + sealed tests + open tests

In an AI agent pipeline, each role is typically a separate agent invocation with isolated context. In a human workflow, roles may be assigned to different team members.

3.2 Test Suites

Sealed Tests (S): Tests generated from the specification before any implementation exists. These tests are hidden from the Implementer. They validate requirements, not implementation details.
Open Tests (O): Tests written by the Implementer alongside the code. These validate the Implementer's own understanding of the requirements.

3.3 Shadow Score Formula

Shadow Score = (Sf / St) × 100

Where:

Sf = Number of sealed tests that failed
St = Total number of sealed tests
Result is expressed as a percentage (0–100)

3.4 Interpretation Scale

Score	Level	Indicator	Meaning
0%	Perfect	✅	Implementer's tests covered everything the sealed tests checked
1–15%	Minor	🟢	Small blind spots — likely edge cases or boundary conditions
16–30%	Moderate	🟡	Meaningful gaps — Implementer missed some scenarios
31–50%	Significant	🟠	Major gaps — review the Implementer's testing approach
>50%	Critical	🔴	Fundamental quality issues — consider re-implementation

3.5 Supplementary Metrics

These optional metrics provide additional context alongside the primary Shadow Score:

Coverage Delta: |sealed_categories_tested - open_categories_tested| — measures how many types of scenarios (happy path, edge case, error handling, security) differ between suites.
Overlap Ratio: matching_scenarios / St — measures how many sealed test scenarios the Implementer independently anticipated.
Hardening Velocity: gap_reduction_per_cycle — measures how quickly the Shadow Score decreases during iterative fixing.

4. Sealed-Envelope Protocol

The Sealed-Envelope Protocol is the testing methodology that produces a valid Shadow Score. Implementations MUST follow this protocol to claim Shadow Score conformance.

4.1 Seal Generation

Input: Specification document (requirements, acceptance criteria, user stories).
Output: Sealed test files written to an isolated location.

Requirements:

The Seal Author receives ONLY the specification — never code, architecture, or design documents.
Sealed tests MUST validate behavior, not implementation details (no testing internal functions, private APIs, or data structures).
Sealed tests MUST cover:
- Happy path scenarios (expected inputs → expected outputs)
- Edge cases and boundary conditions
- Error handling and invalid inputs
- Security scenarios (where applicable)
Sealed tests MUST be executable by a standard test runner for the target language/framework.
After generation, the sealed test directory SHOULD be hashed (SHA-256) as tamper evidence.

4.2 Information Isolation

This is the critical invariant of the protocol. Breaking isolation invalidates the Shadow Score.

Phase	Seal Author Sees	Implementer Sees	Validator Sees
Seal Generation	Specification	—	—
Implementation	—	Specification, Architecture	—
Validation	—	—	Code, Sealed Tests, Open Tests
Hardening	—	Failure messages ONLY	Code, All Tests

Isolation mechanisms (in order of strength):

Process isolation: Separate OS processes with no shared filesystem access (strongest)
Context isolation: Separate AI agent invocations with no shared context (recommended for AI pipelines)
Role isolation: Different humans/teams with access controls (acceptable for human workflows)
Honor system: Trust-based separation (weakest — not recommended for conformance claims)

4.3 Validation

Input: Implementation code, sealed tests, open tests.
Output: Gap Report (see §5).

Procedure:

Copy sealed tests into the implementation workspace
Install dependencies and build the project
Run sealed tests using the appropriate test runner
Run open tests using the appropriate test runner
Record: total sealed tests, passed, failed (with failure messages)
Record: total open tests, passed, failed
Compute Shadow Score: (sealed_failures / sealed_total) × 100
Categorize failures by type (happy path, edge case, error handling, security)
Produce Gap Report

4.4 Hardening

When Shadow Score > 0%, the Implementer may fix the code iteratively. The hardening loop preserves information isolation:

Extract from the Gap Report: test name, expected result, actual result, failure message
Do NOT share the sealed test source code with the Implementer
The Implementer fixes the implementation based on failure descriptions only
Re-run validation (§4.3)
Repeat up to a configured maximum number of cycles
If Shadow Score remains > 0% after max cycles, escalate to human review

Rationale: Sharing only failure messages (not test code) forces the Implementer to fix the root cause rather than pattern-match against specific test assertions.

4.5 Tamper Evidence

To ensure sealed tests are not modified after generation:

After seal generation, compute: hash = SHA-256(sorted_file_contents_of_sealed_directory)
Store the hash in a tamper-evident log (e.g., state file, database, version control)
Before validation, recompute the hash and compare
If hashes differ, the Shadow Score is invalid — sealed tests were tampered with

Recommended implementation:

find <sealed_dir> -type f | sort | xargs shasum -a 256 | shasum -a 256

5. Reporting Format

5.1 Gap Report Structure

Implementations SHOULD produce a Gap Report in at least one of these formats:

JSON Format (machine-readable)

{
  "shadow_score_spec_version": "1.0.0",
  "report": {
    "id": "run-20260224-1200",
    "timestamp": "2026-02-24T12:00:00Z",
    "specification": "PRD.md",
    "shadow_score": 11.1,
    "level": "minor",
    "sealed_hash": "sha256:a1b2c3d4..."
  },
  "sealed_tests": {
    "total": 18,
    "passed": 16,
    "failed": 2
  },
  "open_tests": {
    "total": 12,
    "passed": 12,
    "failed": 0
  },
  "failures": [
    {
      "test_name": "test_rejects_gpl_dependency",
      "category": "security",
      "expected": "CLI exits with code 2",
      "actual": "CLI exits with code 0",
      "message": "GPL dependency not blocked"
    },
    {
      "test_name": "test_csv_report_includes_risk",
      "category": "edge_case",
      "expected": "CSV contains risk column",
      "actual": "Column missing",
      "message": "Report missing risk metadata"
    }
  ],
  "coverage_comparison": {
    "happy_path": { "open": 6, "sealed": 6, "delta": 0 },
    "edge_cases": { "open": 3, "sealed": 5, "delta": 2 },
    "error_handling": { "open": 3, "sealed": 4, "delta": 1 },
    "security": { "open": 0, "sealed": 3, "delta": 3 }
  },
  "hardening": {
    "cycles_completed": 1,
    "max_cycles": 3,
    "initial_shadow_score": 22.2,
    "final_shadow_score": 11.1
  }
}

Markdown Format (human-readable)

See examples/ for complete rendered examples.

5.2 Required Fields

Field	Type	Required	Description
`shadow_score_spec_version`	string	✅	Spec version this report conforms to
`shadow_score`	number	✅	The computed Shadow Score (0–100)
`level`	string	✅	One of: `perfect`, `minor`, `moderate`, `significant`, `critical`
`sealed_tests.total`	integer	✅	Total sealed tests run
`sealed_tests.passed`	integer	✅	Sealed tests that passed
`sealed_tests.failed`	integer	✅	Sealed tests that failed
`failures`	array	✅	List of failure objects (test_name, expected, actual, message)
`sealed_hash`	string	RECOMMENDED	SHA-256 hash of sealed test directory
`open_tests.*`	object	RECOMMENDED	Open test results for comparison
`coverage_comparison`	object	OPTIONAL	Category-level breakdown
`hardening`	object	OPTIONAL	Hardening loop metadata

5.3 Failure Categories

Implementations SHOULD categorize each sealed test into one of:

Category	Description
`happy_path`	Standard expected-input → expected-output scenarios
`edge_case`	Boundary conditions, empty inputs, max values, unicode
`error_handling`	Invalid inputs, missing data, malformed requests
`security`	Injection, overflow, unauthorized access, data leakage

6. Conformance Levels

Implementations may claim conformance at three levels:

Level 1 — Shadow Score Computation

Requirements:

Computes Shadow Score using the formula in §3.3
Produces a Gap Report with all required fields (§5.2)
Uses the interpretation scale in §3.4

Does NOT require: Sealed-envelope isolation (tests may be authored with knowledge of the implementation).

Use case: Retrofitting Shadow Score onto existing test suites for comparative analysis.

Level 2 — Sealed-Envelope Isolation

Requirements:

All of Level 1
Sealed tests are generated before implementation begins
Information isolation (§4.2) is enforced at context-isolation level or stronger
Tamper evidence hash is computed and stored (§4.5)

Use case: AI agent pipelines and multi-agent build systems.

Level 3 — Full Protocol with Hardening

Requirements:

All of Level 2
Hardening loop (§4.4) is implemented
Failure messages shared with Implementer do NOT include test source code
Hardening velocity is tracked
Shadow Score is recomputed after each hardening cycle

Use case: Production-grade autonomous build systems.

7. Reference Implementations

The reference implementation of the Shadow Score Specification is Dark Factory, an autonomous agentic build system for the GitHub Copilot CLI.

Dark Factory implements Level 3 conformance:

Sealed tests generated by QA Sealed agent from PRD only (§4.1)
Context-isolated agents via separate task() invocations (§4.2)
QA Validator runs both suites and produces Gap Report (§4.3)
Hardening loop with failure-message-only feedback (§4.4)
SHA-256 hash computed and stored in state.json (§4.5)

The reference Level 2 implementation is Terminal Stampede, a parallel agent runtime for CLI coding agents.

Terminal Stampede implements Level 2 conformance:

Sealed tests generated before agents launch, stored in sealed-tests/ (§4.1)
Information isolation — agents never see sealed tests or know they're being scored (§4.2)
Merger agent runs both suites and integrates Shadow Score into merge report (§4.3)
SHA-256 tamper hash verified via .seal-hash before validation (§4.5)

Lightweight reference validators for computing Shadow Score from test output are available in the validators/ directory.

Appendix A: Worked Examples

A.1 — Perfect Score (0%)

Scenario: Build a Fibonacci calculator CLI.

Suite	Total	Passed	Failed
Sealed	5	5	0
Open	8	8	0

Shadow Score: 0 / 5 × 100 = 0% ✅ Perfect

The Implementer's tests covered all scenarios the sealed tests checked, plus 3 additional cases. This indicates thorough understanding of the specification.

A.2 — Minor Gaps (11.1%)

Scenario: Build a license scanner CLI tool.

Suite	Total	Passed	Failed
Sealed	18	16	2
Open	12	12	0

Shadow Score: 2 / 18 × 100 = 11.1% 🟢 Minor

Failures:

test_rejects_gpl_dependency — GPL dependency not blocked (security gap)
test_csv_report_includes_risk — Report missing risk column (edge case)

The Implementer built solid core functionality but missed a security edge case and a reporting detail. One hardening cycle resolved both.

A.3 — Critical Gaps (60%)

Scenario: Build a user registration API.

Suite	Total	Passed	Failed
Sealed	15	6	9
Open	4	4	0

Shadow Score: 9 / 15 × 100 = 60% 🔴 Critical

The Implementer wrote only 4 tests — all happy path. Sealed tests caught: missing email validation, no password hashing, duplicate user returns 500 instead of 409, SQL injection vulnerability, missing rate limiting, and more. This indicates the Implementer built to the "golden path" without considering real-world edge cases.

Recommendation: Re-implement with explicit attention to error handling and security requirements.

Appendix B: FAQ

Q: Can I use Shadow Score without AI agents?
A: Yes. Shadow Score works with any workflow where one party writes specification-based tests and another party implements the code. Human teams can use it for code review, pair programming assessment, or contractor evaluation.

Q: What if the sealed tests themselves are wrong?
A: Sealed tests should be reviewed by a human (or a separate validator) before being finalized. Buggy sealed tests inflate the Shadow Score incorrectly. The tamper evidence hash (§4.5) ensures they aren't changed after the fact — but it doesn't guarantee correctness.

Q: Does a 0% Shadow Score mean the code is perfect?
A: No. It means the code passes all sealed tests. The sealed tests may not cover every possible scenario. Shadow Score measures specification compliance, not absolute correctness.

Q: How many sealed tests should I write?
A: Enough to cover every acceptance criterion in the specification, including happy path, edge cases, error handling, and security. As a guideline: 3–5 sealed tests per acceptance criterion.

Q: Can the Implementer game the system?
A: If information isolation (§4.2) is properly enforced, no. The Implementer cannot see the sealed tests and therefore cannot write code that specifically targets them. If isolation is broken, the Shadow Score is invalid.

Q: How does Shadow Score compare to code coverage?
A: Code coverage measures lines of code executed by tests. Shadow Score measures specification requirements satisfied by the implementation. You can have 100% code coverage and a 50% Shadow Score (the code runs but produces wrong results for half the requirements).

Q: Is Shadow Score useful for non-AI development?
A: Yes. Any team practicing independent verification and validation (IV&V) can benefit. The concept originates from quality engineering practices used in aerospace, medical devices, and safety-critical systems.

Appendix C: Changelog

1.0.0 (2026-02-24)

Initial specification release
Shadow Score formula and interpretation scale
Sealed-Envelope Protocol (4 phases)
Reporting format (JSON + Markdown)
Three conformance levels
Reference implementation: Dark Factory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shadow Score Specification

1. Abstract

2. Motivation

The Teach-to-Test Problem

What's Missing

Shadow Score Solves This

3. Definitions

3.1 Roles

3.2 Test Suites

3.3 Shadow Score Formula

3.4 Interpretation Scale

3.5 Supplementary Metrics

4. Sealed-Envelope Protocol

4.1 Seal Generation

4.2 Information Isolation

4.3 Validation

4.4 Hardening

4.5 Tamper Evidence

5. Reporting Format

5.1 Gap Report Structure

JSON Format (machine-readable)

Markdown Format (human-readable)

5.2 Required Fields

5.3 Failure Categories

6. Conformance Levels

Level 1 — Shadow Score Computation

Level 2 — Sealed-Envelope Isolation

Level 3 — Full Protocol with Hardening

7. Reference Implementations

Appendix A: Worked Examples

A.1 — Perfect Score (0%)

A.2 — Minor Gaps (11.1%)

A.3 — Critical Gaps (60%)

Appendix B: FAQ

Appendix C: Changelog

1.0.0 (2026-02-24)

FilesExpand file tree

SPEC.md

Latest commit

History

SPEC.md

File metadata and controls

Shadow Score Specification

1. Abstract

2. Motivation

The Teach-to-Test Problem

What's Missing

Shadow Score Solves This

3. Definitions

3.1 Roles

3.2 Test Suites

3.3 Shadow Score Formula

3.4 Interpretation Scale

3.5 Supplementary Metrics

4. Sealed-Envelope Protocol

4.1 Seal Generation

4.2 Information Isolation

4.3 Validation

4.4 Hardening

4.5 Tamper Evidence

5. Reporting Format

5.1 Gap Report Structure

JSON Format (machine-readable)

Markdown Format (human-readable)

5.2 Required Fields

5.3 Failure Categories

6. Conformance Levels

Level 1 — Shadow Score Computation

Level 2 — Sealed-Envelope Isolation

Level 3 — Full Protocol with Hardening

7. Reference Implementations

Appendix A: Worked Examples

A.1 — Perfect Score (0%)

A.2 — Minor Gaps (11.1%)

A.3 — Critical Gaps (60%)

Appendix B: FAQ

Appendix C: Changelog

1.0.0 (2026-02-24)