Version: 1.0.0
Status: Draft
Date: 2026-02-24
Authors: DUBSOpenHub
License: MIT
Shadow Score is a framework-agnostic metric for measuring the quality of AI-generated code. It quantifies the difference between what an AI implementation agent tested for itself and what an independent, specification-derived test suite actually requires. A Shadow Score of 0% means the implementation's own tests fully anticipated the acceptance criteria. A high Shadow Score reveals blind spots — requirements the AI "thought" it covered but didn't.
Shadow Score is computed using the Sealed-Envelope Protocol, a testing methodology where acceptance tests are generated from the specification before code is written and hidden from the implementation agent throughout the build process.
Most AI coding tools write code and tests together. The AI generates an implementation, then writes tests that validate what it built — not what was required. These tests almost always pass, creating a false sense of quality.
This is analogous to a student writing both the exam and the answer key. A 100% score is guaranteed but meaningless.
The industry lacks a standardized, quantitative metric for answering: "How well did the AI understand and implement the specification, independent of its own self-assessment?"
Existing metrics fall short:
| Metric | Limitation |
|---|---|
| Test pass rate | AI writes tests to match its own code — circular |
| Code coverage | Measures lines executed, not requirements fulfilled |
| Human code review | Subjective, expensive, doesn't scale |
| LLM-as-judge | Another AI evaluating AI — same blindness risk |
Shadow Score introduces an independent, adversarial quality signal by separating test authorship from code authorship and measuring the delta. It answers a specific question: "What percentage of specification requirements did the implementation fail to satisfy, as measured by tests the implementer never saw?"
| Role | Responsibility | Information Access |
|---|---|---|
| Specifier | Produces the specification (requirements, acceptance criteria) | Full project context |
| Seal Author | Generates specification tests (sealed tests) from the spec | Specification ONLY — never code, never architecture |
| Implementer | Writes code and implementation tests (open tests) | Specification + architecture — NEVER sealed tests |
| Validator | Runs all tests and computes Shadow Score | Full access: code + sealed tests + open tests |
In an AI agent pipeline, each role is typically a separate agent invocation with isolated context. In a human workflow, roles may be assigned to different team members.
-
Sealed Tests (
S): Tests generated from the specification before any implementation exists. These tests are hidden from the Implementer. They validate requirements, not implementation details. -
Open Tests (
O): Tests written by the Implementer alongside the code. These validate the Implementer's own understanding of the requirements.
Shadow Score = (Sf / St) × 100
Where:
Sf= Number of sealed tests that failedSt= Total number of sealed tests- Result is expressed as a percentage (0–100)
| Score | Level | Indicator | Meaning |
|---|---|---|---|
| 0% | Perfect | ✅ | Implementer's tests covered everything the sealed tests checked |
| 1–15% | Minor | 🟢 | Small blind spots — likely edge cases or boundary conditions |
| 16–30% | Moderate | 🟡 | Meaningful gaps — Implementer missed some scenarios |
| 31–50% | Significant | 🟠 | Major gaps — review the Implementer's testing approach |
| >50% | Critical | 🔴 | Fundamental quality issues — consider re-implementation |
These optional metrics provide additional context alongside the primary Shadow Score:
- Coverage Delta:
|sealed_categories_tested - open_categories_tested|— measures how many types of scenarios (happy path, edge case, error handling, security) differ between suites. - Overlap Ratio:
matching_scenarios / St— measures how many sealed test scenarios the Implementer independently anticipated. - Hardening Velocity:
gap_reduction_per_cycle— measures how quickly the Shadow Score decreases during iterative fixing.
The Sealed-Envelope Protocol is the testing methodology that produces a valid Shadow Score. Implementations MUST follow this protocol to claim Shadow Score conformance.
Input: Specification document (requirements, acceptance criteria, user stories).
Output: Sealed test files written to an isolated location.
Requirements:
- The Seal Author receives ONLY the specification — never code, architecture, or design documents.
- Sealed tests MUST validate behavior, not implementation details (no testing internal functions, private APIs, or data structures).
- Sealed tests MUST cover:
- Happy path scenarios (expected inputs → expected outputs)
- Edge cases and boundary conditions
- Error handling and invalid inputs
- Security scenarios (where applicable)
- Sealed tests MUST be executable by a standard test runner for the target language/framework.
- After generation, the sealed test directory SHOULD be hashed (SHA-256) as tamper evidence.
This is the critical invariant of the protocol. Breaking isolation invalidates the Shadow Score.
| Phase | Seal Author Sees | Implementer Sees | Validator Sees |
|---|---|---|---|
| Seal Generation | Specification | — | — |
| Implementation | — | Specification, Architecture | — |
| Validation | — | — | Code, Sealed Tests, Open Tests |
| Hardening | — | Failure messages ONLY | Code, All Tests |
Isolation mechanisms (in order of strength):
- Process isolation: Separate OS processes with no shared filesystem access (strongest)
- Context isolation: Separate AI agent invocations with no shared context (recommended for AI pipelines)
- Role isolation: Different humans/teams with access controls (acceptable for human workflows)
- Honor system: Trust-based separation (weakest — not recommended for conformance claims)
Input: Implementation code, sealed tests, open tests.
Output: Gap Report (see §5).
Procedure:
- Copy sealed tests into the implementation workspace
- Install dependencies and build the project
- Run sealed tests using the appropriate test runner
- Run open tests using the appropriate test runner
- Record: total sealed tests, passed, failed (with failure messages)
- Record: total open tests, passed, failed
- Compute Shadow Score:
(sealed_failures / sealed_total) × 100 - Categorize failures by type (happy path, edge case, error handling, security)
- Produce Gap Report
When Shadow Score > 0%, the Implementer may fix the code iteratively. The hardening loop preserves information isolation:
- Extract from the Gap Report: test name, expected result, actual result, failure message
- Do NOT share the sealed test source code with the Implementer
- The Implementer fixes the implementation based on failure descriptions only
- Re-run validation (§4.3)
- Repeat up to a configured maximum number of cycles
- If Shadow Score remains > 0% after max cycles, escalate to human review
Rationale: Sharing only failure messages (not test code) forces the Implementer to fix the root cause rather than pattern-match against specific test assertions.
To ensure sealed tests are not modified after generation:
- After seal generation, compute:
hash = SHA-256(sorted_file_contents_of_sealed_directory) - Store the hash in a tamper-evident log (e.g., state file, database, version control)
- Before validation, recompute the hash and compare
- If hashes differ, the Shadow Score is invalid — sealed tests were tampered with
Recommended implementation:
find <sealed_dir> -type f | sort | xargs shasum -a 256 | shasum -a 256Implementations SHOULD produce a Gap Report in at least one of these formats:
{
"shadow_score_spec_version": "1.0.0",
"report": {
"id": "run-20260224-1200",
"timestamp": "2026-02-24T12:00:00Z",
"specification": "PRD.md",
"shadow_score": 11.1,
"level": "minor",
"sealed_hash": "sha256:a1b2c3d4..."
},
"sealed_tests": {
"total": 18,
"passed": 16,
"failed": 2
},
"open_tests": {
"total": 12,
"passed": 12,
"failed": 0
},
"failures": [
{
"test_name": "test_rejects_gpl_dependency",
"category": "security",
"expected": "CLI exits with code 2",
"actual": "CLI exits with code 0",
"message": "GPL dependency not blocked"
},
{
"test_name": "test_csv_report_includes_risk",
"category": "edge_case",
"expected": "CSV contains risk column",
"actual": "Column missing",
"message": "Report missing risk metadata"
}
],
"coverage_comparison": {
"happy_path": { "open": 6, "sealed": 6, "delta": 0 },
"edge_cases": { "open": 3, "sealed": 5, "delta": 2 },
"error_handling": { "open": 3, "sealed": 4, "delta": 1 },
"security": { "open": 0, "sealed": 3, "delta": 3 }
},
"hardening": {
"cycles_completed": 1,
"max_cycles": 3,
"initial_shadow_score": 22.2,
"final_shadow_score": 11.1
}
}See examples/ for complete rendered examples.
| Field | Type | Required | Description |
|---|---|---|---|
shadow_score_spec_version |
string | ✅ | Spec version this report conforms to |
shadow_score |
number | ✅ | The computed Shadow Score (0–100) |
level |
string | ✅ | One of: perfect, minor, moderate, significant, critical |
sealed_tests.total |
integer | ✅ | Total sealed tests run |
sealed_tests.passed |
integer | ✅ | Sealed tests that passed |
sealed_tests.failed |
integer | ✅ | Sealed tests that failed |
failures |
array | ✅ | List of failure objects (test_name, expected, actual, message) |
sealed_hash |
string | RECOMMENDED | SHA-256 hash of sealed test directory |
open_tests.* |
object | RECOMMENDED | Open test results for comparison |
coverage_comparison |
object | OPTIONAL | Category-level breakdown |
hardening |
object | OPTIONAL | Hardening loop metadata |
Implementations SHOULD categorize each sealed test into one of:
| Category | Description |
|---|---|
happy_path |
Standard expected-input → expected-output scenarios |
edge_case |
Boundary conditions, empty inputs, max values, unicode |
error_handling |
Invalid inputs, missing data, malformed requests |
security |
Injection, overflow, unauthorized access, data leakage |
Implementations may claim conformance at three levels:
Requirements:
- Computes Shadow Score using the formula in §3.3
- Produces a Gap Report with all required fields (§5.2)
- Uses the interpretation scale in §3.4
Does NOT require: Sealed-envelope isolation (tests may be authored with knowledge of the implementation).
Use case: Retrofitting Shadow Score onto existing test suites for comparative analysis.
Requirements:
- All of Level 1
- Sealed tests are generated before implementation begins
- Information isolation (§4.2) is enforced at context-isolation level or stronger
- Tamper evidence hash is computed and stored (§4.5)
Use case: AI agent pipelines and multi-agent build systems.
Requirements:
- All of Level 2
- Hardening loop (§4.4) is implemented
- Failure messages shared with Implementer do NOT include test source code
- Hardening velocity is tracked
- Shadow Score is recomputed after each hardening cycle
Use case: Production-grade autonomous build systems.
The reference implementation of the Shadow Score Specification is Dark Factory, an autonomous agentic build system for the GitHub Copilot CLI.
Dark Factory implements Level 3 conformance:
- Sealed tests generated by QA Sealed agent from PRD only (§4.1)
- Context-isolated agents via separate
task()invocations (§4.2) - QA Validator runs both suites and produces Gap Report (§4.3)
- Hardening loop with failure-message-only feedback (§4.4)
- SHA-256 hash computed and stored in state.json (§4.5)
The reference Level 2 implementation is Terminal Stampede, a parallel agent runtime for CLI coding agents.
Terminal Stampede implements Level 2 conformance:
- Sealed tests generated before agents launch, stored in
sealed-tests/(§4.1) - Information isolation — agents never see sealed tests or know they're being scored (§4.2)
- Merger agent runs both suites and integrates Shadow Score into merge report (§4.3)
- SHA-256 tamper hash verified via
.seal-hashbefore validation (§4.5)
Lightweight reference validators for computing Shadow Score from test output are available in the validators/ directory.
Scenario: Build a Fibonacci calculator CLI.
| Suite | Total | Passed | Failed |
|---|---|---|---|
| Sealed | 5 | 5 | 0 |
| Open | 8 | 8 | 0 |
Shadow Score: 0 / 5 × 100 = 0% ✅ Perfect
The Implementer's tests covered all scenarios the sealed tests checked, plus 3 additional cases. This indicates thorough understanding of the specification.
Scenario: Build a license scanner CLI tool.
| Suite | Total | Passed | Failed |
|---|---|---|---|
| Sealed | 18 | 16 | 2 |
| Open | 12 | 12 | 0 |
Shadow Score: 2 / 18 × 100 = 11.1% 🟢 Minor
Failures:
test_rejects_gpl_dependency— GPL dependency not blocked (security gap)test_csv_report_includes_risk— Report missing risk column (edge case)
The Implementer built solid core functionality but missed a security edge case and a reporting detail. One hardening cycle resolved both.
Scenario: Build a user registration API.
| Suite | Total | Passed | Failed |
|---|---|---|---|
| Sealed | 15 | 6 | 9 |
| Open | 4 | 4 | 0 |
Shadow Score: 9 / 15 × 100 = 60% 🔴 Critical
The Implementer wrote only 4 tests — all happy path. Sealed tests caught: missing email validation, no password hashing, duplicate user returns 500 instead of 409, SQL injection vulnerability, missing rate limiting, and more. This indicates the Implementer built to the "golden path" without considering real-world edge cases.
Recommendation: Re-implement with explicit attention to error handling and security requirements.
Q: Can I use Shadow Score without AI agents?
A: Yes. Shadow Score works with any workflow where one party writes specification-based tests and another party implements the code. Human teams can use it for code review, pair programming assessment, or contractor evaluation.
Q: What if the sealed tests themselves are wrong?
A: Sealed tests should be reviewed by a human (or a separate validator) before being finalized. Buggy sealed tests inflate the Shadow Score incorrectly. The tamper evidence hash (§4.5) ensures they aren't changed after the fact — but it doesn't guarantee correctness.
Q: Does a 0% Shadow Score mean the code is perfect?
A: No. It means the code passes all sealed tests. The sealed tests may not cover every possible scenario. Shadow Score measures specification compliance, not absolute correctness.
Q: How many sealed tests should I write?
A: Enough to cover every acceptance criterion in the specification, including happy path, edge cases, error handling, and security. As a guideline: 3–5 sealed tests per acceptance criterion.
Q: Can the Implementer game the system?
A: If information isolation (§4.2) is properly enforced, no. The Implementer cannot see the sealed tests and therefore cannot write code that specifically targets them. If isolation is broken, the Shadow Score is invalid.
Q: How does Shadow Score compare to code coverage?
A: Code coverage measures lines of code executed by tests. Shadow Score measures specification requirements satisfied by the implementation. You can have 100% code coverage and a 50% Shadow Score (the code runs but produces wrong results for half the requirements).
Q: Is Shadow Score useful for non-AI development?
A: Yes. Any team practicing independent verification and validation (IV&V) can benefit. The concept originates from quality engineering practices used in aerospace, medical devices, and safety-critical systems.
- Initial specification release
- Shadow Score formula and interpretation scale
- Sealed-Envelope Protocol (4 phases)
- Reporting format (JSON + Markdown)
- Three conformance levels
- Reference implementation: Dark Factory