Skip to content

Codex Document Format — an open specification for documents that unify viewing and editing, with modern security and machine readability

License

Notifications You must be signed in to change notification settings

Entrolution/codex-file-format-spec

Codex Document Format

Validate Schemas License

An open specification for documents that unify viewing and editing, with modern security and machine readability.

Status: Draft Specification (v0.1)

Problem Statement

The document landscape is fundamentally divided:

  • View-optimized formats (PDF, DJVU) offer layout fidelity but poor editability
  • Edit-optimized formats (DOCX, ODF, Markdown) enable rich editing but render inconsistently
  • No existing format excels at both viewing and editing

This divide creates workflow friction, format conversion overhead, and lost fidelity. Additionally:

  • PDF's security model is broken: 21 of 22 desktop viewers vulnerable to signature attacks
  • Machine readability is an afterthought: 73-96% accuracy even with state-of-the-art extraction
  • Compression is outdated: PDF uses 30-year-old DEFLATE, missing modern algorithms
  • No clear "frozen" semantics: Signatures don't truly lock documents
  • No verifiable history: Document lineage and provenance are external concerns, not built-in
  • Appearance varies by viewer: Even PDF renders differently across implementations

Design Goals

Primary Goals

  1. Unified View/Edit Mode: One format from draft to archive, no conversion step
  2. Semantic-First: Content stored as meaning, presentation derived
  3. Modern Security: Algorithm-agile cryptography, post-quantum ready
  4. Machine Readable: AI/ML extraction works reliably by design
  5. Content-Addressable: Document hash is its identity; modifications create new versions
  6. Verifiable Provenance: Built-in hash chains and Merkle trees for tamper-evident history

Secondary Goals

  • Efficient compression using modern algorithms (Zstandard, AVIF)
  • Clear document state machine (draft → review → frozen/signed)
  • State-aware presentation (reactive for drafts, precise for frozen/published)
  • Block-level proofs (prove a section exists without revealing the whole document)
  • Timestamp anchoring (RFC 3161, blockchain)
  • Accessibility built-in (WCAG-aligned)

Non-Goals

  • Replacing PDF for legacy/archival use cases
  • Supporting scripting/executable content (security risk)
  • Achieving 100% PDF import fidelity

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    DOCUMENT CONTAINER                        │
│  (ZIP archive, content-addressable hash = identifier)        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────┐  ┌──────────────────────────────────┐ │
│  │ MANIFEST         │  │ CONTENT STREAMS                  │ │
│  │ - Version        │  │ ┌────────────────────────────┐   │ │
│  │ - State          │  │ │ Semantic Document Layer    │   │ │
│  │ - Signatures[]   │  │ │ (JSON blocks + Merkle tree)│   │ │
│  │ - Lineage        │  │ └────────────────────────────┘   │ │
│  │ - Merkle root    │  │ ┌────────────────────────────┐   │ │
│  └──────────────────┘  │ │ Presentation Layer(s)      │   │ │
│                        │ │ - Reactive (hints/styles)  │   │ │
│  ┌──────────────────┐  │ │ - Precise (exact coords)   │   │ │
│  │ SECURITY LAYER   │  │ └────────────────────────────┘   │ │
│  │ - Signatures     │  │ ┌────────────────────────────┐   │ │
│  │ - Encryption     │  │ │ Assets                     │   │ │
│  │ - Timestamps     │  │ │ (images, fonts, embeds)    │   │ │
│  └──────────────────┘  │ └────────────────────────────┘   │ │
│                        └──────────────────────────────────┘ │
│  ┌──────────────────┐                                       │
│  │ PROVENANCE       │  ← Links to parent document hash      │
│  │ - Parent hash    │    (documents form a hash chain)      │
│  │ - Ancestors      │                                       │
│  │ - Timestamps     │                                       │
│  └──────────────────┘                                       │
└─────────────────────────────────────────────────────────────┘

Key Feature: Verifiable Provenance

Codex documents form a cryptographic hash chain — each document's identity IS its content hash, and each document can reference its parent by hash:

doc-v1 (sha256:aaa)  ←──  doc-v2 (sha256:bbb)  ←──  doc-v3 (sha256:ccc)
      │                        │                        │
      └── parent: null         └── parent: aaa          └── parent: bbb

This enables:

  • Tamper-evident history: Forging an intermediate version is computationally infeasible
  • No external infrastructure: Documents themselves ARE the chain
  • Block-level Merkle proofs: Prove a specific section exists without revealing the whole document
  • Selective disclosure: Share redacted documents with cryptographic proof of what was removed
  • Timestamp anchoring: Anchor document hashes to RFC 3161 TSAs or public blockchains

Key Feature: State-Aware Presentation

Codex uses progressive enhancement for presentation — the level of layout precision evolves with document maturity:

Document State Presentation Requirement What's Frozen
DRAFT Reactive only (hints/styles) Nothing — content flows freely
REVIEW Reactive (precise optional) Nothing — still editing
FROZEN Reactive + precise required Content immutable; appearance locked
PUBLISHED Same as FROZEN Authoritative, immutable record

Why This Matters

When a document reaches FROZEN or PUBLISHED state, its precise layout (exact coordinates for every element) becomes part of the immutable record:

  • Semantic content is the hash: The document ID covers semantic content only — what it says, not how it looks
  • Appearance is locked alongside content: Precise layouts are immutable when frozen, but separate from the content hash. Use scoped signatures (Security Extension) for appearance attestation
  • Citations are reliable: "Page 7, line 23" means the same thing in every viewer
  • Legal/archival integrity: Frozen documents render pixel-perfectly, forever
  • No viewer inconsistency: Unlike PDF, whose appearance varies by renderer

This is the key insight: the state machine isn't just about workflow — it enforces the relationship between content stability and presentation precision. You can't freeze a document without committing to exactly how it looks.

DRAFT                    FROZEN/PUBLISHED
┌─────────────────┐      ┌─────────────────┐
│ Semantic content│      │ Semantic content│  ← Document ID
│ (JSON blocks)   │      │ (JSON blocks)   │    (content hash)
├─────────────────┤      ├─────────────────┤
│ Reactive hints  │  →   │ Reactive hints  │
│ (optional)      │      │ + Precise layout│ ← Required, immutable
└─────────────────┘      │ (exact coords)  │    (scoped signatures
                         └─────────────────┘     for attestation)

Specification Structure

The specification is modular:

Core Specification (Required)

Extension Specifications (Optional)

Quick Start

Document Structure

A Codex document is a ZIP archive with this structure:

document.cdx
├── manifest.json          # Document manifest (includes lineage, Merkle root)
├── content/
│   ├── document.json      # Semantic content blocks
│   └── block-index.json   # Block hashes for Merkle proofs
├── presentation/
│   ├── defaults.json      # Base styles
│   ├── paginated.json     # Print hints (reactive)
│   ├── continuous.json    # Screen hints (reactive)
│   └── layouts/           # Precise layouts (required for FROZEN/PUBLISHED)
│       ├── letter.json    # US Letter format coordinates
│       └── a4.json        # A4 format coordinates
├── assets/
│   ├── images/
│   ├── fonts/
│   └── embeds/
├── security/
│   └── signatures.json    # Digital signatures
├── provenance/
│   └── record.json        # Lineage, timestamps, derivations
└── metadata/
    └── dublin-core.json   # Standard metadata

Minimal Example

{
  "codex": "0.1",
  "id": "sha256:a1b2c3...",
  "state": "draft",
  "content": {
    "blocks": [
      {
        "type": "heading",
        "level": 1,
        "children": [{ "type": "text", "value": "Hello, World" }]
      },
      {
        "type": "paragraph",
        "children": [
          { "type": "text", "value": "This is a " },
          { "type": "text", "value": "Codex", "marks": ["bold"] },
          { "type": "text", "value": " document." }
        ]
      }
    ]
  }
}

File Extension and MIME Type

  • Extension: .cdx
  • MIME Type: application/vnd.codex+json (canonical JSON form)
  • Alternative: application/vnd.codex (binary/packed form)

Roadmap

Phase 1: Core Specification

  • Complete core specification documents
  • JSON Schema for validation
  • Example documents

Phase 2: Extensions

  • Security extension (signatures, encryption)
  • Collaboration extension (CRDT integration)
  • Presentation extension (advanced layout)
  • Forms extension (input fields, validation)
  • Semantic extension (JSON-LD, citations)

Phase 3: Reference Implementation (Current)

  • TypeScript parser/writer library
  • Web-based viewer
  • Basic editor integration

Phase 4: Ecosystem

  • Conversion tools (PDF, DOCX import/export)
  • CLI tools (validate, sign, verify)
  • Standards body submission

Contributing

See CONTRIBUTING.md for guidelines.

License

This specification and all associated code are licensed under the Apache License 2.0. See LICENSE for details.

Acknowledgments

This specification draws inspiration from:

  • PDF (ISO 32000) for document packaging concepts
  • Portable Text (Sanity) for semantic content modeling
  • EPUB for reflowable document structure
  • CRDTs (Yjs, Automerge) for collaboration patterns
  • Git for content-addressable storage and hash chain lineage
  • Merkle trees for efficient integrity proofs
  • OpenTimestamps for decentralized timestamp anchoring

About

Codex Document Format — an open specification for documents that unify viewing and editing, with modern security and machine readability

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

No packages published