RFC: Scaling Beyond the 50MB Monolithic CRDT Limit

YAOS currently uses a monolithic vault model: one vault maps to one shared Y.Doc containing markdown content, metadata, folder
  structure, blob references, and tombstones. This is a deliberate V1 choice, not an accident.

That architecture gives YAOS its strongest product properties:

  - atomic folder renames
  - cross-file structural consistency
  - simple sync semantics
  - a single shared collaboration surface
  - straightforward snapshotting and recovery

  It also creates a real ceiling. The current docs describe roughly 40-50 MB of raw markdown text as a comfortable target for the
  monolith. That is not a hard crash line, but it is the point where CPU cost, memory pressure, and mobile startup behavior become worth
  treating as first-class design concerns.

  This RFC does not propose re-architecting YAOS immediately.
  It defines:

  - what problem we are actually solving
  - what the current architecture already guarantees
  - which scaling paths are real
  - which ones are mostly illusions
  - and what the pragmatic roadmap should be if the monolith stops being the right default

  ## Why this RFC Exists

  [monolith.md](https://github.com/kavinsood/yaos/blob/main/engineering/monolith.md) already explains why YAOS chose the monolith.
  That document is the rationale for the current design.

  This RFC is different.

  Its purpose is to answer the next question:

  If YAOS succeeds and users push beyond the comfortable monolithic ceiling, what is the least-wrong path forward that preserves YAOS’s
  core moat of atomic structural integrity?

  This is a future-scaling decision framework, not a restatement of current architecture.

  ## Motivation

  There are three separate pressures here.

  ### 1. Server-side memory and compute ceilings

  The current checkpoint+journal storage engine solved write amplification. It did not remove the cost of holding and operating on one
  large in-memory Y.Doc.

  Large vaults still pay for:

  - Y.encodeStateAsUpdate(...)
  - Y.encodeStateVector(...)
  - merge/apply work during load and reconnect
  - larger cold-start replay and checkpoint operations

  ### 2. Client-side startup and mobile cost

  Even if transport and storage are efficient, a large monolithic update still has to be:

  - downloaded
  - parsed
  - applied
  - indexed into editor/runtime state

  Mobile devices will feel this first.

  ### 3. Competitive ceiling

  YAOS is intentionally optimized for normal human note vaults, not 20 GB archival datasets. That is fine. But if YAOS becomes the
  default recommendation for serious PKM users, the “what happens when my vault gets huge?” question stops being theoretical and becomes
  a product-boundary question.

  ## Current Architecture

  Today YAOS uses:

  - one shared Y.Doc per vault
  - markdown text as fileId -> Y.Text
  - metadata as CRDT maps
  - intentional markdown/blob tombstones to block resurrection
  - chunked checkpoint+journal persistence on the server
  - schema-version gating and local reset paths on the client

  Important current properties:

  - folder renames are batched into a single transaction
  - file IDs stay stable across rename waves
  - deleted paths remain tombstoned to prevent stale offline resurrection
  - local cache reset already exists
  - full “nuclear reset” already exists

  That means YAOS already has two important ingredients for a future scaling plan:

  - a tested structural-consistency baseline
  - an existing UX precedent for “your local state is too stale, throw it away and resync”

  ## Problem Statement

  The scaling problem is often described too vaguely. In practice, there are three different kinds of “history” involved:

  ### A. Storage history

  This is the checkpoint+journal layer on the server.

  YAOS already compacts this. That solved the old “rewrite the whole doc on every save” failure mode.

  ### B. Yjs causal/history state

  This is the in-memory CRDT state that grows with long-lived editing churn.

  This is the real monolith ceiling.

  ### C. YAOS application-level tombstones

  These are explicit file/blob deletion markers stored in the CRDT to prevent resurrection from stale devices.

  These are intentional correctness data. They are not the same thing as generic Yjs causal history, and they cannot be casually
  vacuumed away.

  Any RFC that talks about “garbage collection” must keep these three layers separate.

  ## Critical Observation

  A naive “vacuum” is not actually a vacuum.

  It is tempting to think this works:

  1. call Y.encodeStateAsUpdate(doc)
  2. apply that update to a fresh Y.Doc
  3. replace the old doc

  That does not meaningfully reset causal history by itself.

  In a local synthetic Yjs experiment, re-encoding and re-applying the same update preserved essentially the same encoded size, while
  rebuilding a brand new doc from the materialized final text shrank it dramatically. In other words:

  - checkpoint rewrite is storage compaction
  - snapshotting is backup/recovery
  - neither automatically implies CRDT-history vacuuming

  If YAOS wants an actual epoch reset, it must rebuild semantic state into a causally new document, not just replay the old update into
  a fresh shell.

  ## Non-Goals

  This RFC does not propose:

  - replacing the monolith in V1
  - sacrificing atomic folder renames by default
  - claiming “infinite scale” for YAOS today
  - solving distributed sharding across multiple servers
  - treating LiveSync’s architecture as the target to copy
  - adding a high-complexity multiplexed subdocument architecture immediately

  ## Design Principles

  Any future scaling path must be judged against these rules:

  - Preserve atomic structural operations by default.
  - Never casually reintroduce deleted files from stale clients.
  - Prefer explicit operator-controlled escape hatches over magical background GC.
  - Use measured thresholds, not fear-driven refactors.
  - Separate “support larger vaults” from “support arbitrarily large vaults.”
  - Do not trade correctness for scale unless the tradeoff is explicit and user-visible.

  ## Approaches Considered

  ### 1. Stay Monolithic, Add Instrumentation and Thresholds

  This is the immediate path and should happen first regardless of everything else.

  Add measurement for:

  - encoded CRDT bytes
  - checkpoint bytes
  - journal bytes
  - active markdown path count
  - tombstoned path count
  - startup sync duration
  - cold-start replay mode and journal size
  - client-side local load / provider sync timing

  Also add user-facing thresholds:

  - healthy
  - warning
  - danger zone

  This does not solve the ceiling, but it turns “50MB-ish” into an actual operational signal instead of a vibe.

  Verdict: Mandatory first step.

  ### 2. Epoch-Fenced Rebuild

  This is the most pragmatic short-term escape hatch.

  Mechanically, this would mean:

  - materialize the current semantic vault state
  - rebuild it into a causally new Y.Doc
  - increment a syncEpoch
  - reject clients from older epochs
  - force stale clients to clear local cache and rehydrate from the new epoch

  This is important:

  - the epoch cutover must be explicit
  - stale clients must not be allowed to delta-merge old causal history into the new epoch
  - the server and client protocol would need to carry epoch information, not just schema version

  What this buys us:

  - preserves monolithic semantics within an epoch
  - gives an operational reset button when a vault becomes unhealthy
  - avoids a full architectural rewrite as the first response

  What it costs:

  - cold devices may require a reset. The client-side plugin will detect the Epoch mismatch, automatically backup local unsynced changes to a recovery folder, and pull the new Epoch state.
  - “automatic safe GC when all devices are past X” is not something YAOS can currently prove, because there is no durable per-device
    acknowledged high-water-mark registry today
  - operator tooling and UX will need to explain epoch cutovers clearly

  The crucial nuance is that this is not “garbage collect in place.”
  It is “rebuild and start a new causal era.”

  Verdict: Best short-term scaling escape hatch.

  ### 3. Two-Tier Hybrid Model: Graph Doc + Leaf Docs

  This is the strongest long-term scalable architecture currently on the table.

  Structure:

  - a small monolithic Graph Doc contains file IDs, paths, metadata, structural state, blob refs, and deletion markers
  - each markdown file’s text lives in its own Leaf Doc
  - the client loads and subscribes only to the leaf docs it needs
  - the graph remains always-on and small


  What this preserves:

  - atomic folder/path structure inside the graph
  - stable file IDs
  - scalable text capacity by paging content docs in and out

  What this gives up:

  - truly atomic multi-note content mutations across leaf docs
  - simple single-doc mental model
  - trivial server routing

  What it requires:

  - multiplexed room/subscription transport
  - lazy load/unload policies
  - LRU or similar eviction for inactive leaf docs
  - consistency handling between graph updates and content updates
  - explicit handling for intermediate “tearing” states during partial sync

  This is probably the correct V2 architecture if YAOS ever needs to scale beyond the monolith while preserving its structural moat.

  Verdict: Best long-term research direction.

  ### 4. Pure Per-File Sharding / Subdocuments Everywhere

  This is the most obvious answer and the most dangerous to hand-wave.

  Yes, it scales text capacity.
  No, it does not preserve YAOS’s strongest guarantee.

  Problems:

  - folder rename becomes multi-doc orchestration
  - cross-file mutations tear
  - partial sync exposes semantically invalid intermediate states
  - WebSocket/provider complexity rises sharply
  - native Yjs subdocuments do not magically solve transactionality

  This may still be viable for a future “scale mode,” but it should not be the default path, and it should not be described as
  equivalent to the current architecture.

  Verdict: Not recommended as the first scale-up move.

  ## Comparative View

  | Approach | Preserves atomic structure | Extends ceiling | Complexity | Recommended |
  | :--- | :--- | :--- | :--- | :--- |
  | Instrumentation + thresholds | Yes | No | Low | Yes, now |
  | Epoch-fenced rebuild | Yes, within epoch | Medium | Medium | Yes, next |
  | Graph + leaf docs | Mostly graph-level | High | High | Yes, research |
  | Pure per-file/subdocs | No, not fully | High | High | Not first |

  ## Recommendation

  The recommendation is:

  ### 1. Do not replace the monolith now

  The current monolith is coherent, tested, and still the correct default for YAOS.

  ### 2. Add observability first

  Before any refactor, teach YAOS to measure and report monolith health.

  ### 3. Build an epoch-fenced rebuild path as the first real escape hatch

  This should be the first scaling feature that actually changes behavior.

  Not because it is glamorous, but because it preserves YAOS’s strongest product property: atomic structural integrity.

  ### 4. Treat the Graph + Leaf model as V2 research

  That is the serious long-term architecture if YAOS needs to serve much larger text vaults without giving up its identity.

  ## Proposed Roadmap

  ### Phase 1: Instrumentation

  Add server and client metrics for:

  - encoded document size
  - checkpoint size
  - journal size
  - replay mode on load
  - active markdown paths
  - tombstoned markdown paths
  - startup sync duration
  - local IndexedDB load time
  - provider sync time

  Add a debug/diagnostics surface that makes vault health visible.

  ### Phase 2: Danger-Zone UX

  Add warning thresholds and user-facing messaging for large vaults.

  Possible actions:

  - show current vault text footprint
  - warn when vault is entering a monolithic danger zone
  - suggest snapshots before risky operations
  - explain that old idle devices may need a reset after future maintenance

  ### Phase 3: Epoch-Fenced Rebuild

  Implement:

  - syncEpoch
  - epoch-aware handshake
  - stale-epoch rejection path
  - “reset local cache and rejoin” UX
  - operator-triggered rebuild flow
  - clear safety docs around what happens to stale devices

  This should initially be manual and explicit.

  ### Phase 4: Hybrid Research Track

  Explore:

  - graph document scope
  - leaf document storage layout
  - multiplexed transport design
  - eviction policy for inactive content docs
  - consistency model for graph/content races
  - whether this is opt-in per vault or a future default for large vaults

  ## Acceptance Criteria

  This RFC should be considered meaningfully implemented when:

  - YAOS can measure monolith health instead of guessing
  - users get explicit visibility before a vault becomes unhealthy
  - operators have a safe rebuild path that does not silently corrupt semantics
  - stale clients cannot resurrect pre-epoch history into a rebuilt vault
  - current rename atomicity remains intact for the default path
  - the codebase has a documented research direction for post-monolith scaling

  ## Open Questions

  1. What exact metric should define the danger zone:
     encoded CRDT bytes, live markdown bytes, startup latency, or some composite score?
  2. Should epoch rebuild be:
     manual only, suggested, or eventually automatic under strict thresholds?
  3. In a hybrid model, what belongs in the Graph Doc:
     just path/file ID mappings, or also metadata, blob refs, and tombstones?
  4. If YAOS ever introduces a “scale mode,” should that be:
     automatic, per-vault, or explicit at vault creation time?

Approach	Preserves atomic structure	Extends ceiling	Complexity	Recommended
Instrumentation + thresholds	Yes	No	Low	Yes, now
Epoch-fenced rebuild	Yes, within epoch	Medium	Medium	Yes, next
Graph + leaf docs	Mostly graph-level	High	High	Yes, research
Pure per-file/subdocs	No, not fully	High	High	Not first

RFC: Scaling Beyond the 50MB Monolithic CRDT Limit #3

Description

Why this RFC Exists

Motivation

1. Server-side memory and compute ceilings

2. Client-side startup and mobile cost

3. Competitive ceiling

Current Architecture

Problem Statement

A. Storage history

B. Yjs causal/history state

C. YAOS application-level tombstones

Critical Observation

Non-Goals

Design Principles

Approaches Considered

1. Stay Monolithic, Add Instrumentation and Thresholds

2. Epoch-Fenced Rebuild

3. Two-Tier Hybrid Model: Graph Doc + Leaf Docs

4. Pure Per-File Sharding / Subdocuments Everywhere

Comparative View

Recommendation

1. Do not replace the monolith now

2. Add observability first

3. Build an epoch-fenced rebuild path as the first real escape hatch

4. Treat the Graph + Leaf model as V2 research

Proposed Roadmap

Phase 1: Instrumentation

Phase 2: Danger-Zone UX

Phase 3: Epoch-Fenced Rebuild

Phase 4: Hybrid Research Track

Acceptance Criteria

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions