Skip to content

RFC: Scaling Beyond the 50MB Monolithic CRDT Limit #3

@kavinsood

Description

@kavinsood

YAOS currently uses a monolithic vault model: one vault maps to one shared Y.Doc containing markdown content, metadata, folder
structure, blob references, and tombstones. This is a deliberate V1 choice, not an accident.

That architecture gives YAOS its strongest product properties:

  • atomic folder renames
  • cross-file structural consistency
  • simple sync semantics
  • a single shared collaboration surface
  • straightforward snapshotting and recovery

It also creates a real ceiling. The current docs describe roughly 40-50 MB of raw markdown text as a comfortable target for the
monolith. That is not a hard crash line, but it is the point where CPU cost, memory pressure, and mobile startup behavior become worth
treating as first-class design concerns.

This RFC does not propose re-architecting YAOS immediately.
It defines:

  • what problem we are actually solving
  • what the current architecture already guarantees
  • which scaling paths are real
  • which ones are mostly illusions
  • and what the pragmatic roadmap should be if the monolith stops being the right default

Why this RFC Exists

monolith.md already explains why YAOS chose the monolith.
That document is the rationale for the current design.

This RFC is different.

Its purpose is to answer the next question:

If YAOS succeeds and users push beyond the comfortable monolithic ceiling, what is the least-wrong path forward that preserves YAOS’s
core moat of atomic structural integrity?

This is a future-scaling decision framework, not a restatement of current architecture.

Motivation

There are three separate pressures here.

1. Server-side memory and compute ceilings

The current checkpoint+journal storage engine solved write amplification. It did not remove the cost of holding and operating on one
large in-memory Y.Doc.

Large vaults still pay for:

  • Y.encodeStateAsUpdate(...)
  • Y.encodeStateVector(...)
  • merge/apply work during load and reconnect
  • larger cold-start replay and checkpoint operations

2. Client-side startup and mobile cost

Even if transport and storage are efficient, a large monolithic update still has to be:

  • downloaded
  • parsed
  • applied
  • indexed into editor/runtime state

Mobile devices will feel this first.

3. Competitive ceiling

YAOS is intentionally optimized for normal human note vaults, not 20 GB archival datasets. That is fine. But if YAOS becomes the
default recommendation for serious PKM users, the “what happens when my vault gets huge?” question stops being theoretical and becomes
a product-boundary question.

Current Architecture

Today YAOS uses:

  • one shared Y.Doc per vault
  • markdown text as fileId -> Y.Text
  • metadata as CRDT maps
  • intentional markdown/blob tombstones to block resurrection
  • chunked checkpoint+journal persistence on the server
  • schema-version gating and local reset paths on the client

Important current properties:

  • folder renames are batched into a single transaction
  • file IDs stay stable across rename waves
  • deleted paths remain tombstoned to prevent stale offline resurrection
  • local cache reset already exists
  • full “nuclear reset” already exists

That means YAOS already has two important ingredients for a future scaling plan:

  • a tested structural-consistency baseline
  • an existing UX precedent for “your local state is too stale, throw it away and resync”

Problem Statement

The scaling problem is often described too vaguely. In practice, there are three different kinds of “history” involved:

A. Storage history

This is the checkpoint+journal layer on the server.

YAOS already compacts this. That solved the old “rewrite the whole doc on every save” failure mode.

B. Yjs causal/history state

This is the in-memory CRDT state that grows with long-lived editing churn.

This is the real monolith ceiling.

C. YAOS application-level tombstones

These are explicit file/blob deletion markers stored in the CRDT to prevent resurrection from stale devices.

These are intentional correctness data. They are not the same thing as generic Yjs causal history, and they cannot be casually
vacuumed away.

Any RFC that talks about “garbage collection” must keep these three layers separate.

Critical Observation

A naive “vacuum” is not actually a vacuum.

It is tempting to think this works:

  1. call Y.encodeStateAsUpdate(doc)
  2. apply that update to a fresh Y.Doc
  3. replace the old doc

That does not meaningfully reset causal history by itself.

In a local synthetic Yjs experiment, re-encoding and re-applying the same update preserved essentially the same encoded size, while
rebuilding a brand new doc from the materialized final text shrank it dramatically. In other words:

  • checkpoint rewrite is storage compaction
  • snapshotting is backup/recovery
  • neither automatically implies CRDT-history vacuuming

If YAOS wants an actual epoch reset, it must rebuild semantic state into a causally new document, not just replay the old update into
a fresh shell.

Non-Goals

This RFC does not propose:

  • replacing the monolith in V1
  • sacrificing atomic folder renames by default
  • claiming “infinite scale” for YAOS today
  • solving distributed sharding across multiple servers
  • treating LiveSync’s architecture as the target to copy
  • adding a high-complexity multiplexed subdocument architecture immediately

Design Principles

Any future scaling path must be judged against these rules:

  • Preserve atomic structural operations by default.
  • Never casually reintroduce deleted files from stale clients.
  • Prefer explicit operator-controlled escape hatches over magical background GC.
  • Use measured thresholds, not fear-driven refactors.
  • Separate “support larger vaults” from “support arbitrarily large vaults.”
  • Do not trade correctness for scale unless the tradeoff is explicit and user-visible.

Approaches Considered

1. Stay Monolithic, Add Instrumentation and Thresholds

This is the immediate path and should happen first regardless of everything else.

Add measurement for:

  • encoded CRDT bytes
  • checkpoint bytes
  • journal bytes
  • active markdown path count
  • tombstoned path count
  • startup sync duration
  • cold-start replay mode and journal size
  • client-side local load / provider sync timing

Also add user-facing thresholds:

  • healthy
  • warning
  • danger zone

This does not solve the ceiling, but it turns “50MB-ish” into an actual operational signal instead of a vibe.

Verdict: Mandatory first step.

2. Epoch-Fenced Rebuild

This is the most pragmatic short-term escape hatch.

Mechanically, this would mean:

  • materialize the current semantic vault state
  • rebuild it into a causally new Y.Doc
  • increment a syncEpoch
  • reject clients from older epochs
  • force stale clients to clear local cache and rehydrate from the new epoch

This is important:

  • the epoch cutover must be explicit
  • stale clients must not be allowed to delta-merge old causal history into the new epoch
  • the server and client protocol would need to carry epoch information, not just schema version

What this buys us:

  • preserves monolithic semantics within an epoch
  • gives an operational reset button when a vault becomes unhealthy
  • avoids a full architectural rewrite as the first response

What it costs:

  • cold devices may require a reset. The client-side plugin will detect the Epoch mismatch, automatically backup local unsynced changes to a recovery folder, and pull the new Epoch state.
  • “automatic safe GC when all devices are past X” is not something YAOS can currently prove, because there is no durable per-device
    acknowledged high-water-mark registry today
  • operator tooling and UX will need to explain epoch cutovers clearly

The crucial nuance is that this is not “garbage collect in place.”
It is “rebuild and start a new causal era.”

Verdict: Best short-term scaling escape hatch.

3. Two-Tier Hybrid Model: Graph Doc + Leaf Docs

This is the strongest long-term scalable architecture currently on the table.

Structure:

  • a small monolithic Graph Doc contains file IDs, paths, metadata, structural state, blob refs, and deletion markers
  • each markdown file’s text lives in its own Leaf Doc
  • the client loads and subscribes only to the leaf docs it needs
  • the graph remains always-on and small

What this preserves:

  • atomic folder/path structure inside the graph
  • stable file IDs
  • scalable text capacity by paging content docs in and out

What this gives up:

  • truly atomic multi-note content mutations across leaf docs
  • simple single-doc mental model
  • trivial server routing

What it requires:

  • multiplexed room/subscription transport
  • lazy load/unload policies
  • LRU or similar eviction for inactive leaf docs
  • consistency handling between graph updates and content updates
  • explicit handling for intermediate “tearing” states during partial sync

This is probably the correct V2 architecture if YAOS ever needs to scale beyond the monolith while preserving its structural moat.

Verdict: Best long-term research direction.

4. Pure Per-File Sharding / Subdocuments Everywhere

This is the most obvious answer and the most dangerous to hand-wave.

Yes, it scales text capacity.
No, it does not preserve YAOS’s strongest guarantee.

Problems:

  • folder rename becomes multi-doc orchestration
  • cross-file mutations tear
  • partial sync exposes semantically invalid intermediate states
  • WebSocket/provider complexity rises sharply
  • native Yjs subdocuments do not magically solve transactionality

This may still be viable for a future “scale mode,” but it should not be the default path, and it should not be described as
equivalent to the current architecture.

Verdict: Not recommended as the first scale-up move.

Comparative View

Approach Preserves atomic structure Extends ceiling Complexity Recommended
Instrumentation + thresholds Yes No Low Yes, now
Epoch-fenced rebuild Yes, within epoch Medium Medium Yes, next
Graph + leaf docs Mostly graph-level High High Yes, research
Pure per-file/subdocs No, not fully High High Not first

Recommendation

The recommendation is:

1. Do not replace the monolith now

The current monolith is coherent, tested, and still the correct default for YAOS.

2. Add observability first

Before any refactor, teach YAOS to measure and report monolith health.

3. Build an epoch-fenced rebuild path as the first real escape hatch

This should be the first scaling feature that actually changes behavior.

Not because it is glamorous, but because it preserves YAOS’s strongest product property: atomic structural integrity.

4. Treat the Graph + Leaf model as V2 research

That is the serious long-term architecture if YAOS needs to serve much larger text vaults without giving up its identity.

Proposed Roadmap

Phase 1: Instrumentation

Add server and client metrics for:

  • encoded document size
  • checkpoint size
  • journal size
  • replay mode on load
  • active markdown paths
  • tombstoned markdown paths
  • startup sync duration
  • local IndexedDB load time
  • provider sync time

Add a debug/diagnostics surface that makes vault health visible.

Phase 2: Danger-Zone UX

Add warning thresholds and user-facing messaging for large vaults.

Possible actions:

  • show current vault text footprint
  • warn when vault is entering a monolithic danger zone
  • suggest snapshots before risky operations
  • explain that old idle devices may need a reset after future maintenance

Phase 3: Epoch-Fenced Rebuild

Implement:

  • syncEpoch
  • epoch-aware handshake
  • stale-epoch rejection path
  • “reset local cache and rejoin” UX
  • operator-triggered rebuild flow
  • clear safety docs around what happens to stale devices

This should initially be manual and explicit.

Phase 4: Hybrid Research Track

Explore:

  • graph document scope
  • leaf document storage layout
  • multiplexed transport design
  • eviction policy for inactive content docs
  • consistency model for graph/content races
  • whether this is opt-in per vault or a future default for large vaults

Acceptance Criteria

This RFC should be considered meaningfully implemented when:

  • YAOS can measure monolith health instead of guessing
  • users get explicit visibility before a vault becomes unhealthy
  • operators have a safe rebuild path that does not silently corrupt semantics
  • stale clients cannot resurrect pre-epoch history into a rebuilt vault
  • current rename atomicity remains intact for the default path
  • the codebase has a documented research direction for post-monolith scaling

Open Questions

  1. What exact metric should define the danger zone:
    encoded CRDT bytes, live markdown bytes, startup latency, or some composite score?
  2. Should epoch rebuild be:
    manual only, suggested, or eventually automatic under strict thresholds?
  3. In a hybrid model, what belongs in the Graph Doc:
    just path/file ID mappings, or also metadata, blob refs, and tombstones?
  4. If YAOS ever introduces a “scale mode,” should that be:
    automatic, per-vault, or explicit at vault creation time?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions