Skip to content

feat: multi-tenant gRPC storage with embedded Rust in NativeAOT#48

Merged
valdo404 merged 85 commits intomainfrom
feat/sse-grpc-multi-tenant-20
Feb 19, 2026
Merged

feat: multi-tenant gRPC storage with embedded Rust in NativeAOT#48
valdo404 merged 85 commits intomainfrom
feat/sse-grpc-multi-tenant-20

Conversation

@valdo404
Copy link
Copy Markdown
Owner

@valdo404 valdo404 commented Feb 5, 2026

Summary

Complete multi-tenant architecture with gRPC storage, SSE proxy with PAT auth, Google Drive integration, and cloud source support — replacing direct filesystem access with a proper storage abstraction layer.

Architecture

  • 3 gRPC services (proto/storage.proto): StorageService, SourceSyncService, ExternalWatchService
  • Dual-server storage split: IHistoryStorage (sessions/WAL/index/checkpoints — can be remote R2) + ISyncStorage (file sync/watch — always local embedded or GDrive)
  • Caller-orchestrated auto-save: Tool classes call sync.MaybeAutoSave() after sessions.AppendWal(). SessionManager and SyncManager are fully independent.
  • Rust storage server (docx-storage-local): local filesystem backend with tenant-aware paths, file locking, SHA256-based change detection
  • Embedded mode (new): Rust staticlib linked directly into .NET NativeAOT binaries — single binary, no separate process needed. In-memory DuplexStream transport for gRPC.
  • Remote mode: standalone docx-storage-local binary over TCP/Unix socket for server deployments
  • Cloudflare backend (docx-storage-cloudflare): R2-only storage with ETag-based optimistic locking (CAS), no KV dependency

Key changes

  • IStorageClient split: monolithic IStorageClient/StorageClientIHistoryStorage/HistoryStorageClient + ISyncStorage/SyncStorageClient
  • SessionManager refactored: depends only on IHistoryStorage (removed all sync/watch/tracker logic)
  • SyncManager created: depends only on ISyncStorage, handles RegisterAndWatch, MaybeAutoSave, Save, StopWatch
  • 13 mutation call sites updated: PatchTool, StyleTools (×3), CommentTools (×2), RevisionTools (×3), HistoryTools (×3), ElementTools pass-through
  • Embedded Rust FFI: lib.rs exposes C entry points (docx_storage_init, docx_pipe_read/write/flush, docx_storage_shutdown), .NET calls via P/Invoke (NativeStorage.cs, InMemoryPipeStream.cs)
  • External change flow: Rust detects file changes (SHA256), streams events to .NET, .NET creates WAL entries (requires Open XML SDK for DocumentSnapshot)
  • Rust register_source fix: creates SessionIndexEntry if absent (dual-server mode: AddSessionToIndex goes to remote, but RegisterSource goes to local)
  • Format compatibility: 8-byte .NET header on DOCX/checkpoints preserved
  • publish.sh: builds Rust staticlib → links into NativeAOT binaries + optional standalone binary
  • Docker: multi-stage build (Rust staticlib → .NET NativeAOT), single binary runtime with build cache
  • Infra: Pulumi IaC for Cloudflare resources, CI path filters, macOS/Windows builds restricted to workflow_dispatch

Phase H — SSE Proxy multi-tenant ✅

  • HTTP reverse proxy (Rust/Axum) with PAT auth via Cloudflare D1
  • Transparent SSE/JSON forwarding with X-Tenant-Id injection
  • SessionManagerPool for multi-tenant single-process MCP
  • Session recovery: proxy detects 404 (backend restart), transparently re-initializes MCP session and retries — per-tenant recovery lock prevents duplicate init under concurrent load
  • Docker Compose: proxy → mcp-http → storage (local) or proxy → mcp-http → storage + gdrive (cloud)

Phase G — Google Drive gRPC server ✅

  • docx-storage-gdrive: SourceSyncService + ExternalWatchService for Google Drive
  • Per-tenant OAuth tokens from D1 (oauth_connection table), auto-refresh via TokenManager
  • URI format: gdrive://{connection_id}/{file_id}
  • File download, upload (update existing), and create new file on first sync
  • Website OAuth consent flow (connect, callback, connections API)
  • ConnectionsManager component in dashboard for browsing external storage

Cloud Source Support ✅

  • DocumentTools.DocumentOpen supports cloud sources (Google Drive) via source_type + connection_id + file_id
  • ResolveSourceType() infers source type and blocks local sources in cloud mode
  • SyncManager.ReadSourceBytes() abstracts local vs cloud file reads
  • ExternalChangeGate + ExternalChangeTools work with cloud sources (not just local disk)
  • DocumentSave preserves existing source type on save-as
  • ExternalChangeGate state persisted in gRPC storage index (survives backend restart)

Website ✅

  • Astro 6 + Cloudflare Pages + Better Auth + D1
  • OAuth connections management (Google Drive)
  • Dashboard with ConnectionsManager for browsing cloud storage
  • Mobile responsive layout
  • i18n (FR/EN)

Pending

Phase D — Validation

  • Large document validation (>50MB) — streaming checkpoints/WAL

Phase W — WAL/Sessions viewer

  • HTTP server for visualizing sessions and WAL entries stored on R2

Phase K — Déploiement Koyeb

  • Deploy SSE proxy, .NET MCP server, Cloudflare storage on Koyeb

Infra

  • Cross-platform Docker builds (linux-x64, linux-arm64)

Relates to

Test plan

  • 428 .NET unit tests passing (embedded mode)
  • 428 .NET unit tests passing (dual-server mode with STORAGE_GRPC_URL → Cloudflare R2)
  • 0 skipped tests — all auto-save, external sync, and external change tracker tests run in both modes
  • 31 Rust unit tests passing
  • 63 MCP integration tests passing (via mcptools)
  • CLI lifecycle test (open, add, query, replace-text, undo, redo, history, export, save, close)
  • NativeAOT publish with embedded Rust (35MB single binary)
  • Cloudflare R2 storage backend (ETag CAS, no KV dependency)
  • Dual-server TestHelpers (IHistoryStorage→remote, ISyncStorage→local embedded)
  • Docker Compose full stack (storage + gdrive + mcp-http + proxy) — all 4 services healthy
  • Session recovery verified (proxy detects 404 on backend restart, re-initializes transparently)
  • Cross-platform Docker builds (linux-x64, linux-arm64)
  • Large document stress tests (>50MB)

🤖 Generated with Claude Code

Laurent Valdes and others added 27 commits February 5, 2026 20:13
Define the StorageService gRPC interface for multi-tenant document storage:
- Session lifecycle (load, save, delete, list, exists)
- Index operations for session metadata
- WAL operations with streaming support for large entries
- Checkpoint management
- Distributed lock operations with TTL

All operations include TenantContext for multi-tenant isolation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New docx-mcp-storage crate implementing multi-tenant storage:
- Cargo workspace setup with proto compilation (tonic-build)
- StorageBackend trait with LocalStorage implementation
- LockManager trait with FileLock implementation
- gRPC service supporting TCP and Unix socket transports
- Tenant-aware file organization: {base}/{tenant_id}/sessions/

Supports all storage operations: sessions, WAL, checkpoints, index, locks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major refactor of .NET components for multi-tenant architecture:

New DocxMcp.Grpc project:
- IStorageClient interface for storage abstraction
- StorageClient implementation with gRPC streaming
- GrpcLauncher for auto-launching local gRPC server
- TenantContextHelper with AsyncLocal for per-request tenant
- StorageClientOptions for configuration

SessionManager rewrite:
- All operations now tenant-aware via TenantContextHelper
- Delegates storage to IStorageClient (no local persistence)
- Removed direct file system access

Removed local storage code:
- Deleted SessionStore.cs (replaced by gRPC)
- Deleted MappedWal.cs (WAL managed by storage server)
- Deleted SessionLock.cs (locks managed by storage server)

CLI updates:
- Global --tenant flag support
- Auto-launch gRPC server via Unix socket

Test infrastructure:
- MockStorageClient for unit testing without gRPC
- Updated project references

Version bump to 1.6.0 across all projects.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New docx-mcp-proxy crate for remote MCP client access:
- Axum-based HTTP server with Streamable HTTP transport
- Configuration for D1 database credentials
- Environment-based configuration for PAT validation
- Placeholder for D1 PAT validation and tenant routing

The proxy validates Bearer tokens against Cloudflare D1 and
extracts tenant_id for multi-tenant request routing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Multi-stage Dockerfiles for production deployment:

Main Dockerfile (docx-mcp + storage):
- Rust builder stage for docx-mcp-storage
- .NET builder stage with NativeAOT
- Runtime stage with all binaries

docx-mcp-storage/Dockerfile:
- Standalone gRPC storage server
- TCP transport on port 50051
- Health check via grpc_health_probe

docx-mcp-proxy/Dockerfile:
- SSE/HTTP proxy server
- HTTP port 8080
- Health check via curl

All images use non-root users for security.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker Compose configuration for local development:

Services:
- storage: gRPC storage server (port 50051)
- mcp: MCP stdio server (interactive)
- cli: CLI tool (profile: cli)
- proxy: SSE/HTTP proxy (profile: proxy, port 8080)

Volumes:
- storage-data: persistent session storage
- sessions-data: MCP session data

Usage examples in comments for common scenarios.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit moves locking from the client (.NET) to the server (Rust)
by replacing explicit lock RPCs with atomic index operations that handle
concurrency internally.

Changes:
- Remove AcquireLock/ReleaseLock/RenewLock RPCs from proto
- Add atomic index operations: AddSessionToIndex, UpdateSessionInIndex,
  RemoveSessionFromIndex (server acquires/releases locks internally)
- Remove WithLockedIndex methods and _holderId/_indexLock from SessionManager
- Rename DTO types with Dto suffix to avoid proto-generated type conflicts
  (SessionInfoDto, WalEntryDto, CheckpointInfoDto, SessionIndexEntryDto)
- Fix GrpcLauncher to find Rust binary via correct relative paths
- Update .gitignore for Rust artifacts and Claude Code files

This fixes the ParallelCreation_NoLostSessions race condition where
parallel session creation could lose sessions due to client-side
lock/load/save/unlock races. Server-side atomic operations ensure
index updates are serialized correctly.

All 428 tests pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add build-storage job that builds docx-mcp-storage for 6 targets:
  linux-x64, linux-arm64, macos-x64, macos-arm64, windows-x64, windows-arm64
- Tests now download linux-x64 storage server before running
- Windows installer downloads platform-specific storage server
- macOS installer downloads both arch binaries and creates universal binary
- Implement fork/join semantics: parent kills child via ProcessExit event
- Add unique PID-based socket paths to prevent conflicts
- Add parent death monitoring (prctl on Linux, polling fallback on macOS/Windows)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only trigger Build, Test & Release workflow when:
- src/, tests/, crates/ code changes
- Cargo.toml/Cargo.lock changes
- Dockerfile, docker-compose files change
- installers/ or publish.sh changes
- Workflow itself changes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add protobuf compiler installation for all platforms:
- Linux: apt-get install protobuf-compiler
- macOS: brew install protobuf
- Windows: choco install protoc

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add #[cfg(unix)] to UnixListener import and Unix transport handling
- Define SYNCHRONIZE constant locally to avoid Windows feature issues
- Return error on Windows when Unix transport is requested
- Add Win32_Security feature to windows-sys

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The native GitHub paths filter evaluates against entire PR diff,
not per-push. Use dorny/paths-filter to check actual changes in
each push and skip website build when unrelated files change.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure session storage path is consistent between .NET and Rust:
- Add LocalStorageDir to StorageClientOptions
- Support both LOCAL_STORAGE_DIR and DOCX_SESSIONS_DIR env vars
- Pass --local-storage-dir when launching storage server
- Default: LocalApplicationData/docx-mcp/sessions

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests now use a unique temp directory for session storage, ensuring
complete isolation from production data and other test runs.
The temp directory is cleaned up when DisposeStorageAsync is called.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The file doesn't exist in the repo - was likely removed previously.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…llers

- Windows: Add docx-mcp-storage.exe to Inno Setup script
- macOS: Add docx-mcp-storage to PKG installer and sign it
- Update documentation in installers to mention storage server

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…collision

Fixes Docker build error due to unstable_name_collisions warning.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change default tenant from "local" to "" (empty string) so sessions
are stored directly in {base_dir}/sessions/ rather than
{base_dir}/local/sessions/, maintaining compatibility with the
legacy session storage layout.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Rust gRPC storage server now correctly reads and writes files in
the exact same format as the .NET code:

**WAL format (.wal files)**:
- 8-byte little-endian i64 header = data length (NOT including header)
- JSONL content: each entry is a JSON line ending with \n
- Raw .NET WalEntry JSON bytes are stored/retrieved as-is

**Session/Checkpoint DOCX files**:
- Strip 8-byte .NET header prefix when loading (detects PK signature)
- Returns pure DOCX content starting with PK\x03\x04

**Session Index (index.json)**:
- Changed from HashMap to Vec<SessionIndexEntry> to match .NET format
- Added version field, id per entry, docx_file, wal_count, cursor_position
- Uses serde aliases for field name compatibility (modified_at/last_modified_at)

**Other changes**:
- Added serde_bytes for efficient binary serialization of patch_json
- Added tonic-reflection for gRPC service introspection
- Allow empty tenant_id for backward compatibility with legacy paths
- Comprehensive tests for .NET format compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SessionIndex now uses List<SessionIndexEntry> instead of Dictionary
- Added Id field to SessionIndexEntry for array-based format
- Added helper methods: GetById, TryGetValue, ContainsKey, Upsert, Remove
- Fixed checkpoint positions to use int (matching .NET WAL format)
- Added WalPosition property for backward compatibility
- StorageClientOptions: clarified base directory vs sessions directory
- SessionManager: handle legacy WAL formats gracefully during restore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The truncate_wal function was using "keep_from" semantics (keep entries
from position N onwards) but .NET expected "keep_count" semantics (keep
first N entries).

This caused the Undo_ThenNewPatch_DiscardsRedoHistory test to fail because:
- After undo to position 1, cursor = 1
- Applying new patch called truncate_wal(1)
- Old behavior: keep entries with position >= 1 (all entries kept)
- New behavior: keep entries with position <= 1 (only first entry kept)

Changes:
- Renamed parameter from keep_from to keep_count
- Changed partition logic: keep entries where position <= keep_count
- Updated test to use correct value (1 instead of 2)
- Updated trait documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@valdo404
Copy link
Copy Markdown
Owner Author

valdo404 commented Feb 6, 2026

missing:

the whole is working into docker compose
the whole is working into koyeb

it integrates into claude desktop / claude.ai

Laurent Valdes and others added 2 commits February 6, 2026 17:31
…cloudflare S3 client

Provision R2 bucket, KV namespace, D1 database, and R2 API token via
Pulumi Python. Import existing resources (D1 auth, KV session). Add
env-setup.sh to source all Cloudflare env vars from Pulumi outputs.
Fix aws-sdk-s3 BehaviorVersion panic in storage-cloudflare.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… limits

send_with_retry wraps all KV HTTP calls with up to 5 retries,
starting at 200ms and doubling each attempt. Prevents cascading
failures under heavy load from Cloudflare KV rate limiting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 17:59 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 18:18 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 18:18 Inactive
Scale-to-zero applied via `koyeb services update --min-scale 0` since
the Pulumi provider incorrectly requires routes for scale-to-zero
(Koyeb API/CLI accepts it on mesh-only services).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 18:19 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 18:19 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/mcp-http February 19, 2026 18:19 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 18:19 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 18:49 Inactive
…s to CLAUDE.md

Add comprehensive operational documentation: Koyeb CLI cheat sheet, mcptools
usage for local and production proxy testing via mcp-remote, Dockerfile local
testing workflow, and Koyeb container debugging. Install grpcurl in mcp-http
Dockerfile for gRPC debugging inside containers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 19:23 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 19:24 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 19:24 Inactive
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/mcp-http February 19, 2026 19:26 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 19:26 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 19:26 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 19:26 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/mcp-http February 19, 2026 19:39 Inactive
/health now only checks that the proxy itself is running — no upstream
dependency. Koyeb health checks were failing when mcp-http was slow to
start, causing the edge to return 502 for all traffic.

Added /upstream-health for deep health checks (proxy + mcp-http backend).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/mcp-http February 19, 2026 20:50 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 20:50 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 20:50 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 20:50 Inactive
Replace axum::serve (HTTP/1.1 only) with hyper-util auto::Builder
which negotiates HTTP/1.1 or HTTP/2 (h2c) per connection. This fixes
502 errors on Koyeb where the edge may connect via HTTP/2.

Also split /health (liveness, no upstream dep) from /upstream-health
(deep check including mcp-http backend).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koyeb koyeb bot temporarily deployed to production-docx-mcp/gdrive February 19, 2026 20:59 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/proxy February 19, 2026 20:59 Inactive
@koyeb koyeb bot temporarily deployed to production-docx-mcp/storage February 19, 2026 20:59 Inactive
Root cause: all 4 services in the same Koyeb app shared route "/" on the
same domain. Koyeb's edge routed traffic to the wrong service (e.g. gRPC
storage instead of the HTTP proxy) → 502.

Fix: internal services (mcp-http, storage, gdrive) now use protocol=tcp
with no public routes — they are only reachable via Koyeb service mesh.
Only the proxy keeps protocol=http with route "/".

- infra/__main__.py: public→http+route, internal→tcp+no routes+min=1
- infra/koyeb-fix-routes.sh: script to fix routes via Koyeb API
- CLAUDE.md: document tcp/mesh architecture, PAT token warning, 502 debug
- Cargo.lock: hyper/hyper-util/tower deps for dual-stack h2c proxy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant