Skip to content

Comments

fix(core): lifecycle management, bounded caches, and memory safety across 11 crates#13

Open
dviejokfs wants to merge 9 commits intomainfrom
feat/postgres-backup-sidecar
Open

fix(core): lifecycle management, bounded caches, and memory safety across 11 crates#13
dviejokfs wants to merge 9 commits intomainfrom
feat/postgres-backup-sidecar

Conversation

@dviejokfs
Copy link
Contributor

@dviejokfs dviejokfs commented Feb 19, 2026

Summary

Comprehensive memory safety, lifecycle management, and performance improvements across 11 crates. Fixes OOM risks, resource leaks, unbounded caches, and adds proper shutdown/cleanup semantics to long-running background tasks.

Changes

Backup & Data Safety

  • pg_dump sidecar pattern — Streams backups through a sidecar container instead of loading into memory, preventing OOM on large TimescaleDB databases
  • Branch uniqueness validation — Environments now enforce unique branch names per project

Memory & Performance Fixes

  • SQL percentile_cont() — Replaces in-memory percentile sorting in analytics-performance
  • select_only() for source maps — Avoids fetching blob content when listing source maps
  • Temp file tar streaming — Docker build context uses temp files instead of in-memory buffers
  • Tail-limited build logs — Caps stored build output to prevent unbounded growth
  • Bounded session replayLIMIT + bulk UPDATE on large replay queries
  • LazyLock regexes — Error tracking regexes compiled once instead of per-request

Lifecycle & Resource Management (6 fixes)

  • DigestScheduler Arc cycle (temps-notifications) — new() returns plain Self instead of Arc<Self>, added start()/shutdown() methods + Drop for clean cancellation
  • RouteTableListener leaked spawn (temps-routes) — Stores JoinHandle in Mutex, added shutdown() + Drop. Spawned task captures only Arc<CachedPeerTable> instead of Arc<Self>
  • ProjectChangeListener no cancellation (temps-routes) — Changed start_listening from consuming self to &self, spawns internal task with stored JoinHandle + Drop
  • MetricsCache unbounded (temps-analytics-funnels) — Added max_capacity (default 1000). Evicts expired entries first, then oldest-by-expiration on overflow
  • OutageDetectionService unbounded monitor_states (temps-monitoring) — scan_monitors() now prunes entries for monitors no longer in the active DB set
  • WebhookEventListener no Drop (temps-webhooks) — Added Drop impl using try_write() to abort spawned task handle

Other Improvements

  • Source map CLI commandstemps errors sourcemaps upload/list/delete
  • Analytics events — OpenAPI spec improvements and handler refactoring
  • Deployments — Workflow execution and job processor improvements

Testing

28 new tests covering lifecycle, bounded cache, and cleanup behavior:

  • 6 tests for DigestScheduler lifecycle (64 total in crate)
  • 3 tests for RouteTableListener shutdown/Drop (42 total in crate)
  • 3 tests for ProjectChangeListener shutdown/Drop (42 total in crate)
  • 8 tests for MetricsCache bounded capacity and eviction (17 total in crate)
  • 4 tests for OutageDetectionService state pruning (21 total in crate)
  • 4 tests for WebhookEventListener Drop behavior (12 total in crate)

Crates Modified (30 files)

temps-backup, temps-deployer, temps-deployments, temps-environments, temps-error-tracking, temps-analytics-events, temps-analytics-funnels, temps-analytics-performance, temps-analytics-session-replay, temps-monitoring, temps-notifications, temps-routes, temps-webhooks, temps-projects, temps-status-page, temps-cli

… and error hardening

- Replace pg_dumpall exec-in-container with a disposable sidecar container running
  the same image, connected via Docker network; fixes OOM kills (exit 137) under load
  and supports TimescaleDB databases (--format=custom, streaming gzip, no memory spike)
- Wire up all preset providers (Next.js, Vite, Rsbuild, Docusaurus v1/v2, NestJS,
  Angular, Astro, Dockerfile, Nixpacks) in PresetProviderRegistry::new(); Dockerfile
  and Nixpacks registered first for highest detection precedence
- Convert ServiceRegistry and PluginStateRegistry from Mutex to RwLock for concurrent reads
- Harden BackupError: structured NotFound/Internal variants with named fields, remove
  DatabaseConnectionError and Operation variants, exhaustive From<BackupError> for Problem
- Replace hand-rolled CORS middleware helper with tower_http::cors::CorsLayer doc comment
- Remove legacy CreateService.tsx and CreateServiceRefactored.tsx frontend pages
…ross 11 crates

- Backup: pg_dump sidecar pattern to prevent OOM on large TimescaleDB databases
- Environments: branch uniqueness validation per project
- CLI: source map upload/list/delete commands for error tracking
- Performance: SQL percentile_cont() instead of in-memory sorting
- Session replay: LIMIT + bulk UPDATE to bound memory on large replays
- Source maps: select_only() to avoid fetching blob content in listings
- Error tracking: LazyLock regexes to avoid recompilation per request
- Deployer: temp file tar streaming + tail-limited build logs
- Notifications: DigestScheduler Arc cycle fix with proper start/shutdown
- Routes: RouteTableListener + ProjectChangeListener stored JoinHandle + Drop
- Funnels: MetricsCache bounded to max_capacity with LRU-style eviction
- Monitoring: OutageDetectionService prunes stale monitor states
- Webhooks: WebhookEventListener Drop impl aborts spawned task
- Analytics events: OpenAPI spec improvements and handler refactoring
- Deployments: workflow execution and job processor improvements

Includes 28 new tests covering lifecycle, bounded cache, and cleanup behavior.
@dviejokfs dviejokfs changed the title feat(providers,backup,core): PostgreSQL backup via pg_dump sidecar, preset registry wiring, and error hardening fix(core): lifecycle management, bounded caches, and memory safety across 11 crates Feb 20, 2026
…logging

pg_dump with --format=custom buffered per-table data in both the sidecar
container and the temps process (via Bollard's HTTP stream), causing 2-3 GB
memory peaks and OOM kills (exit code 137) on large TimescaleDB databases.

Switch to --format=plain which streams SQL text (COPY statements) row-by-row
with constant memory usage. Update both the main database backup path and the
external service postgres provider.

Additional fixes:
- Stream S3 uploads via ByteStream::from_path instead of reading entire file
  into memory in upload_single_part
- Add error! logging for backup failures in handlers and service layer that
  were previously silently converted to HTTP Problem responses
- Update restore paths to detect format from S3 location and use psql for
  new .sql.gz backups or pg_restore for old .pgdump.gz backups, preserving
  backward compatibility with existing backups
…emory management

- Change backup file extension from .postgresql.gz to .sql.gz to align with plain format pg_dump output.
- Update metadata location handling to accommodate both .sql.gz and legacy .backup.postgresql.gz formats.
- Enhance sidecar container configuration to prevent OOM kills by overriding the entrypoint and adjusting oom_score_adj.
- Update tests to validate the new plain format output structure.
… improved error handling

- Implement a host directory for bind mounting, allowing pg_dump to write directly to disk, reducing memory usage during backups.
- Update container configuration to prevent OOM kills by adjusting entrypoint and adding bind mounts.
- Introduce a helper function to ensure sidecar removal on error paths.
- Streamline pg_dump command execution to pipe output directly to a gzip-compressed file on the host filesystem, maintaining low memory usage.
…M issues

Updated the command duration from 300 seconds to 86400 seconds (24 hours) in both BackupService and PostgresService configurations to ensure the sidecar outlives pg_dump operations on large databases, preventing potential OOM kills during backups.
…access and clarity

- Change bind mount path from /tmp to /backup in the container configuration to ensure proper write access for the postgres user.
- Set the container user to root to guarantee write permissions on the bind mount.
- Enhance comments for clarity regarding the backup process and container setup.
- Update pg_dump command execution to run fully detached, avoiding memory growth by not streaming stdout/stderr through the Temps process.
- Redirect stderr to a file within the container for error logging, improving error handling during backup operations.
- Adjust exec options to disable stdout/stderr attachment, enhancing performance and stability during large database backups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant