fix(core): lifecycle management, bounded caches, and memory safety across 11 crates#13
Open
fix(core): lifecycle management, bounded caches, and memory safety across 11 crates#13
Conversation
… and error hardening - Replace pg_dumpall exec-in-container with a disposable sidecar container running the same image, connected via Docker network; fixes OOM kills (exit 137) under load and supports TimescaleDB databases (--format=custom, streaming gzip, no memory spike) - Wire up all preset providers (Next.js, Vite, Rsbuild, Docusaurus v1/v2, NestJS, Angular, Astro, Dockerfile, Nixpacks) in PresetProviderRegistry::new(); Dockerfile and Nixpacks registered first for highest detection precedence - Convert ServiceRegistry and PluginStateRegistry from Mutex to RwLock for concurrent reads - Harden BackupError: structured NotFound/Internal variants with named fields, remove DatabaseConnectionError and Operation variants, exhaustive From<BackupError> for Problem - Replace hand-rolled CORS middleware helper with tower_http::cors::CorsLayer doc comment - Remove legacy CreateService.tsx and CreateServiceRefactored.tsx frontend pages
…ross 11 crates - Backup: pg_dump sidecar pattern to prevent OOM on large TimescaleDB databases - Environments: branch uniqueness validation per project - CLI: source map upload/list/delete commands for error tracking - Performance: SQL percentile_cont() instead of in-memory sorting - Session replay: LIMIT + bulk UPDATE to bound memory on large replays - Source maps: select_only() to avoid fetching blob content in listings - Error tracking: LazyLock regexes to avoid recompilation per request - Deployer: temp file tar streaming + tail-limited build logs - Notifications: DigestScheduler Arc cycle fix with proper start/shutdown - Routes: RouteTableListener + ProjectChangeListener stored JoinHandle + Drop - Funnels: MetricsCache bounded to max_capacity with LRU-style eviction - Monitoring: OutageDetectionService prunes stale monitor states - Webhooks: WebhookEventListener Drop impl aborts spawned task - Analytics events: OpenAPI spec improvements and handler refactoring - Deployments: workflow execution and job processor improvements Includes 28 new tests covering lifecycle, bounded cache, and cleanup behavior.
…logging pg_dump with --format=custom buffered per-table data in both the sidecar container and the temps process (via Bollard's HTTP stream), causing 2-3 GB memory peaks and OOM kills (exit code 137) on large TimescaleDB databases. Switch to --format=plain which streams SQL text (COPY statements) row-by-row with constant memory usage. Update both the main database backup path and the external service postgres provider. Additional fixes: - Stream S3 uploads via ByteStream::from_path instead of reading entire file into memory in upload_single_part - Add error! logging for backup failures in handlers and service layer that were previously silently converted to HTTP Problem responses - Update restore paths to detect format from S3 location and use psql for new .sql.gz backups or pg_restore for old .pgdump.gz backups, preserving backward compatibility with existing backups
…emory management - Change backup file extension from .postgresql.gz to .sql.gz to align with plain format pg_dump output. - Update metadata location handling to accommodate both .sql.gz and legacy .backup.postgresql.gz formats. - Enhance sidecar container configuration to prevent OOM kills by overriding the entrypoint and adjusting oom_score_adj. - Update tests to validate the new plain format output structure.
… improved error handling - Implement a host directory for bind mounting, allowing pg_dump to write directly to disk, reducing memory usage during backups. - Update container configuration to prevent OOM kills by adjusting entrypoint and adding bind mounts. - Introduce a helper function to ensure sidecar removal on error paths. - Streamline pg_dump command execution to pipe output directly to a gzip-compressed file on the host filesystem, maintaining low memory usage.
…M issues Updated the command duration from 300 seconds to 86400 seconds (24 hours) in both BackupService and PostgresService configurations to ensure the sidecar outlives pg_dump operations on large databases, preventing potential OOM kills during backups.
…access and clarity - Change bind mount path from /tmp to /backup in the container configuration to ensure proper write access for the postgres user. - Set the container user to root to guarantee write permissions on the bind mount. - Enhance comments for clarity regarding the backup process and container setup.
- Update pg_dump command execution to run fully detached, avoiding memory growth by not streaming stdout/stderr through the Temps process. - Redirect stderr to a file within the container for error logging, improving error handling during backup operations. - Adjust exec options to disable stdout/stderr attachment, enhancing performance and stability during large database backups.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive memory safety, lifecycle management, and performance improvements across 11 crates. Fixes OOM risks, resource leaks, unbounded caches, and adds proper shutdown/cleanup semantics to long-running background tasks.
Changes
Backup & Data Safety
Memory & Performance Fixes
percentile_cont()— Replaces in-memory percentile sorting in analytics-performanceselect_only()for source maps — Avoids fetching blob content when listing source mapsLIMIT+ bulkUPDATEon large replay queriesLazyLockregexes — Error tracking regexes compiled once instead of per-requestLifecycle & Resource Management (6 fixes)
temps-notifications) —new()returns plainSelfinstead ofArc<Self>, addedstart()/shutdown()methods +Dropfor clean cancellationtemps-routes) — StoresJoinHandleinMutex, addedshutdown()+Drop. Spawned task captures onlyArc<CachedPeerTable>instead ofArc<Self>temps-routes) — Changedstart_listeningfrom consumingselfto&self, spawns internal task with storedJoinHandle+Droptemps-analytics-funnels) — Addedmax_capacity(default 1000). Evicts expired entries first, then oldest-by-expiration on overflowtemps-monitoring) —scan_monitors()now prunes entries for monitors no longer in the active DB settemps-webhooks) — AddedDropimpl usingtry_write()to abort spawned task handleOther Improvements
temps errors sourcemaps upload/list/deleteTesting
28 new tests covering lifecycle, bounded cache, and cleanup behavior:
Crates Modified (30 files)
temps-backup,temps-deployer,temps-deployments,temps-environments,temps-error-tracking,temps-analytics-events,temps-analytics-funnels,temps-analytics-performance,temps-analytics-session-replay,temps-monitoring,temps-notifications,temps-routes,temps-webhooks,temps-projects,temps-status-page,temps-cli