feat(runtime)!: persist response chains on neon postgresql#211
feat(runtime)!: persist response chains on neon postgresql#211xirothedev wants to merge 7 commits intoSoju06:mainfrom
Conversation
Add durable response snapshots and replay previous_response_id across HTTP and websocket responses. Prefer the original serving account when it is still eligible, add migration coverage, and sync the OpenSpec change and main specs.
Require PostgreSQL runtime configuration, add a dedicated migration DSN, and remove SQLite-specific startup behavior from the main runtime path. Normalize asyncpg SSL query parameters for Docker deployments and update OpenSpec, docs, compose, and tests for the Neon-first database contract.
There was a problem hiding this comment.
Pull request overview
Implements durable previous_response_id replay for /v1/responses (HTTP + WebSocket) by persisting response snapshots, adds routing continuity by preferring the prior account when eligible, and shifts runtime/test expectations to Neon PostgreSQL with a dedicated migration DSN. Also fixes API key expiration persistence by normalizing timezone-aware datetimes to UTC-naive.
Changes:
- Persist and resolve durable response snapshots to rebuild prior input/output history for
previous_response_id, with HTTP/WebSocket parity and integration coverage. - Prefer the prior serving account for replayed requests (with fallback), and add regression coverage for the new selection behavior.
- Require explicit PostgreSQL configuration (runtime + tests), introduce migration DSN resolution, and normalize asyncpg SSL URL query params; plus API key expiry timezone normalization.
Reviewed changes
Copilot reviewed 46 out of 47 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Bumps local package version in lockfile. |
| app/modules/proxy/service.py | Adds snapshot persistence, previous_response_id resolution, replay input reconstruction, and preferred-account routing for Responses (HTTP + WS). |
| app/modules/proxy/response_snapshots_repository.py | New repository for response snapshot get/upsert. |
| app/modules/proxy/request_policy.py | Adds helper to build OpenAI invalid_request_error envelopes with param. |
| app/modules/proxy/repo_bundle.py | Extends proxy repo bundle to include response_snapshots. |
| app/modules/proxy/load_balancer.py | Adds optional preferred_account_id to bias account selection when eligible. |
| app/modules/api_keys/service.py | Normalizes timezone-aware expires_at values to UTC-naive before persistence. |
| app/dependencies.py | Wires ResponseSnapshotsRepository into proxy dependencies. |
| app/db/models.py | Adds ResponseSnapshot ORM model + index. |
| app/db/alembic/versions/20260315_120000_add_response_snapshots.py | Adds migration creating response_snapshots table + index. |
| app/db/alembic/env.py | Alembic now uses migration DSN when configured; removes sqlite batch mode toggles. |
| app/db/migrate.py | CLI defaults to migration DSN (or runtime DSN) when --db-url not supplied. |
| app/db/session.py | Removes SQLite startup logic; adds asyncpg URL normalization and runs startup migrations against migration DSN. |
| app/core/openai/requests.py | Stops rejecting previous_response_id; enforces mutual exclusivity with conversation. |
| app/core/config/settings.py | Requires CODEX_LB_DATABASE_URL; adds CODEX_LB_DATABASE_MIGRATION_URL with fallback to runtime DSN. |
| tests/unit/test_proxy_utils.py | Adds unit coverage for rebuilding upstream input from stored snapshots in WS prepare flow. |
| tests/unit/test_openai_requests.py | Updates validation tests for previous_response_id acceptance + conflict with conversation. |
| tests/unit/test_db_session.py | Updates session/init_db tests for required DB URL, migration DSN behavior, and asyncpg URL normalization. |
| tests/unit/test_api_keys_service.py | Adds regression tests for timezone-aware expires_at normalization on create/update. |
| tests/integration/test_openai_compat_features.py | Adds integration coverage for unknown previous_response_id and replay after restart for HTTP. |
| tests/integration/test_proxy_websocket_responses.py | Adds integration coverage for replay after restart for WebSocket Responses. |
| tests/integration/test_migrations.py | Verifies migrations create the response_snapshots table/columns. |
| tests/integration/test_load_balancer_integration.py | Adds integration coverage for preferred-account selection + fallback. |
| tests/conftest.py | Requires explicit PostgreSQL test DSN env vars; switches temp key file to tmp_path. |
| README.md | Updates runtime instructions/docs to require Neon PostgreSQL and separate migration DSN. |
| .env.example | Updates env template to Neon PostgreSQL runtime + migration DSNs (and sets encryption key file). |
| docker-compose.yml | Removes local Postgres service/volume. |
| openspec/specs/responses-api-compat/spec.md | Updates spec to define durable previous_response_id replay + routing continuity requirements. |
| openspec/specs/responses-api-compat/context.md | Updates context to reflect proxy-local previous_response_id support and error semantics. |
| openspec/specs/api-keys/spec.md | Updates spec to require timezone-aware expiresAt acceptance + normalization. |
| openspec/specs/database-migrations/spec.md | Updates migration spec to use resolved migration DSN. |
| openspec/specs/database-migrations/context.md | Updates migration DSN resolution and removes sqlite backup notes. |
| openspec/specs/database-backends/spec.md | Updates backend spec to require Neon PostgreSQL and explicit test DSNs. |
| openspec/specs/database-backends/context.md | Updates backend context for Neon pooled vs direct DSNs. |
| openspec/specs/query-caching/context.md | Updates context wording for PostgreSQL-backed runtime. |
| openspec/changes/support-previous-response-id-persistence/* | Adds proposal/design/tasks + modified specs for durable previous_response_id support. |
| openspec/changes/neon-postgresql-runtime/* | Adds proposal/tasks + modified specs for Neon-first runtime behavior. |
| openspec/changes/fix-api-key-expiration-timezone/* | Adds proposal/tasks + modified spec for expiry timezone normalization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
app/db/alembic/versions/20260315_120000_add_response_snapshots.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 972251abce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72cd7e8885
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resolved_request = await self._resolve_previous_response_request( | ||
| payload, | ||
| api_key_id=api_key.id if api_key else None, | ||
| ) |
There was a problem hiding this comment.
Move previous_response lookup inside settlement guard
_resolve_previous_response_request() now runs before the try/finally that releases stream API-key reservations, so any non-ProxyResponseError raised while loading snapshots (for example a transient DB failure during repos.response_snapshots.get) exits _stream_with_retry() without releasing api_key_reservation. Because reservations are created before streaming starts, repeated failures on requests that use previous_response_id can accumulate leaked reservations and block that key from further traffic.
Useful? React with 👍 / 👎.
|
Follow-up fix for the first-turn WebSocket transport fallback issue. This patch updates the non-terminal upstream disconnect path to:
Regression coverage added for:
Verification run locally:
|
|
Added an image-based VPS deploy path in this branch: manual branch image build workflow () and for pull-based deploys. Triggering the first image build now to avoid VPS OOM during local Docker builds. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a80159f1d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await self._handle_stream_error(account, {"message": disconnect_message}, "stream_incomplete") | ||
| retried = await self._retry_websocket_request_after_disconnect( |
There was a problem hiding this comment.
Avoid penalizing account on idle websocket disconnects
This disconnect path now records stream_incomplete against the current account before checking whether any requests are actually in flight. When the upstream websocket closes during an idle period (for example, normal upstream idle timeout between response.create calls), _handle_stream_error still runs and record_error marks a healthy account as failing, which can skew later account selection and cause avoidable failover. The penalty should only be applied when a pending request was interrupted (or after retry failure for such a request).
Useful? React with 👍 / 👎.
|
I don't understand the clear reason for removing sqlite |
|
does this fix the invalid_request_error errors in codex 106 mentioned in #221 ? |
|
thanks, this makes sense. i went through #211 again against current main and i agree sqlite should stay the default and postgres should stay optional. also, the cache/continuity side of this branch is no longer best carried by #211 as-is. main already moved forward with the bridge-based path in #220, #232, #236, and #239, so i’m not going to push the neon/sqlite runtime rewrite or the snapshot persistence approach from this branch. i split out the remaining unmerged piece that still looked useful: timezone-aware api key expiration normalization. that pr is here: #249 so from my side, #211 should be treated as superseded rather than trimmed down in place. |
|
Following up on the split request here: I opened #270 as the standalone SQLite-first version of the useful continuity/cache work from this branch. What #270 does:
So if the goal is to keep the session-cache/continuity improvement without changing the repo's DB default, please review #270 instead of trimming #211 in place. Full credit to @xirothedev for the original continuity implementation and follow-up fixes in #211. #270 just isolates the SQLite-compatible subset onto current |
Closes #210
Summary
response_idand replayprevious_response_idacross collected HTTP responses, streaming HTTP responses, and WebSocket Responses trafficCODEX_LB_DATABASE_MIGRATION_URL, and remove SQLite-first runtime assumptions from the main execution pathRoot Cause
previous_response_idcontinuity gapThe proxy could not delegate
previous_response_idto ChatGPT upstream, but it also did not persist enough local state to rebuild prior turns. That left/v1/responsesincompatible with clients that continue a conversation by referencing the previous response id.Runtime database contract drift
The repo still defaulted to SQLite-oriented runtime behavior in config, startup, and docs even though the intended deployment model is Neon/PostgreSQL. That mismatch made the runtime contract and operator guidance diverge.
Timezone-aware API key expiration failures
expiresAtvalues from the dashboard include timezone information, but the backend wrote them into atimestamp without time zonecolumn without first normalizing them, which caused PostgreSQL writes to fail.Changes
previous_response_idreplay, shared output-item accumulation, and preferred-account reuse/fallback for/v1/responsesover HTTP and WebSocketTesting
.venv/bin/python -m compileall app tests.venv/bin/pytest tests/unit/test_api_keys_service.py tests/unit/test_db_session.py tests/unit/test_openai_requests.py tests/unit/test_proxy_utils.py tests/integration/test_migrations.py tests/integration/test_openai_compat_features.py tests/integration/test_proxy_websocket_responses.py tests/integration/test_load_balancer_integration.pycould not run in this environment becauseCODEX_LB_TEST_DATABASE_URLis not setopenspec validate --specscould not run in this environment because theopenspecCLI is not installed