Skip to content

fix: harden data room uploads and backfill view schema#1

Open
joelev wants to merge 1 commit intosdamico:mainfrom
joelev:fix/data-room-upload-backfill
Open

fix: harden data room uploads and backfill view schema#1
joelev wants to merge 1 commit intosdamico:mainfrom
joelev:fix/data-room-upload-backfill

Conversation

@joelev
Copy link
Contributor

@joelev joelev commented Mar 2, 2026

Problem

Data room upload/access paths had multiple correctness and security gaps:

  • Chunked uploads enforced the wrong size limit and could exceed the intended 50MB file cap.
  • Chunk append operations did not consistently update size_bytes and could succeed without checking post-append size constraints atomically.
  • Entry-page metadata updates did not persist mime_type.
  • Protected data-room page responses were cacheable (public, max-age=3600), which is unsafe for access-controlled content.
  • Existing environments could reference views/view_files/data_room_access.view_id from newer code paths without guaranteed schema presence.

Root Cause

  • Validation logic drift between upload initialization and chunk append paths.
  • Missing atomic guardrail in SQL update for append-size enforcement.
  • Incomplete schema evolution for legacy databases where views artifacts were not present.
  • Cache headers on protected endpoints were optimized for performance rather than confidentiality.

What Changed

api/admin/data-room-upload.js

  • Added shared MIME validation helper and enforced MIME checks in upload init path.
  • Added validation for invalid sizeBytes values.
  • Switched chunk request-size cutoff from MAX_FILE_SIZE to MAX_UPLOAD_BYTES (base64 transport overhead aware).
  • Made chunk append update atomic: append + recompute size_bytes + enforce max size in one SQL statement.
  • Added explicit 404 vs 413 handling when append update does not return a row.
  • Ensured finalize path synchronizes size_bytes from actual bytea length.
  • Entry-page init update now persists mime_type.

api/data/pages.js

  • Replaced cacheable headers with strict no-cache headers for protected file responses:
    • Cache-Control: no-store, no-cache, must-revalidate
    • Pragma: no-cache
    • Expires: 0

migrations/012_view_schema_backfill.sql

  • Added idempotent backfill for view access schema:
    • views table
    • view_files table + indexes
    • data_room_access.view_id + index
  • Designed as safe no-op on fresh installs and corrective for older deployments.

Migration Impact

  • Forward-compatible and idempotent (IF NOT EXISTS / ADD COLUMN IF NOT EXISTS).
  • No destructive DDL.
  • Existing data remains intact; only missing structures are created.

Risk Assessment

  • Low to medium: upload path behavior is stricter and may reject malformed previously-tolerated requests.
  • Low: cache policy becomes more restrictive (expected for sensitive content).
  • Low: migration is additive/idempotent.

Rollback Plan

  1. Revert commit a219ac0 if regressions are found.
  2. Application rollback is immediate; migration objects are additive and can remain in place safely.
  3. If required, isolate rollback to API behavior by reverting only api/admin/data-room-upload.js and api/data/pages.js while leaving schema backfill intact.

Validation

  • npm run build passes:
    • Built content/page.html (63624 bytes, 9 slides)
  • No test, lint, or typecheck scripts are currently defined in package.json.

@joelev
Copy link
Contributor Author

joelev commented Mar 2, 2026

Reviewer checklist for this PR:

  1. Upload correctness
  • Confirm chunked uploads now enforce transport size using MAX_UPLOAD_BYTES and final stored size using atomic SQL guard (<= MAX_FILE_SIZE).
  • Confirm size_bytes remains accurate after append/finalize paths.
  1. Validation/security
  • Confirm MIME validation is enforced consistently in upload init and standard upload flows.
  • Confirm invalid sizeBytes inputs fail fast with 400.
  1. Access-controlled content caching
  • Confirm protected responses in api/data/pages.js are non-cacheable (no-store/no-cache headers) and this is acceptable for UX/perf tradeoff.
  1. Schema/migration safety
  • Confirm migrations/012_view_schema_backfill.sql is additive and idempotent in production-like DB states.
  • Confirm no conflict with existing migrations that reference view_id/views.
  1. Regression spot-check
  • Manual smoke test: standard upload, chunked upload, and data-room page load for an authorized user.
  • Manual negative test: over-limit chunked upload should fail with 413.

@joelev
Copy link
Contributor Author

joelev commented Mar 2, 2026

Validation update (local smoke harness):

Executed 4 targeted scenarios against the changed handlers with mocked auth/DB dependencies to isolate behavior in this PR:

  1. action=simple upload path
  • Result: PASS (201), persisted expected size_bytes.
  1. Chunked upload flow (action=init -> action=chunk -> action=finish)
  • Result: PASS (200 chain), final size_bytes synchronized correctly.
  1. Over-limit chunked upload (> 50MB request payload)
  • Result: PASS (413), request stream aborted as expected.
  1. Authorized api/data/pages asset fetch caching policy
  • Result: PASS (200), response headers include:
    • Cache-Control: no-store, no-cache, must-revalidate
    • Pragma: no-cache
    • Expires: 0

Additional validation:

  • npm run build passes (Built content/page.html (63624 bytes, 9 slides)).

Scope note:

  • This repo currently has no test/lint/typecheck scripts. The smoke checks were run via a local Node harness with mocked DB/auth modules to validate the exact changed code paths deterministically.

Copy link
Owner

@sdamico sdamico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work overall — the cache-control fix on protected assets is an important security improvement, the atomic chunk enforcement (enforcing MAX_FILE_SIZE in the WHERE clause) is a solid approach, MIME validation is well-factored into isAllowedMimeType, and the migration backfill is clean and safe.

Two things worth addressing before merging:


Required: MAX_UPLOAD_BYTES must account for base64 overhead

MAX_UPLOAD_BYTES and MAX_FILE_SIZE are both set to 50 * 1024 * 1024:

// api/admin/data-room-upload.js, lines 4-5
const MAX_FILE_SIZE   = 50 * 1024 * 1024; // 50MB
const MAX_UPLOAD_BYTES = 50 * 1024 * 1024; // 50MB  ← bug

The chunk action reads raw base64 text from the request body and gates it with MAX_UPLOAD_BYTES. But a 50 MB binary file, when base64-encoded, is ~66.7 MB on the wire (~33% overhead). Using the same cap for both means a valid 50 MB file will be rejected mid-transfer with a 413 before it even reaches the SQL layer.

The fix is to size MAX_UPLOAD_BYTES to cover the largest base64-encoded payload that could represent a MAX_FILE_SIZE file:

const MAX_FILE_SIZE    = 50 * 1024 * 1024;
const MAX_UPLOAD_BYTES = Math.ceil(MAX_FILE_SIZE * 4 / 3); // ~66.7 MB, covers base64 overhead

This keeps the actual stored size enforced at the DB layer (the WHERE clause you added), and MAX_UPLOAD_BYTES becomes a transport-layer guard that won't prematurely kill valid uploads.


Minor: decode(chunkBase64, 'base64') is called twice in the chunk UPDATE

In the new chunk append query:

SET
  content = content || decode(${chunkBase64}, 'base64'),
  size_bytes = octet_length(content) + octet_length(decode(${chunkBase64}, 'base64'))
WHERE id = ${id}
  AND octet_length(content) + octet_length(decode(${chunkBase64}, 'base64')) <= ${MAX_FILE_SIZE}

decode(chunkBase64, 'base64') appears three times — once in SET, once in the size_bytes expression, and once in the WHERE predicate. Postgres may optimize this away, but it's not guaranteed to. A CTE avoids the ambiguity:

WITH decoded AS (
  SELECT decode(${chunkBase64}, 'base64') AS chunk_bytes
)
UPDATE data_room_files
SET
  content    = content || (SELECT chunk_bytes FROM decoded),
  size_bytes = octet_length(content) + octet_length((SELECT chunk_bytes FROM decoded))
WHERE id = ${id}
  AND octet_length(content) + octet_length((SELECT chunk_bytes FROM decoded)) <= ${MAX_FILE_SIZE}
RETURNING id

This is a minor optimization — the correctness issue with MAX_UPLOAD_BYTES is the one that needs fixing before ship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants