Skip to content

task(dataset): Redirect multipart upload through File Service #4110

@carloea2

Description

@carloea2

Task Summary

The current dataset upload flow exposes MinIO/LakeFS presigned URLs to the client and lets the browser upload file parts directly to object storage. A new design moves the File Service into the data path and introduces server-side upload sessions states keyed by filePath, uid and did.

New DB schema (multipart upload session + parts)

Field / Property dataset_upload_session dataset_upload_session_part
Purpose Tracks one active multipart upload session for a dataset file (per user + dataset + file path). Tracks per-part completion state for a multipart upload (stores etag needed for finalize).
Primary key (uid, did, file_path) (upload_id, part_number)
Key columns upload_id (UNIQUE), physical_address, num_parts_requested etag
Defaults etag TEXT NOT NULL DEFAULT ''
Checks CHECK (part_number > 0)
Foreign keys did → dataset(did) ON DELETE CASCADE
uid → "user"(uid) ON DELETE CASCADE
upload_id → dataset_upload_session(upload_id) ON DELETE CASCADE
Cleanup behavior Deleting a session deletes the session row (and cascades to parts). Deleted automatically when the parent session is deleted.
Why this matters Keeps server-side state (no presigned URLs). Enforces expected total parts. Enables per-part locking, retries, and DB-based completeness validation (no listParts() call).

Current Behavior

Image Image

New Behavior

Image Image Image Image

Priority

P3 – Low

Task Type

  • Code Implementation
  • Documentation
  • Refactor / Cleanup
  • Testing / QA
  • DevOps / Deployment

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagePending for triaging

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions