Skip to content

design: reconsider blob-level deduplication architecture #603

@lancy

Description

@lancy

Background

In PR #595, we implemented direct S3 upload to bypass Vercel's 4.5MB limit. The original design included:

  • Version-level deduplication: If content hash matches existing version, skip upload ✅
  • Blob-level deduplication: Upload individual file blobs separately for cross-version deduplication ❌

Problem

The blob-level deduplication creates a paradox:

Approach Upload Download Issue
Only archive archive.tar.gz archive.tar.gz No file-level dedup
Only blobs individual blobs Server reconstructs Vercel serverless limits
Both blobs + archive blobs + archive archive Double upload (wasteful)

Current state:

  • Prepare endpoint generates presigned URLs for blobs (never used)
  • CLI/Sandbox only upload archive + manifest (not blobs)
  • Blob table records created but no actual blob files in S3

Decision

For now, we will:

  1. Remove blob presigned URL generation - saves AWS API calls
  2. Remove blob ref counting - eliminates race condition risk
  3. Keep version-level deduplication - this already works and provides value

Future Consideration

If file-level deduplication becomes necessary (e.g., large repos with incremental changes), consider:

Option: Lambda async archive generation

  1. Client uploads only changed blobs
  2. Trigger AWS Lambda to generate archive from blobs
  3. Archive available after async processing

This requires additional infrastructure (Lambda, SQS) and is out of scope for now.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    known problemKnown problem, scheduled for future resolutionlaterMaybe later

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions