design: reconsider blob-level deduplication architecture

## Background

In PR #595, we implemented direct S3 upload to bypass Vercel's 4.5MB limit. The original design included:
- **Version-level deduplication**: If content hash matches existing version, skip upload ✅
- **Blob-level deduplication**: Upload individual file blobs separately for cross-version deduplication ❌

## Problem

The blob-level deduplication creates a paradox:

| Approach | Upload | Download | Issue |
|----------|--------|----------|-------|
| Only archive | archive.tar.gz | archive.tar.gz | No file-level dedup |
| Only blobs | individual blobs | Server reconstructs | Vercel serverless limits |
| Both blobs + archive | blobs + archive | archive | Double upload (wasteful) |

Current state:
- Prepare endpoint generates presigned URLs for blobs (never used)
- CLI/Sandbox only upload archive + manifest (not blobs)
- Blob table records created but no actual blob files in S3

## Decision

For now, we will:
1. **Remove blob presigned URL generation** - saves AWS API calls
2. **Remove blob ref counting** - eliminates race condition risk
3. **Keep version-level deduplication** - this already works and provides value

## Future Consideration

If file-level deduplication becomes necessary (e.g., large repos with incremental changes), consider:

**Option: Lambda async archive generation**
1. Client uploads only changed blobs
2. Trigger AWS Lambda to generate archive from blobs
3. Archive available after async processing

This requires additional infrastructure (Lambda, SQS) and is out of scope for now.

## Related

- PR #595: feat(api): add direct S3 upload endpoints for large file support
- Original issue #588: Checkpoint creation fails with "Invalid JSON response" on large file uploads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

design: reconsider blob-level deduplication architecture #603

Background

Problem

Decision

Future Consideration

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Upload	Download	Issue
Only archive	archive.tar.gz	archive.tar.gz	No file-level dedup
Only blobs	individual blobs	Server reconstructs	Vercel serverless limits
Both blobs + archive	blobs + archive	archive	Double upload (wasteful)

design: reconsider blob-level deduplication architecture #603

Description

Background

Problem

Decision

Future Consideration

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions