Skip to content

Conversation

@pingSubhajit
Copy link
Contributor

Summary

This PR refactors the document ingestion logic to use a proper UNIQUE constraint on documents.source_id, enabling atomic upsert-by-sourceId semantics that are safe under concurrent writes.

Problem

The previous implementation used a "delete existing document by source_id, then insert" pattern. This approach is NOT safe under concurrent writes - if two ingest calls for the same sourceId run simultaneously, both could delete and then both insert, resulting in duplicate documents.

Solution

  • Add a UNIQUE constraint to documents.source_id in the database schema
  • Replace the delete-then-insert pattern with ON CONFLICT (source_id) DO UPDATE for atomic upsert
  • Change upsert() to return { documentId: string } - the canonical (stable) document ID
  • Delete chunks by document_id after upserting the document, instead of deleting documents by source_id

Changes

Schema Updates

  • documents.source_id now requires a UNIQUE constraint
  • Added recommended HNSW index for vector search performance in documentation

Store Adapters (all updated with same pattern)

  • drizzle-postgres-pgvector
  • prisma-postgres-pgvector
  • raw-sql-postgres-pgvector

Core

  • ingest.ts: Uses the canonical documentId returned from store.upsert()
  • types.ts: Updated VectorStore.upsert() signature to return { documentId: string }

Doctor Command

  • New check: db-sourceid-unique - Verifies UNIQUE constraint exists on documents.source_id
  • New check: db-sourceid-duplicates - Detects duplicate source_id values that must be resolved before adding the constraint

Documentation

  • Updated quickstart guide with new schema requirements
  • Updated internal docs (unrag.md)

Tests

  • Updated store tests to verify new upsert behavior
  • Added test for empty chunks array validation

Migration Notes

Existing users will need to:

  1. Check for duplicate source_id values:
    SELECT source_id, COUNT() FROM documents GROUP BY source_id HAVING COUNT() > 1;

@pingSubhajit pingSubhajit self-assigned this Jan 9, 2026
@vercel
Copy link

vercel bot commented Jan 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
unrag-web Ready Ready Preview, Comment Jan 9, 2026 5:38pm

@pingSubhajit pingSubhajit merged commit 0635fed into main Jan 9, 2026
3 checks passed
@pingSubhajit pingSubhajit deleted the fix/refactor-sourcid-as-unique-key branch January 9, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants