Skip to content

Incremental per-file change detection #20

@johnlanda

Description

@johnlanda

Summary

Currently, if any file in a dependency changes, all files are re-chunked and re-embedded because the content hash covers all files as a unit. Implement per-file content hashing so only changed files trigger re-processing, reducing incremental sync time by 50-90%.

Context

The hasher in internal/hasher/hasher.go computes a single ContentHash by sorting all files by path and hashing the concatenation. This means changing one file in a 200-file dependency invalidates the entire store key, forcing re-chunking and re-embedding of all 200 files. The store key (StoreKey) is derived from content hash + embedding model + chunking config.

Key files:

  • internal/hasher/hasher.goContentHash() (line ~26) hashes all files together; StoreKey() (line ~48) derives the store key
  • internal/pipeline/pipeline.goProcessFiles() processes all files for a dependency as one unit
  • internal/lockfile/lockfile.go — stores one content_hash and store_key per source
  • internal/store/lancedb.goUpsert() deletes all chunks for a store key then re-inserts

Acceptance Criteria

  • Per-file content hashes are computed and stored (in lockfile or separate metadata)
  • On sync, only files whose content hash changed are re-chunked and re-embedded
  • Unchanged file chunks are preserved in the store (not deleted and re-inserted)
  • The overall dependency store key still changes when any file changes (for lockfile correctness)
  • Incremental sync of a 100-file dependency with 1 file changed re-processes only that file
  • Full re-sync still works correctly when embedding model or chunking config changes

Technical Approach

  1. Extend lockfile to store per-file content hashes (e.g., file_hashes map in source entry)
  2. On sync, compute per-file hashes and compare with lockfile
  3. Only process changed/new/deleted files through the chunk→embed pipeline
  4. For the store: use per-file store keys (e.g., sha256(dep_store_key + file_path)) so individual files can be upserted/deleted independently
  5. Keep the top-level store_key as the aggregate for lockfile change detection

Dependencies

None — but interacts with the store's upsert/delete patterns.

Out of Scope

  • Changes to the chunker implementations
  • Parallel processing of changed files (covered by parallel chunking issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions