Summary
Currently, if any file in a dependency changes, all files are re-chunked and re-embedded because the content hash covers all files as a unit. Implement per-file content hashing so only changed files trigger re-processing, reducing incremental sync time by 50-90%.
Context
The hasher in internal/hasher/hasher.go computes a single ContentHash by sorting all files by path and hashing the concatenation. This means changing one file in a 200-file dependency invalidates the entire store key, forcing re-chunking and re-embedding of all 200 files. The store key (StoreKey) is derived from content hash + embedding model + chunking config.
Key files:
internal/hasher/hasher.go — ContentHash() (line ~26) hashes all files together; StoreKey() (line ~48) derives the store key
internal/pipeline/pipeline.go — ProcessFiles() processes all files for a dependency as one unit
internal/lockfile/lockfile.go — stores one content_hash and store_key per source
internal/store/lancedb.go — Upsert() deletes all chunks for a store key then re-inserts
Acceptance Criteria
Technical Approach
- Extend lockfile to store per-file content hashes (e.g.,
file_hashes map in source entry)
- On sync, compute per-file hashes and compare with lockfile
- Only process changed/new/deleted files through the chunk→embed pipeline
- For the store: use per-file store keys (e.g.,
sha256(dep_store_key + file_path)) so individual files can be upserted/deleted independently
- Keep the top-level
store_key as the aggregate for lockfile change detection
Dependencies
None — but interacts with the store's upsert/delete patterns.
Out of Scope
- Changes to the chunker implementations
- Parallel processing of changed files (covered by parallel chunking issue)
Summary
Currently, if any file in a dependency changes, all files are re-chunked and re-embedded because the content hash covers all files as a unit. Implement per-file content hashing so only changed files trigger re-processing, reducing incremental sync time by 50-90%.
Context
The hasher in
internal/hasher/hasher.gocomputes a singleContentHashby sorting all files by path and hashing the concatenation. This means changing one file in a 200-file dependency invalidates the entire store key, forcing re-chunking and re-embedding of all 200 files. The store key (StoreKey) is derived from content hash + embedding model + chunking config.Key files:
internal/hasher/hasher.go—ContentHash()(line ~26) hashes all files together;StoreKey()(line ~48) derives the store keyinternal/pipeline/pipeline.go—ProcessFiles()processes all files for a dependency as one unitinternal/lockfile/lockfile.go— stores onecontent_hashandstore_keyper sourceinternal/store/lancedb.go—Upsert()deletes all chunks for a store key then re-insertsAcceptance Criteria
Technical Approach
file_hashesmap in source entry)sha256(dep_store_key + file_path)) so individual files can be upserted/deleted independentlystore_keyas the aggregate for lockfile change detectionDependencies
None — but interacts with the store's upsert/delete patterns.
Out of Scope