Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the scanning/dedupe/restore pipelines to reduce redb write-transaction overhead by batching state updates and (for scanning) moving DB I/O onto a dedicated writer thread, aiming to keep worker threads CPU-bound on hashing.
Changes:
- Introduces
DbOp+State::batch_write/batch_write_from_channelto batch and apply DB mutations. - Updates
scan_pipeline,dedupe_groups, andrestore_pipelineto buffer DB operations and flush in ~1,000-op batches. - Adds benchmark setup + README/CHANGELOG updates describing performance results and how to reproduce them.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/state.rs | Adds DbOp and batched DB write APIs, including a channel-driven writer loop. |
| src/main.rs | Routes state mutations through batching/channel writer and adds batch flush logic in pipelines. |
| benchmarks/setup_bench.sh | Creates benchmark data “arenas” for performance comparisons. |
| README.md | Documents benchmark results and steps to reproduce. |
| CHANGELOG.md | Adds a 0.1.2 entry describing the performance-oriented DB refactor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```bash | ||
| hyperfine \ | ||
| --warmup 1 \ | ||
| --prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \ |
There was a problem hiding this comment.
The benchmark hyperfine --prepare step wipes ~/.bdstorage, but the codebase stores state under ~/.imprint (see src/state.rs and src/vault.rs). As written, reruns may not reset the real DB/vault and could skew results. Update the prepare command (and any other benchmark docs) to clear ~/.imprint instead.
| --prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \ | |
| --prepare 'rm -rf ~/.imprint && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \ |
| db: std::sync::Arc<Database>, | ||
| } | ||
|
|
||
| #[allow(dead_code)] |
There was a problem hiding this comment.
#[allow(dead_code)] on the entire impl State suppresses dead-code warnings for all methods, which can hide genuinely unused APIs and makes clippy less effective. It doesn’t appear necessary here since State methods are referenced from main.rs; consider removing it or scoping the allow to the specific unused item (if any).
| #[allow(dead_code)] |
| for op in ops { | ||
| match op { | ||
| DbOp::UpsertFile(path, metadata) => { | ||
| let key = path.to_string_lossy().as_bytes().to_vec(); | ||
| let value = bincode::serialize(&metadata) | ||
| .with_context(|| "serialize file metadata")?; | ||
| let mut table = txn.open_table(FILE_INDEX)?; | ||
| table.insert(key.as_slice(), value.as_slice())?; | ||
| } | ||
| DbOp::SetCasRefcount(hash, count) => { | ||
| let key = hash.to_vec(); | ||
| let value = count.to_le_bytes().to_vec(); | ||
| let mut table = txn.open_table(CAS_INDEX)?; | ||
| table.insert(key.as_slice(), value.as_slice())?; | ||
| } | ||
| DbOp::MarkInodeVaulted(inode) => { | ||
| let key = inode.to_le_bytes(); | ||
| let value = 1u8; | ||
| let mut table = txn.open_table(VAULTED_INODES)?; | ||
| table.insert(key.as_slice(), std::slice::from_ref(&value))?; | ||
| } | ||
| DbOp::RemoveFileFromIndex(path) => { | ||
| let key = path.to_string_lossy().as_bytes().to_vec(); | ||
| let mut table = txn.open_table(FILE_INDEX)?; | ||
| table.remove(key.as_slice())?; | ||
| } | ||
| DbOp::UnmarkInodeVaulted(inode) => { | ||
| let key = inode.to_le_bytes(); | ||
| let mut table = txn.open_table(VAULTED_INODES)?; | ||
| table.remove(key.as_slice())?; | ||
| } | ||
| DbOp::RemoveCasRefcount(hash) => { | ||
| let key = hash.to_vec(); | ||
| let mut table = txn.open_table(CAS_INDEX)?; | ||
| table.remove(key.as_slice())?; |
There was a problem hiding this comment.
batch_write opens the same tables repeatedly inside the per-op loop (txn.open_table(FILE_INDEX/CAS_INDEX/VAULTED_INODES) for every op). This adds overhead and may reduce/negate the intended batching performance gains; consider opening each table once per transaction (or lazily caching the table handles) and reusing them while iterating over ops.
| for op in ops { | |
| match op { | |
| DbOp::UpsertFile(path, metadata) => { | |
| let key = path.to_string_lossy().as_bytes().to_vec(); | |
| let value = bincode::serialize(&metadata) | |
| .with_context(|| "serialize file metadata")?; | |
| let mut table = txn.open_table(FILE_INDEX)?; | |
| table.insert(key.as_slice(), value.as_slice())?; | |
| } | |
| DbOp::SetCasRefcount(hash, count) => { | |
| let key = hash.to_vec(); | |
| let value = count.to_le_bytes().to_vec(); | |
| let mut table = txn.open_table(CAS_INDEX)?; | |
| table.insert(key.as_slice(), value.as_slice())?; | |
| } | |
| DbOp::MarkInodeVaulted(inode) => { | |
| let key = inode.to_le_bytes(); | |
| let value = 1u8; | |
| let mut table = txn.open_table(VAULTED_INODES)?; | |
| table.insert(key.as_slice(), std::slice::from_ref(&value))?; | |
| } | |
| DbOp::RemoveFileFromIndex(path) => { | |
| let key = path.to_string_lossy().as_bytes().to_vec(); | |
| let mut table = txn.open_table(FILE_INDEX)?; | |
| table.remove(key.as_slice())?; | |
| } | |
| DbOp::UnmarkInodeVaulted(inode) => { | |
| let key = inode.to_le_bytes(); | |
| let mut table = txn.open_table(VAULTED_INODES)?; | |
| table.remove(key.as_slice())?; | |
| } | |
| DbOp::RemoveCasRefcount(hash) => { | |
| let key = hash.to_vec(); | |
| let mut table = txn.open_table(CAS_INDEX)?; | |
| table.remove(key.as_slice())?; | |
| let mut file_index = txn.open_table(FILE_INDEX)?; | |
| let mut cas_index = txn.open_table(CAS_INDEX)?; | |
| let mut vaulted_inodes = txn.open_table(VAULTED_INODES)?; | |
| for op in ops { | |
| match op { | |
| DbOp::UpsertFile(path, metadata) => { | |
| let key = path.to_string_lossy().as_bytes().to_vec(); | |
| let value = bincode::serialize(&metadata) | |
| .with_context(|| "serialize file metadata")?; | |
| file_index.insert(key.as_slice(), value.as_slice())?; | |
| } | |
| DbOp::SetCasRefcount(hash, count) => { | |
| let key = hash.to_vec(); | |
| let value = count.to_le_bytes().to_vec(); | |
| cas_index.insert(key.as_slice(), value.as_slice())?; | |
| } | |
| DbOp::MarkInodeVaulted(inode) => { | |
| let key = inode.to_le_bytes(); | |
| let value = 1u8; | |
| vaulted_inodes.insert(key.as_slice(), std::slice::from_ref(&value))?; | |
| } | |
| DbOp::RemoveFileFromIndex(path) => { | |
| let key = path.to_string_lossy().as_bytes().to_vec(); | |
| file_index.remove(key.as_slice())?; | |
| } | |
| DbOp::UnmarkInodeVaulted(inode) => { | |
| let key = inode.to_le_bytes(); | |
| vaulted_inodes.remove(key.as_slice())?; | |
| } | |
| DbOp::RemoveCasRefcount(hash) => { | |
| let key = hash.to_vec(); | |
| cas_index.remove(key.as_slice())?; |
| } | ||
|
|
||
| if !buffer.is_empty() { | ||
| let _ = self.batch_write(std::mem::take(&mut buffer)); |
There was a problem hiding this comment.
std::mem::take(&mut buffer) replaces buffer with an empty Vec (capacity 0). On the next loop iteration, pushes will reallocate, which undermines the “zero-copy swap” goal and can add significant allocation churn under load. Consider swapping with a preallocated replacement (e.g., mem::replace with Vec::with_capacity(BATCH_SIZE) or swap into a separate vec) to preserve capacity across batches.
| let _ = self.batch_write(std::mem::take(&mut buffer)); | |
| let to_write = std::mem::replace(&mut buffer, Vec::with_capacity(BATCH_SIZE)); | |
| let _ = self.batch_write(to_write); |
| if !buffer.is_empty() { | ||
| let _ = self.batch_write(std::mem::take(&mut buffer)); | ||
| } | ||
| } |
There was a problem hiding this comment.
batch_write_from_channel discards all database write errors (let _ = self.batch_write(...)). If a commit fails, the in-memory scan/dedupe results will diverge from persisted state with no signal to the caller. Consider returning a Result<()> from this method and propagating/handling the error (or at least logging and terminating the writer thread on failure).
|
|
||
| hash_bar.finish_and_clear(); | ||
|
|
||
| let _ = db_writer_handle.join(); |
There was a problem hiding this comment.
The DB writer thread join() result is ignored. If the writer panics (or if you later change it to return Result), this will silently proceed with potentially incomplete DB state. Consider handling the JoinHandle result (e.g., propagate panic, or return an error from scan_pipeline).
| let _ = db_writer_handle.join(); | |
| if let Err(err) = db_writer_handle.join() { | |
| std::panic::resume_unwind(err); | |
| } |
| global_db_ops.extend(db_ops); | ||
| if global_db_ops.len() >= 1000 { | ||
| state.batch_write(std::mem::take(&mut global_db_ops))?; | ||
| } |
There was a problem hiding this comment.
Batch flush uses a hard-coded 1000 and std::mem::take(&mut global_db_ops), which drops the vector’s capacity and can reintroduce allocation churn on the next accumulation cycle. Consider using a shared BATCH_SIZE constant (single source of truth) and swapping with a preallocated vec so capacity is retained between flushes.
| global_restore_ops.extend(restore_ops); | ||
| if global_restore_ops.len() >= 1000 { | ||
| let _ = state.batch_write(std::mem::take(&mut global_restore_ops)); | ||
| } |
There was a problem hiding this comment.
Like in other batching sites, std::mem::take(&mut global_restore_ops) drops capacity and let _ = state.batch_write(...) suppresses write/commit failures. For restore, silent DB failures can leave refcounts/inode markers inconsistent with the filesystem. Prefer swapping with a preallocated vec and propagating (or at least logging) batch_write errors.
completely decoupled the worker threads from database I/O to allow them to hash at maximum CPU speed:
crossbeamchannel to sendDbOpenums from the workers to a single, dedicated background database writer.dedupe_groups,scan_pipeline, andrestore_pipeline. Operations are now buffered and flushed toredbin batches of 1,000.std::mem::taketo swap the buffer into the database writer without expensive memory allocations.Closes #36