Skip to content

Perf/db transaction batching#37

Merged
Rakshat28 merged 2 commits intomainfrom
perf/db-transaction-batching
Feb 28, 2026
Merged

Perf/db transaction batching#37
Rakshat28 merged 2 commits intomainfrom
perf/db-transaction-batching

Conversation

@Rakshat28
Copy link
Copy Markdown
Owner

@Rakshat28 Rakshat28 commented Feb 28, 2026

completely decoupled the worker threads from database I/O to allow them to hash at maximum CPU speed:

  • Dedicated Writer Thread: Created a new crossbeam channel to send DbOp enums from the workers to a single, dedicated background database writer.
  • Transaction Batching: Implemented an accumulator pattern in dedupe_groups, scan_pipeline, and restore_pipeline. Operations are now buffered and flushed to redb in batches of 1,000.
  • Zero-Copy Memory Swap: Utilized std::mem::take to swap the buffer into the database writer without expensive memory allocations.

Closes #36

Copilot AI review requested due to automatic review settings February 28, 2026 11:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the scanning/dedupe/restore pipelines to reduce redb write-transaction overhead by batching state updates and (for scanning) moving DB I/O onto a dedicated writer thread, aiming to keep worker threads CPU-bound on hashing.

Changes:

  • Introduces DbOp + State::batch_write / batch_write_from_channel to batch and apply DB mutations.
  • Updates scan_pipeline, dedupe_groups, and restore_pipeline to buffer DB operations and flush in ~1,000-op batches.
  • Adds benchmark setup + README/CHANGELOG updates describing performance results and how to reproduce them.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/state.rs Adds DbOp and batched DB write APIs, including a channel-driven writer loop.
src/main.rs Routes state mutations through batching/channel writer and adds batch flush logic in pipelines.
benchmarks/setup_bench.sh Creates benchmark data “arenas” for performance comparisons.
README.md Documents benchmark results and steps to reproduce.
CHANGELOG.md Adds a 0.1.2 entry describing the performance-oriented DB refactor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

```bash
hyperfine \
--warmup 1 \
--prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark hyperfine --prepare step wipes ~/.bdstorage, but the codebase stores state under ~/.imprint (see src/state.rs and src/vault.rs). As written, reruns may not reset the real DB/vault and could skew results. Update the prepare command (and any other benchmark docs) to clear ~/.imprint instead.

Suggested change
--prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \
--prepare 'rm -rf ~/.imprint && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \

Copilot uses AI. Check for mistakes.
db: std::sync::Arc<Database>,
}

#[allow(dead_code)]
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[allow(dead_code)] on the entire impl State suppresses dead-code warnings for all methods, which can hide genuinely unused APIs and makes clippy less effective. It doesn’t appear necessary here since State methods are referenced from main.rs; consider removing it or scoping the allow to the specific unused item (if any).

Suggested change
#[allow(dead_code)]

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +268
for op in ops {
match op {
DbOp::UpsertFile(path, metadata) => {
let key = path.to_string_lossy().as_bytes().to_vec();
let value = bincode::serialize(&metadata)
.with_context(|| "serialize file metadata")?;
let mut table = txn.open_table(FILE_INDEX)?;
table.insert(key.as_slice(), value.as_slice())?;
}
DbOp::SetCasRefcount(hash, count) => {
let key = hash.to_vec();
let value = count.to_le_bytes().to_vec();
let mut table = txn.open_table(CAS_INDEX)?;
table.insert(key.as_slice(), value.as_slice())?;
}
DbOp::MarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
let value = 1u8;
let mut table = txn.open_table(VAULTED_INODES)?;
table.insert(key.as_slice(), std::slice::from_ref(&value))?;
}
DbOp::RemoveFileFromIndex(path) => {
let key = path.to_string_lossy().as_bytes().to_vec();
let mut table = txn.open_table(FILE_INDEX)?;
table.remove(key.as_slice())?;
}
DbOp::UnmarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
let mut table = txn.open_table(VAULTED_INODES)?;
table.remove(key.as_slice())?;
}
DbOp::RemoveCasRefcount(hash) => {
let key = hash.to_vec();
let mut table = txn.open_table(CAS_INDEX)?;
table.remove(key.as_slice())?;
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_write opens the same tables repeatedly inside the per-op loop (txn.open_table(FILE_INDEX/CAS_INDEX/VAULTED_INODES) for every op). This adds overhead and may reduce/negate the intended batching performance gains; consider opening each table once per transaction (or lazily caching the table handles) and reusing them while iterating over ops.

Suggested change
for op in ops {
match op {
DbOp::UpsertFile(path, metadata) => {
let key = path.to_string_lossy().as_bytes().to_vec();
let value = bincode::serialize(&metadata)
.with_context(|| "serialize file metadata")?;
let mut table = txn.open_table(FILE_INDEX)?;
table.insert(key.as_slice(), value.as_slice())?;
}
DbOp::SetCasRefcount(hash, count) => {
let key = hash.to_vec();
let value = count.to_le_bytes().to_vec();
let mut table = txn.open_table(CAS_INDEX)?;
table.insert(key.as_slice(), value.as_slice())?;
}
DbOp::MarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
let value = 1u8;
let mut table = txn.open_table(VAULTED_INODES)?;
table.insert(key.as_slice(), std::slice::from_ref(&value))?;
}
DbOp::RemoveFileFromIndex(path) => {
let key = path.to_string_lossy().as_bytes().to_vec();
let mut table = txn.open_table(FILE_INDEX)?;
table.remove(key.as_slice())?;
}
DbOp::UnmarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
let mut table = txn.open_table(VAULTED_INODES)?;
table.remove(key.as_slice())?;
}
DbOp::RemoveCasRefcount(hash) => {
let key = hash.to_vec();
let mut table = txn.open_table(CAS_INDEX)?;
table.remove(key.as_slice())?;
let mut file_index = txn.open_table(FILE_INDEX)?;
let mut cas_index = txn.open_table(CAS_INDEX)?;
let mut vaulted_inodes = txn.open_table(VAULTED_INODES)?;
for op in ops {
match op {
DbOp::UpsertFile(path, metadata) => {
let key = path.to_string_lossy().as_bytes().to_vec();
let value = bincode::serialize(&metadata)
.with_context(|| "serialize file metadata")?;
file_index.insert(key.as_slice(), value.as_slice())?;
}
DbOp::SetCasRefcount(hash, count) => {
let key = hash.to_vec();
let value = count.to_le_bytes().to_vec();
cas_index.insert(key.as_slice(), value.as_slice())?;
}
DbOp::MarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
let value = 1u8;
vaulted_inodes.insert(key.as_slice(), std::slice::from_ref(&value))?;
}
DbOp::RemoveFileFromIndex(path) => {
let key = path.to_string_lossy().as_bytes().to_vec();
file_index.remove(key.as_slice())?;
}
DbOp::UnmarkInodeVaulted(inode) => {
let key = inode.to_le_bytes();
vaulted_inodes.remove(key.as_slice())?;
}
DbOp::RemoveCasRefcount(hash) => {
let key = hash.to_vec();
cas_index.remove(key.as_slice())?;

Copilot uses AI. Check for mistakes.
}

if !buffer.is_empty() {
let _ = self.batch_write(std::mem::take(&mut buffer));
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::mem::take(&mut buffer) replaces buffer with an empty Vec (capacity 0). On the next loop iteration, pushes will reallocate, which undermines the “zero-copy swap” goal and can add significant allocation churn under load. Consider swapping with a preallocated replacement (e.g., mem::replace with Vec::with_capacity(BATCH_SIZE) or swap into a separate vec) to preserve capacity across batches.

Suggested change
let _ = self.batch_write(std::mem::take(&mut buffer));
let to_write = std::mem::replace(&mut buffer, Vec::with_capacity(BATCH_SIZE));
let _ = self.batch_write(to_write);

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +310
if !buffer.is_empty() {
let _ = self.batch_write(std::mem::take(&mut buffer));
}
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_write_from_channel discards all database write errors (let _ = self.batch_write(...)). If a commit fails, the in-memory scan/dedupe results will diverge from persisted state with no signal to the caller. Consider returning a Result<()> from this method and propagating/handling the error (or at least logging and terminating the writer thread on failure).

Copilot uses AI. Check for mistakes.

hash_bar.finish_and_clear();

let _ = db_writer_handle.join();
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB writer thread join() result is ignored. If the writer panics (or if you later change it to return Result), this will silently proceed with potentially incomplete DB state. Consider handling the JoinHandle result (e.g., propagate panic, or return an error from scan_pipeline).

Suggested change
let _ = db_writer_handle.join();
if let Err(err) = db_writer_handle.join() {
std::panic::resume_unwind(err);
}

Copilot uses AI. Check for mistakes.
Comment on lines +473 to +476
global_db_ops.extend(db_ops);
if global_db_ops.len() >= 1000 {
state.batch_write(std::mem::take(&mut global_db_ops))?;
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch flush uses a hard-coded 1000 and std::mem::take(&mut global_db_ops), which drops the vector’s capacity and can reintroduce allocation churn on the next accumulation cycle. Consider using a shared BATCH_SIZE constant (single source of truth) and swapping with a preallocated vec so capacity is retained between flushes.

Copilot uses AI. Check for mistakes.
Comment on lines +622 to +625
global_restore_ops.extend(restore_ops);
if global_restore_ops.len() >= 1000 {
let _ = state.batch_write(std::mem::take(&mut global_restore_ops));
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in other batching sites, std::mem::take(&mut global_restore_ops) drops capacity and let _ = state.batch_write(...) suppresses write/commit failures. For restore, silent DB failures can leave refcounts/inode markers inconsistent with the filesystem. Prefer swapping with a preallocated vec and propagating (or at least logging) batch_write errors.

Copilot uses AI. Check for mistakes.
@Rakshat28 Rakshat28 merged commit ff30424 into main Feb 28, 2026
6 checks passed
@Rakshat28 Rakshat28 deleted the perf/db-transaction-batching branch March 11, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: High latency on deep directories of tiny files (Database transaction bottleneck)

2 participants