Perf/db transaction batching by Rakshat28 · Pull Request #37 · Rakshat28/bdstorage

Rakshat28 · 2026-02-28T11:19:55Z

completely decoupled the worker threads from database I/O to allow them to hash at maximum CPU speed:

Dedicated Writer Thread: Created a new crossbeam channel to send DbOp enums from the workers to a single, dedicated background database writer.
Transaction Batching: Implemented an accumulator pattern in dedupe_groups, scan_pipeline, and restore_pipeline. Operations are now buffered and flushed to redb in batches of 1,000.
Zero-Copy Memory Swap: Utilized std::mem::take to swap the buffer into the database writer without expensive memory allocations.

Closes #36

Copilot

Pull request overview

Refactors the scanning/dedupe/restore pipelines to reduce redb write-transaction overhead by batching state updates and (for scanning) moving DB I/O onto a dedicated writer thread, aiming to keep worker threads CPU-bound on hashing.

Changes:

Introduces DbOp + State::batch_write / batch_write_from_channel to batch and apply DB mutations.
Updates scan_pipeline, dedupe_groups, and restore_pipeline to buffer DB operations and flush in ~1,000-op batches.
Adds benchmark setup + README/CHANGELOG updates describing performance results and how to reproduce them.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/state.rs	Adds `DbOp` and batched DB write APIs, including a channel-driven writer loop.
src/main.rs	Routes state mutations through batching/channel writer and adds batch flush logic in pipelines.
benchmarks/setup_bench.sh	Creates benchmark data “arenas” for performance comparisons.
README.md	Documents benchmark results and steps to reproduce.
CHANGELOG.md	Adds a 0.1.2 entry describing the performance-oriented DB refactor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T11:23:23Z

README.md

+   ```bash
+   hyperfine \
+     --warmup 1 \
+     --prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \


The benchmark hyperfine --prepare step wipes ~/.bdstorage, but the codebase stores state under ~/.imprint (see src/state.rs and src/vault.rs). As written, reruns may not reset the real DB/vault and could skew results. Update the prepare command (and any other benchmark docs) to clear ~/.imprint instead.

Suggested change

--prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \

--prepare 'rm -rf ~/.imprint && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \

Copilot · 2026-02-28T11:23:24Z

src/state.rs

    db: std::sync::Arc<Database>,
 }

+#[allow(dead_code)]


#[allow(dead_code)] on the entire impl State suppresses dead-code warnings for all methods, which can hide genuinely unused APIs and makes clippy less effective. It doesn’t appear necessary here since State methods are referenced from main.rs; consider removing it or scoping the allow to the specific unused item (if any).

Suggested change

#[allow(dead_code)]

Copilot · 2026-02-28T11:23:24Z

src/state.rs

+            for op in ops {
+                match op {
+                    DbOp::UpsertFile(path, metadata) => {
+                        let key = path.to_string_lossy().as_bytes().to_vec();
+                        let value = bincode::serialize(&metadata)
+                            .with_context(|| "serialize file metadata")?;
+                        let mut table = txn.open_table(FILE_INDEX)?;
+                        table.insert(key.as_slice(), value.as_slice())?;
+                    }
+                    DbOp::SetCasRefcount(hash, count) => {
+                        let key = hash.to_vec();
+                        let value = count.to_le_bytes().to_vec();
+                        let mut table = txn.open_table(CAS_INDEX)?;
+                        table.insert(key.as_slice(), value.as_slice())?;
+                    }
+                    DbOp::MarkInodeVaulted(inode) => {
+                        let key = inode.to_le_bytes();
+                        let value = 1u8;
+                        let mut table = txn.open_table(VAULTED_INODES)?;
+                        table.insert(key.as_slice(), std::slice::from_ref(&value))?;
+                    }
+                    DbOp::RemoveFileFromIndex(path) => {
+                        let key = path.to_string_lossy().as_bytes().to_vec();
+                        let mut table = txn.open_table(FILE_INDEX)?;
+                        table.remove(key.as_slice())?;
+                    }
+                    DbOp::UnmarkInodeVaulted(inode) => {
+                        let key = inode.to_le_bytes();
+                        let mut table = txn.open_table(VAULTED_INODES)?;
+                        table.remove(key.as_slice())?;
+                    }
+                    DbOp::RemoveCasRefcount(hash) => {
+                        let key = hash.to_vec();
+                        let mut table = txn.open_table(CAS_INDEX)?;
+                        table.remove(key.as_slice())?;


batch_write opens the same tables repeatedly inside the per-op loop (txn.open_table(FILE_INDEX/CAS_INDEX/VAULTED_INODES) for every op). This adds overhead and may reduce/negate the intended batching performance gains; consider opening each table once per transaction (or lazily caching the table handles) and reusing them while iterating over ops.

Suggested change

for op in ops {

match op {

DbOp::UpsertFile(path, metadata) => {

let key = path.to_string_lossy().as_bytes().to_vec();

let value = bincode::serialize(&metadata)

.with_context(|| "serialize file metadata")?;

let mut table = txn.open_table(FILE_INDEX)?;

table.insert(key.as_slice(), value.as_slice())?;

}

DbOp::SetCasRefcount(hash, count) => {

let key = hash.to_vec();

let value = count.to_le_bytes().to_vec();

let mut table = txn.open_table(CAS_INDEX)?;

table.insert(key.as_slice(), value.as_slice())?;

}

DbOp::MarkInodeVaulted(inode) => {

let key = inode.to_le_bytes();

let value = 1u8;

let mut table = txn.open_table(VAULTED_INODES)?;

table.insert(key.as_slice(), std::slice::from_ref(&value))?;

}

DbOp::RemoveFileFromIndex(path) => {

let key = path.to_string_lossy().as_bytes().to_vec();

let mut table = txn.open_table(FILE_INDEX)?;

table.remove(key.as_slice())?;

}

DbOp::UnmarkInodeVaulted(inode) => {

let key = inode.to_le_bytes();

let mut table = txn.open_table(VAULTED_INODES)?;

table.remove(key.as_slice())?;

}

DbOp::RemoveCasRefcount(hash) => {

let key = hash.to_vec();

let mut table = txn.open_table(CAS_INDEX)?;

table.remove(key.as_slice())?;

let mut file_index = txn.open_table(FILE_INDEX)?;

let mut cas_index = txn.open_table(CAS_INDEX)?;

let mut vaulted_inodes = txn.open_table(VAULTED_INODES)?;

for op in ops {

match op {

DbOp::UpsertFile(path, metadata) => {

let key = path.to_string_lossy().as_bytes().to_vec();

let value = bincode::serialize(&metadata)

.with_context(|| "serialize file metadata")?;

file_index.insert(key.as_slice(), value.as_slice())?;

}

DbOp::SetCasRefcount(hash, count) => {

let key = hash.to_vec();

let value = count.to_le_bytes().to_vec();

cas_index.insert(key.as_slice(), value.as_slice())?;

}

DbOp::MarkInodeVaulted(inode) => {

let key = inode.to_le_bytes();

let value = 1u8;

vaulted_inodes.insert(key.as_slice(), std::slice::from_ref(&value))?;

}

DbOp::RemoveFileFromIndex(path) => {

let key = path.to_string_lossy().as_bytes().to_vec();

file_index.remove(key.as_slice())?;

}

DbOp::UnmarkInodeVaulted(inode) => {

let key = inode.to_le_bytes();

vaulted_inodes.remove(key.as_slice())?;

}

DbOp::RemoveCasRefcount(hash) => {

let key = hash.to_vec();

cas_index.remove(key.as_slice())?;

Copilot · 2026-02-28T11:23:24Z

src/state.rs

+            }
+
+            if !buffer.is_empty() {
+                let _ = self.batch_write(std::mem::take(&mut buffer));


std::mem::take(&mut buffer) replaces buffer with an empty Vec (capacity 0). On the next loop iteration, pushes will reallocate, which undermines the “zero-copy swap” goal and can add significant allocation churn under load. Consider swapping with a preallocated replacement (e.g., mem::replace with Vec::with_capacity(BATCH_SIZE) or swap into a separate vec) to preserve capacity across batches.

Suggested change

let _ = self.batch_write(std::mem::take(&mut buffer));

let to_write = std::mem::replace(&mut buffer, Vec::with_capacity(BATCH_SIZE));

let _ = self.batch_write(to_write);

Copilot · 2026-02-28T11:23:24Z

src/state.rs

+            if !buffer.is_empty() {
+                let _ = self.batch_write(std::mem::take(&mut buffer));
+            }
+        }


batch_write_from_channel discards all database write errors (let _ = self.batch_write(...)). If a commit fails, the in-memory scan/dedupe results will diverge from persisted state with no signal to the caller. Consider returning a Result<()> from this method and propagating/handling the error (or at least logging and terminating the writer thread on failure).

Copilot · 2026-02-28T11:23:25Z

src/main.rs


    hash_bar.finish_and_clear();

+    let _ = db_writer_handle.join();


The DB writer thread join() result is ignored. If the writer panics (or if you later change it to return Result), this will silently proceed with potentially incomplete DB state. Consider handling the JoinHandle result (e.g., propagate panic, or return an error from scan_pipeline).

Suggested change

let _ = db_writer_handle.join();

if let Err(err) = db_writer_handle.join() {

std::panic::resume_unwind(err);

}

Copilot · 2026-02-28T11:23:25Z

src/main.rs

+            global_db_ops.extend(db_ops);
+            if global_db_ops.len() >= 1000 {
+                state.batch_write(std::mem::take(&mut global_db_ops))?;
+            }


Batch flush uses a hard-coded 1000 and std::mem::take(&mut global_db_ops), which drops the vector’s capacity and can reintroduce allocation churn on the next accumulation cycle. Consider using a shared BATCH_SIZE constant (single source of truth) and swapping with a preallocated vec so capacity is retained between flushes.

Copilot · 2026-02-28T11:23:25Z

src/main.rs

+                global_restore_ops.extend(restore_ops);
+                if global_restore_ops.len() >= 1000 {
+                    let _ = state.batch_write(std::mem::take(&mut global_restore_ops));
+                }


Like in other batching sites, std::mem::take(&mut global_restore_ops) drops capacity and let _ = state.batch_write(...) suppresses write/commit failures. For restore, silent DB failures can leave refcounts/inode markers inconsistent with the filesystem. Prefer swapping with a preallocated vec and propagating (or at least logging) batch_write errors.

Rakshat28 added 2 commits February 28, 2026 16:46

perf: implement redb transaction batching for v0.1.2

71ccbee

perf: implement redb transaction batching for v0.1.2

ff50019

Copilot AI review requested due to automatic review settings February 28, 2026 11:19

Copilot started reviewing on behalf of Rakshat28 February 28, 2026 11:20 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

Rakshat28 merged commit ff30424 into main Feb 28, 2026
6 checks passed

Rakshat28 deleted the perf/db-transaction-batching branch March 11, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/db transaction batching#37

Perf/db transaction batching#37
Rakshat28 merged 2 commits intomainfrom
perf/db-transaction-batching

Rakshat28 commented Feb 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	--prepare 'rm -rf ~/.bdstorage && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \
	--prepare 'rm -rf ~/.imprint && rm -rf /tmp/bench_data/arena_tiny/test && cp -r /tmp/bench_data/arena_tiny/pristine /tmp/bench_data/arena_tiny/test' \

	let _ = self.batch_write(std::mem::take(&mut buffer));
	let to_write = std::mem::replace(&mut buffer, Vec::with_capacity(BATCH_SIZE));
	let _ = self.batch_write(to_write);


		hash_bar.finish_and_clear();

		let _ = db_writer_handle.join();

-    let _ = db_writer_handle.join();
+    if let Err(err) = db_writer_handle.join() {
+        std::panic::resume_unwind(err);
+    }

Conversation

Rakshat28 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rakshat28 commented Feb 28, 2026 •

edited

Loading