Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive integration test suite and CI/CD pipeline for the bdstorage file deduplication tool. The changes establish automated testing infrastructure to ensure data safety and correctness across dedupe, restore, and scan operations. The PR includes 9 integration tests covering happy paths, edge cases (zero-byte files, deep nesting, sparse hashing collisions), metadata preservation, safety features (paranoid mode, dry-run), and hardlink fallback behavior. A GitHub Actions workflow enforces code quality through formatting checks, linting, testing, and coverage reporting.
Changes:
- Added comprehensive integration test suite using assert_cmd and tempfile for isolated HOME directory sandboxing
- Implemented GitHub Actions CI pipeline with formatting, linting, testing (sequential with --test-threads=1), and code coverage
- Applied rustfmt formatting improvements across source files (line breaking, import ordering)
- Added dev-dependencies: assert_cmd, tempfile, predicates, walkdir
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration_tests.rs | New integration test suite with 9 tests covering deduplication, restoration, metadata integrity, dry-run mode, paranoid mode, and edge cases |
| .github/workflows/ci.yml | CI pipeline configuration with test, format, lint, and coverage jobs running on ubuntu-latest |
| Cargo.toml | Added dev-dependencies for integration testing framework |
| Cargo.lock | Dependency lockfile updates for new dev-dependencies |
| src/main.rs | Formatting improvements: multi-line formatting for long function calls and string literals |
| src/state.rs | Formatting improvements: line breaking for method chains |
| src/vault.rs | Import reordering (alphabetical) and multi-line formatting for conditionals |
| src/dedupe.rs | Import reordering and formatting improvements for function signatures and method chains |
| src/hasher.rs | Formatting improvements for method chains |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,346 @@ | |||
| use assert_cmd::Command; | |||
| use std::fs; | |||
| use std::os::unix::fs::{MetadataExt, PermissionsExt}; | |||
There was a problem hiding this comment.
The integration tests import Unix-specific APIs (std::os::unix::fs) without platform guards. While the PR description mentions testing on ubuntu-latest, the code should either have cfg attributes or documentation noting these tests are Unix-only. This is consistent with the main codebase which is Unix-only, but should be explicit in the test file.
tests/integration_tests.rs
Outdated
| let vault = home.join(".imprint").join("store"); | ||
| assert!( |
There was a problem hiding this comment.
The dry-run test verifies that the vault is not created, but doesn't check if the state database is created. In true dry-run mode, the state database at ~/.imprint/state.redb should also not be created or modified. Consider adding an assertion to verify the entire ~/.imprint directory doesn't exist after a dry-run to ensure complete isolation.
| let vault = home.join(".imprint").join("store"); | |
| assert!( | |
| let imprint_dir = home.join(".imprint"); | |
| assert!( | |
| !imprint_dir.exists(), | |
| ".imprint directory should not be created in dry-run mode" | |
| ); | |
| let vault = imprint_dir.join("store"); | |
| assert!( |
.github/workflows/ci.yml
Outdated
| uses: actions/cache@v3 | ||
| with: | ||
| path: ~/.cargo/registry | ||
| key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
||
| - name: Cache cargo index | ||
| uses: actions/cache@v3 | ||
| with: | ||
| path: ~/.cargo/git | ||
| key: ${{ runner.os }}-cargo-git-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
||
| - name: Cache cargo build | ||
| uses: actions/cache@v3 |
There was a problem hiding this comment.
The workflow uses actions/cache@v3 while actions/checkout@v4 is used. Consider updating to actions/cache@v4 for consistency and to benefit from the latest cache improvements and bug fixes. GitHub Actions v4 is the current major version as of the knowledge cutoff.
| uses: actions/cache@v3 | |
| with: | |
| path: ~/.cargo/registry | |
| key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo index | |
| uses: actions/cache@v3 | |
| with: | |
| path: ~/.cargo/git | |
| key: ${{ runner.os }}-cargo-git-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo build | |
| uses: actions/cache@v3 | |
| uses: actions/cache@v4 | |
| with: | |
| path: ~/.cargo/registry | |
| key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo index | |
| uses: actions/cache@v4 | |
| with: | |
| path: ~/.cargo/git | |
| key: ${{ runner.os }}-cargo-git-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo build | |
| uses: actions/cache@v4 |
tests/integration_tests.rs
Outdated
| assert_eq!( | ||
| file1_meta.ino(), | ||
| file2_meta.ino(), | ||
| "Hardlinked files should have the same inode" | ||
| ); |
There was a problem hiding this comment.
The test_hardlink_fallback test assumes that reflinks will not be available and hardlinks will be used. However, on systems that support reflinks (like modern Linux filesystems with CoW support), this test will fail because the code will create reflinks (different inodes) instead of hardlinks (same inode). The test should either check which type of link was created, or ensure reflinks are unavailable for this specific test scenario.
| assert_eq!( | |
| file1_meta.ino(), | |
| file2_meta.ino(), | |
| "Hardlinked files should have the same inode" | |
| ); | |
| if file1_meta.ino() == file2_meta.ino() { | |
| // On filesystems without reflink support (or when hardlinks are used), | |
| // deduplication should result in hardlinked files with the same inode. | |
| assert_eq!( | |
| file1_meta.ino(), | |
| file2_meta.ino(), | |
| "Hardlinked files should have the same inode" | |
| ); | |
| } else { | |
| // On filesystems with reflink (CoW) support, the implementation may create | |
| // reflinks instead of hardlinks, which results in different inodes while | |
| // still sharing storage. In that case, we at least verify that the | |
| // deduplicated files have identical contents. | |
| let file1_content = | |
| fs::read(&target.join("file1.txt")).expect("Failed to read file1 contents"); | |
| let file2_content = | |
| fs::read(&target.join("file2.txt")).expect("Failed to read file2 contents"); | |
| assert_eq!( | |
| file1_content, file2_content, | |
| "Deduplicated files should have identical contents regardless of link type" | |
| ); | |
| } |
.github/workflows/ci.yml
Outdated
| - name: Cache cargo registry | ||
| uses: actions/cache@v3 | ||
| with: | ||
| path: ~/.cargo/registry | ||
| key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
||
| - name: Cache cargo index | ||
| uses: actions/cache@v3 | ||
| with: | ||
| path: ~/.cargo/git | ||
| key: ${{ runner.os }}-cargo-git-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
||
| - name: Cache cargo build | ||
| uses: actions/cache@v3 | ||
| with: | ||
| path: target | ||
| key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
There was a problem hiding this comment.
The cache paths use ~/.cargo which may not be portable across different runner configurations. Consider using Swatinem/rust-cache action instead, which handles Rust-specific caching more robustly and includes both ~/.cargo and target directories automatically with better cache key management.
| - name: Cache cargo registry | |
| uses: actions/cache@v3 | |
| with: | |
| path: ~/.cargo/registry | |
| key: ${{ runner.os }}-cargo-registry-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo index | |
| uses: actions/cache@v3 | |
| with: | |
| path: ~/.cargo/git | |
| key: ${{ runner.os }}-cargo-git-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache cargo build | |
| uses: actions/cache@v3 | |
| with: | |
| path: target | |
| key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }} | |
| - name: Cache Rust build | |
| uses: Swatinem/rust-cache@v2 |
| let mut restore_cmd = run_cmd(home, &["restore", &target.to_string_lossy()]); | ||
| restore_cmd.assert().success(); | ||
|
|
||
| let vault_files: Vec<_> = walkdir::WalkDir::new(&vault) | ||
| .into_iter() | ||
| .filter_map(|e| e.ok()) | ||
| .filter(|e| e.file_type().is_file()) | ||
| .collect(); | ||
|
|
||
| assert!( | ||
| vault_files.is_empty(), | ||
| "Vault should be empty after restore and GC" | ||
| ); |
There was a problem hiding this comment.
The restore test verifies the vault is empty after restore but doesn't verify that the restored files have the correct content. The test should verify that at least one of the deduplicated files can be read and contains the expected content after restoration to ensure the restore operation actually works correctly.
tests/integration_tests.rs
Outdated
| if vault.exists() { | ||
| let vault_files: Vec<_> = walkdir::WalkDir::new(&vault) | ||
| .into_iter() | ||
| .filter_map(|e| e.ok()) | ||
| .filter(|e| e.file_type().is_file()) | ||
| .collect(); | ||
|
|
||
| assert!( | ||
| vault_files.len() >= 1, | ||
| "Vault should contain master files after dedupe of large files" | ||
| ); | ||
| } |
There was a problem hiding this comment.
The test creates two 15KB files with a single byte difference at position 7KB. This tests the sparse hashing logic, but the assertion only checks if the vault exists and has at least 1 file. The test should verify that exactly 2 vault files exist (one for each unique content) to properly validate that the sparse hashing correctly distinguished between the two files.
| if vault.exists() { | |
| let vault_files: Vec<_> = walkdir::WalkDir::new(&vault) | |
| .into_iter() | |
| .filter_map(|e| e.ok()) | |
| .filter(|e| e.file_type().is_file()) | |
| .collect(); | |
| assert!( | |
| vault_files.len() >= 1, | |
| "Vault should contain master files after dedupe of large files" | |
| ); | |
| } | |
| assert!( | |
| vault.exists(), | |
| "Vault store directory should exist after dedupe of large files" | |
| ); | |
| let vault_files: Vec<_> = walkdir::WalkDir::new(&vault) | |
| .into_iter() | |
| .filter_map(|e| e.ok()) | |
| .filter(|e| e.file_type().is_file()) | |
| .collect(); | |
| assert_eq!( | |
| vault_files.len(), | |
| 2, | |
| "Vault should contain exactly 2 master files (one per unique large file content)" | |
| ); |
| run: cargo tarpaulin --out Xml --timeout 300 -- --test-threads=1 | ||
|
|
||
| - name: Upload coverage to Codecov | ||
| uses: codecov/codecov-action@v3 |
There was a problem hiding this comment.
The workflow uses codecov/codecov-action@v3, but v4 is available and recommended as of late 2023. Consider updating to v4 which includes improvements to upload reliability and tokenless uploads for public repositories. Note that v4 may require explicit token configuration for private repositories.
| uses: codecov/codecov-action@v3 | |
| uses: codecov/codecov-action@v4 |
tests/integration_tests.rs
Outdated
| file_content.push(0xFF); | ||
| fs::write(&vault_path, file_content).expect("Failed to corrupt vault file"); |
There was a problem hiding this comment.
The paranoid mode test corrupts a vault file by appending data, but this changes the file size which may be detected before the hash comparison. The test should verify that paranoid mode actually performs byte-by-byte comparison. Consider corrupting a byte in the middle of the file instead of appending to better test the paranoid verification logic.
| file_content.push(0xFF); | |
| fs::write(&vault_path, file_content).expect("Failed to corrupt vault file"); | |
| if !file_content.is_empty() { | |
| let idx = file_content.len() / 2; | |
| file_content[idx] ^= 0xFF; | |
| fs::write(&vault_path, file_content).expect("Failed to corrupt vault file"); | |
| } |
tests/integration_tests.rs
Outdated
| if let Ok(Some(attr_val)) = xattr::get(&dup_path, "user.test_attr") { | ||
| assert_eq!(attr_val, b"test_value", "Extended attribute value should match"); |
There was a problem hiding this comment.
The xattr assertion is wrapped in an if let Ok(Some(...)) which silently passes the test if xattrs are not supported or fail to be retrieved. This means the test could pass even if xattr preservation is completely broken. Consider using expect() or at minimum logging a warning when xattrs cannot be verified, or skip the xattr assertion on systems where user xattrs are not supported.
| if let Ok(Some(attr_val)) = xattr::get(&dup_path, "user.test_attr") { | |
| assert_eq!(attr_val, b"test_value", "Extended attribute value should match"); | |
| match xattr::get(&dup_path, "user.test_attr") { | |
| Ok(Some(attr_val)) => { | |
| assert_eq!(attr_val, b"test_value", "Extended attribute value should match"); | |
| } | |
| Ok(None) => { | |
| panic!("Extended attribute 'user.test_attr' not found on duplicate file"); | |
| } | |
| Err(e) => { | |
| panic!( | |
| "Failed to retrieve extended attribute 'user.test_attr' on duplicate file: {e}" | |
| ); | |
| } |
test(#28): implement robust integration test suite and CI pipeline
assert_cmdandtempfilefor safe$HOMEsandboxing.--allow-unsafe-hardlinksand--paranoidbit-rot detection.ci.yml) to enforce formatting, linting, and sequential testing onubuntu-latest.Closes #28