Conversation
| for (int batch = 0; batch < (num_batches + 1000); batch++) { | ||
| std::vector<Doc> docs; | ||
| for (int i = 0; i < batch_size; i++) { | ||
| docs.push_back(CreateTestDoc(batch * batch_size + i, 0)); | ||
| } | ||
| auto write_result = collection->Insert(docs); | ||
| ASSERT_TRUE(write_result); | ||
| for (auto &s : write_result.value()) { | ||
| ASSERT_TRUE(s.ok()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Final verification loop inserts instead of verifying
The comment on line 178 says "verify data integrity", but the loop body calls collection->Insert() instead of collection->Fetch() + comparing results. This is a copy-paste error from the data population block above. The test will modify the collection instead of validating it, and any data-integrity bug would go undetected here.
Compare this to the correctly-written verification block at lines 142–157, which uses collection->Fetch(pks) and then checks *actual_doc == expected_doc.
| for (int batch = 0; batch < (num_batches + 1000); batch++) { | |
| std::vector<Doc> docs; | |
| for (int i = 0; i < batch_size; i++) { | |
| docs.push_back(CreateTestDoc(batch * batch_size + i, 0)); | |
| } | |
| auto write_result = collection->Insert(docs); | |
| ASSERT_TRUE(write_result); | |
| for (auto &s : write_result.value()) { | |
| ASSERT_TRUE(s.ok()); | |
| } | |
| } | |
| for (int batch = 0; batch < (num_batches + 1000); batch++) { | |
| for (int i = 0; i < batch_size; i++) { | |
| Doc expected_doc = CreateTestDoc(batch * batch_size + i, 0); | |
| std::vector<std::string> pks{expected_doc.pk()}; | |
| if (auto res = collection->Fetch(pks); res) { | |
| auto map = res.value(); | |
| if (map.find(expected_doc.pk()) == map.end()) { | |
| FAIL() << "Returned map does not contain doc[" << expected_doc.pk() << "]"; | |
| } | |
| const auto actual_doc = map.at(expected_doc.pk()); | |
| ASSERT_EQ(*actual_doc, expected_doc) | |
| << "Data mismatch for doc[" << expected_doc.pk() << "]"; | |
| } else { | |
| FAIL() << "Failed to fetch doc[" << expected_doc.pk() << "]"; | |
| } | |
| } | |
| } |
| auto result = | ||
| zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true}); |
There was a problem hiding this comment.
CollectionOptions mismatch with test configuration
The test file optimize_recovery_test.cc creates and reopens the collection with zvec::CollectionOptions{false, true, 256 * 1024} (a max-doc-per-segment limit of 256 * 1024), but the collection_optimizer binary opens the same on-disk collection with only zvec::CollectionOptions{false, true} (no segment size limit). This mismatch means the optimizer may behave differently than expected during the crash recovery test — for example, it might not trigger segment compaction at the same thresholds, leading to a test that doesn't actually exercise the intended code paths.
| auto result = | |
| zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true}); | |
| zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true, 256 * 1024}); |
| ASSERT_TRUE(WIFSIGNALED(status)) | ||
| << "Child process was not killed by a signal. It exited normally?"; |
There was a problem hiding this comment.
Flaky
ASSERT_TRUE(WIFSIGNALED) when process exits before kill
If the optimizer finishes naturally within the seconds window, kill(pid, 0) will return -1 (process gone), the kill is skipped, and waitpid will collect a normal-exit status — causing ASSERT_TRUE(WIFSIGNALED(status)) to fail with "Child process was not killed by a signal." The same pattern already exists in write_recovery_test.cc (RunGeneratorAndCrash), but the concern is amplified here because on faster machines the optimize of 2000 * 50 = 100 000 documents might complete quickly.
Consider using a softer assertion (e.g., accept either WIFEXITED or WIFSIGNALED) or asserting that the remaining verification logic still holds regardless of which exit path occurred.
| if (s.ok()) { | ||
| std::cout << "Optimize completed successfully" << std::endl; | ||
| // Print final stats | ||
| std::cout << "Final stats: " << collection->Stats()->to_string_formatted(); |
There was a problem hiding this comment.
Missing newline after final stats output
The final stats line has no std::endl or "\n", so the output may be buffered and mixed with subsequent output (e.g., the process's exit trace) without a line break. All other output statements in this file include std::endl.
| std::cout << "Final stats: " << collection->Stats()->to_string_formatted(); | |
| std::cout << "Final stats: " << collection->Stats()->to_string_formatted() << std::endl; |
| // VectorQuery query; | ||
| // query.output_fields_ = {"name", "age"}; | ||
| // query.filter_ = "invert_id >= 6000 and id < 6080"; | ||
| // query.topk_ = 100; | ||
| // std::vector<float> feature(4, 0.0); | ||
| // query.query_vector_.assign((const char *)feature.data(), | ||
| // feature.size() * sizeof(float)); | ||
| // query.field_name_ = "dense"; | ||
| // collection->Query(); |
There was a problem hiding this comment.
Commented-out dead code should be removed
Leaving a commented-out VectorQuery block in the test adds noise without providing value. If this functionality is intended to be tested, it should be enabled; otherwise, it should be deleted.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Greptile Summary
This PR adds a crash recovery test for the
Optimizeoperation, introduces a newcollection_optimizerhelper binary, and bundles several housekeeping fixes: typo corrections (CreateComapctTask→CreateCompactTask,new_segmnet_meta→new_segment_meta, etc.), a parameter rename for clarity (create→create_if_missing), and a relaxation of the floating-point comparison tolerance inDoc::operator==.Key changes:
collection_optimizerbinary that opens a collection and callsOptimize(), used as a subprocess that can be killed to simulate a crash.OptimizeRecoveryTest::CrashDuringOptimizetest that inserts 100 000 documents, crashes the optimizer mid-run, and then verifies the collection can be re-opened and data is intact.CreateComapctTaskcall-sites (source + tests) toCreateCompactTask.segment.cc,segment.h, andindex_storage.h.doc.ccloosened (1e-6f→1e-4f,1e-9→1e-6).Issues found:
CrashDuringOptimize(after the secondRunOptimizercall) callscollection->Insert()instead ofcollection->Fetch()+ compare — it inserts documents instead of verifying them, so data corruption would go undetected.write_recovery_test.ccSetUp/TearDownstill reference the oldcrash_test_dbpath after the variable was renamed towrite_recovery_test_db, leaving the test directory uncleaned between runs.collection_optimizer.ccopens the collection withCollectionOptions{false, true}(no segment limit), while the test creates/reopens it withCollectionOptions{false, true, 256 * 1024}, causing a configuration mismatch that may affect whether compaction is triggered.Confidence Score: 2/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant T as Test participant C as Collection in-process participant O as collection_optimizer subprocess participant D as On-disk Collection T->>C: CreateAndOpen(dir_path_) T->>C: Insert 100k docs C->>D: Persist segments T->>C: collection.reset() T->>O: fork + execvp RunOptimizerAndCrash 4s O->>D: Open with CollectionOptions{false,true} O->>D: Optimize compaction starts T-->>O: SIGKILL after 4s T->>C: Open for recovery T->>C: Fetch all 100k docs and verify integrity T->>C: Insert 50k more docs T->>C: collection.reset() T->>O: fork + execvp RunOptimizer no crash O->>D: Open and Optimize completes cleanly O-->>T: exit 0 T->>C: Open final T->>C: Stats assert 150k docs T->>C: BUG Insert 150k docs instead of Fetch and verifyComments Outside Diff (2)
tests/db/crash_recovery/write_recovery_test.cc, line 129 (link)SetUp/TearDownreference stale path after renamedir_path_was renamed from"crash_test_db"to"write_recovery_test_db"in this PR, but bothSetUp(line 129) andTearDown(line 134) still attempt to delete the old"./crash_test_db"path. As a result:write_recovery_test_dbis never cleaned up between test runs, causing test pollution and potential false-positives on repeated runs.crash_test_dbis a no-op since that directory no longer exists.tests/db/crash_recovery/write_recovery_test.cc, line 134 (link)TearDownafter renameSame rename miss as in
SetUp. TheTearDownmust also reference the updated directory name.Last reviewed commit: "fix: fix some typos"