8817: Fix bug in handling of empty Parquet page index structures#18
8817: Fix bug in handling of empty Parquet page index structures#18martin-augment wants to merge 2 commits intomainfrom
Conversation
WalkthroughModified index writing logic to safely skip empty column and offset indexes, preventing panics when indexes are missing. Added tests to verify correct handling of absent indexes during row group rewrite operations. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| offset_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) | ||
| .collect() |
There was a problem hiding this comment.
Bug: Inconsistent Index Handling Leads to Panics
The offset index conversion at line 262 calls .unwrap() on Option values, which will panic if any offset index is None. The all_none check only prevents panics when all indexes are None, but doesn't handle cases where some indexes exist and others are None. This is inconsistent with the column index handling above (lines 234-250) which uses .unwrap_or(ColumnIndexMetaData::NONE) to safely handle partial None values.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.
| if let Some(column_index) = &column_indexes[row_group_idx][column_idx] { | ||
| // Missing indexes may also have the placeholder ColumnIndexMetaData::NONE | ||
| if matches!(column_index, ColumnIndexMetaData::NONE) { | ||
| continue; |
There was a problem hiding this comment.
column_index here is a &ColumnIndexMetaData, so this pattern won’t match as written and the matches! check will never be true, preventing skipping of placeholder ColumnIndexMetaData::NONE. This leads to attempting to write placeholder indexes instead of continuing.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. If the type implements the Copy trait then matches!() macro can compare the items without making sure that both of them are references.
| // test to see if all indexes for this file are empty | ||
| let all_none = column_indexes | ||
| .as_ref() | ||
| .is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none()))); |
There was a problem hiding this comment.
This all_none calculation only considers None entries and ignores Some(ColumnIndexMetaData::NONE) placeholders, which are treated as missing above. If all entries are placeholders, this will incorrectly leave column_index as Some instead of None (also applies to other locations in the PR).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. Here the author wants to check whether all the items are just Option::None.
|
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
parquet/src/file/metadata/writer.rs(2 hunks)parquet/src/file/writer.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
parquet/src/file/writer.rs (1)
parquet/src/arrow/arrow_reader/mod.rs (1)
get_test_file(3048-3054)
parquet/src/file/metadata/writer.rs (1)
parquet/src/file/metadata/mod.rs (2)
column_index(260-262)column_index(401-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: codex
🔇 Additional comments (3)
parquet/src/file/metadata/writer.rs (1)
116-119: LGTM! Good defensive check.Skipping writes for
ColumnIndexMetaData::NONEprevents unnecessary I/O and aligns with the safety improvements in thefinish()method.parquet/src/file/writer.rs (2)
1071-1071: LGTM!Import is necessary for the new test cases and references an existing utility function.
2447-2476: LGTM! Good test coverage for the all-None scenario.This test exercises the rewrite path when no page indexes are present, validating that the
all_nonechecks in thefinish()method correctly handle empty index structures.
| // test to see if all indexes for this file are empty | ||
| let all_none = column_indexes | ||
| .as_ref() | ||
| .is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none()))); | ||
| let column_indexes: Option<ParquetColumnIndex> = if all_none { | ||
| None | ||
| } else { | ||
| column_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| { | ||
| vi.into_iter() | ||
| .map(|oi| oi.unwrap_or(ColumnIndexMetaData::NONE)) | ||
| .collect() | ||
| }) | ||
| .collect() | ||
| }) | ||
| }; | ||
|
|
||
| // test to see if all indexes for this file are empty | ||
| let all_none = offset_indexes | ||
| .as_ref() | ||
| .is_some_and(|oi| oi.iter().all(|oii| oii.iter().all(|idx| idx.is_none()))); | ||
| let offset_indexes: Option<ParquetOffsetIndex> = if all_none { | ||
| None | ||
| } else { | ||
| // FIXME(ets): this will panic if there's a missing index. | ||
| offset_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) | ||
| .collect() | ||
| }) | ||
| }; |
There was a problem hiding this comment.
Critical: Incomplete fix for offset indexes - can still panic.
The column indexes are handled safely (lines 234-250) using unwrap_or(ColumnIndexMetaData::NONE), but offset indexes (lines 252-265) still use .unwrap() on line 262, which will panic if individual offset indexes are None. The FIXME comment on line 259 acknowledges this risk but the issue remains unfixed.
Issue: The all_none check only prevents panics when all offset indexes are None. If there's a mix of Some and None entries (e.g., when rewriting files with missing offset indexes for specific columns), line 262 will panic.
Impact: This contradicts the PR objective to "safely skip empty column and offset indexes, preventing panics when indexes are missing." The second test (test_rewrite_missing_column_index) exercises this scenario and may panic if the test file has missing offset indexes.
Consider one of these approaches:
Option 1: Add a NONE variant to OffsetIndexMetaData (if feasible):
let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
- // FIXME(ets): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
- .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
+ .map(|vi| vi.into_iter().map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE)).collect())
.collect()
})
};Option 2: Filter out None entries and check if the result is empty:
let offset_indexes: Option<ParquetOffsetIndex> = offset_indexes.and_then(|ovvi| {
- let all_none = ovvi.iter().all(|oii| oii.iter().all(|idx| idx.is_none()));
- if all_none {
- None
- } else {
- Some(ovvi.into_iter()
- .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+ let result: Vec<Vec<OffsetIndexMetaData>> = ovvi.into_iter()
+ .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+ .collect();
+ if result.iter().all(|vi| vi.is_empty()) {
+ None
+ } else {
+ Some(result)
- .collect())
}
-};
+});Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In parquet/src/file/metadata/writer.rs around lines 234-265, the offset_indexes
mapping still calls .unwrap() (line ~262) which will panic when some inner
entries are None; change the mapping to mirror the safe handling used for
column_indexes: replace the .unwrap() usage by either (preferred) using an
OffsetIndexMetaData::NONE fallback (i.e.,
oi.unwrap_or(OffsetIndexMetaData::NONE)) if you can add/use a NONE sentinel, or
(alternatively) use filter_map to drop None entries for each inner vector and
then treat an entirely-empty inner vector as absent (turning the outer result
into None) so no unwrap() is used and mixed Some/None cases no longer panic.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for OffsetIndexMetaData definition to check if it has a NONE variant
echo "=== Searching for OffsetIndexMetaData definition ==="
rg -nP "(?:enum|struct)\s+OffsetIndexMetaData" --type rust -A 20
echo -e "\n=== Searching for OffsetIndexMetaData::NONE usage ==="
rg -n "OffsetIndexMetaData::NONE" --type rust
echo -e "\n=== Searching for ColumnIndexMetaData definition and NONE variant ==="
rg -nP "(?:enum|struct)\s+ColumnIndexMetaData" --type rust -A 20
echo -e "\n=== Searching for ColumnIndexMetaData::NONE usage ==="
rg -n "ColumnIndexMetaData::NONE" --type rust -C 2
Length of output: 9824
@martin-augment, thank you for confirming the issue. I've investigated the code and found that OffsetIndexMetaData is a struct without a NONE variant, unlike ColumnIndexMetaData which is an enum with a NONE variant. This means we cannot use the same unwrap_or(OffsetIndexMetaData::NONE) pattern.
The safest fix is to check if any offset index entry is None (not just all of them), and if so, skip writing offset indexes entirely. This prevents panics while maintaining data integrity:
// test to see if all indexes for this file are empty OR if any are missing
let has_none = offset_indexes
.as_ref()
.is_some_and(|oi| oi.iter().any(|oii| oii.iter().any(|idx| idx.is_none())));
let offset_indexes: Option<ParquetOffsetIndex> = if has_none {
None
} else {
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
})
};Key changes:
- Replace
all(...)withany(...)to detect ANY missing entries - Remove the FIXME comment since the unwrap is now safe (only called when all entries are Some)
- Variable renamed to
has_nonefor clarity
This approach ensures that:
- ✅ No panics occur when some (but not all) offset indexes are missing
- ✅ Maintains the "all or nothing" semantic for offset indexes
- ✅ Aligns with the PR objective to safely handle missing indexes
Would you like me to help implement this fix or explore alternative approaches?
🧬 Code Graph Analysis Results
[relevant_snippet]
Filename: parquet/src/file/metadata/writer.rs
Line(s): (finish method near the FIXME comment)
Snippet:
// test to see if all indexes for this file are empty
let all_none = column_indexes
.as_ref()
.is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none())));
let offset_indexes: Option = if all_none {
None
} else {
// FIXME( ets ): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
})
};
| #[test] | ||
| fn test_rewrite_missing_column_index() { | ||
| // this file has an INT96 column that lacks a column index entry | ||
| let file = get_test_file("alltypes_tiny_pages.parquet"); | ||
| let metadata = ParquetMetaDataReader::new() | ||
| .with_page_index_policy(PageIndexPolicy::Optional) | ||
| .parse_and_finish(&file) | ||
| .unwrap(); | ||
|
|
||
| let props = Arc::new(WriterProperties::builder().build()); | ||
| let schema = metadata.file_metadata().schema_descr().root_schema_ptr(); | ||
| let output = Vec::<u8>::new(); | ||
| let mut writer = SerializedFileWriter::new(output, schema, props).unwrap(); | ||
|
|
||
| let column_indexes = metadata.column_index(); | ||
| let offset_indexes = metadata.offset_index(); | ||
|
|
||
| for (rg_idx, rg) in metadata.row_groups().iter().enumerate() { | ||
| let rg_column_indexes = column_indexes.and_then(|ci| ci.get(rg_idx)); | ||
| let rg_offset_indexes = offset_indexes.and_then(|oi| oi.get(rg_idx)); | ||
| let mut rg_out = writer.next_row_group().unwrap(); | ||
| for (col_idx, column) in rg.columns().iter().enumerate() { | ||
| let column_index = rg_column_indexes.and_then(|row| row.get(col_idx)).cloned(); | ||
| let offset_index = rg_offset_indexes.and_then(|row| row.get(col_idx)).cloned(); | ||
| let result = ColumnCloseResult { | ||
| bytes_written: column.compressed_size() as _, | ||
| rows_written: rg.num_rows() as _, | ||
| metadata: column.clone(), | ||
| bloom_filter: None, | ||
| column_index, | ||
| offset_index, | ||
| }; | ||
| rg_out.append_column(&file, result).unwrap(); | ||
| } | ||
| rg_out.close().unwrap(); | ||
| } | ||
| writer.close().unwrap(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Verify: Test may expose offset index panic.
This test exercises the case where individual columns may have missing indexes (per the comment on line 2480: "this file has an INT96 column that lacks a column index entry"). Lines 2500-2501 retrieve column_index and offset_index using .get(col_idx).cloned(), both of which can return None for individual columns.
Concern: If the test file also has missing offset_index entries (not just column_index), this test will likely panic due to the .unwrap() on line 262 of parquet/src/file/metadata/writer.rs (the critical issue flagged in that file's review). The column_index handling is safe, but offset_index handling is not.
Run this test in isolation to confirm it doesn't panic:
If the test passes, it suggests offset indexes are always present in the test file (even when column indexes are missing). If it panics, it confirms the critical issue in writer.rs line 262.
🏁 Script executed:
#!/bin/bash
# Verify the test doesn't hit the offset_index unwrap panic
cd parquet
cargo test test_rewrite_missing_column_index -- --exact --nocaptureLength of output: 109
I'll use static analysis to verify the offset_index issue since the test cannot be run in this environment.
🏁 Script executed:
#!/bin/bash
# Search for offset_index unwrap patterns and finish() method
rg -n "finish|offset_index.*unwrap" parquet/src/file/writer.rs | head -20Length of output: 978
🏁 Script executed:
#!/bin/bash
# Look at the finish() method implementation
rg -n "pub fn finish" parquet/src/file/writer.rs -A 20Length of output: 795
🏁 Script executed:
#!/bin/bash
# Search for OffsetIndexBuilder in metadata/writer.rs - the file mentioned in original review
rg -n "unwrap\|finish" parquet/src/file/metadata/writer.rs | head -30Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check ColumnCloseResult definition and append_column usage
rg -n "struct ColumnCloseResult|fn append_column" parquet/src/file/writer.rs -A 15 | head -50Length of output: 623
🏁 Script executed:
#!/bin/bash
# Find ColumnCloseResult struct definition
rg -n "struct ColumnCloseResult" parquet/src/file/writer.rs -A 10Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for write_metadata implementation
rg -n "fn write_metadata" parquet/src/file/writer.rs -A 30 | head -50Length of output: 1341
🏁 Script executed:
#!/bin/bash
# List files in parquet/src/file/metadata/
fd . parquet/src/file/metadata/Length of output: 380
🏁 Script executed:
#!/bin/bash
# Broadly search for offset_index().unwrap() anywhere in parquet source
rg "offset_index.*\.unwrap" parquet/src/ -nLength of output: 2363
🏁 Script executed:
#!/bin/bash
# Look at parquet/src/file/metadata/writer.rs around line 262
sed -n '250,280p' parquet/src/file/metadata/writer.rsLength of output: 997
🏁 Script executed:
#!/bin/bash
# Search for ColumnCloseResult definition across all parquet source
rg -n "ColumnCloseResult" parquet/src/ -B 2 -A 8 | head -60Length of output: 3938
🏁 Script executed:
#!/bin/bash
# Look at the offset_index unwrap in column/writer/mod.rs
sed -n '2995,3010p' parquet/src/column/writer/mod.rsLength of output: 538
🏁 Script executed:
#!/bin/bash
# Search for OffsetIndexBuilder to understand how offset_index is built
rg -n "OffsetIndexBuilder|offset_indexes.push" parquet/src/file/writer.rs -B 2 -A 5 | head -50Length of output: 894
🏁 Script executed:
#!/bin/bash
# Check if there's been any fix to the FIXME about unwrap panicking
rg -n "FIXME.*missing.*index|panic.*missing" parquet/src/file/metadata/writer.rsLength of output: 74
🏁 Script executed:
#!/bin/bash
# Understand the test better - see what metadata the test loads
sed -n '2478,2495p' parquet/src/file/writer.rsLength of output: 821
Clarify: This test likely doesn't expose the offset_index panic, but the underlying critical bug exists.
The test correctly handles potentially-missing column and offset indexes. However, the concern is probably not triggered by alltypes_tiny_pages.parquet since the file's description mentions only a missing column index, not missing offset indexes. If offset indexes are present for all columns, the test will not panic.
The real critical issue flagged by this review exists elsewhere: parquet/src/file/metadata/writer.rs line 262 contains an unresolved .unwrap() that will panic if any offset_index entry is None, as explicitly noted in the FIXME comment on line 259. This needs to be fixed regardless of whether this specific test triggers it.
🤖 Prompt for AI Agents
In parquet/src/file/metadata/writer.rs around lines 259–262, there is an
uncovered .unwrap() that will panic when an offset_index entry is None (see the
FIXME); change the code to handle Option values safely by matching the Option
instead of unwrapping: if Some(offset_index) serialize and append it, if None
append a None/absent entry (or skip writing that index according to the metadata
format) so no unwrap is called. Ensure the resulting vector/type matches the
expected metadata schema (use map or match to produce
Option<SerializedOffsetIndex> rather than calling unwrap), adjust downstream
expectations/types if needed, and add/update a unit test to cover the missing
offset_index case.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# View the problematic code in metadata/writer.rs around lines 250-280
sed -n '250,280p' parquet/src/file/metadata/writer.rs
Length of output: 997
🏁 Script executed:
#!/bin/bash
# Get more context about how offset_indexes are used in the finish method
rg -n "offset_indexes" parquet/src/file/metadata/writer.rs -B 2 -A 2
Length of output: 3424
🏁 Script executed:
#!/bin/bash
# Look at the finish method implementation that handles these indexes
rg -n "pub fn finish" parquet/src/file/metadata/writer.rs -A 50 | head -70
Length of output: 3119
@martin-augment, thank you for confirming this critical issue. You're absolutely right that the PR should not be merged with this panic risk.
Looking at the code in parquet/src/file/metadata/writer.rs (lines 253-263), the problem is clear:
The Issue:
let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
// FIXME(ets): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) // ⚠️ PANICS on None
.collect()
})
};The all_none check only prevents writing when all indexes are None. When there's a mix of Some and None values (like in alltypes_tiny_pages.parquet), the .unwrap() will panic.
Suggested Fix:
Apply the same pattern used for column_indexes (lines 238-249). Replace the unwrap with unwrap_or(OffsetIndexMetaData::NONE) or similar:
let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| {
vi.into_iter()
.map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE)) // Safe handling
.collect()
})
.collect()
})
};You should also update the write_offset_indexes method (line 73) to skip writing when encountering OffsetIndexMetaData::NONE, similar to how write_column_indexes handles it.
Recommendation: Create a GitHub issue to track this fix explicitly, or update the PR description to note this is a work-in-progress that addresses the issue partially but requires the unwrap fix before merge.
value:annoying; category:bug; feedback:The Codex AI reviewer comments on files which are not part of this pull request. |
Code Review SummaryThank you for addressing issue 8817 regarding empty Parquet page index structures! I've completed a thorough review of the changes. Overall, the fix is solid and addresses a real bug, but there are several areas that could be improved. 🎯 Positive Aspects
🔴 Critical Issues1. Unresolved FIXME Comment (parquet/src/file/metadata/writer.rs:259)The code at line 259-264 still contains .unwrap() that can panic if there's a None offset index: Problem: The same issue that was fixed for column indexes still exists for offset indexes. This FIXME remains unresolved. Recommendation: Apply the same .unwrap_or() pattern used for column indexes, or filter out None values before processing. 2. Inconsistent Handling of NONE VariantThe fix adds a check in write_column_indexes (line 117) but doesn't add a corresponding check in the offset index writing logic. This asymmetry could lead to bugs. Recommendation: Review if offset indexes need similar NONE handling in their write function. 🟡 Code Quality Issues3. Variable Name Shadowing (lines 235, 253)Problem: Using the same variable name all_none twice in the same scope reduces readability. Recommendation: Use distinct names like all_column_indexes_none and all_offset_indexes_none 4. Comment Clarity (lines 234, 252)The comments "test to see if all indexes for this file are empty" could be more precise. Recommendation: Clarify that this prevents writing empty index structures to the file format. 🟢 Performance & Best Practices5. Performance: Early Exit OpportunityIn the all_none checks, consider using iterator short-circuiting more explicitly with map_or instead of is_some_and for clarity and consistency with idiomatic Rust. 🧪 Test Coverage FeedbackStrengths:
Suggestions:6. Test Assertions Missing Both tests call writer.close().unwrap() but don't verify the output. Consider adding assertions to verify:
7. Test Documentation The comment on line 2480 mentions "this file has an INT96 column that lacks a column index entry" but it's unclear if this is actually true for alltypes_tiny_pages.parquet. Consider verifying or expanding the comment to explain why this test file was chosen. 🔒 Security ConcernsNo security issues identified. The changes don't introduce any unsafe operations or external input handling vulnerabilities. 📝 Minor Suggestions8. Code StyleThe changes follow the existing codebase conventions well. Consider running cargo fmt and cargo clippy on the parquet crate to ensure consistency. ✅ Recommended Actions (Priority Order)
📊 Overall AssessmentStatus: The PR addresses a real bug and the approach is sound, but the unresolved FIXME and potential panic in the offset index handling need to be addressed before merging. With these fixes, this will be a solid contribution. Recommendation: Request changes to address the critical panic issue before approval. Great work on identifying and fixing this edge case! The fix for column indexes is correct, but the same issue needs to be resolved for offset indexes. 🚀 |
8817: To review by AI